3/5 Word Vectors

Today, we'll introduce another approach to transform textual to numerical data: with word vectors, we try to represent semantic relationships as linear ones. First, we try to explain the method in words. Then, we'll let the experiments speak for themselves.

Statistics on Semantic Relationships

So, whatever do we mean by this? Well, semantical means pertaining to meaning. Therefore, a vector should represent meaning somehow. But, as we train our system on text only, given a corpus -- there are so-called world models that are trained on all sorts of data, as a human is, but we'll not get into that here --, you can't just tell the system that "apple" signifies an apple.

What you can do instead is to take a contextual approach: a word is described by how often it occurs in the context of what words. If such statistics are the same for two words, they are interchangeable, so we view them as having the same meaning.

Context Windows

Well now, given a corpus, when can I say that a word occurs in the context of another? We take a crude approach: we fix the context window length, an integer hyperparameter. Then we say that a word is in the context of another if their distance in the text is at most this number.

Skip-Gram with Negative Sampling

We'll discuss the word vector learning algorithm Skip-Gram with Negative Sampling (SGNS) [1]. We'll discuss the most basic approach. For alternatives, see the comparison paper [2]. We will be using pre-trained word vectors that were gotten with many refinements [3].

Pre-Training

This is the first time we'll use a pre-trained model. This means that some parameters are obtained by training on a more general task. In this particular case, we'll be using word vectors that were trained on a big, generic corpus.

Negative Sampling

The word vectors are obtained via Negative Sampling. This is our first Unsupervised Task: we don't have explicit targets, rather, the model has to fit to the distribution in the data.

In Negative Sampling, our dataset is a relation: $R\subseteq\mathscr X_1\times\mathscr X_2$. Examples include:

Entries are words, the relation is occurring in the same context.
Entries are words and contexts, the relation is the word occurring in the context.
Entries as questions and answers, the relation is the answer answering the question.
Entries are photographs, the relation is being photographs of the same person.
Entries are 3d coordinates, the relation is being vertices of the same 3d object.

The model should create embedding vectors of the objects: $$ \mathscr X_i\xrightarrow{f_i}\mathbf R^d,\,i=1,2 $$ such that the embedding vectors of related entries align in the following sense: given

an anchor $x_1\in\mathscr X_1$,
a positive example $x_2^p\in\mathscr X_2:(x_1,x_2^p)\in R$ and
negative examples $x_2^{n_j}\in\mathscr X_2:(x_1,x_2^{n_j})\notin R$ for $1\le j\le k$ -- a hyperparameter! --,

we want $$ f_1(x_1)\cdot f_2(x_2^p)>f_1(x_1)\cdot f_2(x_2^{n_j}),\,1\le j\le k. $$

We can use cross-entropy as loss, with the scalar products being the logits: $$ \ell=\log\left( \exp\left( f_1(x_1)\cdot f_2(x_2^p) \right) + \sum_{j=1}^k\exp\left( f_1(x_1)\cdot f_2(x_2^{n_j}) \right) \right) - f_1(x_1))\cdot f_2(x_2^p) $$

We only let gradients flow through the anchor, that is we take a SGD step with respect to $f_1$. But we reverse the roles of $f_1$ and $f_2$, provided they are different.

Document Vectors from Word Vectors

Given a document $$ d=(t_1,\dotsc,t_l), $$ we get a sequence of word vectors $$ s=(f(t_1),\dotsc,f(t_l))\in\mathbf R^{l\times d}. $$ How should we get one feature vector out of this? The basic approach we'll use for now is to take averages componentwise: $$ \mathbf x:=\frac{1}{l}\sum_{i=1}^lf(t_i). $$

CORRUPTION ALERT

Note that the pre-trained vectors we use were trained on corpora in a big part crawled from the internet up to June 2017. The paper introducing the IMDB dataset [4] was published in 2011. Therefore, it is possible that the pre-training corpus includes information on the validation set of the IMDB dataset. So, this is not a valid way of testing generalization performance. You need to make sure that your validation data didn't leak into your training data. The safest way is if it is possible to check that your dataset became public later than the time the pre-training corpus was crawled. In this spirit, in the lab today, we'll use a dataset that was published in 2018 [5]:
https://huggingface.co/datasets/dair-ai/emotion

References

[1] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado and Jeff Dean: Distributed Representations of Words and Phrases and their Compositionality, 2013. Advances in Neural Information Processing Systems 26 (NIPS 2013). https://papers.nips.cc/paper_files/paper/2013/hash/9aa42b31882ec039965f3c4923ce901b-Abstract.html

[2] Omer Levy, Yoav Goldberg and Ido Dagan: Improving Distributional Similarity with Lessons Learned from Word Embeddings, 2015. Transactions of the Association for Computational Linguistics (TACL), volume 3, pages 211--225. https://aclanthology.org/Q15-1016/

[3] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch and Armand Joulin: Advances in Pre-Training Distributed Word Representations, 2018. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). https://aclanthology.org/L18-1008/

Dataset References

[4] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts: Learning Word Vectors for Sentiment Analysis, 2011. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL 2011), pages 142--150. https://aclanthology.org/P11-1015/

[5] Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu and Yi-Shin Chen: CARER: Contextualized Affect Representations for Emotion Recognition, 2018. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687--3697. https://aclanthology.org/D18-1404/