2/28 Binary Text Classification with Latent Semantic Analysis
Today, we'll take a dip in Natural Language Processing (NLP).
The IMDB Dataset
The IMDB Dataset has
- movie review texts as inputs and
- whether the review is positive or negative as labels.
Both novelties require special treatment. We will see some simple methods today.
Text to Vectors: tf-idf
We'll still want to use an affine transformation as model. That requires vectors as input. So, first, we need to transform the review texts to numerical vectors.
A simple method that you can still use as a quick baseline is tf-idf (term frequency-inverse document frequency). This approach has many variations; we will describe its basic form.
Corpus, Documents, Terms
Let us first fix notation:
- We are given a corpus, that is a sequence of documents \(D=(d_1,\dotsc,d_n)\). In the present case, the documents are the reviews.
-
We decompose each document \(d_i\) as a sequence of terms \(d=(t_1^i,\dotsc,t_{l_i}^i)\).
-
We are free to decide what we term a term:
- They can be words. This is what we will do today.
- They can be \(n\)-grams, that is sequences of consecutive words of a given number \(n\) (note that the \(n\) in a gram is different from the \(n\) for dataset length). For example a bigram is a pair of consecutive words.
- They can be subwords. We will see more of this when we get to transformers, the state-of-the-art NLP models.
-
Note that the documents have varying numbers of terms. This is a technical issue we will need to be mindful of later.
-
Vocabulary
The vocabulary is the set
$$ V=\{t_i^j:1\le i\le n,\,1\le j\le l_i\} $$ of all terms appearing in the corpus. For \(t\in V\) and \(1\le i\le n\), we will write \(t\in d_i\) if there exists \(1\le j\le l_i\) such that \(t_j^i=t\).
tf and idf
Now for each document \(d_i\) and term \(t\in V\), the value \(\mathop{\text{tf-idf}}(t, d_i, D)\) is the product of two values:
- Term frequency (tf) is the frequency of a term in the document being the selected term: $$ \mathop{\text{tf}}(t,d_i)=\frac{|{1\le j\le l_i:t_j^i=t}|}{l_i} $$
- Inverse document frequency (idf) [1] is the natural logarithm of the inverse of the frequency of the documents in the corpus that contain the given term: $$ \mathop{\text{idf}}(t, D)=\log\frac{n}{|{1\le i\le n:t\in d_i}|} $$
That is, in tf-idf the term counts in a document are normalized by how common the terms are in a corpus. Thus, rarer terms get larger weight. It is also common to introduce stop words: forbid very common words such as "a" or "the".
tf-idf matrix
Let's make the vocabulary into a sequence $$ \mathscr V=(t_1,\dotsc,t_m). $$
Then we can form the tf-idf matrix, the \(n\times m\) matrix of the tf-idf values of each document-term pair: $$ \mathop{\text{tf-idf}}(D)=(\mathop{\text{tf-idf}}(d_i,t_j,D):1\le i\le n, 1\le j\le m) $$
One may wonder: can we use this as feature matrix?
- Doing this for a train split of size 22 500, we get a vocabulary of 71 531 entries. With datatype float32, storing this matrix would take up \(22\,500\cdot71\,531\cdot4=6\,437\,790\,000\) bytes, that is close to 6.5 gigabytes.
- On the other hand, this matrix is very sparse: it only has 3 098 615 nonzero values: this is 0.192526627927% of the number of entries.
- Sparse matrices are stored in a special way that requires space linear in the number of nonzero entries. See here for more info:
https://docs.scipy.org/doc/scipy/reference/sparse.html
- Sparse matrices are stored in a special way that requires space linear in the number of nonzero entries. See here for more info:
To go back to the question: it would be a vast waste of resources to use the vocabulary entries as feature dimensions of our model.
Dimension Reduction: Truncated SVD
We apply a technique called Truncated Singular Value Decomposition (Truncated SVD) [2, Remark 5.1] to reduce the number of feature dimensions.
In general, for an \(n\times m\) matrix \(X\), there exists a Singular Value Decomposition (SVD) \(X=U\Sigma V^T\) where:
- \(\Sigma\) is an \(n\times m\) diagonal matrix with nonnegative numbers in the diagonal, and
- \(U\) and \(V\) are orthonormal matrices of shapes \(n\times n\) and \(m\times m\), respectively.
In Truncated SVD:
- We select a hyperparameter \(d\): the reduced feature dimensions.
- Let \(\Sigma'\) denote the \(d\times d\) matrix of the rows and columns of \(\Sigma\) that intersect at the \(d\) largest values.
- Let \(U'\) and \(V'\) denote the collection of the corresponding columns of \(U\) and \(V\). That is, they are \(n\times d\) and \(m\times d\) matrices, respectively.
- Then \(U'\Sigma'(V')^T\) is a good approximation of \(X\).
- So, we replace the \(n\times m\) feature matrix \(X\) with the \(n\times d\) feature matrix \(XV'\).
In NLP, the approach of using a low-rank approximation of a term-occurence matrix is called Latent Semantic Analysis (LSA).
WARNING! Preprocessing and Splits
So, we have a preprocessing pipeline that takes a corpus and converts it to an \(n\times d\) features matrix. Now we'll train a model on these features.
Note that the model will only understand the text from the \(d\) features dimensions we got. Therefore:
-
You need to run the aforementioned pipeline:
- Run tf-idf on the corpus to get tf-idf values for each document-term pair.
- Run Truncated SVD to transform the sparse matrix with high feature dimension to a dense matrix with low feature dimension.
on the train set.
-
Use the same
- vocabulary sequence \(\mathscr V=(t_1,\dotsc,t_m)\),
- idf values \(\mathop{\text{idf}}(t, D)\), and
- reduction matrix \(V'\)
when preprocessing the validation and test splits. If you encounter a term not in the vocabulary, don't count it in.
Binary Classification
Bernoulli Distribution
Recall that for classification problems with \(l\) possible outcomes, we used categorical distribution \((p_1,\dotsc,p_l)\). In binary classification, we have \(l=2\). Therefore, we get \(p_2 = 1 - p_1\). We see that it is enough to calculate \(p_1\).
It is customary to name the two outcomes the negative and positive outcomes and use the Bernoulli distribution as model, where there is only one parameter: the probability \(p\) of the positive outcome.
Sigmoid and Logit Functions
So, our affine transformation will have 1-dimensional output. We'll transform this real number to a probability with the sigmoid (logistic) function $$ p=\sigma(z)=\frac{1}{1-e^{-z}}. $$ The inverse of this function is the logit function $$ z=\log\frac{p}{1-p}. $$ Its values, the logits, stand for logistic units.
Binary Accuracy
Note that to calculate accuracy, we need to transform the probability to a prediction. Usually, we say the model predicts an entry positive, if the probability is larger than 50%. Note the following:
- This agrees with the multiclass case: \(p>1-p\) if and only if \(p>1/2\).
- It is equivalent to ask if the logit is positive.
Binary Cross-Entropy
We need to make a similar change to cross-entropy. Let us use 0 as the negative and 1 as the positive label. Then we can calculate cross-entropy as $$ \ell(\mathbf x, y; \theta)=y\log(\sigma(p_\theta(\mathbf x)))+(1-y)\log(1-\sigma(p_\theta(\mathbf x))) $$