2/28 Binary Text Classification with Latent Semantic Analysis

Today, we'll take a dip in Natural Language Processing (NLP).

The IMDB Dataset

The IMDB Dataset has

movie review texts as inputs and
whether the review is positive or negative as labels.

Both novelties require special treatment. We will see some simple methods today.

Text to Vectors: tf-idf

We'll still want to use an affine transformation as model. That requires vectors as input. So, first, we need to transform the review texts to numerical vectors.

A simple method that you can still use as a quick baseline is tf-idf (term frequency-inverse document frequency). This approach has many variations; we will describe its basic form.

Corpus, Documents, Terms

Let us first fix notation:

We are given a corpus, that is a sequence of documents $D=(d_1,\dotsc,d_n)$. In the present case, the documents are the reviews.
We decompose each document $d_i$ as a sequence of terms $d=(t_1^i,\dotsc,t_{l_i}^i)$.
1. We are free to decide what we term a term:
  1. They can be words. This is what we will do today.
  2. They can be $n$-grams, that is sequences of consecutive words of a given number $n$ (note that the $n$ in a gram is different from the $n$ for dataset length). For example a bigram is a pair of consecutive words.
  3. They can be subwords. We will see more of this when we get to transformers, the state-of-the-art NLP models.
2. Note that the documents have varying numbers of terms. This is a technical issue we will need to be mindful of later.

Vocabulary

The vocabulary is the set

$$ V=\{t_i^j:1\le i\le n,\,1\le j\le l_i\} $$ of all terms appearing in the corpus. For $t\in V$ and $1\le i\le n$, we will write $t\in d_i$ if there exists $1\le j\le l_i$ such that $t_j^i=t$.

tf and idf

Now for each document $d_i$ and term $t\in V$, the value $\mathop{\text{tf-idf}}(t, d_i, D)$ is the product of two values:

Term frequency (tf) is the frequency of a term in the document being the selected term: $$ \mathop{\text{tf}}(t,d_i)=\frac{|{1\le j\le l_i:t_j^i=t}|}{l_i} $$
Inverse document frequency (idf) [1] is the natural logarithm of the inverse of the frequency of the documents in the corpus that contain the given term: $$ \mathop{\text{idf}}(t, D)=\log\frac{n}{|{1\le i\le n:t\in d_i}|} $$

That is, in tf-idf the term counts in a document are normalized by how common the terms are in a corpus. Thus, rarer terms get larger weight. It is also common to introduce stop words: forbid very common words such as "a" or "the".

tf-idf matrix

Let's make the vocabulary into a sequence $$ \mathscr V=(t_1,\dotsc,t_m). $$

Then we can form the tf-idf matrix, the $n\times m$ matrix of the tf-idf values of each document-term pair: $$ \mathop{\text{tf-idf}}(D)=(\mathop{\text{tf-idf}}(d_i,t_j,D):1\le i\le n, 1\le j\le m) $$

One may wonder: can we use this as feature matrix?

Doing this for a train split of size 22 500, we get a vocabulary of 71 531 entries. With datatype float32, storing this matrix would take up $22\,500\cdot71\,531\cdot4=6\,437\,790\,000$ bytes, that is close to 6.5 gigabytes.
On the other hand, this matrix is very sparse: it only has 3 098 615 nonzero values: this is 0.192526627927% of the number of entries.
1. Sparse matrices are stored in a special way that requires space linear in the number of nonzero entries. See here for more info:
  https://docs.scipy.org/doc/scipy/reference/sparse.html

To go back to the question: it would be a vast waste of resources to use the vocabulary entries as feature dimensions of our model.

Dimension Reduction: Truncated SVD

We apply a technique called Truncated Singular Value Decomposition (Truncated SVD) [2, Remark 5.1] to reduce the number of feature dimensions.

In general, for an $n\times m$ matrix $X$, there exists a Singular Value Decomposition (SVD) $X=U\Sigma V^T$ where:

$\Sigma$ is an $n\times m$ diagonal matrix with nonnegative numbers in the diagonal, and
$U$ and $V$ are orthonormal matrices of shapes $n\times n$ and $m\times m$, respectively.

In Truncated SVD:

We select a hyperparameter $d$: the reduced feature dimensions.
Let $\Sigma'$ denote the $d\times d$ matrix of the rows and columns of $\Sigma$ that intersect at the $d$ largest values.
Let $U'$ and $V'$ denote the collection of the corresponding columns of $U$ and $V$. That is, they are $n\times d$ and $m\times d$ matrices, respectively.
Then $U'\Sigma'(V')^T$ is a good approximation of $X$.
So, we replace the $n\times m$ feature matrix $X$ with the $n\times d$ feature matrix $XV'$.

In NLP, the approach of using a low-rank approximation of a term-occurence matrix is called Latent Semantic Analysis (LSA).

WARNING! Preprocessing and Splits

So, we have a preprocessing pipeline that takes a corpus and converts it to an $n\times d$ features matrix. Now we'll train a model on these features.

Note that the model will only understand the text from the $d$ features dimensions we got. Therefore:

You need to run the aforementioned pipeline:
1. Run tf-idf on the corpus to get tf-idf values for each document-term pair.
2. Run Truncated SVD to transform the sparse matrix with high feature dimension to a dense matrix with low feature dimension.
on the train set.
Use the same
1. vocabulary sequence $\mathscr V=(t_1,\dotsc,t_m)$,
2. idf values $\mathop{\text{idf}}(t, D)$, and
3. reduction matrix $V'$
when preprocessing the validation and test splits. If you encounter a term not in the vocabulary, don't count it in.

Binary Classification

Bernoulli Distribution

Recall that for classification problems with $l$ possible outcomes, we used categorical distribution $(p_1,\dotsc,p_l)$. In binary classification, we have $l=2$. Therefore, we get $p_2 = 1 - p_1$. We see that it is enough to calculate $p_1$.

It is customary to name the two outcomes the negative and positive outcomes and use the Bernoulli distribution as model, where there is only one parameter: the probability $p$ of the positive outcome.

Sigmoid and Logit Functions

So, our affine transformation will have 1-dimensional output. We'll transform this real number to a probability with the sigmoid (logistic) function $$ p=\sigma(z)=\frac{1}{1-e^{-z}}. $$ The inverse of this function is the logit function $$ z=\log\frac{p}{1-p}. $$ Its values, the logits, stand for logistic units.

Binary Accuracy

Note that to calculate accuracy, we need to transform the probability to a prediction. Usually, we say the model predicts an entry positive, if the probability is larger than 50%. Note the following:

This agrees with the multiclass case: $p>1-p$ if and only if $p>1/2$.
It is equivalent to ask if the logit is positive.

Binary Cross-Entropy

We need to make a similar change to cross-entropy. Let us use 0 as the negative and 1 as the positive label. Then we can calculate cross-entropy as $$ \ell(\mathbf x, y; \theta)=y\log(\sigma(p_\theta(\mathbf x)))+(1-y)\log(1-\sigma(p_\theta(\mathbf x))) $$