2/19 Stochastic Gradient Descent

with Replacement (Random Sampling)

As the MNIST train set has 60 0000 and the test set 10 000 entries, we randomly split off 10 000 entries from the train set for the validation set.

That is, if we perform gradient descent with this training set, in each step, we have to calculate the gradient $\nabla_\theta\ell(\mathbf x, y; \theta)$ at 50 000 dataset entries $(\mathbf x, y)$.

Now, recall that we view the train set as a collection of random samples from the joint distribution $(X, Y)$. So, we can try out taking training steps on small samples from the train set. If you know bootstrapping, this is like that: we approximate sampling from the distribution by sampling from the dataset.

It turns out this:

Works!
It not just works, it works better: the noise we get from the random samples helps the optimization process escape local minima.

The size of the small samples we take is the minibatch size, yet another hyperparameter. It is also called batch size, but I recommend the former as

minibatch only refers to the batch of entries you use in one training step, while
batch can mean batches of entries for other means.

Beware: we need to take a different small sample in each training step. Training on the same small sample causes the model to overfit.

without Replacement (Random Reshuffling)

It turns out that the Stochastic Gradient Descent method actually used in most cases is another version: here, we repeat the following:

Shuffle the train set.
Perform gradient descent on successive subsets of minibatch size. The size of the last minibatch can be smaller.

One such outer loop is called an epoch. That is, during an epoch we take $\left\lceil\frac{\text{train set size}}{\text{minibatch size}}\right\rceil$ train steps.

Many times, training length is measured in epochs. Note that during an epoch, multiple gradient descent steps are taken, each one changing the parameters.

One may wonder if random reshuffling really works better than random sampling. A recent theoretical paper on this matter is [1]

Broadcasting

Recall that we can view a tensor of shape (n_1,...,n_k) as a function $$ [n_1]\times\dotsb\times[n_k]\to K $$ where $K$ is a Mathematical or Computer Scientifical number set.

Broadcasting means that when you perform a binary operation among tensors, you can leave out some dimensions from one or the other and the operation will be conducted as if the tensors were constant along the given dimensions.

Take tensors $a$ and $b$, of respective shapes $(m_1,\dotsc,m_k)$ and $(n_1,\dotsc,n_l)$. Up to swapping the two tensors, we can assume $k\le l$. 1. If $k<l$, then we change the shape of $a$ by filling it in with 1's from the left until $l=k$. 2. For any $1\le i\le k$: 1. If $m_i=n_i$, then the operation is performed normally. 2. If a size $m_i$ or $n_i$ is 1, then the tensor is viewed as constant along that dimension. 3. If the sizes $m_i$ and $n_i$ are different and neither are 1, then the two tensors are not broadcastable, the operation is forbidden.

Let's see if the following operations are legal and what is their result: 1. [1, 2, 3] + [1]. 2. [1, 2, 3] + [[1]]. 3. [[1, 2, 3]] * [2, 3] 4. [[1, 2, 3]] * [1, 2, 3]. 5. [1, 2, 3] == 2. 6. [1, 2, 3] - [[1, 2, 3], [4, 5, 6]].

For more information, see here:
https://numpy.org/doc/stable/user/basics.broadcasting.html

References

[1] Pierfrancesco Beneventano: On the Trajectories of SGD Without Replacement. arXiv:2312.16143