2/21 Repeated Experiments
Reporting Results of Randomized Experiments
Last time, we introduced Stochastic Gradient Descent. This brought randomness into our experiments: starting with different random seeds, the minibatches will be different, which will in turn change how the parameters are updated.
Therefore, given the results of one experiment, one may ask: am I seeing the result of a super lucky draw, or can I expect about the same performance if I rerun the experiment with a different random seed? To decide questions as such, we need to run experiments multiple times with different random seeds and summarize the results.
Estimating the Mean via the Law of Large Numbers
For \(i=0,\dots\), let \(X_i\) denote a performance metric of an experiment, say the validation accuracy of a logistic regression model. Suppose that these variables are identically and independently distributed (i.i.d.), with mean \(\mu\) and finite variance \(\sigma^2\).
Then by the Law of Large Numbers, the sample mean \(\bar X_n:=\frac{X_1+\dotsb+X_n}{n}\) converges almost surely to \(\mu\) as \(n\to\infty\). That is, the average performance \(\bar X_n\) out of \(n\) independent experiments is a better and better estimate of the true expected performance \(\mu\) as we increase the number \(n\) of experiments.
But how close can these approximations get as a function of \(n\)?
Getting Confidence Intervals
Student \(t\)-Distributions
For this, we make the additional assumption that the random variables \(X_i\) have distribution \(\mathscr N(\mu, \sigma)\), that is normal distribution with mean \(\mu\) and standard deviation (std) \(\sigma\). Let $$ S_n^2=\frac{1}{n-1}\sum_{i=1}^n(X_i-\bar X_n)^2 $$ denote the so-called sample variance. Then the expression $$ T_n:=\frac{\bar X_n-\mu}{\sqrt{S^2_n / n}} $$ has Student \(t\)-distribution with \(n-1\) degrees of freedom. We get a relation between the precision of the estimate and a confidence in said precision: $$ \mathbf P(|\bar X_n-\mu|<c\sqrt{S^2_n / n}) =\mathbf P(|T_n|<c)=\alpha. $$ As the middle term is a value of the cumulative distribution function of a known distribution, it can be approximated numerically. For example, setting \(\alpha=95\%\), and in case \(n=10\), we get \(c\approx2.262\).
That is: if we repeated a lot of times the experiment of running \(n\) experiments, then \(\alpha\) proportion of the time, we would have \(|\bar X_n-\mu|<cS_n/\sqrt n\). The interval $$ (\bar X_n-cS_n/\sqrt n,\bar X_n+cS_n/\sqrt n) $$ is called a confidence interval. It is customary to report such a result as \(\bar X_n\pm cS_n/\sqrt n\) (and declare somewhere that confidence intervals are with level \(\alpha\)).
A Central Limit Theorem
Note that above, we made the assumption that the variables \(X_i\) one by one have a normal distribution. On the other hand, by the Lindenberg--Lévy Central Limit Theorem, the sequence of random variables \(\sqrt n(\bar X_n-\mu)\) converges to the normal distribution \(\mathscr N(0,\sigma^2)\) in distribution, that is for each \(x\in\mathbf R\) we have $$ \mathbf P(\sqrt n(\bar X_n-\mu) < x) \xrightarrow{n\to\infty} \mathbf P(\mathscr N(0, \sigma^2) < x). $$
If you want to write confidence intervals for normal distributions, you can use the so-called 68-95-99.7 rule: we have $$ \mathbf P(|\mathscr N(\mu, \sigma^2) - \mu| < \sigma) \approx 68.27% $$ $$ \mathbf P(|\mathscr N(\mu, \sigma^2) - \mu| < 2\sigma) \approx 95.45% $$ $$ \mathbf P(|\mathscr N(\mu, \sigma^2) - \mu| < 3\sigma) \approx 99.73%. $$
Writing an Ensemble Training Loop
An ensemble is a collection of models trained on the same task. These have a lot of uses such as:
- Performance metrics with confidence intervals, as we just saw.
- Aggregating predicions of an ensemble usually gives a better result than the individual models; we can see this in Homeworks 1 and 3.
- We can train the models in an ensemble with different hyperparameters, thus seeing which is the better; this we shall see next week.
When we write the ensemble training loop, we shall focus on making it efficient: instead of training the ensemble models one by one, we will vectorize the process. This brings a vast speed increase on a GPU and it is also beneficial when training with CPU thanks to SIMD vectorization.
Broadcasting Rules in Matrix Multiplication
Let \(a\) and \(b\) be tensors of shapes \((m_1,\dotsc,m_k)\) and \((n_1,\dotsc,n_l)\), respectively. Special broadcasting rules apply for matrix multiplication \(a\cdot b\): we need
- The batch shapes \((m_1,\dotsc,m_{k-2})\) and \((n_1,\dotsc,n_{l-2})\) to be broadcastable, and
- The shapes \((m_{k-1},m_k)\) and \((n_{l-1},n_l)\) allow matrix multiplication, that is \(m_k=n_{l-1}\).
Let \((r_1,\dotsc,r_{l-2})\) denote the broadcasted batch shapes. Then the matrix product \(a\cdot b\) will have
- shape \((r_1,\dotsc,r_{l-2},m_{k-1},n_l)\) and
- for all index \((l-2)\)-tuples \((i_1,\dotsc,i_{l-2})\in[r_1]\times\dotsb\times[r_{l-2}]\), we have \((a\cdot b)_{i_1,\dotsc,i_{l-2}} = a_{i_1,\dotsc,i_{l-2}}\cdot b_{i_1,\dotsc,i_{l-2}}\).
Use Case Today: Getting Logits from Ensembles of Weight Matrices and Bias Vectors
In today's notebook, we will train an ensemble of \(e\) logistic regression models on the MNIST dataset.
Weight Matrix Ensemble
Instead of a weight matrix, we'll have a weight matrix ensemble or weight tensor \(W\) of shape \((e,d,c)\) where:
- \((e,)\) is the ensemble shape.
- \(d\) is the feature dimension. In the preprocessed MNIST dataset, this is \(24\cdot24=784\), plus 1 if you are adding a column of 1s.
- \(c\) is the number of labels. In MNIST, this is 10.
To apply the corresponding ensemble of linear transformations to a feature matrix \(X\) of shape (n, d), we can make use of broadcasting: the matrix product \(X\cdot W\) will have shape (e, n, c), where for each \(0\le i<e\): the matrix \((X\cdot W)_i\) will be the result of the matrix product \(X\cdot W_i\).
Bias Vector Ensemble
If we do not add a column of 1's to the feature matrices, the we need to make use of a bias vector ensemble. This will be a tensor \(\mathbf b\) of shape (e, c).
To add this to the matrix product \(X\cdot W\) of shape (e, n, c), using broadcasting rules, we can reshape \(\mathbf b\) to have shape (e, 1, c), and use broadcasted addition. Alternatively, you can make the shape of the tensor \(\mathbf b\) (e, 1, c) already, so that you don't have to reshape before each addition.
Ensemble Dataloader
Given a tensor \(X\) of shape \((n_1,\dotsc,n_l)\) and an index tensor \(I\) of shape \((m_1,\dotsc,m_k)\) where all entries of \(I\) are in the interval \([-n_1,n_1)\), the tensor \(X[I]\) will have shape \((m_1,\dotsc,m_k,n_2,\dotsc,n_l)\), where for \((i_1,\dotsc,i_k)\in[m_1]\times\dotsb\times[m_k]\), we have \(X[I][i_1,\dotsc,i_k]=X[I[i_1,\dotsc,i_k]]\).
That is, then tensor \(X[I]\) is the composite $$ [m_1]\times\dotsb\times[m_k]\times[n_2]\times\dotsb\times[n_l] $$ $$ \xrightarrow{(i_1,\dotsc,i_k,j_2,\dotsc,j_l) \mapsto(I[i_1,\dotsc,i_k],j_2,\dotsc,j_l)} [-n_1,n_1)\times[n_2]\times\dotsb\times[n_l]\xrightarrow{X}K. $$ For further details, you can think of the first map as the Cartesian product of the maps $$ [m_1]\times\dotsb\times[m_k]\xrightarrow I[-n_1,n_1) $$ $$ \text{ and } [n_2]\times\dotsb\times[n_l]\xrightarrow{\mathrm{id}\text{ (identity)}} [n_2]\times\dotsb\times[n_l]. $$
Here, we want a dataloader that outputs minibatches of shape (ensemble_size, minibatch_size,). Therefore, we want not one vector of shuffled dataset indices, but a tensor \(I\) of shape (e, n) such that for each index \(0\le i<e\), the vector \(I_i\) is an independent shuffle of the index vector \([n]\). Thus, we can't directly use one permutation. We turn to another method:
- Generate a tensor \(U\) of shape
(e, n)of values drawn from \(\mathscr U([0, 1))\), the uniform distribution on the interval \([0, 1)\). - We can let the index tensor \(I\) be the argsort tensor along the last dimension, that is one such that for each ensemble index \(0\le i<i\), we have $$ U_{i, I_{i, 0}}\le U_{i, I_{i, 1}} \le\dotsb\le U_{i, I_{i, n-1}}. $$