Homework 1

It is general experience in Machine Learning that training multiple models and aggregating their predictions (eg. with averaging) oftentimes yields better performance than that of the individual models. We can already witness this effect. As least squares is a deterministic algorithm, what we can vary is what subset of the training set we fit a linear regression model to.

With the 85%-15% train-test split as in class, the train set has 3550 entries. Let's take subsets of this with 3000 entries.
Create two lists. You'll collect the average MSE of the individual models in one and the MSE of the ensemble in another.
Loop over the number $m$ of models you train, from 2 to 100:
1. Take $m$ random subsets of the train set of 3000 entries. You can use the same method as in Notebook 2/7: you are basically taking train splits of the train split.
2. On each random subset, fit a linear regression model, thus getting weight vectors $\mathbf w_i:0\le i<m$.
3. For each $0\le i<m$, we get predicted test targets $\mathbf z_\mathrm{test}^{(i)}=X_\mathrm{test}\mathbf w_i$.
  1. Calculate the MSE of these individually and record their average in the average MSE list.
  2. We can form ensemble predictions as the average of the individual predictions: $$ \mathbf z_\mathrm{test}^\mathrm{ensemble}=\frac{1}{m}\sum_{i=0}^{m-1}\mathbf z_\mathrm{test}^{(i)}. $$ Calculate the MSE using the ensemble predictions and record it in the ensemble MSE list.
Tip

Using broadcasting (to be covered soon), you can solve the $m$ least squares problems, and evaluate the $m$ models each at once. That is, you do not have to loop over $0\le i<m$ in your code. This is not required for this homework, but it will be good practice and it can speed up the computation greatly.
Compare the two results by making a line plot of each list then showing the canvas.
1. Setting the label keyword argument of the plt.plot calls and then calling plt.legend() before showing the canvas can help the readability of the diagram.