{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Logistic Regression on Averaged Word Vectors\n",
    "\n",
    "## Setup\n",
    "\n",
    "### Install\n",
    "\n",
    "When we use a pre-trained model, we need to make sure that we preprocess our data the same way that it was done during training. In Section 3 of the paper about the pre-trained word vectors we use [1], they write \"We only used a publicly available `tokenizer.perl` script from the [Moses MT project](https://github.com/moses-smt)\". Fortunately, this script has a Python interface:  \n",
    "https://github.com/luismsgomes/mosestokenizer  \n",
    "You can install it via the following line, to be used with your `mamba` environment activated:\n",
    "```bash\n",
    "pip install mosestokenizer\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Imports\n",
    "\n",
    "1. Import the usual modules `datasets`, `plt`, `torch` and `tqdm`.\n",
    "2. Also import the new module `mosestokenizer`.\n",
    "3. Finally, make a module with or update the one you have with the following functions:\n",
    "    1. `get_dataloader_random_reshuffle` and `line_plot_confidence_band` from Notebook 0221.\n",
    "    2. `get_binary_accuracy`, `get_binary_cross_entropy`, `get_seed`, `lsa` and `train_logistic_regression` from Notebook 0228."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Constants\n",
    "\n",
    "Create a configuration dictionary with the following keys:\n",
    "- `\"dataset_path\"`: `str`  \n",
    "    Get this from the dataset page  \n",
    "    https://huggingface.co/datasets/dair-ai/emotion\n",
    "- `\"device\"`: `torch.device | int | str`  \n",
    "    The device identifier.\n",
    "- `\"ensemble_shape\"`: `tuple[int]`  \n",
    "    Make this a `(10,)`, as for now, we'll train one ensemble with the same hyperparameters.\n",
    "- `\"improvement_threshold:`: `float`  \n",
    "    Make this `1e-4`.\n",
    "- `\"labels_dtype\"`: `torch.dtype`  \n",
    "    The datatype we use for label tensors. With multiclass classification, this was `torch.int64`. For binary classification, which is the case here, we want `torch.float32`.\n",
    "- `\"minibatch_size\"`: `int`  \n",
    "    Make this a `256`.\n",
    "- `\"n_components\"`: `int`  \n",
    "    The number of feature dimensions to find with truncated SVD. Let's make this `300`. This is the dimension of the word vectors we'll load. Thus, this will give a fairer comparison between LSA and word vectors.\n",
    "- `\"seed\"`: `int`  \n",
    "    This is for reproducible experiments. Insert any integer.\n",
    "- `\"steps_num\"`: `int`  \n",
    "    Make this a `1_000_000`. Let early stopping take care of stopping.\n",
    "- `\"steps_without_improvement`: `int`  \n",
    "    Make this `1000`.\n",
    "- `\"valid_interval\"` : `int`  \n",
    "    Make this a `10`.\n",
    "- `\"word_vectors_path\"` : `str`  \n",
    "    Make this `data/emotion-sadness-joy-fasttext.vec`. I compressed and uploaded it to the class website. If you haven't done so already, [follow this link](https://www.renyi.hu/~zsamboki/teaching/dml-spring-2025/lab_notebooks/data/emotion-sadness-joy-fasttext.zip) to download it. Please unpack it afterwards."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Set the `torch` PRNG seed to the value in the configuration dictionary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate LSA + Logistic Regression on New Dataset\n",
    "\n",
    "First, we'll run the training procedure from last time on the new dataset so that we have a result to compare our new approach to."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load, Filter and Split Dataset\n",
    "\n",
    "You can notice the following details on the dataset page:\n",
    "1. It's got 2 subsets: `\"split\"` and `\"unsplit\"`. The latter is much larger, so we'll take that and split it ourselves. You can specify the subset you want to load with the `name` keyword argument of `datasets.load_dataset`.\n",
    "2. This is actually a multiclass dataset with 6 labels. On the other hand, about 30%-30% of all labels are `0 - sadness` and `1 - joy`. So we'll filter the entries and keep only these. You can use the `filter` method of the dataset for this. Its positional argument is a funtion that takes a dataset entry and outputs a Boolean that says if we should keep the entry. The method returns the filtered dataset.\n",
    "\n",
    "Perform the operations described above and print the dataset you get. It should have `262_254` entries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a 90%-10% train-valid split on your dataset, while determining the seed for the split via `get_seed`. Print splits and first entries for debugging."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Just like in Notebook 0228:\n",
    "1. get feature matrices and label vectors with `lsa`,\n",
    "2. get a random reshuffling training dataloader,\n",
    "3. train a logistic regression model on your data and\n",
    "4. plot training and validation binary accuracy and binary cross-entropy curves with confidence bands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Averaged Word Vectors + Logistic Regression\n",
    "\n",
    "1. We have a collection of `2_000_000` pre-trained word vectors.\n",
    "2. We will tokenize our dataset and get another collection of unique tokens.\n",
    "\n",
    "We want to\n",
    "1. enumerate the word vectors that are in both collection and\n",
    "2. take the average of the word vectors that occur in a document in our dataset to get its document vector."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load Word Vectors\n",
    "\n",
    "First of all, let's load the word vectors! As the full collection of word vectors is a >7GB text file, I created a smaller one that only contains the word vectors that occur in our dataset, but it has the same structure. I compressed and uploaded it to the repository as `data/emotion-joy-sadness-fasttext.zip`. Please download and unpack it.\n",
    "\n",
    "See the documentation for the structure of the text file and example code to load its data:  \n",
    "https://fasttext.cc/docs/en/english-vectors.html#format\n",
    "\n",
    "The example code creates a dictionary with:\n",
    "1. keys the words and\n",
    "2. values generators that iterate over the word vector components.\n",
    "\n",
    "We want to get the following objects:\n",
    "1. A list of words, by convention called `id2token`.\n",
    "2. A dictionary that maps word to their index in `id2token`, by convention called `token2id`.\n",
    "3. A tensor of shape `(len(id2token), n_components)` that lists the word vectors the same order as `id2token`. Use the device and datatypes given in the configuration dictionary.\n",
    "\n",
    "For debugging:\n",
    "1. Print the first 10 entries of `id2token`.\n",
    "2. Iterate over these 10 entries and print the respective values of `token2id`.\n",
    "3. Print the shape, device and datatype of the word tensor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_vectors(config: dict) -> tuple[list[str], dict, torch.Tensor]:\n",
    "    \"\"\"\n",
    "    Load the word vectors from a file.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    config : dict\n",
    "        Configuration dictionary. Required keys:\n",
    "        device : int | str | torch.device\n",
    "            The device to store the word tensor on.\n",
    "        word_vectors_path : str\n",
    "            The path to the word vector file.\n",
    "            We assume the file is a text file with the following structure:\n",
    "            1. The first line of the file contains the number of words\n",
    "                in the vocabulary and the size of the vectors.\n",
    "            2. Each following line contains a word followed by\n",
    "                its vector components,\n",
    "                like in the default fastText text format.\n",
    "                Each component is space separated.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    A triple of the following:\n",
    "    1. A list of the words in the order they appear in the file.\n",
    "    2. A dictionary mapping words to their index in the list.\n",
    "    3. A tensor of the stacked word vectors.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Get Document Vectors\n",
    "\n",
    "Time to use a `mosestokenizer.Mosestokenizer`! As you can see on the webpage:\n",
    "1. If you initialize it with no arguments, you get a tokenizer for English, which is what we want.\n",
    "2. This object is callable. If you call it on a string, you get a list of tokens as strings.\n",
    "\n",
    "Now, for both the training and validation splits, we want to get a feature matrix such that lists for each document in the split the average of the word vectors with tokens that the tokenizer lists from the document and are elements of the pre-trained word vector collection.\n",
    "\n",
    "To get this for either dataset:\n",
    "1. Initialize a document vector list.\n",
    "2. As you iterate over the documents in the split:\n",
    "    1. Create a new list of token IDs.\n",
    "    2. Iterating over the list of tokens you get from the tokenizer:\n",
    "        1. If the token is in the word vector dictionary:\n",
    "            1. Append the token ID to the list of token IDs for the text.\n",
    "    3. If the token ID list is empty:\n",
    "        1. Append an appropriate zero vector to the document vector list.\n",
    "        2. Otherwise:\n",
    "            1. Index into the word tensor with the token ID list to get a matrix with rows the word vectors for the words in the documents.\n",
    "            2. Take its mean along the word dimension, that is by convention called the *sequence dimension*.\n",
    "            3. Append this mean to the document vector list.\n",
    "3. Stack the document vectors into a feature matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_document_vectors(\n",
    "    documents: list[str],\n",
    "    id2vector: torch.Tensor,\n",
    "    token2id: dict\n",
    ") -> torch.Tensor:\n",
    "    \"\"\"\n",
    "    Given a word vector collection,\n",
    "    transform a list of documents to a feature matrix\n",
    "    where each document vector is the average of the\n",
    "    vectors of the words that appear in the document\n",
    "    and are elements of the word vector collection.\n",
    "\n",
    "    Uses `mosestokenizer.Mosestokenizer` for tokenization.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    documents : list[str]\n",
    "        A list of documents.\n",
    "    id2vector : torch.Tensor\n",
    "        A matrix with rows word vectors.\n",
    "    token2id : dict\n",
    "        A dictionary that maps tokens to row indices in `id2vector`.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    The features matrix.\n",
    "    \"\"\"\n",
    "    raise NotImplementedError\n",
    "\n",
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train and Evaluate\n",
    "\n",
    "1. Get a training dataloader from your new training feature matrix.\n",
    "2. Train a logistic regression model with the new training and validation features matrices.\n",
    "3. Plot training and validation binary accuracy and binary cross-entropy curves with confidence bands."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looks like we get about the same performance as LSA.\n",
    "\n",
    "Therefore, you may wonder if the word representations matter at all. To test this:\n",
    "1. Fill the word tensor with random values with the method `normal_`. This is an in-place operation, so you don't have to reassign variables. Set the `std` keyword argument to $300^{-1/2}$. We'll see why when we discuss initialization of neural networks.\n",
    "2. Repeat the training process as before."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "raise NotImplementedError"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, so the word vectors do matter some.\n",
    "\n",
    "We will be able to make use of internal document structure with them via more advanced models."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# References\n",
    "\n",
    "[1] Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch and Armand Joulin: *Advances in Pre-Training Distributed Word Representations*, 2018. Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). https://aclanthology.org/L18-1008/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dataset References\n",
    "\n",
    "[2] Elvis Saravia, Hsien-Chi Toby Liu, Yen-Hao Huang, Junlin Wu and Yi-Shin Chen: *CARER: Contextualized Affect Representations for Emotion Recognition*, 2018. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687--3697. https://aclanthology.org/D18-1404/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# License\n",
    "\n",
    "This work is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-sa/4.0/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "dml",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
