5/8 Generative Pretrained Transformer

Pretraining

Last week, we introduced the transformer architecture. We discussed how the success of the transformer architecture stems on its robustness: you can train deep transformer models without much hyperparameter tuning, and they generalize well, that is they are resistant to overfitting.

Said generalization capability can be benefitted from by pretraining: In this procedure, before training our model on the task at hand, we imbue it with generic linguistic knowledge.

Text Generation Models

Next Token Prediction

We generate text token by token. That is, given a partial sequence such as "The meaning of life is", we query the model what should the next token be.

Hiding ensemble dimensions, we use a trainable $d\times v$ matrix to transform a tensor $Y=(b, s, d)$ of contextual token embeddings to a tensor $Z=(b, s, v)$ of token logits. The vector $Z[i, j]$ gives the prediction logits for the token that should follow the $j$-th token in the $i$-th sequence in the minibatch.

Causal Attention Mask

When the model sees "The meaning of life is", it does not know what tokens would follow the next token, so it should not take those positions into account.

Remember that (disregarding multiple heads), attention has the formula

\[ \mathrm{Attention}(Q, K, V)=\mathrm{softmax}(QK^T, \mathtt{axis}=-1)V, \]

that is the positional relations are encoded in the (batched) matrix $\mathrm{softmax}(QK^T, \mathtt{axis}=-1)$, where its $j, j'$ entry says how much the $j'$-th token embedding on the $\ell$-th level matters for the $j$-th token embedding on the $(\ell+1)$-st level. We want the values for $j'>j$ be zeros so that a token can't affect the tokens to its left. That is, we want the attention matrix be a lower triangular matrix. This operation is usually performed one step back, on the matrix $QK^T$, using a so-called attention mask, by making the not masked entries $-\infty$.

Next Token Prediction Pretraining

With this, we can perform pretraining in an unsupervised manner: for a minibatch $S$ of sequences of token ids from the corpus, the loss is the cross-entropy between the next token logits we get out of this and the next token ids $S[:, 1:]$; we end the sequences with the special token EOS.

It was discovered in [1] that after this pretraining one can finetune a model easier to different text generation tasks, for example question answering or text summarization. This is where GPT gets its name from: Generative Pretrained Transformer.

Variants on Generation

Based on one's preferences, for example how extraordinary of a text one wants to receive, there are variants on how to generate text based on the next token logits.

Best First Search: At each step, we choose the token with the highest logit.
Sampling: At each step, we sample between the logits with weights based on their prediction probabilities. In this context, the hyperparameter temperature is an exponent one can give the probabilities to control how unusual the received text is.
Beam Seach: Another option is to keep multiple variations at hand. The hyperparameter beam width controls how many variants we keep.

This is just a very small sampling. For more options, you can see for example the transformers.GenerationConfig documentation. The transformers package was created by Hugging Face to help use transformer models.

Emergent Features of Bigger Models

In Complexity Theory, we call emergent features of a complex system those that cannot be explained by their parts. We can observe such features as the model sizes grow. Let's look at the largest sizes in the GPT model families by OpenAI. The architectural details of GPT-4 are not public, but some details are known, see https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/.

name	number of transformer blocks	embedding dimension	feedforward dimension	total parameter count
GPT [1]	12	768	3072	117M
GPT-2 [2]	48	1600	Not Available	1542M
GPT-3 [3]	96	12288	49152	175B
GPT-4 [4]	120?	Not Available	Not Available	1.8T?
GPT-4.5 [5]	Not Available	Not Available	Not Available	Not Available

In GPT-2, it was observed that the models are starting to get nontrivial results in zero- and few-shot learning. This paradigm means that the model is not finetuned in the traditional sense, but only receives a textual description of the task and maybe a few examples.

GPT-3 actually got good at this. Moreover, its text generation capabilities received general media coverage, as you may well know. These gigantic models that have been pretrained on gigantic corpora are called Foundation Models.

Economic and Environmental Costs

Training these models incurs a huge economic and environmental cost. Training GPT-3 is very roughly estimated to have cost $10M https://hai.stanford.edu/news/ai-index-state-ai-13-charts while emitting 85000 kg of CO2 equivalents, the same amount produced by a new car in Europe driving 700,000 km, or 435,000 miles; in contrast, the Earth--Moon distance is about 240,000 miles https://www.theregister.com/2020/11/04/gpt3_carbon_footprint_estimate/.

Instruction Tuning

Given a Text Generation Foundation Model, Instruction Tuning is a finetuning task to create helpful digital assistants. They work by making the model generate multiple answers, and giving feedback on which answer was preferred by the human annotators or an evaluation model.

Reinforcement Learning with Human Feedback

This is the first widely known method to create a helpful digital assistant [6]. It has four steps.

Supervised Learning. Human annotators provide example dialogues and a foundation model is finetuned on them via next token prediction.
Generated Answer Ranking. The model trained in Step 1 is made to generate multiple answers in dialogues. Human annotators rank the answers by preference.
Reward Model Training. A foundation model $r$ is trained to provide scores for the generated answers so that if answer $A$ is preferred by humans to answer $B$, then we have $r(A)>r(B)$.
Reinforcement Learning. The answer generator model is further finetuned with Reinforcement Learning, originally PPO. That is, it is trained so that it generates answers that maximize the expected score by $r$.

Steps 2-4 can be repeated at will.

Direct Preference Optimization

As it later turned out, steps 3-4 can be replaced by an objective directly on the generator model [7].

References

[1] Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever. 2018. Improving Language Understanding by Generative Pre-Training. link

[2] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. link

[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (NeurIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 1877–1901. link

[4] OpenAI (see the Authorship, Credit Attribution, and Acknowledgements section), 2023. GPT-4 Technical Report. link

[5] OpenAI , 2025. OpenAI GPT-4.5 System Card. link

[6] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (NeurIPS). link

[7] Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems 36 (NeurIPS). link