Exercise Class 4

Author

Jonas Skjold Raaschou-Pedersen

Published

September 25, 2025

This week we start from this repository.

Exercise 1 - (Hugging Face) Transformers and GPT-2 (small)

Before trying to build a transformer ourselves we will play around with an already trained model. To do this, we will use the Hugging Face library; a very important library (and community) to get accustomed to if you want to dive deep into LLMs and more.

Following the installation instructions, install the CPU-only version of the library by running the command (at the top of your project directory):

uv add 'transformers[torch]'

Try running the following command, also from the installation instructions, to test that everything works as expected:

python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('hugging face is the best'))"

You should get something like:

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu
[{'label': 'POSITIVE', 'score': 0.999839186668396}]

Explain the command you just ran and the output you got.

Note: the meaning of the -c flag in the python command can be found by either typing python --help in your shell or e.g. here.

There are no requirements for where the code for the following exercises should go. You can do them as you wish: with a .py script; interactively using the interactive mode; using IPython or in a (Jupyter) Notebook.

⚠️ Warning If you get memory issues (e.g. not enough RAM and your computer becomes laggy), one way to proceed is to use the cloud-based Colab notebooks from Google.

Run the following code:
```
from transformers import pipeline, set_seed

set_seed(42)

generator = pipeline("text-generation", model="gpt2")
```
The generator variable is a TextGenerationPipeline object; it takes as input some text (i.e. a prompt) and generates more text conditioned on the prompt. The parameter model="gpt2" selects the GPT2 small model.

See also the docs for more on the TextGenerationPipeline object (the pipeline variable).

Note: running pipeline("text-generation", model="gpt2") the first time downloads the weights (estimated parameters) of the GPT-2 model + more. On Mac/Linux, look for the file model.safetensors inside the folder ~/.cache/huggingface/hub/models--gpt2/; this file contains the weights of the model. The vocabulary and tokenizer is also store there. On Windows, look under C:\Users\username\.cache\huggingface\hub\models--gpt2 or something similar.
Generate some text using the generator. You can e.g. use the prompt "Hello, I'm a language model," as a starting point. Set max_length=50 to avoid frying your computer (the maximum value of max_length is 1024; this is the context length of the model). Try setting the parameter num_return_sequences to some number larger than one to generate several sequences.
Try running:
```
print(generator.tokenizer, generator.model, sep="\n" + "-" * 45 + "\n")
```
You should see the tokenizer and model underlying the generator variable. These are the two components that transforms the text "Hello, I'm a language model," into the output text that was generated in 2.. Behind the scenes:
- the tokenizer tokenizes the text into tokens
- the tokens are transformed into tensors
- the tensors are passed through the network (forward pass)
- the output is a probability distribution for each input token over the vocabulary (which can be used to find the most likely next token)
- a sampling strategy is used to generate text based on the distribution (see e.g. section 10.2 in SLP3)

Instead of using the generator object let’s initialize the tokenizer and model. Do this by running:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

Print out the model and its config

print(model, model.config, sep="\n" + "-" * 45 + "\n")

This should yield:

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)
---------------------------------------------
GPT2Config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "dtype": "float32",
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "reorder_and_upcast_attn": false,
  "resid_pdrop": 0.1,
  "scale_attn_by_inverse_layer_idx": false,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.56.2",
  "use_cache": true,
  "vocab_size": 50257
}

From the config output we see that:

vocab_size=50257 (the vocabulary size is \(|V| = 50257\))
n_heads=12 (the model has \(12\) attention heads)
n_ctx=1024 (the model has a context length of \(1024\) i.e. it can take a maximum of \(1024\) tokens as input)
n_embd=768 (the model dimension equals \(768\))
the bos_token_id and eos_token_id (beginning and end of sequence tokens) both equal 50256

From the model we see that:

the GPT2LMHeadModel consists of a (transformer) and a (lm_head) (language model head)
the transformer consists of
- two embedding layers: one for the vocabulary and one for the time dimension (element in the input sequence)
- \(12\) GPT2Blocks consisting of LayerNorm, attention (GPT2Attention), a feedforward layer (GPT2MLP) and some dropout in between.
the lm_head is just a nn.Linear layer from a space with dimension equal to n_embd to a space with dimension equal to vocab_size

See also SLP3 chapter 9 (e.g. figure 9.15).

If you are interested, you can see the original repository of GPT-2 here.

Let’s tokenize an input sentence. Run the following:
```
text = "Hello, I'm a language model,"
encoded_input = tokenizer(text, return_tensors="pt")
```
Use the following mapping to see what the \(8\) numbers inside the encoded_input["input_ids"] tensor means:
```
id_to_token = {i: t for t, i in tokenizer.get_vocab().items()}
```
Note: Ġ in front of a token indicates that the token begins with a space.

We will now try to reproduce the loss of the model. Run:

output = model(**encoded_input, labels=encoded_input["input_ids"])

output consists of the computed loss, the logits and more. We can print it out as:

print(output.loss, output.logits, output.logits.shape, sep="\n")

tensor(4.0033, grad_fn=<NllLossBackward0>)
tensor([[[ -35.2362,  -35.3266,  -38.9753,  ...,  -44.4645,  -43.9974,
           -36.4580],
         [-112.6171, -114.5832, -116.5725,  ..., -119.0128, -118.8059,
          -111.6917],
         [-151.7890, -152.3330, -156.7318,  ..., -162.0787, -155.4329,
          -154.7270],
         ...,
         [-101.2856, -102.6806, -106.1684,  ..., -111.2952, -112.3795,
          -104.9979],
         [-101.5027, -103.5055, -108.4597,  ..., -116.2317, -114.9146,
          -105.6840],
         [-103.7558, -105.5973, -106.9940,  ..., -110.1292, -110.7860,
          -104.5280]]], grad_fn=<UnsafeViewBackward0>)
torch.Size([1, 8, 50257])

Use output.logits and encoded_input["input_ids"] to compute the cross entropy loss. You should get the exact same value as output.loss.item(). To do this, replace the placeholders _ in the snippet below with the correct values.

import numpy as np
import torch.nn as nn

loss_fn = nn.CrossEntropyLoss()

logits = output.logits.squeeze()
yhat = _
y = _
loss = loss_fn(yhat, y).item()

assert np.allclose(loss, output.loss.item())

Note: the squeeze operation is to transform the logits from a tensor of dimension (1, 8, 50257) into (8, 50257), see here.

Hint: recall the language model compares the output distribution for the current token with the true label of the next token. As there are \(8\) tokens in the example above you will have to slice the logits and true labels to match up correctly. Get visual guidance from Figure 9.1 in SLP3 chapter 9.

Exercise 2 - Sampling from GPT-2

We will now sample from the GPT-2 model i.e. generate output based on a prompt. We will implement greedy decoding and top-k sampling.

Greedy decoding

Greedy decoding amounts to choosing the most probable next token for the given input prompt. Hence you should pick the argmax of the last logit in the output logits. The procedure is as follows:
- tokenize some input sentence
- pass the tokens through the model
- select the logit corresponding to the last token in the input sequence
- choose the index of the largest logit value (i.e. most likely next token); this can be done using the argmax method of the given logit
- append the most likely token to the input sequence
- repeat the above steps until the number of tokens equal max_length
Implement a function greedy_decoding that takes as input a logit and returns the most likely next token (an integer).
```
def greedy_decoding(logit: torch.Tensor) -> int:
    ...
```
To make it general enough to handle greedy decoding and top-k sampling, you should also implement a function generate_text with the following signature:
```
def generate_text(
    model: GPT2LMHeadModel,
    tokens: torch.Tensor,
    max_length: int,
    sample_fn: Callable[[torch.Tensor], int] = greedy_decoding,
) -> torch.Tensor:
    ...
```
i.e. it takes the model, some input tokens (as a tensor), a max_length parameter and some function sample_fn that takes as input a tensor (logit) and returns a token (the sampled next token).

If everything is done correctly, when running
```
tokens = generate_text(
    model,
    tokenizer("Hello world,", return_tensors="pt")["input_ids"],
    max_length=26,
    sample_fn=greedy_decoding,
)
print(tokenizer.decode(tokens[0]))
```
you should get output similar to:
```
Hello world, I'm not sure what to say.

"I'm sorry, but I'm not sure what to say.
```
Top-k sampling

top-k sampling selects the next token by sampling from the top \(k\) tokens (measured by logit value) with the weights given the by their logit values. To do this, you should use the .topk method on the last logit of the sequence to get the indices and logit values corresponding to the top \(k\) tokens. Next, you should sample the next token using a categorical distribution implemented in PyTorch here; you should insert the logits of the top \(k\) tokens into the constructor.

You can import the distribution as:
```
from torch.distributions.categorical import Categorical
```
Hence, implement a function topk_sampling
```
def topk_sampling(logit: torch.Tensor, k: int = 10) -> int:
    ...
```
that samples the next token, based on top \(k\) sampling, given an input logit.

With the generate_text function defined as in the previous exercise, you should be able to generate text by simply changing sample_fn from greedy_decoding to topk_sampling.

If you run:
```
from functools import partial

torch.manual_seed(2025)
tokens = generate_text(
    model,
    tokenizer("Hello world,", return_tensors="pt")["input_ids"],
    max_length=65,
    sample_fn=partial(topk_sampling, k=10),
)
print(tokenizer.decode(tokens[0]))
```
you should get output equivalent to:
```
Hello world, this is not the end of the world. We've got to have something to do with it, because if we don't then things can change. The next big thing is the future.

It is also the future that is the future of the world. It is a world in which people can live
```
Deep stuff!

Optional: Exercise 3 - Implementing GPT-2

See Andrey Karpathy’s video Let’s reproduce GPT-2 (124M).