Exercise Class 4
This week we start from this repository.
Exercise 1 - (Hugging Face) Transformers and GPT-2 (small)
Before trying to build a transformer ourselves we will play around with an already trained model. To do this, we will use the Hugging Face library; a very important library (and community) to get accustomed to if you want to dive deep into LLMs and more.
Following the installation instructions, install the CPU-only version of the library by running the command (at the top of your project directory):
uv add 'transformers[torch]'Try running the following command, also from the installation instructions, to test that everything works as expected:
python -c "from transformers import pipeline; print(pipeline('sentiment-analysis')('hugging face is the best'))"You should get something like:
No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english). Using a pipeline without specifying a model name and revision in production is not recommended. Device set to use cpu [{'label': 'POSITIVE', 'score': 0.999839186668396}]Explain the command you just ran and the output you got.
Note: the meaning of the
-cflag in thepythoncommand can be found by either typingpython --helpin your shell or e.g. here.
There are no requirements for where the code for the following exercises should go. You can do them as you wish: with a .py script; interactively using the interactive mode; using IPython or in a (Jupyter) Notebook.
⚠️ Warning If you get memory issues (e.g. not enough RAM and your computer becomes laggy), one way to proceed is to use the cloud-based Colab notebooks from Google.
Run the following code:
from transformers import pipeline, set_seed set_seed(42) generator = pipeline("text-generation", model="gpt2")The
generatorvariable is aTextGenerationPipelineobject; it takes as input some text (i.e. a prompt) and generates more text conditioned on the prompt. The parametermodel="gpt2"selects the GPT2 small model.See also the docs for more on the
TextGenerationPipelineobject (thepipelinevariable).Note: running
pipeline("text-generation", model="gpt2")the first time downloads the weights (estimated parameters) of the GPT-2 model + more. On Mac/Linux, look for the filemodel.safetensorsinside the folder~/.cache/huggingface/hub/models--gpt2/; this file contains the weights of the model. The vocabulary and tokenizer is also store there. On Windows, look underC:\Users\username\.cache\huggingface\hub\models--gpt2or something similar.Generate some text using the
generator. You can e.g. use the prompt"Hello, I'm a language model,"as a starting point. Setmax_length=50to avoid frying your computer (the maximum value ofmax_lengthis1024; this is the context length of the model). Try setting the parameternum_return_sequencesto some number larger than one to generate several sequences.Try running:
print(generator.tokenizer, generator.model, sep="\n" + "-" * 45 + "\n")You should see the tokenizer and model underlying the
generatorvariable. These are the two components that transforms the text"Hello, I'm a language model,"into the output text that was generated in2.. Behind the scenes:- the tokenizer tokenizes the text into tokens
- the tokens are transformed into tensors
- the tensors are passed through the network (forward pass)
- the output is a probability distribution for each input token over the vocabulary (which can be used to find the most likely next token)
- a sampling strategy is used to generate text based on the distribution (see e.g. section 10.2 in SLP3)
Instead of using the
generatorobject let’s initialize the tokenizer and model. Do this by running:from transformers import GPT2LMHeadModel, GPT2Tokenizer tokenizer = GPT2Tokenizer.from_pretrained("gpt2") model = GPT2LMHeadModel.from_pretrained("gpt2")Print out the model and its config
print(model, model.config, sep="\n" + "-" * 45 + "\n")This should yield:
GPT2LMHeadModel( (transformer): GPT2Model( (wte): Embedding(50257, 768) (wpe): Embedding(1024, 768) (drop): Dropout(p=0.1, inplace=False) (h): ModuleList( (0-11): 12 x GPT2Block( (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (attn): GPT2Attention( (c_attn): Conv1D(nf=2304, nx=768) (c_proj): Conv1D(nf=768, nx=768) (attn_dropout): Dropout(p=0.1, inplace=False) (resid_dropout): Dropout(p=0.1, inplace=False) ) (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (mlp): GPT2MLP( (c_fc): Conv1D(nf=3072, nx=768) (c_proj): Conv1D(nf=768, nx=3072) (act): NewGELUActivation() (dropout): Dropout(p=0.1, inplace=False) ) ) ) (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (lm_head): Linear(in_features=768, out_features=50257, bias=False) ) --------------------------------------------- GPT2Config { "activation_function": "gelu_new", "architectures": [ "GPT2LMHeadModel" ], "attn_pdrop": 0.1, "bos_token_id": 50256, "dtype": "float32", "embd_pdrop": 0.1, "eos_token_id": 50256, "initializer_range": 0.02, "layer_norm_epsilon": 1e-05, "model_type": "gpt2", "n_ctx": 1024, "n_embd": 768, "n_head": 12, "n_inner": null, "n_layer": 12, "n_positions": 1024, "reorder_and_upcast_attn": false, "resid_pdrop": 0.1, "scale_attn_by_inverse_layer_idx": false, "scale_attn_weights": true, "summary_activation": null, "summary_first_dropout": 0.1, "summary_proj_to_labels": true, "summary_type": "cls_index", "summary_use_proj": true, "task_specific_params": { "text-generation": { "do_sample": true, "max_length": 50 } }, "transformers_version": "4.56.2", "use_cache": true, "vocab_size": 50257 }From the config output we see that:
vocab_size=50257(the vocabulary size is \(|V| = 50257\))n_heads=12(the model has \(12\) attention heads)n_ctx=1024(the model has a context length of \(1024\) i.e. it can take a maximum of \(1024\) tokens as input)n_embd=768(the model dimension equals \(768\))- the
bos_token_idandeos_token_id(beginning and end of sequence tokens) both equal50256
From the model we see that:
- the
GPT2LMHeadModelconsists of a(transformer)and a(lm_head)(language model head) - the transformer consists of
- two embedding layers: one for the vocabulary and one for the time dimension (element in the input sequence)
- \(12\)
GPT2Blocks consisting ofLayerNorm, attention (GPT2Attention), a feedforward layer (GPT2MLP) and some dropout in between.
- the
lm_headis just ann.Linearlayer from a space with dimension equal ton_embdto a space with dimension equal tovocab_size
See also SLP3 chapter 9 (e.g. figure 9.15).
If you are interested, you can see the original repository of GPT-2 here.
Let’s tokenize an input sentence. Run the following:
text = "Hello, I'm a language model," encoded_input = tokenizer(text, return_tensors="pt")Use the following mapping to see what the \(8\) numbers inside the
encoded_input["input_ids"]tensor means:id_to_token = {i: t for t, i in tokenizer.get_vocab().items()}Note: Ġ in front of a token indicates that the token begins with a space.
We will now try to reproduce the loss of the model. Run:
output = model(**encoded_input, labels=encoded_input["input_ids"])outputconsists of the computed loss, the logits and more. We can print it out as:print(output.loss, output.logits, output.logits.shape, sep="\n")tensor(4.0033, grad_fn=<NllLossBackward0>) tensor([[[ -35.2362, -35.3266, -38.9753, ..., -44.4645, -43.9974, -36.4580], [-112.6171, -114.5832, -116.5725, ..., -119.0128, -118.8059, -111.6917], [-151.7890, -152.3330, -156.7318, ..., -162.0787, -155.4329, -154.7270], ..., [-101.2856, -102.6806, -106.1684, ..., -111.2952, -112.3795, -104.9979], [-101.5027, -103.5055, -108.4597, ..., -116.2317, -114.9146, -105.6840], [-103.7558, -105.5973, -106.9940, ..., -110.1292, -110.7860, -104.5280]]], grad_fn=<UnsafeViewBackward0>) torch.Size([1, 8, 50257])Use
output.logitsandencoded_input["input_ids"]to compute the cross entropy loss. You should get the exact same value asoutput.loss.item(). To do this, replace the placeholders_in the snippet below with the correct values.import numpy as np import torch.nn as nn loss_fn = nn.CrossEntropyLoss() logits = output.logits.squeeze() yhat = _ y = _ loss = loss_fn(yhat, y).item() assert np.allclose(loss, output.loss.item())Note: the
squeezeoperation is to transform the logits from a tensor of dimension(1, 8, 50257)into(8, 50257), see here.Hint: recall the language model compares the output distribution for the current token with the true label of the next token. As there are \(8\) tokens in the example above you will have to slice the logits and true labels to match up correctly. Get visual guidance from Figure 9.1 in SLP3 chapter 9.
Exercise 2 - Sampling from GPT-2
We will now sample from the GPT-2 model i.e. generate output based on a prompt. We will implement greedy decoding and top-k sampling.
Greedy decoding
Greedy decoding amounts to choosing the most probable next token for the given input prompt. Hence you should pick the argmax of the last logit in the output logits. The procedure is as follows:
- tokenize some input sentence
- pass the tokens through the model
- select the logit corresponding to the last token in the input sequence
- choose the index of the largest logit value (i.e. most likely next token); this can be done using the
argmaxmethod of the given logit - append the most likely token to the input sequence
- repeat the above steps until the number of tokens equal
max_length
Implement a function
greedy_decodingthat takes as input a logit and returns the most likely next token (an integer).def greedy_decoding(logit: torch.Tensor) -> int: ...To make it general enough to handle greedy decoding and top-k sampling, you should also implement a function
generate_textwith the following signature:def generate_text( model: GPT2LMHeadModel, tokens: torch.Tensor, max_length: int, sample_fn: Callable[[torch.Tensor], int] = greedy_decoding, ) -> torch.Tensor: ...i.e. it takes the model, some input tokens (as a tensor), a
max_lengthparameter and some functionsample_fnthat takes as input a tensor (logit) and returns a token (the sampled next token).If everything is done correctly, when running
tokens = generate_text( model, tokenizer("Hello world,", return_tensors="pt")["input_ids"], max_length=26, sample_fn=greedy_decoding, ) print(tokenizer.decode(tokens[0]))you should get output similar to:
Hello world, I'm not sure what to say. "I'm sorry, but I'm not sure what to say.Top-k sampling
top-k sampling selects the next token by sampling from the top \(k\) tokens (measured by logit value) with the weights given the by their logit values. To do this, you should use the
.topkmethod on the last logit of the sequence to get the indices and logit values corresponding to the top \(k\) tokens. Next, you should sample the next token using a categorical distribution implemented in PyTorch here; you should insert the logits of the top \(k\) tokens into the constructor.You can import the distribution as:
from torch.distributions.categorical import CategoricalHence, implement a function
topk_samplingdef topk_sampling(logit: torch.Tensor, k: int = 10) -> int: ...that samples the next token, based on top \(k\) sampling, given an input logit.
With the
generate_textfunction defined as in the previous exercise, you should be able to generate text by simply changingsample_fnfromgreedy_decodingtotopk_sampling.If you run:
from functools import partial torch.manual_seed(2025) tokens = generate_text( model, tokenizer("Hello world,", return_tensors="pt")["input_ids"], max_length=65, sample_fn=partial(topk_sampling, k=10), ) print(tokenizer.decode(tokens[0]))you should get output equivalent to:
Hello world, this is not the end of the world. We've got to have something to do with it, because if we don't then things can change. The next big thing is the future. It is also the future that is the future of the world. It is a world in which people can liveDeep stuff!
Optional: Exercise 3 - Implementing GPT-2
See Andrey Karpathy’s video Let’s reproduce GPT-2 (124M).