LLMs Fundamentals

A Crash Course in the Math Behind Large Language Models

Jan 25, 2024

You have probably used Large Language Models (LLM) many times already; but do you actually know the math behind them? In this blog post, I’ll uncover the fundamental concepts behind a language model. I’ll initially consider it as a black box, and explain how text is transformed into suitable input tokens for the language model and how we can actually generate text from it. Then, we’ll open together the black box, starting from Recurrent Neural Network models and expanding to Transformer models, used for example by ChatGPT or Llama. I’ll also briefly explain how these models can be trained.

An auto-regressive language model, such as ChatGPT, computes the probability distribution over the next word given preceding words. When you are using ChatGPT, you enter a prefix sentence, called a prompt, and ChatGPT samples a completion conditional on this prompt. More generally, sentences can be viewed as a sequence of text tokens, e.g. [“My”, “name”, “is”], and a language model defines a probability over this sequence. The joint probability of the sequence can be factorized using chain rule. For example, the probability P([“My”, “name”, “is”]) can be factorized into P([“My”]).P([“name”] | [“My”]).P([“is”] | [“My”, “name”]). The crux of the matter is how to effectively model and compute these conditional probabilities.

Did you say tokens?

Tokenization is a lossless, i.e. reversible, translation of text into a sequence of smaller text chunks called tokens. The set of such tokens is called vocabulary, and each individual tokens in the vocabulary is assigned a unique token id. Keep in mind that if your text can be tokenized into fewer tokens, this means you can fit more text in your context window (which is limited by a specific number of tokens; 8,192 tokens for gpt-4-0613). It can be an interesting criteria when comparing tokenizers.

One could decide to split sentences into words, and each word in the sentences would be a token. This is called word tokenization. In practice, there are many different words and new words come up every day, thus this approach does not scale very well. A way around is to introduce a special token to represent any word that is not in the vocabulary; this special token is usually called UNK.

Another approach is to split sentences into characters, thus called character tokenization. In practice, since the resulting sentences are much longer (sequences of characters instead of sequence of words), it is rather slow to train. Also, as we’ll later see in this post, language models try to assign meaningful vector-based representations called embeddings to each individual tokens, and it is much harder to learn meaningful representations of individual characters compared to entire words.

Most recent models use a hybrid approach between word-level and character-level, called subword tokenization. There are different methods: for example GPT-2 uses Byte-Pair-Encoding (BPE) whereas BERT uses WordPiece. Both approaches start with an initial vocabulary, e.g. the initial alphabet along with optional special tokens. Then, using a text corpus split according to the vocabulary, they apply merge rules to form a new text token from two following text tokens already present in the vocabulary, and add the newly formed text token to the vocabulary. This process is repeated until the vocabulary reaches a desired size. BPE merges the most frequent pair in the text corpus. WordPiece prioritizes instead the merging of pairs where both text tokens are less frequent in the vocabulary, i.e. [“w1”, “w2”] that maximizes P( [“w1”, “w2”]) divided by P( [“w1”]).P( [“w2”]), computed on the text corpus. This approach maximizes the likelihood of the data once added to the vocabulary. As an example, [“finan”, “cial”] might be merged faster than [“off”, “set”] since “off” is a more probable word start than “finan” and “set” is a more probable word end than “cial”.

Given an existing vocabulary of text tokens, a sentence can be broken down into a sequence of tokens, using for example a greedy method like MaximumMatching. MaximumMatching iteratively picks the longest prefix of the remaining text that matches a subword in the model’s vocabulary.

Note: Tokenization can lead to apparently benign errors. For example, if “://” is a single token, then a language model will not learn to predict “//” after “:”. As another example, a prompt with a trailing whitespace like “My name is “ will be encoded differently than “My name is”, potentially leading to different behaviors.

For more advanced topics, see: SentencePiece, a more sophisticated subword tokenization algorithm

How does the language model generate text?

As we previously said, a language model is a way to efficiently compute the probability distribution over the next word given preceding words. Given a trained language model, one can chose to always pick the most probable next token. This is called greedy decoding. One potential problem with greedy decoding is that you can’t undo choices, and what looked like a good pick might end up off-track. Beam search mitigates this issue by always keeping the K best hypotheses at each time step, and eventually choosing the best one among them.

You can also decide to sample from the probability distribution. Two common practices are called “top-K”, where you sample only among the K most probable next tokens, and “top-p”, where you sample only among the smallest possible set of words whose cumulative probability exceeds the probability p. In both approaches, you redistribute the probability mass among the filtered set of words. One can also modify the randomness of the probability distribution by using “temperature” T. Scores are divided by T before being passed to the softmax operation. Higher temperature T yields more uniform distribution, and a temperature T=0 amounts to greedy decoding.

In general, maximum probability decoding is good for tasks like machine translation and summarization, whereas more randomness can lead to more “creative” responses. To reduce repetition, one can use simple heuristics, such as not repeating n-grams, or different training objectives, such as unlikelihood objective.

For more practical details, see: HuggingFace’s How to generate text.

Fore more advanced topics: contrastive decoding, contrastive search.

Tell me more about what is meant by “language model”

As we discussed previously, sentences are split into a sequence of tokens, and each token is mapped to its unique ID. Then, an embedding layer converts this sequence of IDs into a sequence of vector representations called embeddings. Conceptually, an embedding of a text token is a real-valued vector representation of this text token that is supposed to encode its meaning. For example. we might expect that text tokens close in the embedding space are similar in meaning. An embedding layer is simply a lookup table that stores the embeddings of a dictionary, and retrieves them using indices. These embeddings are randomly initialized, then learnt while training the language model.

Language has a natural ordering, which motivates the use of recurrent neural networks (RNN). Vanilla RNNs process one token at a time using an internal state h(t) that is updated as the sequence of tokens is processed. In math terms: h(t) = f(h(t−1), x(t)), usually with h(t) = tanh(Whh.h(t−1) + Whx.x(t)) and y(t) = Why.h(t), where Whh, Whh and Why are weight matrices, i.e. linear layers. RNNs are used for example in sequence-to-sequence problems like summarization or neural machine translation. In sequence-to-sequence problems, an encoder RNN first process the input sequence and produces an initial hidden state for the decoder RNN. The decoder RNN then generates the target sentence conditioned on the input sequence encoding. At each time step, the decoder produces a hidden state yi for the next token to be predicted, then applies a softmax layer to generate a probability distribution over the next token. More generally, the encoder maps an input sequence into a sequence of intermediate feature representations, and the decoder uses this learned intermediate representation to generate an output sequence.

RNNs can process any length input. The model size does not increase for longer inputs, and in theory it can use information from many steps back. But in practice, recurrence is slow, and it has also proved to be difficult to access information from many steps back (and learn long dependencies). This is partly due to the repeated multiplication by the same weight matrix that can lead to exploding or vanishing gradients.

Long Short-Term Memory (LSTM) models add a separate memory, called cell state c(t), to preserve information over many timesteps. While the hidden state h(t) stores all the information till that time step t, the cell state c(t) stores particular information that might be needed in the future. Using gating mechanisms, LSTMs can read, write and erase information from the cell.

Another way to improve performance is to introduce more direct and linear path, such as residual connections, attention, etc. Attention helps the model to understand which words are particularly relevant. For example, in sequence-to-sequence models, attention allows the decoder to focus on different regions of the source sentence during the course of decoding. Conceptually, it maps a query and a set of key and values. To do so, it computes similarity scores between the query and the keys, then normalize these scores into a probability distribution, and computes a probability-weighted average of the values. Attention helps model complex dependencies without regard to token distances in the input or output sequences. There are different types of attention, but the most widely used one is called scaled dot-product attention:

\(softmax(\frac{QK^T}{d_k})V\)

where dk is the hidden state dimension. Scaling by dk preserves a unit variance, if you consider that each individual component from the dot product are independent centered variables with unit variance. Without scaling, large values might be passed into the softmax, resulting in peaky distributions that might aggregate information from only a few tokens.

Note: RNN can also be used in other applications, for example time series processing.

For more advanced topics, see: ResNet.

Ok, but that’s not the model behind ChatGPT?

You might have heard about the paper “Attention is all you need”. Long story short, this title is the foundational idea of the Transformer architecture, introduced in this paper and still the best performing model architecture for language models. At a high-level, ChatGPT is a decoder-only Transformer model. A decoder-only Transformer consists of an embedding layer followed by stack of decoder blocks, each made of a causal self-attention layer followed by a feed-forward layer, and followed by a final linear layer and softmax to produce the conditional probabilities. The attention layer can be viewed as a communication mechanism, where each token holds some information to be aggregated, whereas the subsequent feed-forward layer helps tokens to act on the propagated information. The name “decoder-only” is due to the fact that the original Transformer model introduced in the paper mentioned above is actually an encoder-decoder model with a stack of encoder blocks and a stack of decoder blocks. Similarly to what we said in the previous section, the encoder maps the input sequence into intermediate features that are used by the decoder to generate the output.

For an input hidden state sequences to the attention layer, each hidden state is first projected into query, key and value vectors (self-attention: across tokens in the same sequence). Attention scores are obtained by taking the dot product of the queries and the keys. A mask is then applied to ensure that words can only attend to, i.e. get information from, their preceding words (causal attention: words are blocked for attending to future words). Masked scores are set to -inf, and scores are passed to a softmax to form a probability distribution. The final output is a weighted sum of the values.

Since there are multiple “aspects” that we would want to match on (syntax, semantics, etc.), multi-head attention generalizes attention by introducing multiple channels of communication. It effectively performs multiple attention in parallel and concatenates their outputs.

\( concat(head_1, ..., head_h)W_0; head_i = attention(QWQ_i , KW K_i , V WV_i )\)

By projecting the input hidden states to query, key, and value vectors of dimension smaller by a factor h, where h is the number of heads in the multi-head attention, the computational cost is kept similar to that of a single head attention with full dimensionality.

With n being the sequence length, and d the hidden state dimension, self attention has a computational complexity of O(n^2.d) per layer, does O(1) sequential operations, and has a maximum path length of O(1). By comparison, a recurrent model has a O(n.d^2) complexity per layer, does O(n) sequential operations, and has a maximum path length of O(n).

Note: There is 2x as many parameters in the feed-forward layer (2.dff.demb+dff +demb where dff = 4.dmodel is the projected dimension size in the feed-forward layer) than in the multi-head attention (3.dk.nhead.demb + demb^2). In terms of FLOPs, multi-head attention scales as O(n^2.d) and feed-forward scales as O(n.d^2). For models with high hidden dimension, for example like GPT3 where the sequence length is 2,048 and the hidden state dimension is 12,000, attention ops is negligible.

For more practical details, see: The Annotated Transformer.

I don’t see any dependency in the word ordering…

You’re correct: there is no notion of word ordering in what I previously described. But word order does matter (the position of a single “not” can change the meaning of a sentence), and that’s why Transformers also have positional embeddings, to give the model some information about the relative or absolute position of the tokens in the sequence.

There are many choices of positional encodings, some learned and other fixed. Ideally, it’d give the same value for a given index whatever the text length (so using fractions based on text length are thus ruled out). A common approach is to use Sinusoidal Positional Encoding. It takes intuition in the representation of numbers in binary format, where each bit is alternating at a different frequency given its position. For example, consider the numbers [0, 1, 2, 3, 4, 5] and their binary representations [0000, 0001, 0010, 0011, 0101], and notice that the lowest bit is alternating on every number, the second lowest bit on every two numbers, etc. In Sinusoidal Positional Encoding, each dimension of the positional encoding corresponds to a sinusoid such that, for any fixed offset k, PE(position+k) can be represented as a linear function of PE(position). Sinusoidal Positional Encoding makes it easier for the model to learn to attend by relative positions. The positional embeddings have the same dimension as the token embeddings, so they can be summed together. Typically, in Sinusoidal Positional Encoding, only the first few embedding dimensions are actually used to store information about the positions. Since the token embeddings are trained from scratch, they are probably learnt such that they don’t interfere with the positional encoding.

For more advanced topics, see: ALIBI, RoPE.

How do you train a language model?

Neural networks are computational graphs, where nodes represent mathematical operations and edges represent tensors. Using a differentiable training loss, the optimizer performs a gradient descent on the model parameters, by backpropagating the gradients through the graph. Different optimizers can be used, but Adam (adaptive moment estimation) or AdamW (adaptive moment estimation with weight decay) are typically used.

For language models, we typically differentiate between pre-training, where the model is trained on a vast amount of unlabeled text data to capture underlying structure and semantic knowledge, and fine-tuning, where the model is trained on a curated dataset for a specific task or domain.

Encoder-only models, such as BERT, are usually pre-trained using a Masked Language Modeling (MLM) loss. A certain percentage, typically 15%, of the words are chosen randomly and then either masked, left unchanged, or replaced with another word. The model is trained to predict those masked words using information on the left and the right (bidirectional). Decoder-only models, such as ChatGPT, are trained using a Causal Language Modeling (CLM) loss. The model is trained to predict the next word given preceding words (unidirectional).

MLM is preferred when the goal is to learn a good representation of the text, whereas CLM is preferred when the goal is the learn a system that generates fluent text. Both uses a cross entropy loss to compare a predicted probability distribution of the target word (either a masked word or a next word) and the truth. The models are trained to maximize the probability of the “correct” word; this behavior is called teacher forcing and it can lead to exposure bias.

Note: Large language model training is a broad topic. In a later blog post, I’ll cover more, for example describing typical optimization techniques like mixed precision, parameter-efficient fine-tuning, and distributed training.

I hope you enjoyed this blog post! Hopefully it was not too dense; I kept away additional topics for later blog posts, such as model evaluation, more advanced details of model training, most recent model architecture improvements, etc.

Julien’s Personal Blog

Discussion about this post