Demystifying the Transformer Architecture

Transformers, Roll Out!

7 min readJan 2, 2025

This article is part of the series Demystifying Transformers.

The Transformer architecture introduced in the paper has revolutionized the modern AI, powering models like BERT, GPT, and countless others. But behind the impressive results lies a complex yet elegant design. This blog post aims to demystify the Transformer, breaking down its core components and explaining how they work together through a simple English to Chinese translation example.

(This blog post assumes the reader is already familiar with the basic concepts such as embedding, attention, etc)

Architecture

Here’s a breakdown the key components of the architecture:

Encoder (Left Side):

Input Embedding: The input words (or tokens) are converted into vector representations called embeddings. These embeddings capture semantic meaning.
Positional Encoding: Because Transformers don’t inherently process sequential information (unlike recurrent neural networks), positional encodings are added to the embeddings to give the model a sense of word order.
Multi-Head Attention: This is the core of the Transformer. It allows the model to attend to different parts of the input sequence simultaneously, capturing relationships between words.
Add & Norm: This block performs a residual connection (adding the input to the output of the attention layer) and layer normalization (normalizing the activations). These techniques help with training deeper networks.
Feed Forward: A fully connected feed-forward network is applied to each position independently.
Nx: The encoder block (Multi-Head Attention, Add & Norm, Feed Forward, Add & Norm) is repeated N times.
Output of the Encoder: The final output of the encoder is a set of encoded representations of the input sequence. These representations capture the contextual information of each word. This is what is passed to the decoder.

Decoder (Right Side):

Output Embedding: Similar to the input, the previous output words (or tokens) are converted into embeddings. Note the “shifted right” label. This means that when predicting the nth word, the decoder only has access to the words from position 1 to n-1.
Positional Encoding: Again, positional encodings are added.
Masked Multi-Head Attention: This is crucial. It’s similar to the encoder’s multi-head attention, but with a mask. The mask prevents the decoder from “looking ahead” at future words in the output sequence during training. This ensures that the prediction for each position only depends on the preceding positions.
Multi-Head Attention: This attention layer attends to the output of the encoder, allowing the decoder to use the encoded input information.
Add & Norm: Same residual connections and layer normalization as in the encoder.
Feed Forward: Same feed-forward network as in the encoder.
Nx: The decoder block is also repeated N times.
Linear and Softmax: The final output of the decoder is passed through a linear layer and then a softmax function to produce probabilities over the vocabulary. This gives the probability distribution over possible next words.

Why Outputs Look Like Inputs

The outputs at various stages (especially before the final linear and softmax layers in the decoder) are indeed vector representations (embeddings) just like the inputs. They are transformed and refined versions of the initial embeddings, capturing different levels of contextual information.

The encoder outputs are contextualized representations of the input words.
The decoder’s intermediate outputs are contextualized representations of the generated (so far) output sequence, informed by the encoded input.

It’s important to understand that these aren’t just copies of the inputs. They’ve been processed through the attention mechanisms and feed-forward networks, which means they contain much richer information than the original embeddings. The final linear and softmax layers in the decoder then map these rich internal representations to the probability distribution over the vocabulary to generate the next word.

Example: English to Chinese Translation

Here’s an example of using English to Chinese in a Transformer model:

English Input: “I love to eat dumplings.”

Chinese Output: “我喜欢吃饺子。”

Process

Input Embedding:

The English words are converted into vector representations (embeddings).
“I” becomes a vector: [0.2, 0.5, -0.1] (simplified example)
“love” becomes a vector: [0.7, -0.2, 0.3]
“to” becomes a vector: [0.1, 0.9, 0.4]
“eat” becomes a vector: [0.5, -0.3, 0.2]
“dumplings” becomes a vector: [0.8, 0.1, -0.2]

Positional Encoding:

These vectors are added to positional encoding vectors that represent their position in the sentence. This gives the model information about word order.

Multi-Head Attention (Encoder):

The model looks at the relationships between the English words. For example, it might learn that “love” is related to “eat” (the action being loved) and “dumplings” (the object of the eating).
The output of the attention mechanism is a new set of vectors that incorporate this information.

Add & Norm and Feed Forward (Encoder):

These layers further process the vectors, adding residual connections and applying non-linear transformations.

Encoder Output:

The encoder outputs five vectors, one for each English word, that now contain rich contextual information.
These are passed to the decoder.

Output Embedding (Decoder):

The decoder starts with a special “start of sequence” token, which is also converted to an embedding.

Masked Multi-Head Attention (Decoder):

Since this is the first Chinese word the decoder is generating, there are no previous words to attend to. The mask ensures this.

Multi-Head Attention (Decoder-Encoder Attention):

Here, the decoder attends to the encoder outputs. It uses the contextual information from the English sentence to help generate the Chinese translation. For example, it might learn that the encoded vector for “love” is related to the concept of “喜欢” in Chinese.

Add & Norm and Feed Forward (Decoder):

These layers process the information.

Linear and Softmax (Decoder):

This is the final step. The vector is passed through a linear layer and a softmax function. The softmax outputs a probability distribution over the Chinese vocabulary.
The decoder selects the Chinese word with the highest probability, which is “我”.

Next Decoder Steps:

Now, the decoder takes “我” as input for the next step.
The masked multi-head attention now attends to “我”.
The multi-head attention attends to the encoder outputs again.
This process repeats, generating “喜欢” ( like), “吃” (eat), and “饺子” (dumplings), until a special “end of sequence” token is generated.

Key Points:

The Transformer model uses attention mechanisms to capture relationships between words in both the source and target languages.
The encoder processes the input English sentence and produces a set of context-rich vectors.
The decoder uses these vectors to generate the output Chinese sentence, one word at a time.
The masked multi-head attention in the decoder prevents it from “cheating” by looking ahead at future words in the output sequence.

Decoder-only Model

While the traditional Transformer architecture for machine translation uses both an encoder and a decoder, it is possible to use a decoder-only model for this and other tasks, though with some key differences in how it’s approached.

How Decoder-Only Models Work for Translation

Decoder-only models, like GPT, are primarily designed for text generation. To adapt them for translation, you essentially frame the translation task as a text generation problem where the input is part of the context. Here’s how it generally works:

Concatenate Source and Target: You combine the source language sentence (e.g., English) and the beginning of the target language sentence (e.g., a “start of sequence” token or a few initial words) into a single sequence.
Special Separator Token: A special token is used to separate the source and target parts of the sequence. This helps the model distinguish between the input and the expected output.
Treat as Language Modeling: The model is then trained to predict the next token in the sequence, just like in regular language modeling. However, because the source sentence is included in the context, the model learns to generate the target language translation.

Example:

For our example “I love to eat dumplings” translating to “我喜欢吃饺子。”, the input to the decoder-only model would look something like this:

"I love to eat dumplings. <SEP> 我"

The model would then be trained to predict the next token, which would ideally be “喜欢”. This process continues until the model generates the complete translation “我喜欢吃饺子。” followed by an “end of sequence” token.

Key Differences and Considerations:

Training Data: Decoder-only models for translation are often pre-trained on massive amounts of text in multiple languages. This allows them to learn general language patterns and cross-lingual relationships. They are then fine-tuned on parallel corpora (pairs of sentences in different languages) to improve their translation performance.
No Explicit Encoder: The key difference is the absence of a separate encoder. The decoder itself is responsible for both understanding the source sentence and generating the target sentence.
Attention Mechanism: The decoder uses masked self-attention, which is crucial for preventing it from “peeking” at future target words during training.
Performance: While decoder-only models have shown impressive results in various language tasks, including translation, they might require more data and computational resources compared to traditional encoder-decoder models, especially for long sentences or complex language pairs.

Takeaways

The Transformer architecture, with its powerful attention mechanism, has revolutionized the modern AI. By understanding its key components — input embeddings, positional encodings, multi-head attention, and the encoder-decoder structure — we can appreciate its elegance and effectiveness. This architecture continues to be the foundation for many state-of-the-art language models, driving progress in various AI tasks.