Introduction: Why Transformers Matter
Every time you ask ChatGPT a question, request Claude to write code, or use Google Translate to convert a sentence, you are relying on a single architectural innovation: the transformer. Introduced in 2017 by researchers at Google, the transformer architecture did not merely improve on existing approaches to natural language processing. It replaced them entirely. Within just a few years, transformers became the foundation for virtually every state-of-the-art language model, from BERT and GPT to Claude, Gemini, and Llama.
Understanding transformers is no longer optional for anyone working in AI, machine learning, or software engineering. Whether you are fine-tuning a model, building retrieval-augmented generation (RAG) pipelines, or simply trying to make sense of the rapid advances in generative AI, you need to understand how transformers work under the hood. This article provides a clear, thorough explanation of the transformer architecture, starting from the problems it solved, moving through its core mechanisms, and ending with why it scales so remarkably well.
Before Transformers: The RNN Problem
Before transformers arrived, the dominant architectures for processing sequential data like text were recurrent neural networks (RNNs) and their improved variants, Long Short-Term Memory networks (LSTMs) and Gated Recurrent Units (GRUs). These architectures processed text one token at a time, maintaining a hidden state that was updated sequentially as each new word was fed into the network.
This sequential processing created two fundamental problems. First, it was inherently slow. Because each step depended on the output of the previous step, RNNs could not take advantage of modern parallel hardware like GPUs and TPUs. Training on large datasets took an impractical amount of time. Second, RNNs struggled with long-range dependencies. Even with gating mechanisms, the hidden state acted as an information bottleneck. By the time the network reached the end of a long paragraph, critical information from the beginning had often been diluted or lost entirely. A sentence like "The cat, which had been sitting on the mat in the corner of the room near the old wooden bookshelf, suddenly jumped" requires the model to connect "cat" with "jumped" across many intervening tokens. RNNs were unreliable at this.
Researchers tried adding attention mechanisms on top of RNNs, particularly in sequence-to-sequence models for machine translation. These hybrid approaches showed promising results, but they still relied on the sequential bottleneck of recurrence. The question became: what if we could build a model based entirely on attention, without any recurrence at all?
The Key Insight: Attention
The core idea behind the transformer is self-attention, sometimes called scaled dot-product attention. Instead of processing tokens one at a time in sequence, self-attention allows the model to look at all tokens in a sequence simultaneously and compute how much each token should "attend to" every other token.
Consider the sentence: "The bank by the river was steep." When processing the word "bank," the model needs context to determine whether it refers to a financial institution or a riverbank. With self-attention, the model computes a relevance score between "bank" and every other word in the sentence. It discovers that "river" and "steep" have high relevance scores, which helps it correctly interpret "bank" as a riverbank. This happens in a single computational step, not through a chain of sequential updates.
Queries, Keys, and Values
Self-attention works through three learned linear projections applied to each input token's embedding. Every token is projected into three vectors: a Query (Q), a Key (K), and a Value (V). Think of it like a search engine: the Query is what you are looking for, the Key is the label on each document, and the Value is the content of that document. The attention score between two tokens is computed by taking the dot product of the Query of one token with the Key of another, then scaling and applying a softmax to get a probability distribution. The final output for each token is a weighted sum of all Value vectors, where the weights are the attention scores.
Mathematically, the attention function is expressed as: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V, where d_k is the dimension of the key vectors. The scaling factor prevents the dot products from growing too large, which would push the softmax into regions with extremely small gradients and slow down training.

Inside the Transformer Architecture
The original transformer, as described in the landmark "Attention Is All You Need" paper by Vaswani et al., consists of two main components: an encoder and a decoder. Each is a stack of identical layers, with the original paper using six layers for each.
The Encoder
Each encoder layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a position-wise feed-forward neural network. Both sub-layers are wrapped with a residual connection and layer normalization. The residual connections are critical because they allow gradients to flow directly through the network during backpropagation, making it possible to train very deep models without gradient degradation.
The encoder's job is to build a rich contextual representation of the input sequence. Each token's representation is refined layer by layer, incorporating information from all other tokens in the sequence. By the final encoder layer, each token's vector representation captures not just the token's own meaning but its meaning in context.
The Decoder
The decoder is similar to the encoder but adds a third sub-layer: cross-attention. In cross-attention, the decoder's queries attend to the encoder's output keys and values. This is how the decoder accesses the information encoded from the input sequence. The decoder also uses masked self-attention in its first sub-layer. The mask prevents each position from attending to subsequent positions, ensuring that predictions for position i can only depend on known outputs at positions less than i. This preserves the autoregressive property necessary for text generation.
Multi-Head Attention
Rather than performing a single attention function, transformers use multi-head attention. The model runs several attention operations in parallel, each with different learned projection matrices. This allows the model to attend to information from different representation subspaces at different positions simultaneously. One attention head might learn to track syntactic relationships (subject-verb agreement), while another captures semantic relationships (synonyms and antonyms), and yet another handles positional patterns. The outputs of all heads are concatenated and linearly projected to produce the final result. The original paper used 8 attention heads, while modern large models use 32, 64, or even 128.
Positional Encoding
Because the transformer processes all tokens in parallel rather than sequentially, it has no inherent notion of word order. To address this, the original architecture adds positional encodings to the input embeddings. The original paper used sinusoidal functions of different frequencies, allowing the model to learn to attend to relative positions. Modern transformers often use learned positional embeddings or more advanced schemes like Rotary Position Embeddings (RoPE), which encode relative position information directly into the attention computation and allow models to generalize to sequence lengths longer than those seen during training.
A Simple Attention Calculation in Python
The following code demonstrates a basic scaled dot-product attention computation using NumPy. This is a simplified version of what happens inside each attention head of a transformer.
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Args:
Q: Query matrix of shape (seq_len, d_k)
K: Key matrix of shape (seq_len, d_k)
V: Value matrix of shape (seq_len, d_v)
Returns:
Attention output of shape (seq_len, d_v)
"""
d_k = K.shape[-1]
# Step 1: Compute dot products between queries and keys
scores = Q @ K.T # shape: (seq_len, seq_len)
# Step 2: Scale by sqrt of key dimension
scores = scores / np.sqrt(d_k)
# Step 3: Apply softmax to get attention weights
exp_scores = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
attention_weights = exp_scores / np.sum(exp_scores, axis=-1, keepdims=True)
# Step 4: Multiply weights by values
output = attention_weights @ V # shape: (seq_len, d_v)
return output, attention_weights
# Example: 4 tokens, embedding dimension of 8
np.random.seed(42)
seq_len, d_k, d_v = 4, 8, 8
# Simulated Q, K, V matrices (in practice, these are learned projections)
Q = np.random.randn(seq_len, d_k)
K = np.random.randn(seq_len, d_k)
V = np.random.randn(seq_len, d_v)
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights (each row sums to 1):")
print(np.round(weights, 3))
print("\nOutput shape:", output.shape)In this code, each row of the attention weights matrix shows how much a given token attends to every other token. The output for each token is a weighted combination of all value vectors, where higher attention weights mean greater influence. In a real transformer, the Q, K, and V matrices are produced by learned linear layers applied to the input embeddings, and this operation runs across multiple heads simultaneously.
From BERT to GPT: Encoder vs Decoder Models
The original transformer used both an encoder and a decoder, but subsequent research discovered that you could build powerful models using only one or the other. This insight led to two major families of transformer models that dominate modern NLP.
Encoder-Only Models: BERT and Its Descendants
BERT (Bidirectional Encoder Representations from Transformers), released by Google in 2018, uses only the encoder stack. Because there is no autoregressive constraint, BERT can attend to tokens in both directions simultaneously, which makes it exceptionally good at understanding tasks. BERT and its variants (RoBERTa, ALBERT, DeBERTa) excel at classification, named entity recognition, question answering, and semantic similarity. You can explore these models through the Hugging Face Transformers library, which provides pre-trained weights and easy-to-use APIs for hundreds of encoder models.
Decoder-Only Models: The GPT Family and Beyond
GPT (Generative Pre-trained Transformer) from OpenAI uses only the decoder stack with masked self-attention. It processes text left-to-right, predicting the next token at each step. This autoregressive approach turns out to be extraordinarily powerful for generative tasks. GPT-2 showed that scaling up decoder-only models produced surprisingly coherent text. GPT-3 demonstrated that at sufficient scale, these models could perform tasks they were never explicitly trained for, a phenomenon known as in-context learning. GPT-4, Claude, Gemini, and Llama all follow this decoder-only paradigm. The simplicity of the "predict the next token" objective, combined with massive scale, proved to be a remarkably effective recipe for building general-purpose AI systems.
Encoder-Decoder Models: T5 and Beyond
Some models preserve the full encoder-decoder structure. Google's T5 (Text-to-Text Transfer Transformer) frames every NLP task as a text-to-text problem: classification becomes generating a label string, translation becomes generating the translated text, and summarization becomes generating a shorter version. This unified approach is elegant, though decoder-only models have generally won out in the race toward the most capable general-purpose systems. For a deeper visual understanding of how these architectures differ, The Illustrated Transformer by Jay Alammar is one of the best visual guides available.

Why Transformers Scale
One of the most consequential properties of the transformer architecture is how well it scales. Unlike RNNs, transformers are inherently parallelizable. Because self-attention computes relationships between all token pairs simultaneously, the entire computation can be distributed across thousands of GPU or TPU cores. This parallelism means that doubling your hardware roughly halves your training time, a property that RNNs fundamentally lack.
Scaling Laws
Research from OpenAI and DeepMind has revealed predictable scaling laws for transformer models. Performance improves as a smooth power law function of three variables: the number of parameters in the model, the amount of training data, and the compute budget used for training. This predictability is remarkable. It means researchers can estimate the performance of a larger model before spending millions of dollars training it. These scaling laws have driven the trend toward ever-larger models, from GPT-2's 1.5 billion parameters to GPT-4's rumored trillion-plus parameters.
The Context Window Challenge
The primary computational limitation of standard self-attention is its quadratic complexity with respect to sequence length. If you double the input length, the memory and compute requirements quadruple. This is because every token must compute an attention score with every other token. Significant research has gone into addressing this limitation. Techniques like FlashAttention optimize the memory access patterns, while approaches like sliding window attention, sparse attention, and linear attention reduce the theoretical complexity. Modern models like Claude and GPT-4 support context windows of 100,000 tokens or more, a feat made possible by these optimizations combined with hardware advances. For an excellent visual explanation of how attention computation works at a deeper level, see 3Blue1Brown's attention explanation on YouTube.
The Paper That Started It All
The transformer architecture was introduced in "Attention Is All You Need" by Vaswani et al. in 2017, published by Google Brain and Google Research. The paper proposed eliminating recurrence and convolutions entirely, relying solely on attention mechanisms. It achieved state-of-the-art results on English-to-German and English-to-French machine translation while being significantly faster to train than previous approaches. As of 2026, it has been cited over 150,000 times and remains one of the most influential papers in the history of artificial intelligence.
Related Reading
Continue learning with these related articles:
- Our beginner guide to neural networks
- Fine-tune a transformer model with LoRA
- RAG vs fine-tuning: when to use which
Key Takeaways
- Transformers replaced RNNs and LSTMs by eliminating sequential processing, enabling massive parallelization and solving the long-range dependency problem through self-attention.
- Self-attention (scaled dot-product attention) is the core mechanism. Each token computes relevance scores with every other token using Query, Key, and Value projections.
- Multi-head attention allows the model to capture different types of relationships simultaneously, from syntactic structure to semantic meaning.
- Encoder-only models (BERT) excel at understanding tasks. Decoder-only models (GPT, Claude) dominate generative tasks. Encoder-decoder models (T5) handle sequence-to-sequence tasks.
- Transformers scale predictably. Increasing model size, data, and compute yields consistent performance improvements following power-law scaling laws.
- The quadratic complexity of attention with respect to sequence length remains a key challenge, but innovations like FlashAttention and sparse attention are steadily pushing context windows to hundreds of thousands of tokens.
- Understanding transformers is essential for anyone working with LLMs, whether you are fine-tuning models, building applications, or evaluating AI capabilities.
The transformer architecture is one of those rare innovations that reshapes an entire field. From its origins as a machine translation model to its current status as the engine behind the most capable AI systems ever built, the transformer has proven that a single, elegant idea, letting every element in a sequence attend to every other element, can unlock capabilities that were previously thought to be decades away. As models continue to grow and new architectural refinements emerge, the core principles of attention and parallelism will remain the foundation on which the future of AI is built.



