Notes on "Attention Is All You Need" — reading the original transformer paper properly

Paper: Attention Is All You Need (Vaswani et al., 2017) Link: https://arxiv.org/abs/1706.03762 Difficulty: Intermediate — assumes familiarity with sequence models and basic linear algebra

One-sentence summary

The transformer replaces recurrence entirely with self-attention, enabling parallel processing of sequences and better capture of long-range dependencies.

The architectural decisions worth understanding

Why no recurrence? RNNs process tokens sequentially — token N cannot be computed until token N-1 is done. This prevents parallelisation and makes long sequences hard. Self-attention computes relationships between all token pairs simultaneously. This is why transformers train faster on modern hardware.

Encoder-decoder structure. The original transformer was designed for translation. The encoder processes the source sentence; the decoder generates the target. Modern LLMs often use only the decoder stack (GPT family) or only the encoder (BERT family). Knowing this helps you read architecture papers that assume familiarity with the original design.

Positional encodings. Self-attention has no inherent sense of order — "the cat sat" and "sat the cat" look identical to a pure attention layer. Sinusoidal positional encodings add position-specific signals to each token embedding. The specific formula uses sin and cos at different frequencies for different dimensions, allowing the model to express relative positions algebraically.

The notation that tripped me up

The scaled dot-product attention formula is:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

The sqrt(d_k) scaling term exists because for large d_k, dot products grow in magnitude, pushing softmax into regions with very small gradients. Scaling prevents this. The paper mentions it briefly — it is worth pausing on because you will see this formula everywhere.

Open questions from this reading

Why sinusoidal specifically? The paper argues it allows generalisation to sequence lengths longer than those seen in training. How well does this actually hold?
How does the decoder's masked self-attention prevent attending to future tokens at the implementation level? Worth implementing once to see clearly.
The paper uses dropout extensively. Where exactly and why those positions?

What I want to do next

Implement the attention mechanism from scratch in NumPy — not PyTorch — so I cannot rely on autograd to paper over gaps in my understanding. Then compare my implementation against the torch.nn.MultiheadAttention output on a small example.

The architectural decisions worth understanding

The notation that tripped me up

The scaled dot-product attention formula is:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Open questions from this reading

Why sinusoidal specifically? The paper argues it allows generalisation to sequence lengths longer than those seen in training. How well does this actually hold?

How does the decoder's masked self-attention prevent attending to future tokens at the implementation level? Worth implementing once to see clearly.

The paper uses dropout extensively. Where exactly and why those positions?