People have tweaked the Transformer architecture enough in 7 years that we’re apparently now calling the current recipe “Transformer++”.
The changes between Vaswani et al 2017 and Radford et al 2019 are well-known (see Figure 3 here): removing the encoder, clipping gradients, not using dropout, and changing the activation function to something weighted or gated. So I focus on post-GPT-2 tweaks.
I also focus on architecture changes rather than data changes (curation or generation) or training changes (infrastructure and hyperparameters) or post-training or inference optimisations (e.g. Various kinds of data parallelism and model parallelism across multiple devices arose largely after 2019). I won’t get into multimodal architectures.
Inclusion criterion: three strong open-source architectures from 2024 using the tweak (LLaMA, Gemma, Qwen2.5, DeepSeek-V2, Hunyuan).
(The term “Transformer++” was coined in this sense by Gu and Dao 2023, who also noted several of the main tweaks.)
The Transformer++
Let the “Transformer++” be a Transformer with
- A fused attention implementation (the scaled dot-product backend -> FlashAttention). Subquadratic memory complexity in input sequence length. Practically: can double GPU utilization and so halve training time. Also enables longer contexts and speeds up inference on long context input.
- Rotary position embedding (sinusoidal -> learned APE -> RoPE)
- Removing attention’s redundant key heads and value heads (vanilla MHA -> MQA -> GQA)
- Regularized / preconditioned optimizer (Adam -> AdamW -> SOAP)
- Normalise before each layer (post LayerNorm -> pre LayerNorm)
- When doing layer normalization: just rescale, don’t centre (LayerNorm -> RMSNorm)
- Divine activation function for the MLP (GeLU -> … -> SwiGLU or GeGLU)
- Tied embeddings. An oldie but goodie.
- Fix logit drift (query/key normalization)
- Fixing that one softmax off-by-one (fixed in some places around 2021)
Less well-established tweaks
- Sparsification. I could mention the turn to sparse Mixtures of Experts. But this turn was more of a cost-saving thing. LLaMA is still dense though.
- BPE -> Tiktoken / SentencePiece. Basically the same, but about 25% better compression.
- Quantization. FP16 to BF16 to int8.
- No bias on QKV projection or layernorm. Thus not sure about putting the biases back into attention, but various people use it.
- Sliding Window Attention e.g. Rolling Buffer Cache
- Cross-Layer Attention shrinks the KV cache
- WARP
- Regularizing outputs (“soft-capping logits”)
Occurrence in top open architectures
| Component | Tweak | LLaMA 3 | Gemma 2 | Qwen2.5 | DeepSeek-V2 | Hunyuan-Large |
|---|---|---|---|---|---|---|
| Attention | Attention kernel | FlashAttention-2? | Eager attention | FlashAttention-2 | FlashAttention-2 | FlashAttention-2 |
| Attention | Sliding window attention | No? | Local-Global SWA | both | No? | No? |
| Attention | Removing KV heads | GQA | GQA | GQA | MLA | GQA |
| Attention | Cross-Layer Attention | No | No | No | No | CLA |
| Attention | prefill KV cache | Yes | ? | ? | No? | ? |
| Attention | low-rank KV cache compression | No | No | No | Yes | No |
| Attention | Biases in QKV projection | No? | ? | QKV biases | ? | No |
| Attention | QK Normalization | No? | No? | ? | No? | No |
| Block sequence | Parallel layers | No | No? | No? | No? | No? |
| Embedding | Position encoding | RoPE | RoPE | RoPE | decoupled RoPE | DynamicNTKRope (*6) |
| Embedding | Tied embeddings | “Shared” (*1) | Tied | Tied (*4) | ? | Tied (*7) |
| Optimizer | Regularized / preconditioned | AdamW | AdamW(*3) | ? | AdamW | AdamW |
| Activation normalization | post or pre layernorm | pre | both | ? | pre | ? |
| Activation normalization | Don’t center | RMSNorm(*2) | RMSNorm | RMSNorm | RMSNorm | RMSNorm |
| Output normalization | Soft-capped logits | No? | Soft-capped logits | No? | No? | No? |
| Activation function | Gated linear unit | SwiGLU | GeGLU | SwiGLU | SwiGLU | SwiGLU (*5) |
| Sparsification | Sparse? | Dense | Dense? | Dense | MoE | MoE |
| Weights quantization | BF16 training | Yes | No, FP32 | Yes | Yes | Yes |
| Weights quantization | 8-bit post-training | In one version | No | No | No | In one version |
[1] In the 3.2 models anyway
[2] Llama 2 uses RMSNorm anyway
[3] They recommend AdamW for fine-tuning, unsure for training
[4] Only the smaller models
[5] Code says “silu”
[6] “Credits to the Reddit users /u/bloc97 and /u/emozilla”
[7] https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Pretrain/modeling_hunyuan.py#L1419
Caveats
- The above ignores the much more important changes since 2017 to data “collection” (curation and synthesis), cluster infrastructure, post-training, and scaffolding.
- The public tokenizers still use byte-pair encoding
- Some models have absurdly high embedding-parameter counts. Previously we conjectured that this is a PR tradeoff, to allow underreporting the Transformer-parameter count and so enter a lesser model class (“7B”). There is actually a good reason on top of this PR reason: TPUs have dedicated hardware (SparseCores) which accelerate embedding lookups relative to NVIDIA hardware, which lets models trained on them have much larger vocabularies, and hence huge embedding tables. The only fudge factor then is then them using the standard parameter reporting convention (only counting Transformer parameters) despite having a good reason to explode the embedding.
- A lot of this doesn’t improve absolute performance that much, but it does make it a lot cheaper to run.
- And this is just the public architecture. And probably there are some public methods which we haven’t realised are improvements yet.
I thank Kushal Thaman for helpful comments.
See also
- https://arxiv.org/html/2410.16682v1
- https://openreview.net/forum?id=d8w0pmvXbZ
Bibliography
- Ainslie, Joshua et. al. (2023). “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” https://arxiv.org/abs/2305.13245
- Bannier, P.A. (2023). “Rolling buffer cache” https://github.com/ggerganov/llama.cpp/discussions/3581
- Biderman, Stella et. al. (2021). “Rotary Embeddings: A Relative Revolution” https://blog.eleuther.ai/rotary-embeddings/
- Brandon, William; Mishra, Mayank; Nrusimha, Aniruddha; Panda, Rameswar and Kelley, Jonathan Ragan (2024). “Reducing Transformer Key-Value Cache Size with Cross-Layer Attention” https://arxiv.org/abs/2405.12981
- Chowdhery, Aakanksha et. al. (2022). “PaLM: Scaling Language Modeling with Pathways” https://arxiv.org/abs/2204.02311
- Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri and Christopher Ré (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” https://arxiv.org/abs/2205.14135
- Dehghani, Mostafa et. al. (2023). “Scaling Vision Transformers to 22 Billion Parameters” https://arxiv.org/abs/2302.05442
- Gu, Albert and Dao, Tri (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” https://arxiv.org/abs/2312.00752
- Leech, Gavin; Garfinkel, Simson; Yagudin, Misha; Briand, Alexander and Zhuralev, Aleksandr (2024). “Ten Hard Problems in Artificial Intelligence We Must Get Right” https://arxiv.org/abs/2402.04464
- Liu, Aixin et. al. (2024). “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” https://arxiv.org/abs/2405.04434
- Liu, Yuxi (2024). “Mixture of Experts” https://yuxi-liu-wired.github.io/essays/posts/mixture-of-experts/
- Loshchilov, Ilya and Frank Hutter (2019). “Decoupled Weight Decay Regularization” https://arxiv.org/abs/1711.05101v3
- Miller, Evan (2023). “Attention Is Off By One” lhttps://www.evanmiller.org/attention-is-off-by-one.html
- Press, Ofir (2023). “Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation” https://github.com/ofirpress/attention_with_linear_biases/#faq
- Radford, Alec; Wu Jeffrey; Amodei, Dario; Sutskever, Ilya et. al. (2019). “Language Models are Unsupervised Multitask Learners” https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
- Roldán, J. Carlos (2024). “What is SwiGLU?” https://jcarlosroldan.com/post/348/what-is-swiglu
- Shazeer, Noam (2019). “Fast Transformer Decoding: One Write-Head is All You Need” https://arxiv.org/abs/1911.02150
- Stack Overflow (2024). “How to implement SwiGLU activation? Why does SwiGLU takes in two tensors?” https://stackoverflow.com/questions/79047727/how-to-implement-swiglu-activation-why-does-swiglu-takes-in-two-tensors
- Sun, Xingwu et. al. (2024). “Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent” https://arxiv.org/abs/2411.02265?utm_source=substack\&utm_medium=email
- Vaswani, Ashish et. al. (2017). “Attention Is All You Need” https://arxiv.org/abs/1706.03762
- Vyas, Nikhil et. al. (2024). “SOAP: Improving and Stabilizing Shampoo using Adam” https://arxiv.org/abs/2409.11321
- Xiong, Riubin et. al. (2020). “On Layer Normalization in the Transformer Architecture” https://arxiv.org/abs/2002.04745
- Zhang, Biao and Sennrich, Rico (2019). “Root Mean Square Layer Normalization” https://arxiv.org/abs/1910.07467
Maybe
https://kellerjordan.github.io/posts/muon/ https://epochai.substack.com/p/how-has-deepseek-improved-the-transformer
