Transformer++

People have tweaked the Transformer architecture enough in 7 years that we’re apparently now calling the current recipe “Transformer++”.

The changes between Vaswani et al 2017 and Radford et al 2019 are well-known (see Figure 3 here): removing the encoder, clipping gradients, not using dropout, and changing the activation function to something weighted or gated. So I focus on post-GPT-2 tweaks.

I also focus on architecture changes rather than data changes (curation or generation) or training changes (infrastructure and hyperparameters) or post-training or inference optimisations (e.g. Various kinds of data parallelism and model parallelism across multiple devices arose largely after 2019). I won’t get into multimodal architectures.

Inclusion criterion: three strong open-source architectures from 2024 using the tweak (LLaMA, Gemma, Qwen2.5, DeepSeek-V2, Hunyuan).

(The term “Transformer++” was coined in this sense by Gu and Dao 2023, who also noted several of the main tweaks.)

The Transformer++

Let the “Transformer++” be a Transformer with

A fused attention implementation (the scaled dot-product backend -> FlashAttention). Subquadratic memory complexity in input sequence length. Practically: can double GPU utilization and so halve training time. Also enables longer contexts and speeds up inference on long context input.
Rotary position embedding (sinusoidal -> learned APE -> RoPE)
Removing attention’s redundant key heads and value heads (vanilla MHA -> MQA -> GQA)
Regularized / preconditioned optimizer (Adam -> AdamW -> SOAP)
Normalise before each layer (post LayerNorm -> pre LayerNorm)
When doing layer normalization: just rescale, don’t centre (LayerNorm -> RMSNorm)
Divine activation function for the MLP (GeLU -> … -> SwiGLU or GeGLU)
Tied embeddings. An oldie but goodie.
Fix logit drift (query/key normalization)
Fixing that one softmax off-by-one (fixed in some places around 2021)

Less well-established tweaks

Sparsification. I could mention the turn to sparse Mixtures of Experts. But this turn was more of a cost-saving thing. LLaMA is still dense though.
BPE -> Tiktoken / SentencePiece. Basically the same, but about 25% better compression.
Quantization. FP16 to BF16 to int8.
No bias on QKV projection or layernorm. Thus not sure about putting the biases back into attention, but various people use it.
Sliding Window Attention e.g. Rolling Buffer Cache
Cross-Layer Attention shrinks the KV cache
WARP
Regularizing outputs (“soft-capping logits”)

Occurrence in top open architectures

Component	Tweak	LLaMA 3	Gemma 2	Qwen2.5	DeepSeek-V2	Hunyuan-Large

Attention	Attention kernel	FlashAttention-2?	Eager attention	FlashAttention-2	FlashAttention-2	FlashAttention-2
Attention	Sliding window attention	No?	Local-Global SWA	both	No?	No?
Attention	Removing KV heads	GQA	GQA	GQA	MLA	GQA
Attention	Cross-Layer Attention	No	No	No	No	CLA
Attention	prefill KV cache	Yes	?	?	No?	?
Attention	low-rank KV cache compression	No	No	No	Yes	No
Attention	Biases in QKV projection	No?	?	QKV biases	?	No
Attention	QK Normalization	No?	No?	?	No?	No

Block sequence	Parallel layers	No	No?	No?	No?	No?

Embedding	Position encoding	RoPE	RoPE	RoPE	decoupled RoPE	DynamicNTKRope (*6)
Embedding	Tied embeddings	“Shared” (*1)	Tied	Tied (*4)	?	Tied (*7)

Optimizer	Regularized / preconditioned	AdamW	AdamW(*3)	?	AdamW	AdamW

Activation normalization	post or pre layernorm	pre	both	?	pre	?
Activation normalization	Don’t center	RMSNorm(*2)	RMSNorm	RMSNorm	RMSNorm	RMSNorm

Output normalization	Soft-capped logits	No?	Soft-capped logits	No?	No?	No?

Activation function	Gated linear unit	SwiGLU	GeGLU	SwiGLU	SwiGLU	SwiGLU (*5)

Sparsification	Sparse?	Dense	Dense?	Dense	MoE	MoE

Weights quantization	BF16 training	Yes	No, FP32	Yes	Yes	Yes
Weights quantization	8-bit post-training	In one version	No	No	No	In one version

[1] In the 3.2 models anyway
[2] Llama 2 uses RMSNorm anyway [3] They recommend AdamW for fine-tuning, unsure for training
[4] Only the smaller models
[5] Code says “silu”
[6] “Credits to the Reddit users /u/bloc97 and /u/emozilla”
[7] https://huggingface.co/tencent/Tencent-Hunyuan-Large/blob/main/Hunyuan-A52B-Pretrain/modeling_hunyuan.py#L1419

Caveats

The above ignores the much more important changes since 2017 to data “collection” (curation and synthesis), cluster infrastructure, post-training, and scaffolding.
The public tokenizers still use byte-pair encoding
Some models have absurdly high embedding-parameter counts. Previously we conjectured that this is a PR tradeoff, to allow underreporting the Transformer-parameter count and so enter a lesser model class (“7B”). There is actually a good reason on top of this PR reason: TPUs have dedicated hardware (SparseCores) which accelerate embedding lookups relative to NVIDIA hardware, which lets models trained on them have much larger vocabularies, and hence huge embedding tables. The only fudge factor then is then them using the standard parameter reporting convention (only counting Transformer parameters) despite having a good reason to explode the embedding.
A lot of this doesn’t improve absolute performance that much, but it does make it a lot cheaper to run.
And this is just the public architecture. And probably there are some public methods which we haven’t realised are improvements yet.

I thank Kushal Thaman for helpful comments.

Bibliography

Ainslie, Joshua et. al. (2023). “GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints” https://arxiv.org/abs/2305.13245
Bannier, P.A. (2023). “Rolling buffer cache” https://github.com/ggerganov/llama.cpp/discussions/3581
Biderman, Stella et. al. (2021). “Rotary Embeddings: A Relative Revolution” https://blog.eleuther.ai/rotary-embeddings/
Brandon, William; Mishra, Mayank; Nrusimha, Aniruddha; Panda, Rameswar and Kelley, Jonathan Ragan (2024). “Reducing Transformer Key-Value Cache Size with Cross-Layer Attention” https://arxiv.org/abs/2405.12981
Chowdhery, Aakanksha et. al. (2022). “PaLM: Scaling Language Modeling with Pathways” https://arxiv.org/abs/2204.02311
Dao, Tri; Fu, Daniel Y.; Ermon, Stefano; Rudra, Atri and Christopher Ré (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness” https://arxiv.org/abs/2205.14135
Dehghani, Mostafa et. al. (2023). “Scaling Vision Transformers to 22 Billion Parameters” https://arxiv.org/abs/2302.05442
Gu, Albert and Dao, Tri (2023). “Mamba: Linear-Time Sequence Modeling with Selective State Spaces” https://arxiv.org/abs/2312.00752
Leech, Gavin; Garfinkel, Simson; Yagudin, Misha; Briand, Alexander and Zhuralev, Aleksandr (2024). “Ten Hard Problems in Artificial Intelligence We Must Get Right” https://arxiv.org/abs/2402.04464
Liu, Aixin et. al. (2024). “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” https://arxiv.org/abs/2405.04434
Liu, Yuxi (2024). “Mixture of Experts” https://yuxi-liu-wired.github.io/essays/posts/mixture-of-experts/
Loshchilov, Ilya and Frank Hutter (2019). “Decoupled Weight Decay Regularization” https://arxiv.org/abs/1711.05101v3
Miller, Evan (2023). “Attention Is Off By One” lhttps://www.evanmiller.org/attention-is-off-by-one.html
Press, Ofir (2023). “Train Short, Test Long: Attention with Linear Biases (ALiBi) Enables Input Length Extrapolation” https://github.com/ofirpress/attention_with_linear_biases/#faq
Radford, Alec; Wu Jeffrey; Amodei, Dario; Sutskever, Ilya et. al. (2019). “Language Models are Unsupervised Multitask Learners” https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
Roldán, J. Carlos (2024). “What is SwiGLU?” https://jcarlosroldan.com/post/348/what-is-swiglu
Shazeer, Noam (2019). “Fast Transformer Decoding: One Write-Head is All You Need” https://arxiv.org/abs/1911.02150
Stack Overflow (2024). “How to implement SwiGLU activation? Why does SwiGLU takes in two tensors?” https://stackoverflow.com/questions/79047727/how-to-implement-swiglu-activation-why-does-swiglu-takes-in-two-tensors
Sun, Xingwu et. al. (2024). “Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent” https://arxiv.org/abs/2411.02265?utm_source=substack\&utm_medium=email
Vaswani, Ashish et. al. (2017). “Attention Is All You Need” https://arxiv.org/abs/1706.03762
Vyas, Nikhil et. al. (2024). “SOAP: Improving and Stabilizing Shampoo using Adam” https://arxiv.org/abs/2409.11321
Xiong, Riubin et. al. (2020). “On Layer Normalization in the Transformer Architecture” https://arxiv.org/abs/2002.04745
Zhang, Biao and Sennrich, Rico (2019). “Root Mean Square Layer Normalization” https://arxiv.org/abs/1910.07467

Maybe

https://kellerjordan.github.io/posts/muon/ https://epochai.substack.com/p/how-has-deepseek-improved-the-transformer

Tags: AI

Transformer++

The Transformer++

Less well-established tweaks

Occurrence in top open architectures

Caveats

See also

Bibliography

Leave a comment

Subscribe through RSS , Podcast , Email