模型设计和训练

大语言模型需要大规模模型。本节介绍大语言模型的大规模模型的设计、训练和推理过程中的优化方法。随着大语言模型越来越大，但训练用的 GPU 硬件资源（如存储单元）是有限的，因此，人们提出了各种方法，来优化模型的存储和计算。

模型设计和训练

我们首先来看大语言模型基于的 Transformer 模型的设计和训练。

首先，Transformer 的模型结构有两项重要更新：SwishGLU 和 RMSNorm。最近，它也有一些改进，比如：相对位置编码，旋转表征、ALiBi 表征、Gated FFN。

其次，为了支持超过 1T 参数的模型训练，需要多 GPU 并行训练和优化存储。其中，通过采用“混合精度” 和 TensorCores ZeRO 冗余，降低对存储的需求。

然后，为了提高效率，还有各种技巧，包括：

ZeRO Infinity —> CPU/NVMe Offloading
8-Bit 训练后量化
PEFT

最后，训练 Transformer 也有很多技巧。GPT-3 的论文中也介绍了很多训练技巧。

上面的这些内容，详见斯坦福 CS224U 的 PPT 和 Yandex 课程的 PEFT PPT。

模型理解

人们探索了如何理解训练获得的语言模型。

首先，可以观察学习到的 Attention。如 Anthropic 2022 年 3 月发表的 In-context Learning and Induction Heads 指出，“Induction heads” 是大多数 In-Context Learning 的内在机制。

其次，研究者在 GPT-2 中找到了一个用于自然语言任务的大电路，并对我们的人类可理解的解释进行了定量评估。这个电路由 26 个 Attention Head 组成，包括 Previous Token Head, Induction Head, Duplicate Token Heads，很有意思。《Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small》，2022。

最后，有研究者发现，Transformer 里的 FFN 层起到了 Key-Value 存储的效果《Transformer FFN Layers are Key-Values Memory》, EMNLP 2021。

课程材料

约翰霍普金斯 GA, Colin Raffel, Building Better Language Models, PPT
斯坦福 CS324 2022 年 Modeling, Training, Parallelism
斯坦福 CS 224U: Natural Language Understanding 的讲座《Fantastic Language Models and How to Build Them》，Siddharth Karamcheti
Yandex 2023 LLM PPT
华盛顿大学 CSE 599 同学 Slides
约翰霍普金斯 UA 2024 Lec 21 Model efficiency,
- Quantization,
- Distillation,
- Efficient forms of self-attention,
- Dealing with hallucination,
约翰霍普金斯 UA 2023 Talk, Decentralized Deep Learning: Running Large Neural Networks Together
Berkeley Summit 2023, Hugging Face, Taming the Wild West of LLMs, Slides, Video, 各种模型的性能比较
Berkeley CS294/194-196: Responsible GenAI, MosaicML, Infrastructure Layer I: Training and Inference, Performance Optimization, Scalability (slides)

软件

开源训练框架 DeepSpeed，《DeepSpeed: Extreme-scale model training for everyone》，介绍，Github

论文

普林斯顿大学课程材料

Systems

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Refer:

Sparse models

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Refer:

约翰霍普金斯大学

模型 Generalism

A Generalist Agent
UnifiedQA: Crossing Format Boundaries With a Single QA System
Perceiver IO: A General Architecture for Structured Inputs & Outputs
Benchmarking Generalization via In-Context Instructions on 1,600+ Language Tasks
Multitask Prompted Training Enables Zero-Shot Task Generalization
Finetuned Language Models Are Zero-Shot Learners
BC-Z: Zero-Shot Task Generalization with Robotic Imitation Learning

Diagnosis:

Do Prompt-Based Models Really Understand the Meaning of Their Prompts?
How transferable are features in deep neural networks?
Frequency Effects on Syntactic Rule Learning in Transformers
Are Multimodal Transformers Robust to Missing Modality?
Capturing Failures of Large Language Models via Human Cognitive Biases

Model

What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?

Additional Reading(s):

GPT-NeoX-20B: An Open-Source Autoregressive Language Model, Webpage
Unifying Language Learning Paradigms
Do transformer modifications transfer across implementations and applications?
Staged Training for Transformer Language Models

参考

“Building ML models like we build open-source software” by Colin Raffel - https://www.youtube.com/watch?v=0oGxT_i7nk8
Rotary position embeddings explanation from EleutherAI - https://blog.eleuther.ai/rotary-embeddings/
Group query attention to reduce the memory usage for inference - https://arxiv.org/abs/2305.13245v2
Gated activations improve transformer (apparently due to divine benevolence) - https://arxiv.org/abs/2002.05202

深度学习模型的实现

深度学习模型层出不穷。要跟上最新的趋势，最好的办法，可能是看 Lucidrains 的 Github 仓库。这个人特别神。他专门实现各种流行的、影响力很大的 AI 模型。从他实现什么模型，就能看出现在最流行什么模型。真的是精力旺盛。

HF Transformer 模型参数

代码

–max_steps: Total number of training steps to run.
–learning_rate: Highest learning rate in the learning rate schedule (by default, the learning rate schedule increases and decreases linearly, with the peak being this set learning rate)
–warmup_steps: Number of steps to do a linear “warmup” from a small learning rate to the desired learning rate
–save_steps: how many steps of training before saving (and evaluating on the val set)
–per_device_train_batch_size: batch size per GPU. The total batch size is [number of GPUs] * per_device_train_batch_size
–gradient_accumulation_steps: Allows for accumulating gradients across multiple sequential steps. This allows you to increase the batch size while trading off computation time. If this parameter is > 1, then the total batch size is [num GPUs] * per_device_train_batch_size * gradient_accumulation_steps
–lr_scheduler_type: learning rate scheduler. The default is “linear” which does a linear increase and decrease to and from the learning_rate.
–adafactor: use this flag if you want to use the Adafactor optimizer instead of AdamW (default). Adafactor can save on GPU memory by only saving 2 copies of the model instead of 3 needed for AdamW (mean, variance, gradient).

系统

2023, Year of open LLMs, Dec 18, 2023，Webpage, 中文版
微软 Phi-2 模型, 2.7B 参数, 效果好, Dec. 2023, Webpage
Mixtral 8x7B, Mixtral of experts, A high quality Sparse Mixture-of-Experts. Dec. 2023, Webpage，中文介绍，MoE 详解，使用指南
BloombergGPT, arXiv 论文
AllenAI Open Language Model: OLMo，包括数据，代码，2024，网站

Index