模型和数据的规模

本节介绍模型和数据的规模对大语言模型性能的影响。

首先,模型的 Loss 和训练的时间、数据集大小、模型规模的关系是 Power Law 的。基于这个关系,可以在资源有限的情况下,选择合适的数据量大小和模型规模,获得计算最优的模型。

其次,人们发现了所谓的“涌现(Emergent Abilities)”能力。涌现指的是:系统的量变导致其行为的质变。如果一种能力不存在于较小的模型中但存在于较大的模型中,则称该能力是涌现的。

关于涌现,治杰最近和我们分享了 NeuIPS 2023 的 Best Paper 《Are Emergent Abilities of Large Language Models a Mirage?》,说模型的 Loss 其实是逐渐下降的,只是由于测试集有限,或者评估方法是离散的,导致评估结果出现“涌现”的结果。很有意思。

课程材料

论文

普林斯顿大学课程参考论文

Refer:

约翰霍普金斯大学参考论文

Scaling

Andrej Karpathy 推荐论文

华盛顿大学参考论文

2: Is scale all we need for language models?

语言模型所需要的就是规模吗?

What are the emergent capabilities of large language models? Are we limited by the model size vs. data size? Is it really that the bigger is always the better? What about inverse scaling laws?

大型语言模型的新兴能力是什么?我们是否受到模型大小与数据大小的限制?真的是越大越好吗?那么逆缩放定律呢?

  1. Training Compute-Optimal Large Language Models (DeepMind, 2022)
  2. Emergent Abilities of Large Language Models (Google, 2022)
  3. In-context Learning and Induction Heads (Anthropic, 2022)
  4. Scaling Instruction-Finetuned Language Models (Google, 2022)
  5. Scaling Laws for Neural Language Models (OpenAI, 2020)
  6. Scaling Laws for Autoregressive Generative Modeling (OpenAI, 2020)
  7. Scaling Laws for Generative Mixed-Modal Language Models (Meta, 2023)
  8. Scaling Laws for Transfer (OpenAI, 2021)
  9. Beyond neural scaling laws: beating power law scaling via data pruning (Meta, 2022)
  10. Measuring Progress on Scalable Oversight for Large Language Models (Anthropic, 2022)
  11. LLM.int8() and Emergent Features (Blog, Dettmers, 2022)
  12. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (Dettmers, 2022)
  13. Predictability and Surprise in Large Generative Models (Anthropic, 2022)
  14. An Information-Theoretic Analysis of Compute-Optimal Neural Scaling Laws (2022)
  15. Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance (Blog Google, 2022)
  16. PaLM: Scaling Language Modeling with Pathways (Google, 2022)
  17. GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (Google, 2022)
  18. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model (Megatron-Turing NLG 530B, Microsoft & NVIDIA, 2022)
  19. Scaling Language Models: Methods, Analysis & Insights from Training Gopher (DeepMind, 2022)
  20. Galactica: A Large Language Model for Science (Meta, 2022)
  21. LaMDA: Towards Safe, Grounded, and High-Quality Dialog Models for Everything (Blog Google, 2022)
  22. Inverse scaling competition (public competition, 2022)
  23. Quantifying Memorization Across Neural Language Models (Carlini et al., 2022)
  24. How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources (Fu et al., 2022)
  25. What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (Wang et al., 2022)


Index Previous Next