推理

什么是推理？

“推理”一词是一个总称，包括演绎、归纳、溯因、类比、常识以及其他解决问题的“理性”或系统方法的能力。推理通常是一个涉及多个推理步骤的过程。推理通常被认为需要抽象，也就是说，推理能力不限于特定示例，而是更普遍。如果我能推理加法，我不仅可以解决 23+37，还可以解决我遇到的任何加法问题。如果我学习以 10 为底的加法并了解其他底数的加法，我的推理能力可以让我快速学会以任何其他底数加法。

作为一个生成模型，大语言模型不具有严格意义上的推理能力。

推理 Prompt

我们可以利用 LLM 进行推理。输入下面的 Prompt，让 LLM 返回推理过程：

Logical and commonsense reasoning exam.

Explain your reasoning in detail, then answer with Yes or No. Your answers should follow this 4-line format:

Premise: . Question: . Reasoning: . Answer: .

Premise: the customer doesn’t have any loans Question: Can we logically conclude for sure that the customer doesn’t have any auto loans? Reasoning: Let’s think logically step by step. The premise basically tells us that

然后 GPT 会接着说，比如：the customer has no loans at all. Therefore, we can conclude that the customer doesn’t have any auto loans either becasuse no loans = no auto loans. Answer: Yes

因果

当异常发生时，追踪调查，分析导致异常的原因，这就是“诊断”。在教育中，这个被称为“知识追踪”；在智能运维中，这被称为“根原因分析”；在医学中，这被称为“病因诊断”。它们的共同特点是：根据异常发生时的各种表现，进行推理分析。

利用现有的通用大模型，以及各种领域的专用大模型（比如医学领域大模型），进行上述追踪任务，就是面向追踪的大语言模型。

Adèle H. Ribeiro，Causality and its Role in Reasoning, Explainability, and Generalizability，LxMLS 2023 PPT
Andrew Lampinen，Passive learning of active causal strategies in agents and language models，DeepMind，LxMLS 2023 PPT

课程材料

华盛顿大学 CSE 599 同学 Slides

Awesome List

Awesome deliberative prompting: How to ask LLMs to produce reliable reasoning and make reason-responsive decisions, Github
Success Stories
Prompting Patterns and Strategies
- Beyond “Let’s think step by step”
- Multi-Agent Deliberation
- Reflection and Meta-Cognition
Text Generation Techniques
Self-Correction
Reasoning Analytics
Limitations, Failures, Puzzles
Datasets
Tools and Frameworks
Other Resources

Conference

COLM 2024, Website

论文

Ng 老师推荐论文

反思
- “Self-Refine: Iterative Refinement with Self-Feedback,” Madaan et al. (2023)
- “Reflexion: Language Agents with Verbal Reinforcement Learning,” Shinn et al. (2023)
- “CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing,” Gou et al. (2024)

Andrej Karpathy 推荐论文

系统一与系统二思维（Thinking Fast and Slow）
- 丹尼尔·卡尼曼描述了一个有两个系统的思想框架。系统1快速、直观、情绪化；系统2更慢、更慎重、更合乎逻辑
Mastering the game of Go with deep neural networks and tree search，论文链接，
- 这是大名鼎鼎的AlphaGo对应的论文，这是计算机程序首次在围棋游戏中击败人类顶级职业选手，此前人们认为这一壮举至少还需要十年的时间
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models，论文链接，
- 生成一组中间推理步骤可以提高LLM执行复杂推理的能力
Tree of Thoughts: Deliberate Problem Solving with Large Language Models，论文链接，
- 用语言模型研究了系统2思维，涉及更多的探索、战略前瞻或规划。通过这种解决问题的方法，你可以用时间换取准确性
System 2 Attention，论文链接
- Meta在论文中增加了第二个注意力步骤，来帮助LLM决定要关注和处理什么。这一步重新生成输入的上下文，只包含相关的部分，然后在关注这些重新生成的上下文后，产生最终的回答

约翰霍普金斯大学推荐论文

Rationalization/Explanations:

The Unreliability of Explanations in Few-Shot In-Context Learning
Can Rationalization Improve Robustness?
Can language models learn from explanations in context?

Compositionality:

Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
ReasonBERT: Pre-trained to Reason with Distant Supervision
Reasoning Like Program Executors
LinkBERT: Pretraining Language Models with Document Links

普林斯顿课程推荐论文

Refer:

华盛顿大学课程推荐论文

3: Can language models reason? 语言模型可以推理吗？

Why is it that deep learning can play chess, fold proteins, yet cannot solve strikingly easy puzzles?

为什么深度学习可以下棋、折叠蛋白质，却不能解决极其简单的难题？

Melanie Mitchell 论文

Memorization vs reasoning (Melanie Mitchell)

Perspectives on the State and Future of Deep Learning, 2023, arXiv:2312.09323，论文
Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks. To appear in Proceedings of the LLM-CP Workshop, AAAI-24. 论文
Can Large Language Models Reason? Webpage，视频 1
The debate over understanding in AI’s large language models，Webpage
How do we know how smart AI systems are? Webpage
AI Models of Conceptual Abstraction and Analogy-Making，博士后项目说明，PDF
Transcript of Episode 33 – Melanie Mitchell on the Elements of AI，Webpage
Book Artificial Intelligence: A Guide for Thinking Humans， 2019

Agent 推理相关论文

复旦大学的 LLM Agent 综述论文中提到的LLM 推理和计划能力相关论文。

Reasoning

[2023/09] ReConcile: Round-Table Conference Improves Reasoning via Consensus among Diverse LLMs. Justin Chih-Yao Chen (University of North Carolina at Chapel Hill) et al. arXiv. [paper] [code]
[2023/05] Self-Polish: Enhance Reasoning in Large Language Models via Problem Refinement. Zhiheng Xi (Fudan University) et al. arXiv. [paper] [code]
[2023-03] Large Language Models are Zero-Shot Reasoners. Takeshi Kojima (The University of Tokyo) et al. arXiv. [paper] [code]
[2023/03] Self-Refine: Iterative Refinement with Self-Feedback. Aman Madaan (Carnegie Mellon University) et al. arXiv. [paper] [code]
[2022/05] Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. Antonia Creswell (DeepMind) et al. arXiv. [paper]
[2022/03] Self-Consistency Improves Chain of Thought Reasoning in Language Models. Xuezhi Wang (Google Research) et al. arXiv. [paper] [code]
[2023/02] Multimodal Chain-of-Thought Reasoning in Language Models. Zhuosheng Zhang (Shanghai Jiao Tong University) et al. arXiv. [paper] [code]
[2022/01] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Jason Wei (Google Research) et al. arXiv. [paper]

综述论文

Towards Reasoning in Large Language Models: A Survey，论文

Choi 老师论文

Faith and Fate: Limits of Transformers on Compositionality，论文，发现多步推理能力有限

系统

谷歌 AI 通过图灵测试，大模型医生来了？图灵人工智能，2024-01-15，微信公众号

实现

LLM Reasoners, A library for advanced large language model reasoning, Github, Paper
- 包括算法
  - Reasoning-via-Planning, MCTS (Hao et al., 2023)
  - StructChem (Ouyang et al., 2023)
  - Chain-of-thoughts (Wei et al., 202)
  - Least-to-most prompting (Zhou et al., 2022)
  - Tree-of-Thoughts, BFS (Yao et al., 2023)
  - Tree-of-Thoughts, DFS (Yao et al., 2023)
  - Self-Eval Guided Decoding, Beam Search (Xie et al., 2023)
  - Grace Decoding (Khalifa et al., 2023)
  - Eurus (Yuan et al., 2024)
  - PromptAgent (Wang et al., 2023)
- 可视化
- LLM 接口

评估

Benchmarking large language models’ complex reasoning ability with chain-of-thought prompting, Github

Index