Introduction

The capacity for reasoning — drawing logical conclusions from given premises — has long been considered a hallmark of human intelligence. Recent work demonstrates that sufficiently large language models exhibit emergent reasoning capabilities that are absent in smaller counterparts.

Background

Scaling Laws and Emergence

Kaplan et al. (2020) established that language model performance scales predictably with compute, data, and parameters. However, Wei et al. (2022) identified emergent abilities — capabilities that appear suddenly at certain scale thresholds rather than improving gradually.

Chain-of-Thought Prompting

“Let’s think step by step” — this simple instruction unlocks multi-step reasoning in models above approximately 100B parameters.

The chain-of-thought paradigm (Wei et al., 2022) decomposes complex reasoning into intermediate steps, allowing the model to use its own generated text as a “scratchpad.”

Methodology

We evaluate reasoning across three dimensions:

  1. Arithmetic reasoning — GSM8K benchmark
  2. Commonsense reasoning — StrategyQA
  3. Symbolic reasoning — Last Letter Concatenation

Key Findings

The relationship between model scale and reasoning accuracy follows a sigmoidal pattern rather than a power law. Below 10B parameters, chain-of-thought prompting provides negligible benefit. Above 100B parameters, it consistently outperforms standard prompting.

Conclusion

Reasoning in LLMs is not merely pattern matching but represents a qualitatively new capability that emerges at scale. Future work should investigate whether this emergence can be induced at smaller scales through targeted training.