End of saturation: AMO-Bench

A new olympiad level math benchmark that reveals the reasoning limits of top large language models.

The saturation problem

Why a new benchmark is needed: existing math benchmarks such as AIME24 and AIME25 face severe performance saturation.

Recent results show that leading models can exceed 90% accuracy on these tests. Such scores reduce the ability of these benchmarks to distinguish further progress.

AMO-Bench: advanced math reasoning

To address this, researchers introduce AMO-Bench, an advanced math reasoning benchmark with 50 carefully crafted problems.

Key properties ensure strict evaluation:

  • Very high difficulty: experts cross validate every problem to reach or exceed International Mathematical Olympiad level.
  • Fully original: all 50 problems are newly created to prevent inflated scores from memorizing existing contest sets.
  • Automatic grading: each problem requires only a final answer rather than a full proof, enabling scalable automatic scoring with consistent evaluation.

Key finding: models still struggle

Under the stricter AMO-Bench setting, the apparent advantage of current large language models fades quickly.

Results show that the best performing model GPT-5 Thinking High reaches only 52.4% accuracy.

Most models score below 40% on this benchmark.

These results indicate substantial room for improvement in complex mathematical reasoning.

Performance comparison

AMO-Bench vs MATH500, source Figure 1

MATH500, saturated
GPT-5 Thinking High
99.6%
LongCat Flash Thinking
99.2%
Gemini 2.5 Pro
98.8%
AMO-Bench, new challenge
GPT-5 Thinking High
52.4%
LongCat Flash Thinking
43.6%
Gemini 2.5 Pro
38.7%

Benchmark construction process

Quality, originality, and difficulty controls

AMO-Bench uses a multi stage pipeline to maintain a high standard:

Human experts
Data creation
Quality review
Quality review
Originality review
Originality review
Difficulty review
Difficulty review