Infographic: AMO-Bench

The saturation problem

Why a new benchmark is needed: existing math benchmarks such as AIME24 and AIME25 face severe performance saturation.

Recent results show that leading models can exceed 90% accuracy on these tests. Such scores reduce the ability of these benchmarks to distinguish further progress.

AMO-Bench: advanced math reasoning

To address this, researchers introduce AMO-Bench, an advanced math reasoning benchmark with 50 carefully crafted problems.

Key properties ensure strict evaluation:

Very high difficulty: experts cross validate every problem to reach or exceed International Mathematical Olympiad level.
Fully original: all 50 problems are newly created to prevent inflated scores from memorizing existing contest sets.
Automatic grading: each problem requires only a final answer rather than a full proof, enabling scalable automatic scoring with consistent evaluation.

Key finding: models still struggle

Under the stricter AMO-Bench setting, the apparent advantage of current large language models fades quickly.

Results show that the best performing model GPT-5 Thinking High reaches only 52.4% accuracy.

Most models score below 40% on this benchmark.

These results indicate substantial room for improvement in complex mathematical reasoning.

Performance comparison

AMO-Bench vs MATH500, source Figure 1

MATH500, saturated

GPT-5 Thinking High

99.6%

LongCat Flash Thinking

99.2%

Gemini 2.5 Pro

98.8%

AMO-Bench, new challenge

GPT-5 Thinking High

52.4%

LongCat Flash Thinking

43.6%

Gemini 2.5 Pro

38.7%

Benchmark construction process

Quality, originality, and difficulty controls

AMO-Bench uses a multi stage pipeline to maintain a high standard:

Human experts
Data creation

→

Quality review
Quality review

→

Originality review
Originality review

→

Difficulty review
Difficulty review