A new olympiad level math benchmark that reveals the reasoning limits of top large language models.
Why a new benchmark is needed: existing math benchmarks such as AIME24 and AIME25 face severe performance saturation.
Recent results show that leading models can exceed 90% accuracy on these tests. Such scores reduce the ability of these benchmarks to distinguish further progress.
To address this, researchers introduce AMO-Bench, an advanced math reasoning benchmark with 50 carefully crafted problems.
Key properties ensure strict evaluation:
Under the stricter AMO-Bench setting, the apparent advantage of current large language models fades quickly.
Results show that the best performing model GPT-5 Thinking High reaches only 52.4% accuracy.
Most models score below 40% on this benchmark.
These results indicate substantial room for improvement in complex mathematical reasoning.
AMO-Bench uses a multi stage pipeline to maintain a high standard: