Gemini Solves Erdős Problems, Highlighting AI Math Research Challenges

2026. február 9. · MI Történik? · 3 perc olvasás

An interdisciplinary group of scientists from Google DeepMind and a bunch of universities have used an internal Google Gemini-based LLM, codenamed Aletheia, to solve some math problems. The results demonstrate that contemporary AI systems can work on the frontiers of science, but also show how evaluating and filtering the solutions they come up with may be an important, challenging task for humans. The key numbers - 700 candidates and 1 creative and interesting solution: Erdős problems are 1000+ open mathematical conjectures left behind by prolific mathematician Paul Erdős at the time of his death. At the time of writing, a few hundred of these problems have been solved. For this research, the researchers tried to see whether their AI system, Aletheia, could generate solutions to any of the 700 remaining open questions. The results: yes, but with many, many caveats. Aletheia was able to surface 200 candidate solutions which humans then needed to grade, slimming down to 63 correct response, and further expert mathematical evaluation slimmed this down to a further subset of only 13 solves that Google calls “correct meaningful responses”. “The remaining 50 of Aletheia’s correct solutions were technically valid but mathematically meaningless because the problem statements were interpreted in a way that did not capture Erdős intent, often (but not always) leading to trivial solutions,” the researchers write. “”Only 13 solutions correctly addressed the intended problem statement (either by invoking the literature, or by a novel argument).” When 13 become 2: When you dig into these 13, the results get a bit less impressive: Who did the research: Along with Google DeepMind, the following universities participated in the research: UC Berkeley, Seoul National University, Stanford University, Korea Institute for Advanced Study, University of Cambridge, Brown University, Yonsei University, Concordia University, Academia Sinica, and National Taiwan University.

5 get classed as “literature identification”: “On these problems, Aletheia found that a solution was already explicitly in the literature, despite the problem being marked “Open” on Bloom’s website at the time of model deployment”.
3 are “partial AI solution”: “On these problems, there were multiple questions and Aletheia found the first correct solution to one of the questions”.
3 are “independent rediscovery”: “On these problems, Aletheia found a correct solution, but human auditors subsequently found an independent solution already in the literature.”
This leaves 2 “autonomous novel solution” solves: “On these problems, Aletheia found the first correct solution (as far as we can tell) in a mathematically substantive way”. Of these, 1 of the solutions seems genuinely interesting: “We tentatively believe Aletheia’s solution to Erdős-1051 represents an early example of an AI system autonomously resolving a slightly non-trivial open Erdős problem of somewhat broader (mild) mathematical interest, for which there exists past literature on closely-related problems [KN16], but none fully resolve Erdős-1051,” they write. “Moreover, it does not appear obvious to us that Aletheia’s solution is directly inspired by any previous human argument”.

Miért fontos?

This paper is a nice example of “O-ring automation” - AI here has massively sped up the art of generating proofs, but it still requires laborious, skilled work by humans to filter this down to the actually correct and useful responses. This trend will likely hold for some years, where AI will not be able to autonomously do science end-to-end, partially because a big chunk of scientific advancement comes down to something you might think of as “expert intuition” which exists in the heads of a small number of living scientists and was refined by their own biological intelligence by reading the same literature as the LLMs. Extracting this kind of expert taste feels like something that is tractable but will take a while. “Large Language Models can easily generate candidate solutions, but the number of experts who can judge the correctness of a solution is relatively small, and even for experts, substantial time is required to carry out such evaluations”, the authors write. “As AI-generated mathematics grows, the community must remain vigilant of “subconscious plagiarism”, whereby AI reproduces knowledge of the literature acquired during training, without proper acknowledgment. Note that formal verification cannot help with any of these difficulties.”

Eredeti forrás megtekintése (angol) →