As AI continues to improve, mathematicians struggle to predict their own future

In the ongoing campaign by artificial intelligence companies to take over pure mathematics, a new round begins.

The team behind First Proof, an effort to benchmark the ability of large language models (LLM) to contribute to research-level mathematics, has announced its next exam. For this second round, which they plan to roll out over the next few months, the team is requiring access and transparency from all AI companies that want to participate.

This is happening in the midst of a change in mathematics research. In recent months, the best publicly available models have begun to generate valid proofs of minor theorems of actual use to working mathematicians. For some experts, the opening round of First Proof was a pivotal moment in this ongoing story.

On supporting science journalism

If you like this article, please consider supporting our award-winning journalism by subscribes. By purchasing a subscription, you help secure the future of impactful stories about the discoveries and ideas that shape our world today.

“We were pretty impressed with how the AI models did,” says Lauren Williams, a Harvard University mathematician and First Proof team member. “The problems we proposed are really at the forefront of what AI models – perhaps together with experts – can solve.”

First Proof grew out of the 11-person team’s own eye-opening – if sometimes frustrating – experiences with AI. No existing benchmarks seemed adequate to test LLMs as a mathematics assistant. In principle, an LLM can save time by proving smaller “lemmas” – intermediate propositions along a mathematician’s path to developing larger theorems of greater interest. In practice, however, such AI assistants have tended to go wrong.

So for their initial, “experimental” test, the First Proof team decided on 10 lemmas from papers that members had written but not yet published, and then set a one-week deadline for AI companies (and everyone else) to try to prove those propositions using their favorite models.

Groups from both OpenAI and Google published the LLMs’ answers to all the problems. Five of the OpenAI model’s proofs appeared to be correct. And Google Deepmind’s Aletheia agent appeared to score six (although experts aren’t unanimous about the validity of any of that evidence). Comparing the performance of the two models, Williams was surprised to find that each had solved several problems that the other could not. “It’s interesting to see that their abilities are different,” she says.

“The performance was higher than I expected,” says Daniel Litt, a mathematician at the University of Toronto, who is not directly involved in the First Proof effort. In all, as many as eight of the 10 problems appear to have been at least partially solved by AI. “It’s clear that capacity has improved very quickly,” says Litt.

A hazy but hopeful future

Litt is not afraid of AI’s growing mathematical prowess. “I don’t expect five years from now to be useless,” he says. “I actually expect to do the best work I’ve ever done because I want these amazing tools.” In fact, the First Proof results inspired him to write an essay, which was widely circulated among mathematicians in recent weeks. It presents a speculative, optimistic view of the field’s AI-infused future.

For the sake of argument, Litt imagines a hypothetical library generated by superintelligent AIs and containing every possible proof in the mathematical universe. A human mathematician wandering among the countless shelves could read all the volumes, but could not produce any new proof himself.

But that does not mean that mathematicians would be crippled with ennui, says Litt. Far from it. “They would be incredibly excited and immediately start working,” he wrote in the essay. The mathematical universe is so vast, he says, that the joy lies in exploring it, whether by reading and digesting a proof or writing a new one. “My job wouldn’t even change at all,” he says. “The job now is to try to understand things.”

Although all mathematicians agreed with Litt’s decidedly utopian conception of this thought experiment, the current situation is far from the lofty ideal—as evidenced by First Proof’s first round. “Together, the models solved maybe eight of the problems,” he says. “But they also produced thousands and thousands of pages of garbage.”

Current AIs, it turns out, are often flawed but convincingly safe. They will cite a result in the literature but pretend it is stronger than it is. Or they bury a crucial error deep in a boring calculation, where it is easy to miss. “Students make mistakes, but they are definitely not samples to make mistakes,” says Litt. “The models are not very honest.”

This qualitative difference in the types of quantitative errors LLMs produce can make judging their responses very challenging. “One of the things we learned from this first round is how difficult it can be to check the correctness of the results,” says Mohammed Abouzaid, a First Proof team member and mathematician at Stanford University. “You almost wanted to say, ‘No human being who would know what all these words mean would make this mistake!'”

For round two, the team plans to outsource the task of evaluating each entry to mathematicians hired as anonymous reviewers, funded with a mix of grant money and donations from AI companies. But with no sign of the mass mathematical onslaught abating, a deluge of LLM-written, subtly incorrect proofs may soon overwhelm human resources. “People need to start thinking about this,” says Litt. “Our institutions and the profession are not adapting to what’s coming down the line.”

An inexplicable gap

The first round apparently revealed a sharp divide between public and proprietary efforts. This seems to challenge the notion that artificial intelligence usurping human skills will democratize them—for example, by expanding who is able to contribute meaningfully to math development.

In the team’s internal tests before publishing the first round’s 10 lemmas, even the best publicly available models were only able to prove two. During the week-long test period, different groups of amateur and professional mathematicians tried to do better by building “scaffolds,” collaborative networks of LLMs that talked to each other to find bugs. But all these efforts solved only one additional problem.

A few different factors may explain why Google and OpenAI were able to (at least partially) solve eight problems versus the public’s three. The companies may use improved, unreleased versions of their LLMs or some more robust, internal scaffolding. Or the answers may rely on some undisclosed input from human mathematicians. (Google’s team posted an explanation of their method. The team said that this approach included “absolutely no human intervention” — the kind of claims that First Proof’s new claim would confirm in the second round.)

That is what the second round is supposed to sort out, says Williams. “This was an experiment,” she says, “to get feedback from the community to figure out how to do a more formal round.”

In addition to more robust human judging, this round will require participants to package models so that the First Proof team can question them directly. “If there’s not a public model, we have to run it,” says Abouzaid, “because otherwise it’s not clear what we’re testing.”

It remains to be seen whether OpenAI and Google will comply – or whether the many other LLM companies and AI-for-math startups that were conspicuously absent during the first round will.

In the coming months, First Proof and other AI benchmarks may help predict the still-unclear fate of mathematics — a small niche in the scientific world that suddenly has some of the world’s richest eyes trained on it.

“One of our main motivations is to make sure we can tell young people what we expect the field to look like in a few years,” says Abouzaid. “And that requires understanding what these systems are actually capable of.”

Click Here to Get More