Researchers at the Center for AI Safety and Scale AI have published “Humanity’s Last Exam” – a test designed to measure how close today’s most powerful artificial intelligence (AI) models are to meet or exceed human-level knowledge across multiple domains.
The test was launched in January 2025, but the researchers outlined the framework and their thinking behind the design for the first time in a new study published on January 28 in the journal Nature. It contains a corpus of 2,500 questions across more than 100 subjects, with input from more than 1,000 subject matter experts from 500 institutions in 50 countries.
At the launch, the researchers tested OpenAI’s GPT-4o and o1 models, Google’s Gemini 1.5 Pro, Anthropic’s Claude 3.5 Sonnet and DeepSeek R1. OpenAI’s o1 system took the top spot with a score of just 8.3%.
Despite this poor performance, the researchers wrote at the time that “given the rapid pace of AI development, it is likely that models could exceed 50% accuracy on HLE by the end of 2025.”
As of February 12, 2026, the highest score achieved so far is 48.4%, set by Google’s Gemini 3 Deep Think. Human experts, meanwhile, score around 90% in their respective domains.
Testing the smartest machines in the world
Humanity’s Last Exam was intentionally designed to be extremely difficult for AI models. During early development, the researchers issued a global call for submissions from subject matter experts across a range of domains.
The researchers enforced strict submission criteria requiring questions to be precise, unambiguous, solvable and non-searchable. They didn’t want models to cheat by performing a simple web search, or for some of the questions to already appear online—thus increasing the likelihood that a given model would have the answer in its training dataset.
Each question submitted was then fed to the AI models. The team automatically rejected all questions the models could answer correctly.
More than 70,000 submissions were attempted, resulting in approximately 13,000 questions stumping LLMs. These were then examined by a team of subject experts, approved by the research team and presented to the scientific community for open feedback.
Ultimately, the researchers reduced the total number of submissions to 2,500 questions that typically fall within the realm of doctoral-level testing.
An example of an exam trivia question is: “In Greek mythology, who was Jason’s maternal great-grandfather?”
Meanwhile, an example physics question asks about the relationship between various forces during motion in a scenario where a block is placed on a horizontal rail (and can slide frictionlessly) while also being attached to a rigid, massless rod of unknown length.
The breadth of questions and scope of topics covered by Humanity’s Last Exam sets it apart from similar benchmarking tools, its creators say.
Common tests, such as Massive language understanding for multiple tasks (MMLU) dataset, which was authored with the participation of the founder of the Center for AI Safety Dan Hendryck’stest only a small subset of expert-level domain knowledge, primarily focusing on coding and mathematics.
Even state-of-the-art benchmarks such as Francois Chollets ARC-AGI suite struggles to overcome the memorization and searchability issues that the creators of Humanity’s Last Exam suggest the new test addresses. Gemini’s Deep Think, for example, scored 84.6% on the ARC-AGI-2 benchmark, just a week after failing to reach 50% on the HLE test.
The ultimate prize is general intelligence
Humanity’s Last Exam probably represents the AI world’s best attempt to date to measure the wide-ranging capabilities of modern AI models against human experts, but the study’s authors categorically state that achieving a high score on the HLE is in no way indicative of the arrival of artificial general intelligence (AGI).
“High accuracy on HLE would demonstrate expert-level performance on closed, verifiable questions and cutting-edge scientific knowledge, but it would not alone suggest autonomous research capabilities or artificial general intelligence,” the researchers said in the study.
“Doing well on HLE is a necessary but not a sufficient criterion for saying that machines have reached true intelligence,” Manuel Schottdorfa neuroscientist in the University of Delaware’s Department of Psychological and Brain Sciences, said in a recent statement. Schottdorf is one of the many experts whose questions were accepted into HLE’s corpus.
“They must be good enough to solve these questions, but that, as a fact alone, cannot allow us to conclude that machines are truly intelligent.”






