Humanity’s Last Exam

Generative Artificial Intelligences (GenAI) can now pass exams we set for humans and even do better than many humans. They can do that even without being able to think in a way a human does, and certainly without being conscious. They are learning to reason and are combining that with having hoovered up all the knowledge we have generated and recorded whether on the web or elsewhere. In effect, they use it to predict what comes next. In an exam what comes next after a question is the answer, so that is what they generate. But how good are they at doing that, really? As good as a good school student? As good as a university student? A PhD student? A Professor? Better than any human? Is there any question we could come up with, as examiners representing the human race, that a GenAI couldn’t answer? The SafeAI Benchmark Competition “Humanity’s Last Exam” is an attempt to find out.

Computer systems including AI-based ones, are typically evaluated based on benchmark questions that assess their intelligence and performance. They are the equivalent of big standardised exams. However, as AI models have rapidly advanced, existing benchmarks have become too easy. The “Humanity’s Last Exam” competition aimed to change this by collecting a new benchmark set of exceptionally difficult questions. The aim was to push artificial intelligence to its limits by challenging it with truly expert-level questions. To stack the deck in our favour any AI aiming to pass needed to be an expert in every subject, not just one or two!

Experts from across the disciplines were challenged to come up with questions in their area that they thought an AI would not be able to answer. The competition was a big success. It attracted more than 1,000 researchers and other experts. They submitted questions (with the correct answers), spanning over 100 different subjects. From all these suggested questions a solid set were selected in three stages.

First, came AI Evaluation: five of the best AI models of late 2024 attempted each question. If all failed it, then the question advanced to the next stage. Second came Expert Review: human experts refined and assessed the questions and answers. They had to make sure that the questions had a known answer that they were sure was correct. The questions also had to be clear. They couldn’t be ambiguous so that more than one answer might be considered correct. Finally, came the Final Selection: a panel of experts and organisers made the final call of which questions were actually to be used.

Out of over 70,000 submitted questions to stage 1, only 2,500 made it into the final benchmark, with the top 50 declared as winners, with the person submitting the question earning a prize. In addition, they were invited to become co-authors of the research paper accompanying the competition.

Two computer scientists from QMUL, Søren Riis and Marc Roth contributed multiple questions to the competition, and despite how many questions failed to make the grade, both were joint winners. Moreover, one of Marc’s questions was selected to be featured in the Nature paper about the results.

But what does a good question look like? To see, lets look at one of Marc’s selected questions. It concerned the process of “discovering” a network, meaning visiting all the nodes of an unknown network. What does this involve? Imagine a mouse is placed in a maze and starts to explore it. The maze is a kind of network with nodes (the junctions) and edges (the paths between them). The mouse, as it explores, is discovering that network. Suppose it does it randomly. Whenever it reaches a junction, it chooses one of the outgoing directions totally at random and continues exploring in that direction. We are interested in several things: how long will it take a mouse, on average, to explore the entire maze? How often will any specific location be visited by the mouse? And how likely it is for the mouse to be at any specific location at the end of its exploration?

The AIs were asked about a variation of this in which the mouse uses a specific but cleverer random strategy as given in the question, rather than just choosing a direction to go in totally at random at each junction. The AIs had to predict the behaviour of a mouse following this new strategy on different types of mazes. Surprisingly perhaps, even the best AIs at the time of the competition (2024) were unable to solve the problem correctly. They all claimed that the updated strategy does not lead to any difference in the overall behaviour compared to the original naive random strategy, in terms of the things of interest (like time taken). This is wrong as there are actually clear differences in the behaviour resulting from the two strategies. That was something that Marc himself was able to correctly work out: Humans: 1 (well at least if you are Marc), AIs: 0

The first version of the overall benchmark (so AI exam) was set and finalised in early 2025. The best two AIs (Open AI o1 and Deepseek R1) got about 8% of the questions right. One year later, Gemini 3 Pro achieved a staggering 38.3%! Its true performance might be even better since the benchmark set might still contain some ambiguous questions with no clear right answer and some questions where the given expert answers are only partially incomplete or incorrect. This is mainly believed to be a possibility in the areas of text-only chemistry and biology questions: so more work for the chemists and biologists!

Because of the need to continue to work on the questions to make sure they are definitely correct and unambiguous, the “Humanities Last Exam” team has now switched to working on the questions on a rolling basis, aiming to improve the questions over the coming years. The AIs are not going to be free from taking exams for some time come! But it may not be long before humanity runs out of questions. In the meantime, anyone thinking that human examiners need to just come up with better questions to avoid the problem of students asking AIs to answer questions for them had better think again. Even the best experts in the world are struggling to find questions no AI can answer. And if they can’t answer them this year, there is always next year, or the year after…

Marc Roth and Paul Curzon, Queen Mary University of London

Getting Technical

“A benchmark of expert-level academic questions to assess AI capabilities”
- This paper about the competition was published in Nature.

Subscribe to be notified whenever we publish a new post to the CS4FN blog.

	Music AI Kriss Kross… on The day the music didn’t …
	Music AI Kriss Kross… on Separate your stems
	Musical Algorithms… on You’ll be Bach! – create…
	The art of animatron… on I’m (not) a little …
	The Decline and Fall… on Victorian volunteers needed…

Humanity’s Last Exam

More on…

Getting Technical

Published by Paul Curzon

Leave a comment Cancel reply

More on…

Getting Technical

Share this:

Related

Published by Paul Curzon

Leave a comment Cancel reply