Humanity’s Last Exam

A robot sitting at a desk doing an exam — Image by Mohamed Hassan from Pixabay

Generative Artificial Intelligences (GenAI) can now pass exams we set for humans and even do better than many humans. They can do that even without being able to think in a way a human does, and certainly without being conscious. They are learning to reason and are combining that with having hoovered up all the knowledge we have generated and recorded whether on the web or elsewhere. In effect, they use it to predict what comes next. In an exam what comes next after a question is the answer, so that is what they generate. But how good are they at doing that, really? As good as a good school student? As good as a university student? A PhD student? A Professor? Better than any human? Is there any question we could come up with, as examiners representing the human race, that a GenAI couldn’t answer? The SafeAI Benchmark Competition “Humanity’s Last Exam” is an attempt to find out.

Computer systems including AI-based ones, are typically evaluated based on benchmark questions that assess their intelligence and performance. They are the equivalent of big standardised exams. However, as AI models have rapidly advanced, existing benchmarks have become too easy. The “Humanity’s Last Exam” competition aimed to change this by collecting a new benchmark set of exceptionally difficult questions. The aim was to push artificial intelligence to its limits by challenging it with truly expert-level questions. To stack the deck in our favour any AI aiming to pass needed to be an expert in every subject, not just one or two!

Experts from across the disciplines were challenged to come up with questions in their area that they thought an AI would not be able to answer. The competition was a big success. It attracted more than 1,000 researchers and other experts. They submitted questions (with the correct answers), spanning over 100 different subjects. From all these suggested questions a solid set were selected in three stages.

First, came AI Evaluation: five of the best AI models of late 2024 attempted each question. If all failed it, then the question advanced to the next stage. Second came Expert Review: human experts refined and assessed the questions and answers. They had to make sure that the questions had a known answer that they were sure was correct. The questions also had to be clear. They couldn’t be ambiguous so that more than one answer might be considered correct. Finally, came the Final Selection: a panel of experts and organisers made the final call of which questions were actually to be used.

Out of over 70,000 submitted questions to stage 1, only 2,500 made it into the final benchmark, with the top 50 declared as winners, with the person submitting the question earning a prize. In addition, they were invited to become co-authors of the research paper accompanying the competition.

Two computer scientists from QMUL, Søren Riis and Marc Roth contributed multiple questions to the competition, and despite how many questions failed to make the grade, both were joint winners. Moreover, one of Marc’s questions was selected to be featured in the Nature paper about the results.

But what does a good question look like? To see, lets look at one of Marc’s selected questions. It concerned the process of “discovering” a network, meaning visiting all the nodes of an unknown network. What does this involve? Imagine a mouse is placed in a maze and starts to explore it. The maze is a kind of network with nodes (the junctions) and edges (the paths between them). The mouse, as it explores, is discovering that network. Suppose it does it randomly. Whenever it reaches a junction, it chooses one of the outgoing directions totally at random and continues exploring in that direction. We are interested in several things: how long will it take a mouse, on average, to explore the entire maze? How often will any specific location be visited by the mouse? And how likely it is for the mouse to be at any specific location at the end of its exploration?

The AIs were asked about a variation of this in which the mouse uses a specific but cleverer random strategy as given in the question, rather than just choosing a direction to go in totally at random at each junction. The AIs had to predict the behaviour of a mouse following this new strategy on different types of mazes. Surprisingly perhaps, even the best AIs at the time of the competition (2024) were unable to solve the problem correctly. They all claimed that the updated strategy does not lead to any difference in the overall behaviour compared to the original naive random strategy, in terms of the things of interest (like time taken). This is wrong as there are actually clear differences in the behaviour resulting from the two strategies. That was something that Marc himself was able to correctly work out: Humans: 1 (well at least if you are Marc), AIs: 0

The first version of the overall benchmark (so AI exam) was set and finalised in early 2025. The best two AIs (Open AI o1 and Deepseek R1) got about 8% of the questions right. One year later, Gemini 3 Pro achieved a staggering 38.3%! Its true performance might be even better since the benchmark set might still contain some ambiguous questions with no clear right answer and some questions where the given expert answers are only partially incomplete or incorrect. This is mainly believed to be a possibility in the areas of text-only chemistry and biology questions: so more work for the chemists and biologists!

Because of the need to continue to work on the questions to make sure they are definitely correct and unambiguous, the “Humanities Last Exam” team has now switched to working on the questions on a rolling basis, aiming to improve the questions over the coming years. The AIs are not going to be free from taking exams for some time come! But it may not be long before humanity runs out of questions. In the meantime, anyone thinking that human examiners need to just come up with better questions to avoid the problem of students asking AIs to answer questions for them had better think again. Even the best experts in the world are struggling to find questions no AI can answer. And if they can’t answer them this year, there is always next year, or the year after…

Marc Roth and Paul Curzon, Queen Mary University of London

Getting Technical

“A benchmark of expert-level academic questions to assess AI capabilities”
- This paper about the competition was published in Nature.

Subscribe to be notified whenever we publish a new post to the CS4FN blog.

AMPER: AI helping future you remember past you

by Jo Brodie, Queen Mary University of London

Old photos image by Karolina Grabowska from Pixabay

Have you ever heard a grown up say “I’d completely forgotten about that!” and then share a story from some long-forgotten memory? While most of us can remember all sorts of things from our own life history it sometimes takes a particular cue for us to suddenly recall something that we’d not thought about for years or even decades.

As we go through life we add more and more memories to our own personal library, but those memories aren’t neatly organised like books on a shelf. For example, can you remember what you were doing on Thursday 20th September 2018 (or can you think of a way that would help you find out)? You’re more likely to be able to remember what you were doing on the last Tuesday in December 2018 (but only because it was Christmas Day!). You might not spontaneously recall a particular toy from your childhood but if someone were to put it in your hands the memories about how you played with it might come flooding back.

Accessing old memories

In Alzheimer’s Disease (a type of dementia) people find it harder to form new memories or retain more recent information which can make daily life difficult and bewildering and they may lose their self-confidence. Their older memories, the ones that were made when they were younger, are often less affected however. The memories are still there but might need drawing out with a prompt, to help bring them to the surface.

old newspaper — Perhaps a newspaper advert will jog your memory in years to come… Image by G.C. from Pixabay

An EPSRC-funded project at Heriot-Watt University in Scotland is developing a tablet-based ‘story facilitator’ agent (a software program designed to adapt its response to human interaction) which contains artificial intelligence to help people with Alzheimer’s disease and their carers. The device, called ‘AMPER’*, could improve wellbeing and a sense of self in people with dementia by helping them to uncover their ‘autobiographical memories’, about their own life and experiences – and also help their carers remember them ‘before the disease’.

Our ‘reminiscence bump’

We form some of our most important memories between our teenage years and early adulthood – we start to develop our own interests in music and the subjects that we like studying, we might experience first loves, perhaps going to university, starting a career and maybe a family. We also all live through a particular period of time where we’re each experiencing the same world events as others of the same age, and those experiences are fitted into our ‘memory banks’ too. If someone was born in the 1950s then their ‘reminiscence bump’ will be events from the 1970s and 1980s – those memories are usually more available and therefore people affected by Alzheimer’s disease would be able to access them until more advanced stages of the disease process. Big important things that, when we’re older, we’ll remember more easily if prompted.

In years to come you might remember fun nights out with friends.
Image by ericbarns from Pixabay

Talking and reminiscing about past life events can help people with dementia by reinforcing their self-identity, and increasing their ability to communicate – at a time when they might otherwise feel rather lost and distressed.

“AMPER will explore the potential for AI to help access an individual’s personal memories residing in the still viable regions of the brain by creating natural, relatable stories. These will be tailored to their unique life experiences, age, social context and changing needs to encourage reminiscing.”
Dr Mei Yii Lim, who came up with the idea for AMPER⁽³⁾.

Saving your preferences

AMPER comes pre-loaded with publicly available information (such as photographs, news clippings or videos) about world events that would be familiar to an older person. It’s also given information about the person’s likes and interests. It offers examples of these as suggested discussion prompts and the person with Alzheimer’s disease can decide with their carer what they might want to explore and talk about. Here comes the clever bit – AMPER also contains an AI feature that lets it adapt to the person with dementia. If the person selects certain things to talk about instead of others then in future the AI can suggest more things that are related to their preferences over less preferred things. Each choice the person with dementia makes now reinforces what the AI will show them in future. That might include preferences for watching a video or looking at photos over reading something, and the AI can adjust to shorter attention spans if necessary.

“Reminiscence therapy is a way of coordinated storytelling with people who have dementia, in which you exercise their early memories which tend to be retained much longer than more recent ones, and produce an interesting interactive experience for them, often using supporting materials — so you might use photographs for instance”
Prof Ruth Aylett, the AMPER project’s lead at Heriot-Watt University⁽⁴⁾.

When we look at a photograph, for example, the memories it brings up haven’t been organised neatly in our brain like a database. Our memories form connections with all our other memories, more like the branches of a tree. We might remember the people that we’re with in the photo, then remember other fun events we had with them, perhaps places that we visited and the sights and smells we experienced there. AMPER’s AI can mimic the way our memories branch and show new information prompts based on the person’s previous interactions.

Although AMPER can help someone with dementia rediscover themselves and their memories it can also help carers in care homes (who didn’t know them when they were younger) learn more about the person they’re caring for.

*AMPER stands for ‘Agent-based Memory Prosthesis to Encourage Reminiscing’.

Suggested classroom activities – find some prompts!

What’s the first big news story you and your class remember hearing about? Do you think you will remember that in 60 years’ time?
What sort of information about world or local events might you gather to help prompt the memories for someone born in 1942, 1959, 1973 or 1997? (Remember that their reminiscence bump will peak in the 15 to 30 years after they were born – some of them may still be in the process of making their memories the first time!).

Related careers

The AMPER project is interdisciplinary, mixing robots and technology with psychology, healthcare and medical regulation.

We have information about four similar-ish job roles on our TechDevJobs blog that might be of interest. This was a group of job adverts for roles in the Netherlands related to the ‘Dramaturgy^ for Devices’ project. This is a project linking technology with the performing arts to adapt robots’ behaviour and improve their social interaction and communication skills.

Below is a list of four job adverts (which have now closed!) which include information about the job description, the types of people that the employers were looking for and the way in which they wanted them to apply. You can find our full list of jobs that involve computer science directly or indirectly here.

^Dramaturgy refers to the study of the theatre, plays and other artistic performances.

Dramaturgy for Devices – job descriptions

Human-Robot Interaction Design for service robots in hospitality settings (at TUDelft)
Shaping a personalised child-robot learning experience through conversational interaction (at VU)
Improvising Robots for Careful interaction: shaping HRI design through theatre (at UTwente)
Performing Arts and Robotics (at Utrecht University)

More on …

1. Agent-based Memory Prosthesis to Encourage Reminiscing (AMPER) Gateway to Research
2. The Digital Human: Reminiscence (13 November 2023) BBC Sounds – a radio programme that talks about the AMPER Project.
3. Storytelling AI set to improve wellbeing of people with dementia (14 March 2022) Heriot-Watt University news
4. AMPER project to improve life for people with dementia (14 January 2022) The Engineer

Subscribe to be notified whenever we publish a new post to the CS4FN blog.

This blog is funded by EPSRC on research agreement EP/W033615/1.

Playing the weighting game

by Paul Curzon, Queen Mary University of London

In the spotlight - bright lights dazzling in circles and stars — *Image by Gerd Altmann from Pixabay*

Imagine having a reality TV show where yet again Simon Cowell is looking for talent. This time it’s talent with a difference though, not stars to entertain us but ones with the raw ability to help find webpages. Yes, this time the budding stars are all words. Word Idol is here!

The format is simple. Each week Simon’s aim is to find talented words to create a new group: a group with star quality, a group with meaning. Like any talent competition, there are thousands of entries. Every word in every webpage out there wants to take part. They all have to be judged, but what do the specialist judges look for?

OK, we’re getting carried away. Simon Cowell may not be interested but there is big money in the idea. It’s a talent show that is happening all the time. The aim is to judge the words in each new webpage as it appears so that search engines can find it if ever someone goes looking. The real star of this show isn’t Simon Cowell but a Cambridge professor, Karen Spärck Jones. She came up with the way to judge words.

Karen worked out that to do this kind of judging a computer needs a thesaurus: a book of words. It just lists groups of words that mean the same thing. A computer, Karen realised, could use one to understand what words mean.

There is big money in the idea!

The fact that there are so many ways to say the same thing in human languages, makes it really hard for a computer to understand what we write. That is where a thesaurus comes in. If you ask a computer to search for web pages about whales, for example, it helps to know that, a page that talks about orcas is about whales too. Worse still, most words have more than one meaning, a fact that keeps crossword lovers in business.

Take the following example: “Leona is the new big star of the music business.”

The word ‘star’ here obviously means a celebrity, but how do you know? It could also mean a sun or a shape. The fact that it’s with the word ‘music’ helps you to work out which meaning is right even if you have no idea who or what Leona is. As Karen realised, a computer can also work out the intended meanings of words by the other words used with them. A thesaurus tells it what the critical groupings are, but what Karen wanted was a way a computer could work the thesaurus out for itself and now she had a way.

Her early approach was to write a program that takes lots and lots of documents and make lists of the words that keep appearing close together. If ‘music’ appears with ‘star’ lots then that is a new meaning. After building up a big collection of such lists of linked words, the program can then use it to decide which pages are talking about the same thing and so which ones to suggest when a search is done. So Karen had found the first way to judge whether a word has the right ‘talent’ to go in a group. The more often words appear together the higher the score or ‘weighting’ they should be given. Simple!

The only trouble is it doesn’t really work. That is where Karen’s big insight came. She realised that if two words appear together in a lot of different documents then, surprisingly perhaps, putting them together in a group isn’t actually that useful for finding documents! Do a search and they will just tell you that lots of web pages match. What you really want is to be told of the few web pages that contain the meaning you are looking for, not lots and lots that don’t.

The important word groupings are actually only in a small number of web pages. That suggests they give a very focused meaning. Word groups like that help you narrow down the search. So Karen now had a better way to judge word talent. Give high marks for pairs that do appear together but in as few web pages as possible. Rather than a talent show, it is more like a giant game of the quiz show Pointless where you win if you pick the words few other people did.

That idea was the big breakthrough and led to what is now called IDF weighting. It is the way to judge words, and is so good that it’s now used by pretty much every search engine out there. Playing the IDF weighting game may not make great TV but thanks to Karen it really does make for great web.

Related Magazines …

EPSRC supports this blog through research grant EP/W033615/1.

	The Hidden Code in Y… on Only the fittest slogans …
	Music AI Kriss Kross… on The day the music didn’t …
	Music AI Kriss Kross… on Separate your stems
	Musical Algorithms… on You’ll be Bach! – create…
	The art of animatron… on I’m (not) a little …

Category: Information Retrieval

Humanity’s Last Exam

More on…

Getting Technical

AMPER: AI helping future you remember past you

Accessing old memories

Our ‘reminiscence bump’

Saving your preferences

Suggested classroom activities – find some prompts!

See also

Related careers

Dramaturgy for Devices – job descriptions

More on …

Playing the weighting game

More on …

Related Magazines …