Recognising (and addressing) bias in facial recognition tech – the Gender Shades Audit #BlackHistoryMonth

The five shades used for skin tone emojis

Some people have a neurological condition called face blindness (also known as ‘prosopagnosia’) which means that they are unable to recognise people, even those they know well – this can include their own face in the mirror! They only know who someone is once they start to speak but until then they can’t be sure who it is. They can certainly detect faces though, but they might struggle to classify them in terms of gender or ethnicity. In general though, most people actually have an exceptionally good ability to detect and recognise faces, so good in fact that we even detect faces when they’re not actually there – this is called pareidolia – perhaps you see a surprised face in this picture of USB sockets below.

A unit containing four sockets, 2 USB and 2 for a microphone and speakers.
Happy, though surprised, sockets

What if facial recognition technology isn’t as good at recognising faces as it has sometimes been claimed to be? If the technology is being used in the criminal justice system, and gets the identification wrong, this can cause serious problems for people (see Robert Williams’ story in “Facing up to the problems of recognising faces“).

In 2018 Joy Buolamwini and Timnit Gebru shared the results of research they’d done, testing three different commercial facial recognition systems. They found that these systems were much more likely to wrongly classify darker-skinned female faces compared to lighter- or darker-skinned male faces. In other words, the systems were not reliable.

“The findings raise questions about how today’s neural networks, which … (look for) patterns in huge data sets, are trained and evaluated.”

Study finds gender and skin-type bias in commercial artificial-intelligence systems
(11 February 2018) MIT News

The Gender Shades Audit

Facial recognition systems are trained to detect, classify and even recognise faces using a bank of photographs of people. Joy and Timnit examined two banks of images used to train facial recognition systems and found that around 80 per cent of the photos used were of people with lighter coloured skin. 

If the photographs aren’t fairly balanced in terms of having a range of people of different gender and ethnicity then the resulting technologies will inherit that bias too. Effectively the systems here were being trained to recognise light-skinned people.

The Pilot Parliaments Benchmark

They decided to create their own set of images and wanted to ensure that these covered a wide range of skin tones and had an equal mix of men and women (‘gender parity’). They did this by selecting photographs of members of various parliaments around the world which are known to have a reasonably equal mix of men and women, and selected parliaments from countries with predominantly darker skinned people (Rwanda, Senegal and South Africa) and from countries with predominantly lighter-skinned people (Iceland, Finland and Sweden). 

They labelled all the photos according to gender (they did have to make some assumptions based on name and appearance if pronouns weren’t available) and used the Fitzpatrick scale (see Different shades, below) to classify skin tones. The result was a set of photographs labelled as dark male, dark female, light male, light female with a roughly equal mix across all four categories – this time, 53 per cent of the people were light-skinned (male and female).

A composite image showing the range of skin tone classifications with the Fitzpatrick scale on top and the skin tone emojis below.

Different shades

The Fitzpatrick skin tone scale (top) is used by dermatologists (skin specialists) as a way of classifying how someone’s skin responds to ultraviolet light. There are six points on the scale with 1 being the lightest skin and 6 being the darkest. People whose skin tone has a lower Fitzpatrick score are more likely to burn in the sun and not tan, and are also at greater risk of melanoma (skin cancer). People with higher scores have darker skin which is less likely to burn and they have a lower risk of skin cancer. 

Below it is a variation of the Fitzpatrick scale, with five points, which is used to create the skin tone emojis that you’ll find on most messaging apps in addition to the ‘default’ yellow. 

Testing three face recognition systems

Joy and Timnit tested the three commercial face recognition systems against their new database of photographs – a fair test of a wide range of faces that a recognition system might come across – and this is where they found that the systems were less able to correctly identify particular groups of people. The systems were very good at spotting lighter-skinned men, and darker skinned men, but were less able to correctly identify darker-skinned women, and women overall.  

These tools, trained on sets of data that had a bias built into them, inherited those biases and this affected how well they worked. Joy and Timnit published the results of their research and it was picked up and discussed in the news as people began to realise the extent of the problem, and what this might mean for the ways in which facial recognition tech is used. 

“An audit of commercial facial-analysis tools found that dark-skinned faces are misclassified at a much higher rate than are faces from any other group. Four years on, the study is shaping research, regulation and commercial practices.”

The unseen Black faces of AI algorithms (19 October 2022) Nature

There is some good news though. The three companies made changes to improve their facial recognition technology systems and several US cities have already banned the use of this tech in criminal investigations, and more cities are calling for it too. People around the world are becoming more aware of the limitations of this type of technology and the harms to which it may be (perhaps unintentionally) put and are calling for better regulation of these systems.

Further reading

Study finds gender and skin-type bias in commercial artificial-intelligence systems (11 February 2018) MIT News
Facial recognition software is biased towards white men, researcher finds (11 February 2018) The Verge
Go read this special Nature issue on racism in science (21 October 2022) The Verge

More technical articles

• Joy Buolamwini and Timnit Gebru (2018) Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, Proceedings of Machine Learning Research 81:1-15.
The unseen Black faces of AI algorithms (19 October 2022) Nature News & Views


See more in ‘Celebrating Diversity in Computing

We have free posters to download and some information about the different people who’ve helped make modern computing what it is today.

Screenshot showing the vibrant blue posters on the left and the muted sepia-toned posters on the right

Or click here: Celebrating diversity in computing


This blog is funded through EPSRC grant EP/W033615/1.

The red sock of doom – trying to catch mistakes before they happen

Washing machine mistake

A red sock in with your white clothes wash – guess what happened next? What can you do to prevent it from happening again? Why should a computer scientist care? It turns out that red socks have something to teach us about medical gadgets.

How can we stop red socks from ever turning our clothes pink again? We need a strategy. Here are some possibilities.

  • Don’t wear red socks.
  • Take a ‘how to wash your clothes’ course.
  • Never make mistakes.
  • Get used to pink clothes.

Let’s look at them in turn – will they work?

Don’t wear red socks: That might help but it’s not much use if you like red socks or if you need them to match your outfit. And how would it help when you wear purple, blue or green socks? Perhaps your clothes will just turn green instead.

Take a ‘how to wash your clothes’ course: Training might help: you’d certainly learn that a red sock and white clothes shouldn’t be mixed, you probably did know that anyway, though. It won’t stop you making a similar mistake again.

Never make misteaks: Just never leave a red sock in your white wash. If only! Unfortunately everyone makes mistakes – that’s why we have erasers on pencils and a delete key on computers – this idea just won’t work.

Get used to pink clothes: Maybe, but it’s not ideal. It might not be so great turning up to school in a pink shirt.

What if the problem’s more serious?

We can probably live with pink clothes, but what happens if a similar mistake is made at a hospital? Not socks, but medicines. We know everyone makes mistakes so how do we stop those mistakes from harming patients? Special machines are used in hospitals to pump medicine directly into a patient’s arm, for example, and a nurse needs to tell it how much medicine to give – if the dose is wrong the patient won’t get better, and might even get worse.

What have we learned from our red sock strategies? We can’t stop giving patients medicine and we don’t want to get used to mistakes so our first and fourth strategies won’t work. We can give nurses more training but everyone makes mistakes even when trained, so the third suggestion isn’t good enough either and it doesn’t stop someone else making the same mistake.

We need to stop thinking of mistakes as a problem that people make and instead as a problem that systems thinking can solve. That way we can find solutions that work for everyone. One possibility is to check whether changes to the device might make mistakes less likely in the first place.

Errors? Or arrows?

Most medical machines are controlled with a panel with numbered keys (a number keypad) like on mobile phones, or up and down arrows (an arrow keypad) like you sometimes get on alarm clocks. CHI+MED researchers have been asking questions like: which way is best for entering numbers quickly, but also which is best for entering numbers accurately? They’ve been running experiments where people use different keypads, are timed and their mistakes are recorded. The researchers also track where people are looking while they use the keypads. Another approach has been to create mathematical descriptions of the different keypads and then mathematically explore how bad different errors might be.

It turns out that if you can see the numbers on a keypad in front of you it’s very easy to type them in quickly, though not always correctly! You need to check the display to see if you have actually put in the right ones. Worse, mistakes that are made are often massive – ten times too much or more. The arrow keypads are a little slower to use but because people are already looking at the display (to see what numbers are appearing) they can help nurses be more accurate, not only are fewer mistakes made but those that are made tend to be smaller.

Smart machines help users

A medical device that actively helps users avoid mistakes helps everyone using it (and the patients it’s being used on!). Changing the interface to reduce errors isn’t the only solution though. Modern machines have ‘intelligent drug libraries’ that contain information about the medicines and what sort of doses are likely and safe. Someone might still mistakenly tell the machine to give too high a dose but now it can catch the error and ask the nurse to double-check. That’s like having a washing machine that can spot bright socks in a white wash and that refuses to switch on till it has been removed.

Building machines with a better ability to catch errors (remember, we all make mistakes) and helping users to recover from them easily is much more reliable than trying to get rid of all possible errors by training people. It’s not about avoiding red socks, or errors, but about putting better systems in place to make sure that we find them before we press that big ‘Start’ button.

This story was originally published here and is an article from CS4FN, a free computer science magazine from Queen Mary University of London which is sent to subscribing UK schools. To find out more please visit our About page.

Further reading / watching
You can find a copy of this article on pages 4 and 5 in issue 17 (Machines Making Medicine Safer) of CS4FN 17.

From 50s in this Paddington 2 clip you can see a ‘real world’ example of a red sock getting into the laundry.