Barbara Liskov: Byzantine birthdays

by Paul Curzon, Queen Mary University of London

(from the archive, originally in our special issue “The women are here”)

The scroll of a cello
Image by HeungSoon from Pixabay
Image by HeungSoon from Pixabay 

You may not think of computers as argumentative, but some of them do actually bicker quite a lot, and for good reason. Many hi-tech systems we depend on rely on the right computers getting their way, and the problems involved are surprisingly tricky to get right. Barbara Liskov’s contributions to this fiendishly difficult problem of making sure the right computers win their arguments helped her scoop the world’s top computing prize in 2009.

The ACM Turing Award, which Barbara won, is the computing equivalent of a Nobel Prize. She was awarded it for more than 40 years’ work. Her early work on programming languages has been used in every significant new language that has been invented since. More recently she has moved on to problems about ‘distributed systems’: computer systems involving lots of separate computers that have to work together. That’s where the arguing comes in.

Barbara has been working on an area of distributed computing called ‘Byzantine fault tolerance’. It’s all about providing a good service despite the fact that just about anything can go wrong to the computers or the network connecting them. It’s so important because those computers could be doing anything from running an ecommerce site over the Internet to keeping an airliner in the air.

Happy birthday

Here’s a thought experiment to show you how tricky distributed computing problems can be. Alice is a cellist in an orchestra and it turns out that the birthday of the conductor, Cassie, is on the day they are playing a concert. Jo, the conductor’s partner, has asked Alice to arrange for the orchestra to play Happy Birthday mid-concert as a surprise. The rest of the orchestra were enthusiastic. The only trouble is they didn’t have the chance to agree which break to do it in. In fact, no one’s sure they’re definitely doing it at all, and now they are now all sitting in their places ready to start.

For it to work they all have to start precisely together or it will just sound horrible. If anyone starts on their own they will just look and sound silly. Can it be done, or should they all just forget the whole thing?

Alice decides to go ahead. She can tell the others it’s on by whispering to those next to her and telling them to pass the message on. As long as her message gets to enough people across the different sections it will be ok. Won’t it?

Actually no.

The problem is: how will she know enough people did get the message? It has to be passed when people aren’t playing, so some could presumably not get it in time. How will she know? If the whispers didn’t get through and she starts to play, she will be the embarrassed one.

Are you in?

That seems an easy thing to solve though – when each person gets the message they just send one back saying, “I’m in”. If she gets enough back, she knows it’s on, doesn’t she? Ahh! There the problem lies. She knows, but no one else can be sure she knows. If any are in doubt they won’t play and it will still go horribly wrong. How does she let everyone know that enough people are willing to go for it? Alice is in the same situation she was at the start! She doesn’t know it will happen for sure and neither does anyone else.

She can start whispering messages again saying that enough people have agreed but that won’t help in the end either. How does she know all the new messages get through?

Change the problem

A computer scientist might have a solution for Alice – change the problem. Following this advice, she starts by whispering to everyone that she will stand up and conduct at an appointed time. Are they in? Now all she needs to be sure of is that when she stands up, enough people have agreed to play so that she won’t look silly. The others send back their message saying ‘I’m in’, but no one else needs to know in advance whether the song is definitely going ahead. If she doesn’t stand up they won’t play. If she does, they go ahead.

Delivering a good service

General knowledge
It’s called Byzantine Fault Tolerance after some imaginary
generals in the ancient days of the Eastern Roman Empire,
whose capital was Byzantium. The original problem was about
how two generals could know to attack a city at the same time.

Byzantine fault tolerance is about designing this kind of system: one that involves lots of ‘agents’ (people or computers) that have to come to an agreement about what they know and will do. The aim is for them to work together to deliver a service. That service might be for an orchestra to play Happy Birthday, but is more likely to be something like taking airline bookings over the Internet, or even deciding on action to take to keep the airliner they are flying in the air. The separate computers have to agree as a group even when some could stop working, make mistakes due to software bugs or even behave maliciously due to a virus at any point. Can a way be engineered that allows the system as a whole to still behave correctly and deliver that service? This is the problem Barbara Liskov has been working on with Miguel Castro at MIT. Of course they weren’t thinking of birthdays and orchestras. They were interested in providing the service of looking after documents so they can be accessed anytime, anywhere. A simple computer does this with a file system. It keeps track of the names of your documents and where it has stored them in its memory. When you open a document it uses the records it has kept to go and fetch it from wherever it was put. With this kind of file system, though, if something goes wrong with your machine you could lose your documents forever.

Spread it around

A way to guard against this is to create a file system that distributes copies of each file to different computers around the Internet. When you make changes, those changes are also sent to all the other computers with copies. Then if the copy on one machine is corrupted, perhaps by a hacker or just by a disk crash, the file system as a whole can still give you the correct document. That is where the computers need to start arguing. When you ask for your document back how do all the computers with (possibly different) copies decide which is the correct, uncorrupted version? That sounds easy, but as the orchestra example shows, as soon as you create a situation where the different agents (whether human or computer) are distributed, and worse you can’t trust anyone or anything for sure, there are lots of subtle ways it can all go horribly wrong.

The way Barbara and Miguel’s solution to this problem works is similar to what Alice was doing. One computer acts as what is called the ‘primary’ (Alice played this role). It is where the request from the client (Jo) goes. The primary sends out a request to all the backup machines for the document, just like Alice’s whispered messages. All the backups reply with their copy of the document. As soon as more than some predetermined number come back with the same document, then that is the good copy.

Not so simple

Of course the detail of Barbara and Miguel’s method is a lot trickier than that. They’ve had to figure out how to cope if something goes wrong with the primary (Alice herself) to ensure that the client still gets their document. Their version also works without any synchronisation to make things happen in lockstep (suppose Alice is at the back so can’t stand up and conduct to keep things in time). There are lots of other details in Barbara and Miguel’s version too. Messages are timestamped, for example, so that the recipients can tell if a message is new or just a copy of an old one.

Practically perfect

The particularly special thing about Barbara and Miguel’s way of providing fault tolerance, though, is that it doesn’t take an excessively long time. Various people had come up with solutions before, but they were so slow no one could really use them. The new method is so fast it’s almost as if you weren’t bothering about fault tolerance at all. Better still, the fact that it doesn’t need synchronisation – no conducting – means it also works when the replicated services are on the Internet where the computers act independently and there is no way to keep them in lockstep.

Barbara’s work might never actually help with an orchestral surprise like Alice’s, However, because of it future computers will be in a better position to shrug off their own kind turning rogue due to hackers, cosmic rays or even just plain old breakdowns. Not a bad reason to have a byzantine argument.

More on …

Related Magazines …

EPSRC supports this blog through research grant EP/W033615/1. 

The First Law of Humans

Preventing robots from hurting us

by Paul Curzon, Queen Mary University of London

A chess board with pieces lined up
Image by Pexels from Pixabay 

The first rule of humans when around robots is apparently that they should not do anything too unexpected…

A 7-year old child playing chess in a chess tournament has had his finger broken by a chess-playing robot which grabbed it as the boy made his move. The child was blamed by one of the organisers, who claimed that it happened because the boy “broke the safety rules”! The organisers also apparently claimed that the robot itself was perfectly safe!

What seems to have happened is, after the robot played its move, the boy played his own move very quickly before the robot had finished. Somehow this triggered the wrong lines of code in the robot’s program: instructions that were intended for some other situation. In the actual situation of the boy’s hand being over the board at the wrong time it led the robot to grab his finger and not let go.

Spoiler Alert

The situation immediately brings to mind the classic science fiction story “Moxon’s Master” by Ambrose Bierce published way back in 1899. It is the story of a chess playing automaton (ie robot) and what happens when it is check-mated when playing a game with its creator Moxon. It flies into a dangerous rage. However, there the problems are apparently because the robot has developed emotions and so emotional reactions. In both situations however a robot intended simply to play chess is capable of harming a human as a result.

The three laws of robotics

Isaac Asimov is famous for his laws of robotics: fundamental unbreakable rules built in to the ‘brain’ of all robots in his fictional world precisely to stop this sort of situation. The rules he formulated were:

  1. A robot may not injure a human being or, through inaction, allow a human being to come to harm.
  2. A robot must obey the orders given to it by human beings except where such orders would conflict with the First Law.
  3. A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
A robot carefully touching fingers with a human
Image by Pete Linforth from Pixabay 

Clearly, had they been in place, the chess robot would not have harmed the boy, and Moxon’s automaton would not have been able to do anything too bad as a result of its temper either.

Asimov devised his rules as a result of a discussion with the Science Fiction magazine editor John W. Campbell, Jr. He then spent much of his Science Fiction career writing robot stories around how humans could end up being hurt despite the apparently clear rules. That aside their key idea was that, to ensure robots were safe around people, they would need built-in logic that could not be circumvented to stop them hurting them. They needed a fail-safe system monitoring their actions that would take over when breaches were possible. Clearly this chess-playing robot was not “perfectly safe” and not even fail-safe as if it was the boy would not have been harmed whatever he did. The robot did not have anything at all akin to a working, unbreakable First Law programmed into it.

Dangerous Machines

Asimov’s robots were intelligent and able to think for themselves in a more general sense than any that currently exist. The First Law essentially prevented them from deciding to harm a human not just do so by accident. However, perhaps the day will soon come when they can start to think for themselves, so perhaps a first law will soon be important. In any case, machines can harm humans without being able to think. That humans need to be wary around robots is obvious from the fact that there have been numerous injuries and even fatalities in factories using industrial robots in the decades since they were introduced. They are dangerous machines. Fortunately, the carnage inflicted on children is at least not quite that of the industrial accidents in the Industrial Revolution. It is still a problem though. People do have to take care and follow safety rules around them!

Rather than humans having to obey safety laws, we perhaps ought to be taking Asimov’s Laws more seriously for all robots, therefore. Why can’t those laws just be built in? It is certainly an interesting research problem to think about. The idea of a fail-safe is standard in engineering, so its not that general idea that is the problem. The problem is that, rather than intelligence being needed for robots to harm us, intelligence is needed to avoid them doing so.

Implementing the First Law

Let’s imagine building in the first law to chess playing robots and in particular the one that hurt the boy. For starters the chess playing robot would have needed to have recognised that the boy WAS a human so should not be harmed. It would also need to be able to recognise that his finger was a part of him and that gripping its finger would harm him. It would also need to know that it was gripping his finger (not a piece) at the time. It would then need a way to stop before it was too late, and do no harm in the stopping. It clearly needs to understand a lot about the world to be able to avoid hurting people in general.

Some of this is almost within our grasp. Computers can certainly do a fairly good job of recognising humans now through image recognition code. They can even recognise individuals, so actually that first fundamental part of knowing what is and isn’t a human is more or less possible now, just not perfectly yet. Recognising objects in general is perhaps harder. The chess robot presumably has code for recognising pieces already though a finger perhaps even looks like a piece at least to a robot. To generally, avoid causing harm in any situation it needs to be able to recognise what lots of objects are not just chess pieces. It also needs to differentiate them from what is part of a human not just what is a human. Object recognition like this is possible at least in well-defined situations. It is much harder to manage it in general, even if the precise objects have never been encountered before. Even harder though is probably recognising all the ways that would constitute doing harm to the human identified in front of it including with any of those objects that are around.

Staying in control

The code to do all this would also have to be in some sense at a higher level of control than that making the robot take actions as it has to overrule them ALWAYS. For the chess robot, there was presumably a bug that allowed it to grip a human’s finger as no programmer will have intended that, so it isn’t about monitoring the code itself. The fail-safe code has to be monitoring what is actually happening in the world and be in a position to take over. It also can’t just make the robot freeze as that may be enough to do the damage of a broken finger if already in the robot’s grip (and that may have been part of the problem for the boy’s finger). It also can’t just move its arm back suddenly as what if another child (a toddler perhaps) has just crawled up behind it! It has to monitor the effects of its own commands too! A simple version of such a monitor is probably straightforward though. The robot’s computer architecture just needs to be designed accordingly. One way robots are designed is for new modules to build on top of existing ones giving new more complex behaviour as a result, which possibly fits what is needed here. Having additional computers acting as monitors to take over when others go wrong is also not really that difficult (bugs in their own code aside) and a standard idea for mission-critical systems.

So it is probably all the complexity of the world with unexpected things happening in it that is the problem that makes a general version of the First Law hard at the moment… If Asimov’s laws in all their generalness are currently a little beyond us, perhaps we should just think about the problem in another more limited way (at least for now)…

Can a chess playing robot be safe?

In the chess game situation, if anything is moving in front of the robot then it perhaps should just be keeping well out of the way. To do so just needs monitoring code that can detect movement in a small fixed area. It doesn’t need to understand anything about the world apart from movement versus non-movement. That is easily in the realms of what computers can do – even some simple toy robots can detect movement. The monitoring code would still need to be able to override the rest of the code of course, bugs included.

Why also could the robot grip a finger with enough pressure to break it, anyway. Perhaps it just needed more accurate sensors in its fingers to avoid doing harm, together with a sensor that just let go if it felt too much resistance back. After all chess pieces don’t resist much!

And one last idea, if a little bit more futuristic. A big research area at the moment is soft robotics: robots that are soft and squidgy not hard and solid, precisely so they can do less harm. Perhaps if the chess robot’s robotic claw-like fingers had instead been totally soft and squishy it would not have harmed him even if it did grab his finger.

Had the robot designers tried hard enough they surely could have come up with solutions to make it safer, even if they didn’t have good enough debugging skills to prevent the actual bug that caused the problem. It needs safety to be a very high priority from the outset though: and certainly safety that isn’t just pushed onto humans to be responsible for as the organisers did.

We shouldn’t be blaming children for not obeying safety rules when they are given what is essentially a hard, industrial robot to play with. Doing so just lets the robot makers off the hook from even trying to make their robots safer, when they clearly could do more. When disasters happen don’t blame the people, improve the system. On the other hand perhaps we should be thinking far more about doing the research that will allow us one day to actually implement Asimov’s Laws in all robots and so have a safety culture in robotics built-in. Then perhaps people would not have to be quite so wary around robots and certainly not have to follow safety rules themselves. That surely is the robot’s job.

More on …

Related Magazines …

This blog is funded through EPSRC grant EP/W033615/1.