Hacking DNA

A jigsaw of DNA with peices missing
Image by Arek Socha from Pixabay

DNA is the molecule of life. Our DNA stores the information of how to create us. Now it can be hacked.

DNA consists of two strands coiling round each other in a double helix. It’s made of four building blocks, or ‘nucleotides’, labelled A, C, G, T. Different orders of letters gives the information of how to build each unique creature, you and me included. Sequences of DNA are analysed in labs by a machine called a gene sequencer. It works out the order of the letters and so tells us what’s in the DNA. When biologists talk of sequencing the human (or another animal or plant’s) genome they mean using a gene sequencer to work out the specific sequences in the DNA for that species. They are also used by forensic scientists to work out who might have been at the scene of a crime, and to predict whether a person has genetic disorders that might lead to disease.

DNA can be used to store information other than that of life: any information in fact. This may be the future of data storage. Computers use a code made of 0s and 1s. There is no reason why you can’t encode all the same information using A, C, G, T instead. For example, a string of 1s and 0s might be encoded by having each pair of bits represented by one of the four nucleotides: 00 = A, 01 = C, 10 = G and 11 = T. The idea has been demonstrated by Harvard scientists who stored a video clip in DNA.

It also leads to whole new cyber-security threats. A program is just data too, so can be stored in DNA sequences, for example. Researchers from the University of Washington have managed to hide a malicious program inside DNA that can attack the gene sequencer itself!

The gene sequencer not only works out the sequence of DNA symbols. As it is a computer, it converts it into a binary form that can then be processed as normal. As DNA sequences are long, the sequencer compresses them. The attack made use of a common bug found in programs that malware often uses: ‘buffer overflow’ errors. These arise when the person writing a program includes instructions to set aside a fixed amount of space to store data, but then doesn’t include code to make sure only that amount of data is stored. If more data is stored then it overflows into the memory area beyond that allocated to it. If executable code is stored there, then the effect can be to overwrite the program with new malicious instructions.

When the gene sequencer reaches that malware DNA, the converted program emerges and is converted back into 1s and 0s. If those bits are treated as instructions and executed, it launches its attack and takes control of the computer that runs the sequencer. In principle, an attack like this could be used to fake results for subsequent DNA tests, subverting court cases, disrupt hospital testing, steal sensitive genetic data, or corrupt DNA-based memory.

Fortunately, the risks of exactly this attack causing any problems in the real world are very low but the team wanted to highlight the potential for DNA based attacks, generally. They pointed out how lax the development processes and controls were for much of the software used in these labs. The bigger risk right now is probably from scientists falling for spear phishing scams (where fake emails pretending to be from someone you know take you to a malware website) or just forgetting to change the default password on the sequencer.

Paul Curzon, Queen Mary University of London

More on …

Subscribe to be notified whenever we publish a new post to the CS4FN blog.


This page is funded by EPSRC on research agreement EP/W033615/1.

QMUL CS4FN EPSRC logos

Software for Justice

by Paul Curzon, Queen Mary University of London (originally published in 2011)

A jury is given misleading information in court by an expert witness. An innocent person goes to prison as a result. This shouldn’t happen, but unfortunately it does and more often than you might hope. It’s not because the experts or lawyers are trying to mislead but because of some tricky mathematics. Fortunately, a team of computer scientists at Queen Mary, University of London are leading the way in fixing the problem.

The Queen Mary team, led by Professor Norman Fenton, is trying to ensure that forensic evidence involving probability and statistics can be presented without making errors, even when the evidence is incredibly complex. Their solution is based on specialist software they have developed.

Many cases in courts rely on evidence like DNA and fibre matching for proof. When police investigators find traces of this kind of evidence from the crime scene they try to link it to a suspect. But there is a lot of misunderstanding about what it means to find a match. Surprisingly, a DNA match between, say, a trace of blood found at the scene and blood taken from a suspect does not mean that the trace must have come from the suspect.

Forensic experts talk about a ‘random match probability’. It is just the probability that the suspect’s DNA matches the trace if it did not actually come from him or her. Even a one-in-a-billion random match probability does not prove it was the suspect’s trace. Worse, the random match probability an expert witness might give is often either wrong or misleading. This can be because it fails to take account of potential cross-contamination, which happens when samples of evidence accidentally get mixed together, or even when officers leave traces of their own DNA from handling the evidence. It can also be wrong due to mistakes in the way the evidence was collected or tested. Other problems arise if family members aren’t explicitly ruled out, as that makes the random match probability much higher. When the forensic match is from fibre or glass, the random match probabilities are even more uncertain.

The potential to get the probabilities wrong isn’t restricted to errors in the match statistics, either. Suppose the match probability is one in ten thousand. When the experts or lawyers present this evidence they often say things like: “The probability that the trace came from anybody other than the defendant is one in ten thousand.” That statement sounds OK but it isn’t true.

The problem is called the prosecutor fallacy. You can’t actually conclude anything about the probability that the trace belonged to the defendant unless you know something about the number of potential suspects. Suppose this is the only evidence against the defendant and that the crime happened on an island where the defendant was one of a million adults who could have committed the crime. Then the random match probability of one in ten thousand actually means that about one hundred of those million adults match the trace. So the probability of innocence is ninety-nine out of a hundred! That’s very different from the one in ten thousand probability implied by the statement given in court.

Norman Fenton’s work is based around a theorem, called Bayes’ theorem, which gives the correct way to calculate these kinds of probabilities. The theorem is over 250 years old but it is widely misunderstood and, in all but the simplest cases is very difficult to calculate properly. Most cases include many pieces of related evidence – including evidence about the accuracy of the testing processes. To keep everything straight, experts need to build a model called a Bayesian network. It’s like a graph that maps out different possibilities and the chances that they are true. You can imagine that in almost any court case, this gets complicated awfully quickly. It is only in the last 20 years that researchers have discovered ways to perform the calculations for Bayesian networks, and written software to help them. What Norman and his team have done is develop methods specifically for modelling legal evidence as Bayesian networks in ways that are understandable by lawyers and expert witnesses.

Norman and his colleague Martin Neil have provided expert evidence (for lawyers) using these methods in several high-profile cases. Their methods help lawyers to determine the true value of any piece of evidence – individually or in combination. They also help show how to present probabilistic arguments properly.

Unfortunately, although scientists accept that Bayes’ theorem is the only viable method for reasoning about probabilistic evidence, it’s not often used in court, and is even a little controversial. Norman is leading an international group to help bring Bayes’ theorem a little more love from lawyers, judges and forensic scientists. Although changes in legal practice happen very slowly (lawyers still wear powdered wigs, after all), hopefully in the future the difficult job of judging evidence will be made easier and fairer with the help of Bayes’ theorem.

If that happens, then thanks to some 250 year-old maths combined with some very modern computer science, fewer innocent people will end up in jail. Given the innocent person in the dock could one day be you, you will probably agree that’s a good thing.


This post was originally published in 2011 on our old CS4FN website and a copy can also be found on pages 18 and 19 in the Alan Turing issue of the CS4FN magazine, issue 14. You can download a free PDF copy of the magazine below along with our entire back issue of magazines and booklets at our downloads site.


Further reading in justice

Edie Schlain Windsor and same sex marriage – Edie was a computer scientist whose marriage to another woman was deemed ineligible for certain rights provided (at that time) only in a marriage between a man and a woman. She fought for those rights and won.


Related Magazine …


EPSRC supports this blog through research grant EP/W033615/1.