Lego computer science: compression algorithms

A giraffe image made of coloured squares
A giraffe as a pixel image.
Colour look-up table
Black 0
Blue 1
Yellow 2
Green 3
Brown 4
Image by CS4FN

Continuing a series of blogs on what to do with all that lego scattered over the floor: learn some computer science…

We saw in the last post how images are stored as pixels – the equivalent of square or round lego blocks of different colours laid out in a grid like a mosaic. By giving each colour a number and drawing out a gird of numbers we give ourself a map to recreate the picture from. Turning that grid of numbers into a list (and knowing the size of the rectangle that is the image) we can store the image as a file of numbers, and send it to someone else to recreate.

Of course, we didn’t really need that grid of numbers at all as it is the list we really need. A different (possibly quicker) way to create the list of numbers is work through the picture a brick at a time, row by row and find a brick of the same colour. Then make a long line of those bricks matching the ones in the lego image, keeping them in the same order as in the image. That long line of bricks is a different representation of the image as a list instead of as a grid. As long as we keep the bricks in order we can regenerate the image. By writing down the number of the colour of each brick we can turn the list of bricks into another representation – the list of numbers. Again the original lego image can be recreated from the numbers.

7 blue squares (1) a green square (3) 6 blue squares (1) 2 green suares (3) ...
The image as a list of bricks and numbers
Colour look-up table: Black 0: Blue 1: Yellow 2: Green 3: Brown 4
Image by CS4FN

The trouble with this is for any decent size image it is a long list of numbers – made very obvious by the very long line of lego bricks now covering your living room floor. There is an easy thing to do to make them take less space. Often you will see that there is a run of the same coloured lego bricks in the line. So when putting them out, stack adjacent bricks of the same colour together in a pile, only starting a new pile if the bricks change colour. If eventually we get to more bricks of the original colour, they start their own new pile. This allows the line of bricks to take up far less space on the floor. (We have essentially compressed our image – made it take less storage space, at least here less floor space).

A histogram of the squares
7x1 1 x3 6x1 ...

Image by CS4FN

Now when we create the list of numbers (so we can share the image, or pack all the lego away but still be able to recreate the image), we count how many bricks are in each pile. We can then write out a list to represent the numbers something like 7 blue, 1 green, … Of course we can replace the colours by numbers that represent them too using our key that gives a number to each colour (as above).

If we are using 1 to mean blue and the line of bricks starts with a pile of seven black bricks then write down a pair of numbers 7 1 to mean “a pile of seven blue bricks”. If this is followed by 1 green bricks with 3 being used for green then we next write down 1 3, to mean a pile of 1 green bricks and so on. As long as there are lots of runs of bricks (pixels) of the same colour then this will use far less numbers to store than the original:

7 1 1 3 6 1 2 3 1 1 1 2 3 1 2 3 2 2 3 1 2 3 …

We have compressed our image file and it will now be much quicker to send to a friend. The picture can still be rebuilt though as we have not lost any information at all in doing this (it is called a lossless data compression algorithm). The actual algorithm we have been following is called run-length encoding.

Of course, for some images, it may take more not less numbers if the picture changes colour nearly every brick (as in the middle of our giraffe picture). However, as long as there are large patches of similar colours then it will do better.

There are always tweaks you can do to algorithms that may improve the algorithm in some circumstances. For example in the above we jumped back to the start of the row when we got to the end. An alternative would be to snake down the image, working along the adjacent rows in opposite directions. That could improve run-length encoding for some images because patches of colour are likely the same as the row below, so this may allow us to continue some runs. Perhaps you can come up with other ways to make a better image compression algorithm

Run-length encoding is a very simple compression algorithm but it shows how the same information can be stored using a different representation in a way that takes up less space (so can be shared more quickly) – and that is what compression is all about. Other more complex compression algorithms use this algorithm as one element of the full algorithm.

Activities

Make this picture in lego (or colouring in on squared paper or in a spreadsheet if you don’t have the lego). Then convert it to a representation consisting of a line of piles of bricks and then create the compressed numbered list.

A pixelated pictures of a camel by a tree in sand dunes
An image of a camel to compress: Colour look-up table: Black 0: Blue 1: Yellow 2: Green 3: Brown 4
Image by CS4FN

Make your own lego images, encode and compress them and send the list of numbers to a friend to recreate.

Paul Curzon, Queen Mary University of London


More on …

Subscribe to be notified whenever we publish a new post to the CS4FN blog.


This blog is funded by EPSRC on research agreement EP/W033615/1.

QMUL CS4FN EPSRC logos

This post was funded by UKRI, through grant EP/K040251/2 held by Professor Ursula Martin, and forms part of a broader project on the development and impact of computing.

Hiding in Elizabethan Binary

The great Tudor and Stuart philosopher Sir Francis Bacon was a scientist, a statesman and an author. He was also a pretty decent computer scientist. He published* a new form of cipher, now called Bacon’s Cipher, invented when he was a teenager. Its core idea is the foundation for the way all messages are stored in computers today.

The Tudor and Stuart eras were a time of plot and intrigue. Perhaps the most famous is the 1605 Gunpowder plot where Guy Fawkes tried to assassinate King James I by blowing up the Houses of Parliament. Secrets mattered! In his youth Bacon had worked as a secret agent for Elizabeth I’s spy chief, Walsingham, so knew all about ciphers. Not content with using those that existed he invented his own. The one he is best remembered for was actually both a cipher and a form of steganography. While a cipher aims to make a message unreadable, steganography is the science of secret writing: disguising messages so no one but the recipient knows there is a message there at all.

A Cipher …

Bacon’s method came in two parts. The first was a substitution cipher, where different symbols are substituted for each letter of the alphabet in the message. This idea dates back to Roman times. Julius Caesar used a version, substituting each letter for a letter from a fixed number of places down the alphabet (so A becomes E, B becomes F, and so on). Bacon’s key idea was to replace each letter of the alphabet with, not a number or letter, but it’s own series of a’s and b’s (see the cipher table). The Elizabethan alphabet actually had only 24 letters so I and J have the same code as do U and V as they were interchangeable (J was the capital letter version of i and similarly for U and v).

In Bacon’s cipher everything is encoded in two symbols, so it is a binary encoding. The letters a and b are arbitrary. Today we would use 0 and 1. This is the first use of binary as a way to encode letters (in the West at least). Today all text stored in computers is represented in this way – though the codes are different – it is all Unicode is. It allocates each character in the alphabet with a binary pattern used to represent it in the computer. When the characters are to be displayed, the computer program just looks up which graphic pattern (the actual symbol as drawn) is linked to that binary pattern in the code being used. Unicode gives a binary pattern for every symbol in every human language (and some alien ones like Klingon).

Image by CS4FN

Steganography

The second part of Bacon’s cipher system was Steganography. Steganography dates back to at least the Greeks, who supposedly tattooed messages on the shaved heads of slaves, then let their hair grow back before sending them as both messenger and message. The binary encoding of Bacon’s cipher was vital to make his steganography algorithm possible. However, the message was not actually written as a’s and b’s. Bacon realised that two symbols could stand for any two things. If you could make the difference hard to spot, you could hide the messages. Bacon invented two ways of handwriting each letter of the alphabet – two fonts. An ‘a’ in the encoded message meant use one font and a ‘b’ meant use the other. The secret message could then be hidden inside an innocent one. The letters written were no longer the message, the message was in the font used. As Bacon noted, once you have the message in binary you could think of other ways to hide it. One way used was with capital and lower-case letters, though only using the first letter of words to make it less obvious.

Suppose you wanted to hide the message “no” in the innocuous message ‘hello world’. The message ‘no’ becomes ‘abbaa abbab’. So far this is just a substitution cipher. Next we hide it in, ‘hello world’. Two different kinds of fonts are those with curls on the tails of letters known as serif fonts and like this one and those without curls known as sans serif fonts and like this one. We can use a sans serif font to represent an ‘a’ in the coded message, and a serif font to represent ‘b’. We just alternate the fonts following the pattern of the a’s and b’s: ‘abbaa abbab’. The message becomes

Image by CS4FN

sans serif, serif, serif, sans serif, sans serif,
sans serif, serif, serif, sans serif, serif.

Using those fonts for our message we get the final mixed font message to send:

Bacon the polymath

Bacon is perhaps best known as one of the principal advocates for rigorous science as a way of building up knowledge. He argued that scientists needed to do more than just come up with theories of how the world worked, and also guard against just seeing the results that matched their theories. He argued knowledge should be based on careful, repeated observation. This approach is the basis of the Scientific Method and one of the foundation stones of modern science.

Bacon was also a famous writer of the time, and one of many authors who has since been suggested as the person who wrote William Shakespeare’s plays. In his case it is because they claim to have found secret messages hidden in the plays in Bacon’s code. The idea that someone else wrote Shakespeare’s plays actually started just because some upper class folk with a lack of imagination couldn’t believe a person from a humble background could turn themselves into a genius. How wrong they were!

Paul Curzon, Queen Mary University of London, Autumn 2017

*Thanks to Pete Langman, whose PhD was on Francis Bacon, for pointing out a mistake in the original version of this blog where I suggested the cipher was published in, 1605, the year of the Gun Powder plot. It was actually first published in 1623 in De augmentis which was a translation/enlargement of his 1605 Advancement of Learning.

He also pointed out that Bacon conceived the idea while working with Elizabethan spymaster, Walsingham’s cipher expert at the time of the Babington plot to assasinate Elizabeth I, Thomas Phileppes, and Mary, Queen of Scots’ jailer, Amias Paulet. Bacon also claimed the cipher was never broken!

Subscribe to be notified whenever we publish a new post to the CS4FN blog.


This blog is funded by EPSRC on research agreement EP/W033615/1.

QMUL CS4FN EPSRC logos