Silly Sequences and Data Compression

Here is a fun challenge for you: can you find the rule for and next term of this sequence?

1
11
21
1211
111221
312211

This is not just some maths exercise – it also gives a clue to a clever way of compressing data! Have you figured it out? What if I told you this is called the “look-and-say” sequence? Say each row out loud in turn…

Here’s the answer: the next term in the sequence is 13112221. Each term describes the previous one. The last row, 312211, read aloud becomes “one three, one one, two twos, two ones”, which when written down with numbers is 13112221!

The sequence doesn’t necessarily need to start with 1: try picking a different starting number (a seed) and see where you end up!

If you’re perplexed with this sequence, you’re in good company. The look-and-say sequence, while not invented by him, was analysed in-depth by the mathematician John Conway after it was introduced to him at a party: a party where maths is discussed openly – sign me up! 

Conway, is famous for his Game of Life [EXTERNAL]: a cellular automaton which uses a few simple rules to create seemingly living patterns. We won’t go into detail on this here, but interestingly though, if you look more at Conway’s Game of Life, you might start to see some similar features – compare the look-and-say sequence starting 22… and a 2×2 block on the Life grid.

In his investigations, Conway discovered some very interesting properties of the look-and-say sequence. One of these is the interesting fact that no digits other than 1, 2, and 3 will appear in the sequence (unless the seed number contains such a digit or a run of more than three of the same digit). 

Take the sequence above starting 1, 11, etc. – this will never contain a digit 4 or above. Can you figure out why?

We can understand this by going backwards – what would we need to get a digit 4 in one of the terms in the sequence? Well, we’d need to have four of the same digit in a row (e.g. 1111). But this is impossible, because the number which would generate this would be 11 (read as  “one one followed by  one one” so written in the next round as 1111 as we want). However 11 if generated is actually written in the next term as 21 (“two ones”) not 1111.

Compress it?

You’re perhaps wondering how this links into computer science? Imagine a black-and-white image stored with 0s and 1s where 0 is a white pixel and 1 is a black pixel: 0000111111001111… Storing this as a long sequence might seem a bit inefficient though, so let’s try applying a look-and-say methodology to this.

We would end up with 40 (“four zeros”), 61, 20, 41 or written in full 40612041. Using just two digits to store data is especially convenient, as since we know that the 1s and 0s will always alternate once compressed, we can even remove them and just store the count of each: 4624…, a much shorter sequence to store compared to our original.

This style of compression is called Run-Length Encoding and is especially useful when you have files with long sequences of identical data (like the black-and-white binary image in our example).

Of course, it’s not universally efficient. As we’ve seen with the ever-growing look-and-say sequences, if the data doesn’t contain many repeating sequences, the file size may even get bigger: sometimes known as negative compression.

This is yet another example of how mathematics and computer science can go hand-in-hand to solve real problems! Keep looking out for interesting patterns – you never know what you might discover!

Daniel Gill, Queen Mary University of London

More on…

Subscribe to be notified whenever we publish a new post to the CS4FN blog.