3D models in motion

by Paul Curzon, Queen Mary University of London
based on a 2016 talk by Lourdes Agapito

The cave paintings in Lascaux, France are early examples of human culture from 15,000 BC. There are images of running animals and even primitive stop motion sequences – a single animal painted over and over as it moves. Even then, humans were intrigued with the idea of capturing the world in motion! Computer scientist Lourdes Agapito is also captivated by moving images. She is investigating whether it’s possible to create algorithms that allow machines to make sense of the moving world around them just like we do. Over the last 10 years her team have shown, rather spectacularly, that the answer is yes.

People have been working on this problem for years, not least because the techniques are behind the amazing realism of CGI characters in blockbuster movies. When we see the world, somehow our brain turns all that information about colour and intensity of light hitting our eyes into a scene we make sense of – we can pick out different objects and tell which are in front and which behind, for example. In the 1950s psychophysics* researcher Gunnar Johansson showed how our brain does this. He dressed people in black with lightbulbs fastened around their bodies. He then filmed them walking, cycling, doing press-ups, climbing a ladder, all in the dark … with only the lightbulbs visible. He found that people watching the films could still tell exactly what they were seeing, despite the limited information. They could even tell apart two people dancing together, including who was in front and who behind. This showed that we can reconstruct 3D objects from even the most limited of 2D information when it involves motion. We can keep track of a knee, and see it as the same point as it moves around. It also shows that we use lots of ‘prior’ information – knowledge of how the world works – to fill in the gaps.

A motion capture system, image credit: T-tus at English Wikipedia. CC BY-SA 3.0

Shortcuts

Film-makers already create 3D versions of actors, but they use shortcuts. The first shortcut makes it easier to track specific points on an actor over time. You fix highly visible stickers (equivalent to Johansson’s light bulbs) all over the actor. These give the algorithms clear points to track. This is a bit of a pain for the actors, though. It also could never be used to make sense of random YouTube or CCTV footage, or whatever a robot is looking at.

The second shortcut is to surround the action with cameras so it’s seen from lots of angles. That makes it easier to track motion in 3D space, by linking up the points. Again this is fine for a movie set, but in other situations it’s impractical.

A third shortcut is to create a computer model of an object in advance. If you are going to be filming an elephant, then hand-create a 3D model of a generic elephant first, giving the algorithms something to match. Need to track a banana? Then create a model of a banana instead. This is fine when you have time to create models for anything you might want your computer to spot.

It is all possible for big budget film studios, if a bit inconvenient, but it’s totally impractical anywhere else.

No Shortcuts

Lourdes took on a bigger challenge than the film industry. She decided to do it without the shortcuts: to create moving 3D models from single cameras, applied to any traditional 2D footage, with no pre-placed stickers or fixed models created in advance.

When she started, a dozen or so years ago, making any progress looked incredibly difficult. Now she has largely solved the problem. Her team’s algorithms are even close to doing it all in real time, so making sense of the world as it happens, just like us. They are able to make really accurate models down to details like the subtle movements of their face as a person talks and changes expression.

There are several secrets to their success, but Johansson’s revelation that we rely on prior knowledge is key. One of the first breakthroughs was to come up with ways that individual points in the scene like the tip of a person’s nose could be tracked from one frame of video to the next. Doing this well relies on making good use of prior information about the world. For example, points on a surface are usually well-behaved in that they move together. That can be used to guess where a point might be in the next frame, given where others are.

The next challenge was to reconstruct all the pixels rather than just a few easy to identify points like the tip of a nose. This takes more processing power but can be done by lots of processors working on different parts of the problem. Key to this was to take account of the smoothness of objects. Essentially a virtual fine 3D mesh is stuck over the object – like a mask over a face – and the mesh is tracked. You can then even stick new stuff on top of the mesh so they move together – adding a moustache, or painting the face with a flag, for example, in a way that changes naturally in the video as the face moves.

Once this could all be done, if slowly, the challenge was to increase the speed and accuracy. Using the right prior information was again what mattered. For example, rather than assuming points have constant brightness, taking account of the fact that brightness changes, especially on flexible things like mouths, mattered. Other innovations were to split off the effect of colour from light and shade.

There is lots more to do, but already the moving 3D models created from YouTube videos are very realistic, and being processed almost as they happen. This opens up amazing opportunities for robots; augmented reality that mixes reality with the virtual world; games, telemedicine; security applications, and lots more. It’s all been done a little at a time, taking an impossible-seeming problem and instead of tackling it all at once, solving simpler versions. All the small improvements, combined with using the right information about how the world works, have built over the years into something really special.

*psychophysics is the “subfield of psychology devoted to the study of physical stimuli and their interaction with sensory systems.”


This article was first published on the original CS4FN website and a copy appears on pages 14 and 15 in “The women are (still) here”, the 23rd issue of the CS4FN magazine. You can download a free PDF copy by clicking on the magazine’s cover below, along with all of our free material.

Another article on 3D research is Making sense of squishiness – 3D modelling the natural world (21 November 2022).


Related Magazine …


EPSRC supports this blog through research grant EP/W033615/1.

Making sense of squishiness – 3D modelling the natural world

by Paul Curzon, Queen Mary University of London

Look out the window at the human-made world. It’s full of hard, geometric shapes – our buildings, the roads, our cars. They are made of solid things like tarmac, brick and metal that are designed to be rigid and stay that way. The natural world is nothing like that though. Things bend, stretch and squish in response to the forces around them. That provides a whole bunch of fascinating problems for computer scientists like Lourdes Agapito of Queen Mary, University of London to solve.

Computer scientists interested in creating 3-dimensional models of the world have so far mainly concentrated on modelling the hard things. Why? Because they are easier! You can see the results in computer-animated films like Toy Story, and the 3D worlds like Second Life your avatar inhabits. Even the soft things tend to be rigid.

Lourdes works in this general area creating 3D computer models, but she wants to solve the problems of creating them automatically just from the flat images in videos and is specifically interested in things that deform – the squishy things.

Look out the window and watch the world go by. As you watch a woman walk past you have no problem knowing that you are looking at the same person as you were a second ago – even if she becomes partially hidden as she walks behind the post box and turns to post a letter. The sun goes behind a cloud and the scene is suddenly darker. It starts to rain and she opens an umbrella. You can still recognise her as the same object. Your brain is pulling some amazing tricks to make this seem so mundane. Essentially it is creating a model of the world – identifying all the 3-dimensional objects that you see and tracking them over time. If we can do it, why can’t a computer?

Unlike hard surfaces, deformable ones don’t look the same from one still to the next. You don’t have to just worry about changes in lighting, them being partially hidden, and that they appear different from a different angle. The object itself will be a different shape from one still to the next. That makes it far harder to work out which bits of one image are actually the same as the ones in the next. Lourdes has taken on a seriously hard problem.

Existing vision systems that create 3D objects have made things easier for themselves by using existing models. If a computer already has a model of a cube to compare what it sees with, then spotting a cube in the image stream is much easier than working it out from scratch. That doesn’t really generalise to deformable objects though because they vary too much. Another approach, used by the film industry, is to put highly visible markers on objects so that those markers can be tracked. That doesn’t help if you just want to point a camera out the window at whatever passes by though.

Software from Lourdes’ team creates a model of the human face as it deforms. A looping gif of a man’s face making different expressions next to a cartoon version which copies him. Red dots on his features are mapped to red dots on the cartoon face

Lourdes aim is to be able to point a camera at a deformable object and have a computer vision system be able to create a 3D model simply by analysing the images. No markers, no existing models of what might be there, not even previous films to train it with, just the video itself. So far her team have created a system that can do this in some situations such as with faces as a person changes their expression. Their next goal is to be able to make their system work for a whole person as they are filmed doing arbitrary things. It’s the technical challenge that inspires Lourdes the most, though once the problems of deformable objects are solved there are applications of course. One immediately obvious area is in operating theatres. Keyhole surgery is now very common. It involves a surgeon operating remotely, seeing what they are doing by looking at flat video images from a fibre optic probe inside the body of the person being operated on. The image is flat but the inside of the person that the surgeon is trying to make cuts in is 3-dimensional. It would be far less error prone if what the surgeon was looking at was an accurate 3D model of the video feed rather than just a flat picture. Of course the inside of your body is made of exactly the kind of squishy deformable surfaces that Lourdes is interested in. Get the computer science right and technologies like this will save lives.

At the same time as tackling seriously hard if squishy computer science problems, Lourdes is also a mother of three. A major reason she can fit it all in, as she points out, is that she has a very supportive partner who shares in the childcare. Without him it would be impossible to balance all the work involved in leading a top European research team. It’s also important to get away from work sometimes. Running regularly helps Lourdes cope with the pressures and as we write she is about to run her first half marathon.

Lourdes may or may not be the person who turns her team’s solutions into the applications that in the future save lives in operating theatres, spot suspicious behaviour in CCTV footage or allow film-makers to quickly animate the actions of actors. Whoever does create the applications, we still need people like Lourdes who are just excited about solving the fundamental problems in the first place.


This article was originally published on the CS4FN website in ~2011. You can read more about Women in Computing here.


EPSRC supports this blog through research grant EP/W033615/1.