In the history of mind-blowing tech demos there is one that stands out, and that is Johnny Lee’s Wii Remote hack to create VR displays:
That video is from 2007, and yet the technology being shown is so impressive that it feels like it could have been made yesterday.
The response to Johnny’s demo was universal acclaim. He presented it at a TED talk a year later, along with many other amazing hacks, and he went on to work on Microsoft’s Kinect and Google’s Project Tango.
Sadly, however, his VR display technique was never used by any Wii games or other commercial products that we know of.
Rewatching his demo earlier this year we wondered why his technique had not taken off. Everyone wanted to try it, but nobody ever did anything big with it.
We were also curious about whether we could implement his technique in the browser using only a webcam. We thought that if that was possible, it would make his technique accessible to everyone and it would open up a whole new world of interactive experiences on the web.
Imagine, for example, opening your favorite brand’s website and being presented with a miniature virtual storefront. You could look at their most popular products as if you were standing on a sidewalk peering into their shop.
That was enough to get us excited, so we got to work on solving this problem.
3D eye tracking with a webcam
The first thing that Johnny Lee’s technique requires is a way to calculate where your eyes are in world space with respect to the camera. In other words, you need a way to accurately say things like “your right eye is 20 centimeters to the right of the camera, 10 centimeters above it, and 50 centimeters in front of it.”
In Johnny’s case, he used the Wii Remote’s infrared camera and the LEDs on the Wii’s Sensor Bar to accurately determine the position of his head. In our case, we wanted to do the same but with a simple webcam.
After researching this problem for a while we came across Google’s MediaPipe Iris library, which is described as follows:
MediaPipe Iris is a ML solution for accurate iris estimation, able to track landmarks involving the iris, pupil and the eye contours using a single RGB camera, in real-time, without the need for specialized hardware. Through use of iris landmarks, the solution is also able to determine the metric distance between the subject and the camera with relative error less than 10%.
[…]
This is done by relying on the fact that the horizontal iris diameter of the human eye remains roughly constant at 11.7±0.5 mm across a wide population, along with some simple geometric arguments.
That’s precisely what we needed, but there was one problem, however: MediaPipe Iris isn’t exposed to JavaScript yet, and we wanted our demo to run in the browser.
We were really disappointed by this, but then we came across this blog post that explains how Google’s MediaPipe Face Mesh library can also be used to measure the distance between the camera and the eyes.
Here’s a summary of how it’s done:
- MediaPipe Face Mesh detects four points for each iris: left, right, top and bottom.
- We know the distance in pixels between the left and right points, and we also know that the world-space distance between them must be roughly equal to 11.7 mm (the standard diameter of a human iris).
- Using those values and the intrinsic parameters of our webcam (the effective focal length in X and Y and the position of the principle point in X and Y) we can solve some simple pinhole camera model equations to determine the depth of the eyes in world space, and once we know their depth we can also calculate their X and Y positions.
It sounds difficult, but it’s surprisingly easy to do.
The only problem then became how to calculate the intrinsic parameters of our webcam. The previous blog post also answered that for us by pointing us to this blog post, which explains it as follows:
- First you print out this c