We have moved! Please visit us at ANTHROECOLOGY.ORG. This website is for archival purposes only.


Dec 06 2010

Tech Update: Depth Cameras and Kinect

 Lidar systems rely on time of travel, and stereoscopic sensors rely on differential feature detection across a baseline - but these aren't the only way to get 3D information from a scene. Hizook covers several models of "Depth Camera".

The Kinect is based on the PrimeSense platform that MS acquired a year or two ago.  As is often the case with MS, a competing technology involving time of travel, called ZCam, was acquired at the same time.

Kinect Diagram from Wikimedia Commons, CC-Attribution

Despite the fact that there are three lenses on the front of the thing, this is not a stereoscopic solution.  One is an infrared laser projector,  the second is an infrared camera, and the third is an RGB camera.  The projector puts out a grid pattern of dots, and the infrared camera appears to measure density / apparent size and also spatial relationships of the dots in order to extract a very low-precision Z dimension.  I'm not certain if there is any high-speed processing going on with the timing of the dots versus the infrared sensor - if it's high speed then it is possible to substitute an active light sensor for a second camera in stereoscopy solutions. The RGB cam maps color values to each dot.

This might be quite useful for indoor mapping, and potentially even for outdoor mapping if they figure out how to manipulate the hardware a bit.  Whether it's possible to integrate this with some high-resolution pictures and interpolate to get a better coverage is an open question.  We do know that SIFT + image bundling falls flat in areas of low surface texture and low dynamic range, where theoretically depth cameras and lidar work just fine.  I have two videos from Dieter Fox's Intel / University of Washington team working with this hardware:

 


Kinect hacks are making progress *rapidly*, with new implementations every day since they made it usable with computers.


Edit: There is a great deal of confusion about how the Kinect works.  The precise implementation isn't well-understood, but the PrimeSense patents involve a "Light Encoded" matrix of dots.  There doesn't seem to be a high enough geometric or temporal resolution in the image sensor to account for all of these dots precisely enough to determine depth - there must be some importance to the stereo separation of the light and the projector in their algorithm, but even then they've got some problems.

They could make a tradeoff there, though.  Given the constraints, one way I might rig this to achieve 60hz depth imaging in agreement with their patent would be grouping the dots into 16-dot 4x4 grids, and then selecting one dot from each grid to light up for the duration of a 1/960 second subframe.  Each subframe is solved for depth based on stereo separation, and 16 subframes are composited together into one frame, then smoothed for intensity and written as the Z dimension to the RGB data.  With less closely spaced dots, and the potential to arrange the selection of their dots into discrete, detectable patterns (one for each subframe), determination of relative position would be plausibly achievable in a low-resolution sensor.

Some discussion on the topic:

Wired

Ars Technica Forums

Mirror Image

 

Edit2:

I found the paper that gives a brief overview of their process.  It's based on a SLAM algorithm which combines the RGB sparse feature recognition that our system uses and a pointcloud generated by the depth layer along with a number of techniques to simplify/align surfaces and close loops to reduce errors.

Some additional links:

http://www.ros.org/wiki/kinect

http://liu-cv.blogspot.com/2009/04/open-source-slam-software.html

http://openslam.org/

Edit3:

While it's definitely not to the same usability level as Dieter Fox's stuff, here's the first Kinect SLAM demo with code available:

 

blog comments powered by Disqus