Sunday, October 24, 2004

Why No Vision?

Why is it that computer vision has proven to be such a difficult problem? The strange thing is that computer graphics, which one might regard as the inverse problem, is rapidly closing in on achieving photorealistic rendering of scenes. I'm also puzzled because recognition problems are typically simpler than generation problems. It's certainly true that computer graphics has benefited from much commercial development and Moore's law, but faster computers should help recognition tasks as well.

One idea is that vision suffers from the same kind of problem as does commonsense reasoning, namely, the lack of large scale knowledge bases about the kinds of objects and materials in the world, what they look like from different angles and under different lighting, and so forth. But if this is the case, and computer graphics has advanced so far, it should not be difficult to generate a suitable such corpus with a moderate investment -- a corpus of images, ground truths in terms of 3d and other types of surface models, and connections to more general commonsense knowledge.


Blogger Bob Mottram said...

The problem of computer vision is a frustratingly stubborn one, and like a beached whale it steadfastly refuses to budge despite the gargantuan exertions of numerous well-intentioned researchers. In my own opinion the difficulties with vision are twofold.

Firstly a modular approach is often taken. By this I mean that the vision functions are gently coaxed off to one side and treated as a separate and independent problem. In some situations such as industrial vision systems where many prior assumptions may be made about exactly what will be visually encountered this may be a sufficiently satisfying approach, but in a more general situation where you might have some machine or robot which must go forth into the world and recognise things for itself this setup loses its viability. I think that vision is not a completely self-contained skill, but instead relies upon other cognitive contributors.

Secondly under the classical view vision is assumed to be a largely bottom up process. Marr described the vision problem as basically taking a two dimensional image on the retina and converting it into three dimensional "representations". Indeed there does appear to be a heirachical process involved, but I think that information within the visual system does not only parambulate in one direction.

Our attention now turns inexorably (currently my favourite word) towards the more successful computer vision systems of recent years. What is going on in these systems? Are there any general principles which can be added to the semi-coherent meanderings of an armchair AI theorist such as myself? Well, yes I think there are. Take the single camera SLAM system developed by Andrew Davison as an example. Here a three dimensional model of the environment is built up and the camera moves through space. Features detected within the image are used to create a number of hypothetical positions within a three dimensional model, and these position guesses are then used to help select new features within the image. Thus the system reliably tracks features, and is even able to re-acquire them as they move in and out of view. In summary, information is going both ways.

In my own experiments with one of the robots I have tried to take a similar approach. At any point in time there is a tension between exogenous contributions from its senses and endogenous factors emerging from internal processes. The balance between these is governed by the frequency of operation of simulated "neurons" within the system. At high frequences exogenous forces dominate and the robots behavior becomes quite reactive. At lower frequencies endogenous forces take over and the robot's perception is largely the result of internal factors based upon previous experiences. It is literally re-membering previous stored perceptions to form the current "mental scene". One perhaps unfortunate side effect is that this also makes it possible under some circumstances for the robot to "hallicinate", or believe that it is seeing things which aren't strictly contained within the current visual images from the cameras.

January 8, 2005 at 4:30 PM  
January 15, 2006 at 8:52 PM  
January 15, 2006 at 10:00 PM  
January 22, 2006 at 11:09 PM  
January 23, 2006 at 7:42 AM  
February 24, 2006 at 4:52 AM  
July 12, 2006 at 11:00 AM  
November 25, 2006 at 9:43 AM  
November 25, 2006 at 9:51 AM  

