Neural model for the visual recognition of goal-directed hand movements

Type of Publication:
Fleischer, Falk
Giese, Martin A.
Bülthoff H H, Chatziastros A, Mallot H A, Ulrich R (eds): Proceedings of the 10th. Tübinger Perception Conference (TWK 2007), Knirsch, Kirchentellinsfurt, 152
not reviewed

Neural model for the visual recognition of goal-directed hand movements The visual recognition of goal-directed movements is crucial for the learning of actions, and possibly for understanding the intentions and goals of others. The discovery of mirror neurons has stimulated a vast amount of research investigating possible links between action perception and action execution [1,2,3]. However, the neural mechanisms underlying the visual recognition of goal-directed movements remain largely unclear. One class of theories suggests, that action recognition is mainly based on a covert internal re-simulation of executed motor acts, potentially even in a joint coordinate system. Another set of approaches assumes that a substantial degree of action understanding might be accomplished by appropriate analysis of spatio-temporal visual features, employing mechanisms that meanwhile are largely accepted as basis for the recognition of non-moving stationary objects. We present a neurophysiologically inspired model for the recognition of hand movements that demonstrates the feasibility of the second approach, recognizing hand actions from real video data. The model addresses in particular how invariance against position variations of object and effector can be accomplished, while preserving the relative spatial information that is required for an accurate recognition of the hand-object interaction. The model is based on a hierarchical feed-forward architecture for invariant object and motion recognition [4,5]. It extends previous approaches to complex stimuli like hands, and adds the capability for the processing of position information. The ability to recognize objects relies on a dictionary of shape-selective cells that are learned in an unsupervised manner from natural images. Feature complexity and invariance properties increase along the hierarchy by linear and nonlinear pooling operations. It is demonstrated that the model is able to correctly classify different grasp types and is suitable for determining the spatial relationships between effector and object, which are crucial for determining whether the action matches correctly the object affordance. The model demonstrates that well-established simple physiologically plausible neural mechanisms account for important aspects of visual action recognition without the need of a detailed 3D representation of object and action. This seems important since the robust extraction of joint angles from videos is a hard and largely unresolved computational problem, for which so far no physiologically plausible neural models have been proposed. [1] di Pellegrino, G. et al. (1992): Exp. Brain Res. 91, 176-180 [2] Gallese, V. et al. (1996): Brain 119, 593-609 [3] Rizzolatti, G. and Craighero, L. (2004): Annu. Rev. Neurosci. 27, 169-192 [4] Riesenhuber, M. and Poggio, T. (1999): Nat. Neurosci. 2, 1019-1025 [5] Giese, A.M. and Poggio, T. (2003): Nat. Rev. Neurosci. 4, 179-192