View-based neural encoding of goal-directed actions: a physiologically-inspired neural theory

Year:
2010
Type of Publication:
In Collection
Authors:
Giese, Martin A.
Caggiano, Vittorio
Thier, Peter
Month:
08
Pages:
1095
BibTex:
Note:
not reviewed
Abstract:

View-based neural encoding of goal-directed actions: a physiologically-inpired neural theory The visual recognition of goal-directed movements is crucial for action understanding. Neurons with visual selectivity for goal-directed hand actions have been found in multiple cortical regions. Such neurons are characterized by a remarkable combination of selectivity and invariance: Their responses vary with subtle differences between hand shapes (e.g. defining different grip types) and the exact spatial relationship between effector and goal object (as required for a successful grip). At the same time, many of these neurons are largely invariant with respect to the spatial position of the stimulus and the visual perspective. This raises the question how the visual system accomplishes this combination of spatial accuracy and invariance. Numerous theories for visual action recognition in neuroscience and robotics have postulated that the visual system reconstructs the three-dimensional structures of effector and object and then verifies their correct spatial relationship, potentially by internal simulation of the observed action in a motor frame of reference. However, novel electrophysiological data showing view-dependent responses of mirror neurons point towards an alternative explanation. We propose an alternative theory that is based on physiologically plausible mechanisms, and which makes predictions that are compatible with electrophysiological results. It is based on the following key components: (1) A neural shape recognition hierarchy with incomplete position invariance; (2) a dynamic neural mechanism that associates shape information over time; (3) a gain-field-like mechanism that computes affordance- and spatial matching between effector and goal object; (4) pooling of the output signals of a small number of view-specific action-selective modules. We show that this model is computationally powerful enough to accomplish robust position- and view-invariant recognition on real videos. At the same time, it reproduces and correctly predicts data from single-cell recordings, e.g. on the view- and temporal–order selectivity of mirror neurons in area F5.