Higly realistic monkey body avatars with veridical motion

Research Area

Neural and Computational Principles of Action and Social Processing

Researchers

Lucas M. Martini; Martin A. Giese

Description

For studying the neural and computational mechanisms of body and social perception, neurophysiological experimenters need highly controllable stimuli, specifying exactly, for example, viewpoints, pose, or texture. Such stimuli can be realized by computer graphics and animation. However, the realism of such stimuli is essential if one wants to study the mechanisms of the processing of real bodies by the visual system. In terms of stimulus genration, this defines a challinging task in order to genrate avatars with highly realistic appearance and showing realistic body motion. Traditional approaches in humans, such as marker-baser motion capture, do not transfer easily to animals, so that other solutions have to be developed. Available video-based markerless approaches either are not accurate enough for full 3D animation of body models, or they require prohibitive amounts of hand labelling of poses in individual frames.  Combining recent advances in Computer Vision, Convolutional Neural Networks (CNNs) we develop methods that realize sufficiently accurate markerless tracking with very limited amounts of hand labelled tracking data, exploiting multi-camera video-based tracking. Based on this approach we are able to genrate highly realistic animation of monkeys, which are used to study body-selective neurons in electrophysiological experiments by our collaboration partners. 

The animal cage was scanned with a 3d scanner and the scene was recontructed using the gaming development platform Unreal Engine. This allows for an exact modeling of the camera positions, and the optimization of the recording setup and the camera setup by simulation before the actual motion recordings  in the real animal facility. From synchronized multi-camera recordings we track exact 3D positions using deep-learning approaches, exploiting only a small number of hand-labelled multi-camera fames. The genrated trajectories, after optimization, are retargeted onto a commercial avatar that is embeded in the Unreal scene. The resulting dynamic scene is extremely similar to a real video, and  apparently not distinguishable from such real scenes by monkey observers. 

 

Commercial monkey avatar driven by markerless motion capture (left). Example of the rendered and posed macaque avatar (right).

The recording area as a gaming scene to adjust for light and camera positioning (created out of 3d scans).

Specifically, we scanned the recording area with a 3d scanner and reconstructed the scenery in the gaming development platform Unreal Engine. In this way, we could think about positioning of cameras and equipment before the actual recordings on site. This is especially useful as the environment to record can be very loud and crowded. We recorded videos from several subjects with a synchronized and calibrated multi-camera setup. Out of this data, we generated labels to train action-specific motion trajectories with deep-learning based tracking methods customized for our problem. After optimization, these trajectories are retargeted onto a commercial avatar that is usually done by commercial tools. However, we provide an easy way to use the generated labels and animate a virtual avatar with it. We can then render the avatar at arbitrary angles or change the surroundings. When using the digital copy of the recording area, we can also recreate the original videos synthetically and make not only highly realistic but also controllable stimuli for neuroscience applications.

 

The pipeline generates highly realistic images. The original recorded image (left) and the synthetically generated image (right).

The pipeline generates highly realistic images. The recorded image (left) and the synthetically generated image (right). 

Publications

Martini, L. M., Bognár, A., Vogels, R. & Giese, M. A. (2024). MacAction: Realistic 3D macaque body animation based on multi-camera markerless motion capture. bioRxiv.
MacAction: Realistic 3D macaque body animation based on multi-camera markerless motion capture
Abstract:

Social interaction is crucial for survival in primates. For the study of social vision in monkeys, highly controllable macaque face avatars have recently been developed, while body avatars with realistic motion do not yet exist. Addressing this gap, we developed a pipeline for three-dimensional motion tracking based on synchronized multi-view video recordings, achieving sufficient accuracy for life-like full-body animation. By exploiting data-driven pose estimation models, we track the complete time course of individual actions using a minimal set of hand-labeled keyframes. Our approach tracks single actions more accurately than existing pose estimation pipelines for behavioral tracking of non-human primates, requiring less data and fewer cameras. This efficiency is also confirmed for a state-of-the-art human benchmark dataset. A behavioral experiment with real macaque monkeys demonstrates that animals perceive the generated animations as similar to genuine videos, and establishes an uncanny valley effect for bodies in monkeys.Competing Interest StatementThe authors have declared no competing interest.

Type of Publication: Article
Journal: bioRxiv
Year: 2024
Martini, L. M., Bognár, A., Vogels, R. & Giese, M. A (2024). Macaques show an uncanny valley in body perception. Journal of Vision September 2024 . Vision Science Society.
Macaques show an uncanny valley in body perception
Abstract:

Previous work has shown that neurons from body patches in macaque superior temporal sulcus (STS) respond selectively to images of bodies. However, the visual features leading to this body selectivity remain unclear. METHODS: We conducted experiments using 720 stimuli presenting a monkey avatar in various poses and viewpoints. Spiking activity was recorded from mid-STS (MSB) and anterior-STS (ASB) body patches, previously identified using fMRI. To identify visual features driving the neural responses, we used a model with a deep network as frontend and a linear readout model that was fitted to predict the neuron activities. Computing the gradients of the outputs backwards along the neural network, we identified the image regions that were most influential for the model neuron output. Since previous work suggests that neurons from this area also respond to some extent to images of objects, we used a similar approach to visualize object parts eliciting responses from the model neurons. Based on an object dataset, we identified the shapes that activate each model unit maximally. Computing and combining the pixel-wise gradients of model activations from object and body processing, we were able to identify common visual features driving neural activity in the model. RESULTS: Linear models fit the data well, with mean noise-corrected correlations with neural data of 0.8 in ASB and 0.94 in MSB. Gradient analysis on the body stimuli did not reveal clear preferences of certain body parts and were difficult to interpret visually. However, the joint gradients between objects and bodies traced visually similar features in both images. CONCLUSION: Deep neural networks model STS data well, even though for all tested models, explained variance was substantially lower in the more anterior region. Further work will test if the features that the deep network relies on are also used by body patch neurons.

Type of Publication: In Collection
JRESEARCH_BOOK_TITLE: Journal of Vision September 2024
Publisher: Vision Science Society
Month: September

Information

All images and videos displayed on this webpage are protected by copyright law. These copyrights are owned by Computational Sensomotorics.

If you wish to use any of the content featured on this webpage for purposes other than personal viewing, please contact us for permission.

Social Media