Multi-Domain Norm-Referenced Encoding
People can innately recognize human facial expressions in unnatural forms, such as when depicted on the unusual faces drawn in cartoons or when applied to an animal’s features. However, current machine learning algorithms struggle with out-of-domain transfer in facial expression recognition (FER). We propose a biologically-inspired mechanism for such transfer learning, which is based on norm-referenced encoding, where patterns are encoded in terms of difference vectors relative to a domain-specific reference vector. By incorporating domain-specific reference frames, we demonstrate high data efficiency in transfer learning across multiple domains. Our proposed architecture provides an explanation for how the human brain might innately recognize facial expressions on varying head shapes (humans, monkeys, and cartoon avatars) without extensive training.
Examples of the portraits that form our basic face shape (BFS) dataset showing seven different expressions on the three basic head shapes-what we refer to as our source and target domains. Our task involves the classification of facial expressions across unseen target domains from a source domain (outlined in blue) using only a single image from the target domain as a reference (outlined in red and green).
Norm-referenced encoding (NRE) is a classical principle in neuroscience for face identity representation. The key idea of NRE is that the face identity is encoded by a difference vector relative to a norm stimulus, typically the average face. The direction of this difference encodes facial identity, while its length represents the distinctiveness of the face relative to the average face. Let s be a vector that represents a face stimulus in an appropriate feature space. Then the difference vector is defined by d = s - r, where r is a norm or reference vector—classically, the average face computed by averaging the feature vectors of a large number of faces.
A) A schematic representation of norm-based encoding. The sketch displays two identities in feature space and their respective tuning vectors n1 and n2. The stimulus s is encoded by its position relative to the reference r through the difference vector d = s - r. B) Read-out activity vi for the input stimulus s for the two identities.
Multi-Domain Norm-Referenced Encoding
NRE is thus an intuitive and powerful model to encode faces and facial expressions. Here, we propose to extend the model to multiple reference frames, referring to it as multi-domain norm-referenced encoding (MD-NRE). We hypothesize that utilizing multiple norm-references might be a highly data-efficient transfer-learning approach for multi-domain FER. In a two-dimensional feature space, we aim to classify the expressions E1 and E2 for two different head shapes. The neutral expressions for the two head shapes are denoted as N1 and N2. Panel A shows the situation using a linear classifier that separates the two expressions correctly for only one head shape. Suppose the effect of changing the basic head shape is a collinear translation of all feature vectors in the feature space. Even if the translation vector is the same for all tested expressions, the classifier fails to classify the two expressions correctly for the second head shape.
Schematic representation of the classification of two expressions (E1 and E2) in a 2D face space, presented on two different basic head shapes. N1 and N2 represent the "norm faces" (neutral expressions) relative to which the individual expressions are encoded. A) A linear classifier class boundary indicated by the dashed line results in the correct classification of the expressions of the first head shape (N1), but fails on the second head shape (N2). B) A norm-referenced classifier, which transfers the tuning vectors n1 and n2 from the first head shape N1 to the second head shape N2 accomplishes correct classification without retraining the classifying neurons. C) An example of a norm-referenced classifier in a poorly-chosen feature space, resulting in misclassification for the second head shape.
To exploit the multi-domain norm-referenced encoding (MD-NRE), the developed model must fit several constraints. First, it needs to construct representations that preserve the tuning vectors across domains. Once the reference and tuning vectors are trained, we need to update the specific reference vectors to encode the inputs relative to it and project the resulting difference vectors onto the tuning directions. Therefore, an architecture with two streams is required: one that updates the reference vector, and one that computes the difference vectors and their projection onto the tuning vectors.
A straightforward method for a FER task would be to use pre-trained facial landmark detector methods. Tracking the landmarks’ displacement for facial expression features would directly output a 2D representation such that the displacement is invariant to face shapes and textures. Such representations support generalization for frontal to moderately rotated views. However, tested state-of-the-art facial landmark detectors [44, 45] yield poor results on non-human face (see supplementary materials for more details on the tested landmark estimation models). Moreover, a simple facial landmark detector would not be sufficient to exploit MD-NRE. In order to compute the outputs of the encoding neurons, the norm-referenced neurons need two inputs: one is the reference vector rN, and the other one is the learned preferred tuning direction n. To transfer between expressions on different head shapes, we need a second stream that determines the type of the head and the associated reference vector.
Model architecture comprising three main components: a CNN for generic feature extraction, a module for feature reduction and the construction of robust features, and a read-out network / classifier module. The feature reduction transforms the latent space of the VGG19 model into a sparse set of facial landmarks. The dimensionality reduction module comprises two pathways. The FR (facial recognition) pathway acts as a reference vector selector and updates the reference parameters with the face position and scale. The FER (facial expression recognition) pathway outputs the absolute position of facial expression landmarks and computes the tuning vectors of the multi-domain norm-referenced encoding (MD-NRE). The role of the FER pathway is to construct a shape- and texture-invariant facial expression landmark detector. The MD-NRE read-out layer encodes the landmark position relative to the selected reference vector and projects it onto the corresponding tuning vector.