The art (it is difficult to refer to it yet as a science) of Facial Emotion Recognition (FER) is, arguably, in a nascent state. The prevailing methodologies, such as the Facial Action Coding System (FACS), are subject to frequent criticism; and the available tools to implement these principles – assuming you support those principles – are often either lightweight but out-of-date, or else reliant on heavy and unwieldy computing systems – and more likely to be proprietary or closed-source code.
Because consensus about FER is lacking, and the field of AI-aided emotion recognition is only just now being defined, available libraries and frameworks fall short of the current needs of the academic research sector.
One such need is for an easily deployable open source system capable of evaluating images of faces in terms of emotion recognition. Many of the most interesting tools in this regard (usually centered around GPU-based AI training) remain proprietary. The alternatives, notably the OpenFace and OpenFace 2 behavioral analysis tool-kits, rely on older statistical analysis approaches such as Support Vector Machine (SVM) and Histograms of Oriented Gradients (HOG) – stalwart, venerable technologies that are now more often found as minor components in larger and more complex frameworks, rather than as the central spine of an evaluative architecture.
To address this shortfall, a new initiative from the University of Southern California proposes to release a novel open source framework titled LibreFace (named, apparently, as a tribute to the open source Microsoft Office alternative LibreOffice).

LibreFace bridges the gap between the (in the authors’ collective opinion) outdated approaches of the OpenFace project and the rigors of developing full-fledged data-gathering and training pipelines for more recent and burdensome frameworks.
The system gathers together some very modern but portable FOSS components, such as MediaPipe (which will be familiar to many users of Stable Diffusion), as well as leveraging several state-of-the-art FER-relevant datasets, to compose a rational system that can run either on CPU or on a GPU – and which, in tests, was found to run twice as fast on CPU as OpenFace.
In addition to this, LibreFace achieves superior performance to OpenFace in general, and is able to perform comparably with other, much heavier and resource-intensive systems.
LibreFace was developed as an array of .NET libraries, and is intended to operate across a number of platforms as an executable; though the current build is a Windows version, the researchers plan to bring LibreOffice to MacOS and Linux. They also plan to release the code at this URL (though at the time of writing, the repository is empty).
In terms of applicability to image synthesis, and the creation of neural characters, better FER tools are always needed, and a system such as LibreFace could be used to evaluate the ’emotional temperature’ of neural faces, both as a filtering tool, and as an aide to development in new systems intended to allow creative practitioners to alter facial expressions.
Currently, most such systems involve simply pushing concepts into the latent space in order to change individual parts of the face, until an instinctively (i.e., by human interpretation) ‘correct’ expression emerges. Therefore any research, and any associated tools, that can help to develop a flexible lexicon of cohesive facial affect values, such as ‘happy’ or ‘sad’, and to at least semi-automate this process, is going to be quite useful.
The new paper announcing the work is titled LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis, and comes from five researchers at USC’s Institute for Creative Technologies.
The system comprises four stages: first, the source facial images are pre-processed, using MediaPipe to create an interpreted mesh, from which facial landmarks can be derived; the results from this stage are then fed into a masked autoencoder (MAE) originally developed by Facebook Research, and then to a linear regression or classification layer that rates the image for Action Unit intensity (i.e., how much the constituent parts of the face image may be said to be producing a specific emotion or compound emotion).
With the MAE fine-tuned thus, feature-wise distillation transfers what the MAE learned into a lightweight student model based on ResNet-18; and lastly, the ResNet-18 output is used to infer the FER characteristics.



The model is then fine-tuned on the spontaneous facial action intensity database DISFA, which performs AU intensity evaluation (i.e., how ‘extreme’ a recognized expression is). A Mean Squared Error (MSE) loss function is used for this purpose.

In a secondary strategy, ResNet-18 is also used as an encoder, and trained on AffectNet and FFHQ, together with a linear classifier, before fine-tuning on DISFA.

Since the ViT backbone in these strategies is quite resource-intensive, the researchers sought to slim down the pipeline by passing the data through the aforementioned student-teacher model, where select results from a heavier framework are passed to a lighter and more adroit module, based on the methodology of prior research from Samsung and the University of Nottingham in the UK.

In the LibreFace implementation, the pre-trained teacher classifier is frozen (i.e., it cannot be affected by the training process, and therefore furnishes reliable and consistent values based on its prior training), and this frozen model is used to inform both the teacher and student models, in contrast to the original approach used for this technique.
The system was tested using PyTorch, and the authors state that the code and model weights will be made available for reproducibility later. All experiments and training were conducted on a single NVIDIA RTX 8000 GPU, which features a formidable 48GB of VRAM. However, as stated, the system can likewise run on a CPU.
Input images were resized to 256x256px. To increase the variety of data, training routines can optionally perform data augmentation, where the source data is fed to the system a number of times with diverse transformations, such as flipping (reversing) an image randomly, changing its angle, and even turning it upside down.

For systems that are seeking to reproduce a particular individual, such augmentations are not advised, since most people do not have perfectly symmetrical faces; however, in terms of generic emotion recognition, parity of symmetry is irrelevant, and augmenting the data in this way can aide generalization.
Therefore the source data was augmented in these ways for testing purposes, for LibreFace.
The model was trained on the AdamW optimizer at a rather high batch size of 28. Due to this high batch size, the learning rate was set to a moderate 3e-5, with a weight decay of 1e-4. If the learning rate had been set very low, as is increasingly becoming common in generative frameworks, the system would have learn