Crafting a Face Tracking Solution for 3D Characters

Here, I'm sharing tips and tricks on how to craft a simple face tracking system that works real-time from input video on top of any existing landmarks extractor.

High-level Problem

If you have a 3D character, you might want to animate its facial expressions to make it look alive and serve its purpose. We wanted to focus on using video input of someone's face to transfer the facial expression to a 3D character.

The use cases for facial animation based on input video for 3D characters vary:

Use Cases

As a part of social apps and experiences where people can communicate to each other via animated avatars.

Tool for Indie-game devs who want to pre-record some animations for their characters in the game.

Tool for VTubers, indie movie makers, and content creators.

Back in 2021, we aimed to develop a solution that takes a user's real-time RGB video (with a face on it) and translates it into character animation. What we were able to build back then could be simply demonstrated via a couple of videos shared in this post.

Building a Simple Solution

At that point in 2021, there weren't many freely accessible solutions for diverse use-cases highlighted above. But the most advanced and popular solution to this date was and is Apple iPhone face tracking with a depth sensor. This solution allows accessible and high quality tracking.

It is a standart in the industry to use a set of 52 blendshapes based on the Facial Action Coding System FACS. To put it simply, it is a way to represent various human emotions as combinations of facial movements that should be seemingly independent, like: Surprise Emotion = Raised Eyebrows + Open Jaw. To read more on it and how it works with 3D characters, I suggest this awesome blog post.

From there, I was solving a well-defined problem:

ML Problem

RGB image sequence as input —> 51 float values from 0 to 1 as output

Approach

In several recent works on face tracking authors use differential rendering to align a 3D morphable model with the image and, by comparing final images, optimize the weights Lele Chen et al., Facebook Reality Labs, Patrik Huber et al., University of Surrey, UK. There are other works that I would like to mention here since I like them for the simplicity:) Hongwei Xu et al., Facegood, Ariel Larey et al., Huawei Research

However, we simplified the problem to just a regression problem for 51 FACS values (we don't track the tongue) solved in a supervised setting.
Also, we didn't want to build our own optimized face detection + landmarks detection system since there was a good open source one already. Thus, we built our solution on top of mediapipe facial 3D landmarks. Example of the mediapipe landmarks detection you can see on the image.

Training Pipeline

The pipeline consisted of several steps, check image 1 below:

Generating synthetic data using Morphable Head Model for sampling diverse identities. Also sampling FACS values from the Gaussian distribution.
Rendering synthetic characters with the given identity and facial expression. We rendered a neutral pose and face expression pose for each character.
Extracting features. We used mediapipe 368 3D facial landmarks + 10 landmarks for iris tracking. We noticed that refined Mediapipe landmarks produce much higher quality of facial tracking.
Using Procrustes transform to align extracted landmarks with reference landmarks in 3D space.
We calculated the difference between neutral pose landmarks and facial expression landmarks. This difference was used as a feature vector for our MLP network.
MLP is predicting 51 FACS floats and we backpropagate Smooth L1 loss between those and sampled FACS values from step 1.

Image 1: Training pipeline

Inference Pipeline

We require an idle image for each person for the algorithm to achieve the best results. However, this step could be omitted, and we can use just an average face as the idle face. In the latter case, the results might be worse for some people.

Per each frame on the input video, we do the following steps during inference (check image 2):

Extract Mediapipe landmarks.
Apply Procrustes to align landmarks in space and get the Euler angles.
Calculate features as a diff between neutral and current landmarks and get MLP predictions.
Apply Moving Average filter taking into account 3-5 last frames. This we do for both FACS and Euler angles to reduce jittering.

Image 2: Inference pipeline

Procrustes Transformation

Since we didn't train landmarks along with the FACS regression model, we needed a way to put detected landmarks into some normalized space where it is easy for the model to compare landmarks and create predictions. We used Procrustes transformation for that. Simply put, this algorithm quickly finds a transformation that rotates and scales 3D landmarks towards our predefined reference landmarks. And it works great in our case since we know 1-to-1 correspondence between 2 sets of landmarks. The rotation and scale do not change meaningful relative positions between the landmarks.

As a bonus from this approach, we also get Euler angles that we can later use for the rotation of the neck bone of the 3D avatar. In more popular terms — for Head Pose Estimation.

Synthetic Data

Image 3: FACS correlations from our recorded iPhone X face expression data

We believe that the success of the whole proposed approach could be attributed to our creative way of data synthesis.
In order to obtain the ground truth distribution of the FACS values, we asked our friends to record their facial expressions using an iPhone X. These values we exported into a CSV file and calculated the covariance matrix. This matrix you can see in Image 3. Later, we used it for sampling realistic facial expressions from a Gaussian distribution described by this matrix. Since Ready Player Me characters support ARKit blendshapes out of the box, we used them for creating synthetic renders.
As for the identity sampling, Ready Player Me has a proprietary morphable model that allows for sampling the components values uniformly and getting realistic-looking diverse faces. Even though in our experiments we used cartoonish renders, we noticed that Mediapipe was able to detect landmarks on those pretty well.

Conclusions

Around March 2023, Google finally released their own model for blendshapes prediction built on top of Mediapipe landmarks. It gave the community good enough face tracking for web and game engines. You can check this open-source project if you are interested. Despite this feature being so highly requested, it didn't become a game-changer in the industry. I think the reason is that people who were doing professional facial mocap were using an iPhone with LiveLink or other professional-grade software FaceWare, FaceGood, Rokoko, and other. Needless to say, that for Windows Desktop users and Vtubers, Animaze is doing the job pretty well.

Image 4: MetaHuman face tracking with Google's face tracking model built on top of Mediapipe. Credit: https://www.phizmocap.dev/

There are not so many use cases for webcam-based mocap other than video meetings. And for this, I enjoy the funny avatars that Google brought to Google Meet 😺 🐶 ~~

We were able to port our face tracking model to the web before this happened, but we believe that Google's version is much more performant. The only advantage of our face tracking implementation is the ability to personalize the predictions by feeding a still face image. I noticed that some avatars of my colleagues in Google Meet are always grumpy. And the reason is their low eyebrows position. There's simply not enough information in the landmarks alone about a particular person's facial features to distinguish facial expression from the identity features. But for the use case that Google's model covers, this is sufficient.

If we were to continue pushing the quality of the model, then utilizing the ideas from Roblox's face tracking approach seems to be the best option. In particular, the most important missing things in our approach are:

Temporal consistency (using temporal consistency loss)
Rich, realistic synthetic data with much more diversity to improve the quality
It would be smart to train a better landmarks model, similar in quality to Alex Carlier's Reshot.AI solution or what Microsoft developed.

Thank you for reading this article. I hope you enjoyed as much as I did solving this technicall puzzle :)

If you want to cite this article, then the best way to do it would be:

Cheskidova, E., & Dutt, N. S. (2023). Crafting a Face Tracking Solution for 3D Characters. Ready Player Me. Retrieved from [https://www.wonder.cat/blog/face-tracking]

In this citation:

"Cheskidova, E., & Dutt, N. S." are the authors.
"(2023)" is the year of publication.
"Crafting a Face Tracking Solution for 3D Characters" is the title of the article.
"Ready Player Me" is the organization where the work was done.

This format is based on the APA citation style, which is commonly used in the sciences. If your field uses a different citation style (like MLA, Chicago, etc.), you might need to adjust the format accordingly.