EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization

1Michigan State University 2Qualcomm Technologies Inc.
Michigan State University Qualcomm
CVPR 2026
Accepted to The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026
Teaser Image.

Figure 1. EmoTaG generates expressive and synchronized 3D talking heads from only 5-second new identity videos. Built upon a FLAME-Gaussian model and a Gated Residual Motion Network, our method achieves better emotional expressiveness, lip synchronization, visual realism, and motion stability compared to state-of-the-art approaches.

Abstract

Audio-driven 3D talking head synthesis has advanced rapidly with Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS). By leveraging rich pre-trained priors, few-shot methods enable instant personalization from just a few seconds of video. However, under expressive facial motion, existing few-shot approaches often suffer from geometric instability and audio-emotion mismatch, highlighting the need for more effective emotion-aware motion modeling. In this work, we present EmoTaG, a few-shot emotion-aware 3D talking head synthesis framework built on the Pretrain-and-Adapt paradigm. Our key insight is to reformulate motion prediction in a structured FLAME parameter space rather than directly deforming 3D Gaussians, thereby introducing explicit geometric priors that improve motion stability. Building upon this, we propose a Gated Residual Motion Network (GRMN), which captures emotional prosody from audio while supplementing head pose and upper-face cues absent from audio, enabling expressive and coherent motion generation. Extensive experiments demonstrate that EmoTaG achieves state-of-the-art performance in emotional expressiveness, lip synchronization, visual realism, and motion stability.

Pipeline Image.

Figure 2. For pretraining, our Gated Residual Motion Network learns a universal motion prior from a multi-identity corpus. This network comprises an Identity-Conditioned Encoder for integrating audio, expression, and identity through AdaIN-based modulation, followed by an Expert Motion Decoder that leverages emotion-distilled supervision to train three cooperative branches (Base, Residual, Gate). During adaptation, the Gated Residual Motion Network is efficiently adapted to a new identity from 5-second video via only tuning the AdaIN modulation parameters. At inference, the adapted model produces expressive, high-fidelity 3D facial animation driven by new audio with head pose and upper-face cues.

SEG Image.

Figure 3. DeepFace provides both a categorical emotion distribution and a scalar emotion score to guide GRMN’s residual and gate branches.

Results

Several videos are included, so loading may take a moment. Thanks for your patience.

Neutral

Real3DPortrait

MimicTalk

TalkingGaussian

InsTaG

EmoTaG

Reference

Emotional

Real3DPortrait

MimicTalk

TalkingGaussian

InsTaG

EmoTaG

Reference

Emotion Intensity

Real3DPortrait

MimicTalk

InsTaG

EmoTaG

Reference

BibTeX

If you find this project useful for your research, please consider citing our paper. ✨

@article{xu2026emotag,
  title={EmoTaG: Emotion-Aware Talking Head Synthesis on Gaussian Splatting with Few-Shot Personalization},
  author={Xu, Haolan and Cheng, Keli and Wang, Lei and Bi, Ning and Liu, Xiaoming},
  journal={xxx},
  year={2026}
}