Synthesizing realistic talking head sequences is difficult because of two major reasons. The first issue is that the human heads have high photometric, geometric and kinematic complexity so it is difficult to model faces. The second is complicating factor is the acuteness of the human visual system so even minor mistakes in the appearance while modelling can cause a problem.
The researchers have presented a system for creating talking head models from a handful of photographs which is also called few-shot learning. The system can also generate a result based on a single photograph, this process is also known as one-shot learning. But adding a few more photographs increases the fidelity of personalization.
The talking heads created by the researchers’ system can handle a large variety of poses that goes beyond the abilities of warping-based systems. The few-shot learning ability is obtained by extensive pre-training (meta-learning) on a large corpus of talking head videos that correspond to different speakers with diverse appearance.
In the course of meta-learning, this system simulates few-shot learning tasks and also learns to transform landmark positions into realistically-looking personalized photographs.
A handful of photographs of a new person will set up a new adversarial learning problem with high-capacity generator and discriminator that are pre-trained via meta-learning. The new problem converges to the state that would generate realistic and personalized images post a few training steps.
In the experiments, the researchers have provided comparisons of talking heads created by their system with alternative neural talking head models through quantitative measurements and a user study. They have also demonstrated several use cases of their talking head models which includes video synthesis using landmark tracks extracted from video sequences and puppeteering (video synthesis of a certain person based on the face landmark tracks of a different person).
The researchers have used two datasets with talking head videos for quantitative and qualitative evaluation: VoxCeleb1 [26] (256p videos at 1 fps) and VoxCeleb2 [8] (224p videos at 25 fps), with the second one having approximately 10 times more videos than the first one. The first dataset, VoxCeleb1 is used for comparison with baselines and ablation studies, the researchers show the potential of their approach with the second dataset, VoxCeleb2.
To conclude, researchers have presented a framework for meta-learning of adversarial generative models that can train highly realistic virtual talking heads in the form of deep generator networks. A handful of photographs (could be as little as one) is needed to create a new model, but the model that is trained on 32 images achieves perfect realism and personalization score in their user study (for 224p static images).
The key limitations of the method are the mimics representation and the lack of landmark adaptation. The landmarks from a different person can lead to a noticeable personality mismatch. If someone wants to create “fake” puppeteering videos without such mismatch then, in that case, some landmark adaptation is needed.
The paper further reads, “We note, however, that many applications do not require puppeteering a different person and instead only need the ability to drive one’s own talking head. For such scenario, our approach already provides a high-realism solution.”
To know more about this news, check out the paper, Few-Shot Adversarial Learning of Realistic Neural Talking Head Models.
Samsung opens its AI based Bixby voice assistant to third-party developers
Researchers from China introduced two novel modules to address challenges in multi-person pose estimation
AI can now help speak your mind: UC researchers introduce a neural decoder that translates brain signals to natural sounding speech