Avatar FX: Cutting-Edge Video Generation by Character.AI

Avatar FX: Cutting-Edge Video Generation by Character.AI

Avatar FX is here.

0:00
/0:28

We’re excited to unveil this groundbreaking work by the Character.AI Multimodal team! 

Avatar FX can make images come to life – and speak, sing, and emote – all with the click of a button. The technology advances the state of the art with its cutting-edge capabilities, including the power to generate photorealistic videos with audio; to maintain strong temporal consistency with face, hand and body movement; to power longform video; to generate top-quality video from a pre-existing image; and even to power videos with multiple speakers and multiple turns! Check it out here.

We're working on bringing the Avatar FX model into the Character.AI product over the coming months for all users to enjoy.

Our CAI+ subscribers will be some of the first to gain access to these new video features when they launch. 

Join the waitlist here. 

How It Works

Achieving this level of realism and expressive nuance in our videos is no trivial task. For high-quality and diverse video generation, flow-based diffusion models have become the gold standard. Building on top of DiT architecture, our Multimodal team designed a parameter-efficient training pipeline that allows the diffusion model to generate realistic lip, head, and body movement based on an audio sequence.

To do this, we developed a novel inference strategy that preserves visual quality, motion consistency, and expressive diversity across arbitrarily long videos.

Our model performs well across a wide range of styles and scenarios, from realistic humans to mythical creatures, and even inanimate objects with faces. More specifically, our data experts built a complex data pipeline that focuses on curating diverse video styles, filtering out low-quality video data, and selecting videos of varying levels of motion and aesthetics to finally obtain a powerful dataset that brings our model’s generative capabilities to the next level. On the audio side, we generate the voice using Character.AI’s proprietary TTS voice model.

To ensure cost and time-efficient generation, we  leverage state-of-the-art distillation techniques to reduce the number of diffusion steps, significantly accelerating inference with practically no quality degradation.

How It’s Different

Avatar FX pushes the boundaries of the state of the art in multiple ways:

  • It can generate top-quality video of 2D animated characters, 3D cartoon characters, and non-human faces (like a favorite pet!) 
  • It maintains top-notch temporal consistency with face, hand and body movement.
  • It has the capability to maintain that temporal consistency even in longform video.
  • It can generate top-quality video from a pre-existing image, rather than relying on text-to-image generation. That will provide users with maximum controllability over the video they want to create.

These qualities represent notable technical advances. Going forward, they’ll turbocharge Character.AI’s products to help our creators and users tell the next generation of stories with the help of AI.  

From Lab to Launch: Scaling the Stack

As we work to bring Avatar FX to all Character.AI users, our mission is clear: make the technology affordable, intuitive, and accessible for everyone. 

Our team of experienced designers and full-stack and infrastructure specialists are doing just that. From GPU orchestration to caching, queuing, and media delivery, once this new technology is integrated into the Character.AI platform, every part of the stack will be designed to ensure that creating a high-quality video feels as seamless and easy as clicking “Generate.”