Apr 21, 2025 3 min read

AvatarFX: Cutting-Edge Video Generation by Character.AI

AvatarFX is here.

0:00

/0:28

We’re excited to unveil this groundbreaking work by the Character.AI Multimodal team!

AvatarFX can make images come to life – and speak, sing, and emote – all with the click of a button. The technology advances the state of the art with its cutting-edge capabilities, including the power to generate photorealistic videos with audio; to maintain strong temporal consistency with face, hand and body movement; to power longform video; to generate top-quality video from a pre-existing image; and even to power videos with multiple speakers and multiple turns! Check it out here.

We're working on bringing the AvatarFX model into the Character.AI product over the coming months for all users to enjoy.

Our CAI+ subscribers will be some of the first to gain access to these new video features when they launch.

Join the waitlist here.

How It Works

Achieving this level of realism and expressive nuance in our videos is no trivial task. For high-quality and diverse video generation, flow-based diffusion models have become the gold standard. Building on top of DiT architecture, our Multimodal team designed a parameter-efficient training pipeline that allows the diffusion model to generate realistic lip, head, and body movement based on an audio sequence.

To do this, we developed a novel inference strategy that preserves visual quality, motion consistency, and expressive diversity across arbitrarily long videos.

Our model performs well across a wide range of styles and scenarios, from realistic humans to mythical creatures, and even inanimate objects with faces. More specifically, our data experts built a complex data pipeline that focuses on curating diverse video styles, filtering out low-quality video data, and selecting videos of varying levels of motion and aesthetics to finally obtain a powerful dataset that brings our model’s generative capabilities to the next level. On the audio side, we generate the voice using Character.AI’s proprietary TTS voice model.

To ensure cost and time-efficient generation, we leverage state-of-the-art distillation techniques to reduce the number of diffusion steps, significantly accelerating inference with practically no quality degradation.

How It’s Different

AvatarFX pushes the boundaries of the state of the art in multiple ways:

It can generate top-quality video of 2D animated characters, 3D cartoon characters, and non-human faces (like a favorite pet!)
It maintains top-notch temporal consistency with face, hand and body movement.
It has the capability to maintain that temporal consistency even in longform video.
It can generate top-quality video from a pre-existing image, rather than relying on text-to-image generation. That will provide users with maximum controllability over the video they want to create.

These qualities represent notable technical advances. Going forward, they’ll turbocharge Character.AI’s products to help our creators and users tell the next generation of stories with the help of AI.

From Lab to Launch: Scaling the Stack

As we work to bring AvatarFX to all Character.AI users, our mission is clear: make the technology affordable, intuitive, and accessible for everyone.

Our team of experienced designers and full-stack and infrastructure specialists are doing just that. From GPU orchestration to caching, queuing, and media delivery, once this new technology is integrated into the Character.AI platform, every part of the stack will be designed to ensure that creating a high-quality video feels as seamless and easy as clicking “Generate.”

How We’re Prioritizing Safety

Even at this test phase of AvatarFX, we’ve already implemented robust safety measures for user-uploaded content. These measures are designed to prevent deepfakes and help our community have fun and stay safe. Here are a few important examples:

We run the dialogue that users write through our safety filters. The filters screen for content that would violate our policies.
We use industry-leading tools to block videos from being generated using photos of minors, high-profile politicians, and other notable figures.
For other uploads of human photos, we use AI to change the image so that it is not recognizable person.
We apply a watermark to the videos to make even clearer that they are not real footage.
Users must agree to a robust set of terms to use the test feature. The terms prohibit impersonation, bullying, deepfakes, and use of protected IP without permission. They also require users to agree to use AvatarFX in line with our Terms of Service, Privacy Policy, and Community Guidelines. And they articulate a strict one-strike policy for violations.

We’ll continue to evolve these controls as we go, and they may change as the product changes. The safety of our community is critically important to us, and it's always front of mind in our product development process.