Best ElevenLabs Lip-Sync Model: Aurora vs OmniHuman vs WAN (Full Transcript)

Learn how to lip-sync audio to images in ElevenLabs and compare Createify Aurora, OmniHuman 1.5, and WAN 2.6 by quality, control, cost, and limits.
Download Transcript (DOCX)
Speakers
add Add new speaker

[00:00:00] Speaker 1: How do you realistically lip-sync any audio with any image and what model should you use?

[00:00:05] Speaker 2: If you go to 11labs we can click on the image and video tool on the left and down at the bottom we can switch to video mode and here if we click on the AI model we can scroll down and we've got a few different lip-sync models. So if we clicked Createify Aurora we could then go and either select an avatar or upload our own and then we can add speech and here we can actually add any of our past generations from within 11labs. We could go and create some new speech within 11labs or we can click the upload button and again upload our own and so here I'm just uploading a previous generation from 11labs and then we click generate.

[00:00:42] Speaker 3: Hey team, how are you doing today? Any news about the, you know?

[00:00:47] Speaker 2: But the question is what model should you use? Well here's the same generation with Createify Aurora, OmniHuman 1.5 and WAN 2.6 side by side.

[00:00:56] Speaker 3: Hey team, how are you doing today?

[00:00:58] Speaker 2: Any news about the, and so as you can see all three of them provide very different results. If we look at OmniHuman, OmniHuman was the one that took the longest to generate. The movement is a little more present compared to Createify Aurora, however with OmniHuman we've got a smile all the way through and we see a lot of teeth with the lip-sync, however looking at Createify Aurora the expression is much more nuanced and matches the dialogue a little better in my opinion. It also generated faster and it costs less credits. Now if we look at WAN 2.6, as you can see both of the results are pretty crazy. There's a lot of movement in terms of the body and also the camera movement. And so WAN 2.6 gives some really good generations and the quality is very crisp but it's a lot harder to control. To fix that I always include in the prompt still continuous shot to get something decent.

[00:01:45] Speaker 3: Hey team, how are you doing today?

[00:01:47] Speaker 2: Any news about the, and as you can see there the lip-sync is actually very good with WAN 2.6 but it takes a few extra prompts to get the result you're looking for. Now the nice thing about Createify Aurora is that you can have much longer generations compared to the other two models. Here if I switch to WAN 2.6, as you can see we've only got the choice between 5, 10 and 15 seconds but we can have a 1080p resolution. Whereas if we switch to OmniHuman the length of your video is dictated by the length of your audio that you upload and it can go up to a maximum of 30 seconds. But if we switch to Createify Aurora, but the issue with Createify Aurora is that you have a maximum resolution for now of 720p. So you do need to go ahead and upscale it afterwards with Topaz. But another constraint is that Topaz only accepts 30 second clips at a time and so if you generated longer than 30 seconds you would have to split the video up to be able to feed it through Topaz upscale in 11 laps.

[00:02:43] Speaker 4: Hey chat, what do you think? Do I look real or do you think it sucks?

[00:02:47] Speaker 2: And as you can see again, three great results, good lip syncing, the hand movement in Createify is a lot more contextual. In WAN 2.6 the video again is very crisp but the lips don't quite match the audio that we've uploaded. But then if we look at OmniHuman we've got some nice movement and decent facial expression but I had to prompt it multiple times to get the result I was looking for. And so here for this video I preferred the output of Createify but then I did have to go and upscale it with Topaz afterwards to get a crisper looking video. For long form more lo-fi content Createify Aurora might be the best. If you're doing shorter form videos you might want to try OmniHuman or even go with WAN 2.6 to have more crisp outputs but a harder to control video. But on the flip side WAN 2.6 does give you some incredible creative freedom allowing you to direct all of the actions and camera angles within your lip synced video just like this.

[00:03:42] Speaker 3: Hey team how are you doing today?

[00:03:44] Speaker 1: Any news about the...

[00:03:45] Speaker 2: And so as you can see there we've directed the exact movement that we want and we've got it as well as lip syncing the audio to the image. But that is how to lip sync your audio inside of Eleven Labs. If you have any questions let us know in the comment section down below and if you enjoyed this video hit that like button and don't forget to subscribe. Thanks for watching.

ai AI Insights
Arow Summary
The transcript explains how to lip-sync any audio to an image using ElevenLabs’ Image/Video tool, comparing three lip-sync models—Createify Aurora, OmniHuman 1.5, and WAN 2.6—across realism, control, generation time, cost, clip length, and resolution. Createify Aurora is presented as faster, cheaper, more nuanced in expression, and capable of longer generations but limited to 720p (often requiring upscaling via Topaz, which handles 30-second clips). OmniHuman is slower, can look overly smiley/teeth-forward, supports up to ~30 seconds dictated by audio length, and may require multiple prompts. WAN 2.6 can be very crisp (up to 1080p) with strong body/camera motion and creative freedom, but is harder to control and may need prompt guidance like “still continuous shot,” with clip length options limited (e.g., 5/10/15s). The speaker recommends Createify for longer, lo-fi content; OmniHuman or WAN 2.6 for shorter, crisper outputs depending on control needs.
Arow Title
How to Lip-Sync Audio to Images in ElevenLabs (Model Comparison)
Arow Keywords
ElevenLabs Remove
lip-sync Remove
image to video Remove
Createify Aurora Remove
OmniHuman 1.5 Remove
WAN 2.6 Remove
video mode Remove
avatar upload Remove
audio upload Remove
prompting Remove
still continuous shot Remove
1080p Remove
720p Remove
Topaz upscaling Remove
generation time Remove
credits cost Remove
control vs creativity Remove
clip length limits Remove
Arow Key Takeaways
  • Use ElevenLabs Image/Video tool in Video mode to lip-sync by choosing a model, selecting/uploading an avatar, and adding/uploading audio before generating.
  • Createify Aurora: fastest/cheapest with more nuanced expressions and longer generations, but currently limited to 720p—often needs Topaz upscaling.
  • OmniHuman 1.5: more movement but slower; can appear overly smiley with visible teeth; video length follows audio up to ~30 seconds; may require multiple prompts.
  • WAN 2.6: very crisp and dynamic with strong body/camera motion and creative freedom; harder to control; may need prompts like “still continuous shot”; limited to short clip durations (5/10/15s) but supports 1080p.
  • Topaz upscaling in ElevenLabs has a 30-second clip limit, so longer videos must be split before upscaling.
  • Choose model based on needs: Createify for longer, lo-fi content; OmniHuman or WAN 2.6 for shorter, higher-crispness outputs (with WAN offering most creative direction but least control).
Arow Sentiments
Neutral: The tone is instructional and comparative, focusing on practical workflow steps and trade-offs (speed, cost, control, resolution) without strong positive or negative emotional language.
Arow Enter your query
{{ secondsToHumanTime(time) }}
Back
Forward
{{ Math.round(speed * 100) / 100 }}x
{{ secondsToHumanTime(duration) }}
close
New speaker
Add speaker
close
Edit speaker
Save changes
close
Share Transcript