Skip to content

Lip Sync using Wan 2.2 (S2V)

Video Lecture

Section Video Links
Wan 2.2 S2V Lip Sync Wan 2.2 S2V Lip Sync Wan 2.2 S2V Lip Sync

Description

We will use the GGUF quantised Wan2.2 models.

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 clip/
│   │   └── umt5-xxl-encoder-Q8_0.gguf
│   ├── 📂 loras/
│   │   ├── Wan2.2-I2V-A14B-lora-high_noise.safetensors
│   ├── 📂 audio_encoders/
│   │   └── wav2vec2_large_english_fp16.safetensors
│   ├── 📂 unet/
│   │   └── Wan2.2-S2V-14B-Q8_0.gguf
│   └── 📂 vae/
│       └── wan2.1_vae.safetensors

Sample Workflows

For S2V workflow use the WanSoundImageToVideo and WanSoundImageToVideoExtend nodes.

Download Example Audio (woman) and save into your ComfyUI/input/ folder.

Initial Image Input Video Workflow
S2V Initial Image S2V Start Workflow
S2V Controlnet Initial Image S2V ControlNet Workflow
S2V Controlnet Initial Image Ext S2V ControlNet Workflow
i2v-1 initial image S2V Ref Video Workflow

WGET Commands

If you are using Runpod, or a similar hosted GPU service, then you can access your running pod/instance using a terminal.

#
#
# CD into ./ComfyUI/models/clip/ folder
wget -c https://huggingface.co/city96/umt5-xxl-encoder-gguf/resolve/main/umt5-xxl-encoder-Q8_0.gguf
#
#
# CD into ./ComfyUI/models/loras/ folder
wget https://huggingface.co/lightx2v/Wan2.2-Lightning/resolve/main/Wan2.2-I2V-A14B-4steps-lora-rank64-Seko-V1/high_noise_model.safetensors -O Wan2.2-I2V-A14B-lora-high_noise.safetensors
#
#
# CD into ./ComfyUI/models/audio_encoders/ folder # Create if you can't find this folder
wget -c https://huggingface.co/Comfy-Org/Wan_2.2_ComfyUI_Repackaged/resolve/main/split_files/audio_encoders/wav2vec2_large_english_fp16.safetensors
#
#
# CD into ./ComfyUI/models/unet/ folder
wget -c https://huggingface.co/QuantStack/Wan2.2-S2V-14B-GGUF/resolve/main/Wan2.2-S2V-14B-Q8_0.gguf
#
#
# CD into ./ComfyUI/models/vae/ folder
wget -c https://huggingface.co/Comfy-Org/Wan_2.1_ComfyUI_repackaged/resolve/main/split_files/vae/wan_2.1_vae.safetensors

Wait for files to download fully before running your workflows.

Troubleshooting

Error : Input type (float) and bias type (c10::Half) should be the same

The problem is a dtype mismatch.

We can hard patch the ComfyUI audio encoder script. Open:

ComfyUI/comfy/audio_encoders/audio_encoders.py

Find:

out, all_layers = self.model(audio.to(self.load_device))

Change to:

self.model = self.model.to(self.load_device)

# Match input dtype to model dtype
audio = audio.to(self.load_device, dtype=next(self.model.parameters()).dtype)

out, all_layers = self.model(audio)

Warning

Indent using the Space key, not the Tab key.

-        out, all_layers = self.model(audio.to(self.load_device))
+        self.model = self.model.to(self.load_device)
+
+        # Match input dtype to model dtype
+        audio = audio.to(self.load_device, dtype=next(self.model.parameters()).dtype)
+
+        out, all_layers = self.model(audio)