Stable Video Diffusion
Video Lecture
| Section | Video Links |
|---|---|
| Stable Video Diffusion | |
(Pay Per View)
Video Timings
00:00 Introduces Stable Video Diffusion (SVD) for converting still images to video00:15 Explains downloading SVD models from Civitai
00:40 Demonstrates setting up the img2vid checkpoint loader in Comfy UI
01:30 Details workflow configuration including VAE decode and WebP output
02:00 Notes optimal SVD output is 1024x576 at 14 frames, 6 frames/sec
02:50 Shows examples; SVD is effective at separating foreground and background
04:00 Explains using "motion bucket ID" (127 often optimal) to control movement
04:45 Discusses augmentation level for removing noise, potentially affecting quality
05:30 Demonstrates cropping images to guide SVD generation focus
07:00 Highlights model naming differences between Civitai and Hugging Face downloads
Description
Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
| Platform | Links |
|---|---|
| Civitai | img2vid | img2vid-xt | img2vid-xt-1.1 |
| HuggingFace | img2vid | img2vid-xt | img2vid-xt-1.1 |
Tip
CivitAI and Hugging Face use different file names for all the img2vid files. When you use them in ComfyUI, you will need to remember which file names they were saved under, whether you downloaded them from CivitAI or Hugging Face.
The base img2vid model was trained to generate 14 frames at 1024x576.
img2vid-xt was trained to generate 25 frames at 1024x576.
img2vid-xt-1.1 is a more finely tuned version of img2vid-xt.
SVD_img2vid_Conditioning
The SVD_img2vid_Conditioning node controls the motion behavior during image-to-video generation.
For best quality, the width and height should be 1024x576. You can also get get results using 576x1024.
The Frames should be 14 when using img2vid, and 25 when using img2vid-xt or img2vid-xt-1.1.
The Motion Bucket ID is default 127 and normally produces adequate results. You can change the value between 0 to 255. The value refers to a pre selected set of discrete "motion buckets" that the model was trained on. The value controls the intensity and complexity of motion in the generated video. Lower numbers will make the movement appear more static, verses higher numbers more dramatic. But numbers higher the 127 tend to produce more unstable results.
The Augmentation Level is another factor that can effect the final camera shifting, cropping, colours, contrast, gaussian noise injection, texture distortion. Higher numbers tend to filter out noises more so details, such as skin texture can appear more smoothed.
Sample Input Images
"a car on a dusty road", SD1.5, 512x512

"coral reef", 515-inpainting, 768x768

"A traditional english village, tilt-shift photography", Flux Schnell, 1024x1024

"a person with freckels reading a newspaper", Flux Schnell, 1024x1024

"photo of a person wearing a high tech scifi armor", SD3.5, 1024x1024

"a fighter pilots view of flying at street level between city buildings in the sunset", Flux Schnell, 1024x1024

"photograph beautiful scenery nature mountains alps river rapids snow sky cumulus clouds", Flux Schnell, 1024x1024
































