Stable Video Diffusion
Video Lecture
Section | Video Links |
---|---|
Stable Video Diffusion | ![]() |
Video Timings
00:00 Introduces Stable Video Diffusion (SVD) for converting still images to video00:15 Explains downloading SVD models from Civitai
00:40 Demonstrates setting up the img2vid checkpoint loader in Comfy UI
01:30 Details workflow configuration including VAE decode and WebP output
02:00 Notes optimal SVD output is 1024x576 at 14 frames, 6 frames/sec
02:50 Shows examples; SVD is effective at separating foreground and background
04:00 Explains using "motion bucket ID" (127 often optimal) to control movement
04:45 Discusses augmentation level for removing noise, potentially affecting quality
05:30 Demonstrates cropping images to guide SVD generation focus
07:00 Highlights model naming differences between Civitai and Hugging Face downloads
Description
Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.
Downloads : img2vid | img2vid-xt | img2vid-xt-1.1
The base img2vid
model was trained to generate 14 frames at 1024x576.
img2vid-xt
was trained to generate 25 frames at 1024x576.
img2vid-xt-1.1
is a more finely tuned version of img2vid-xt
.
SVD_img2vid_Conditioning
The SVD_img2vid_Conditioning
node controls the motion behavior during image-to-video generation.
For best quality, the width and height should be 1024x576. You can also get get results using 576x1024.
The Frames
should be 14 when using img2vid
, and 25 when using img2vid-xt
or img2vid-xt-1.1
.
The Motion Bucket ID
is default 127 and normally produces adequate results. You can change the value between 0 to 255. The value refers to a pre selected set of discrete "motion buckets" that the model was trained on. The value controls the intensity and complexity of motion in the generated video. Lower numbers will make the movement appear more static, verses higher numbers more dramatic. But numbers higher the 127 tend to produce more unstable results.
The Augmentation Level
is another factor that can effect the final camera shifting, cropping, colours, contrast, gaussian noise injection, texture distortion. Higher numbers tend to filter out noises more so details, such as skin texture can appear more smoothed.