Skip to content

Stable Video Diffusion

Video Lecture

Section Video Links
Stable Video Diffusion Stable Video Diffusion
Video Timings 00:00 Introduces Stable Video Diffusion (SVD) for converting still images to video
00:15 Explains downloading SVD models from Civitai
00:40 Demonstrates setting up the img2vid checkpoint loader in Comfy UI
01:30 Details workflow configuration including VAE decode and WebP output
02:00 Notes optimal SVD output is 1024x576 at 14 frames, 6 frames/sec
02:50 Shows examples; SVD is effective at separating foreground and background
04:00 Explains using "motion bucket ID" (127 often optimal) to control movement
04:45 Discusses augmentation level for removing noise, potentially affecting quality
05:30 Demonstrates cropping images to guide SVD generation focus
07:00 Highlights model naming differences between Civitai and Hugging Face downloads

Description

Stable Video Diffusion (SVD) Image-to-Video is a diffusion model that takes in a still image as a conditioning frame, and generates a video from it.

Downloads : img2vid | img2vid-xt | img2vid-xt-1.1

The base img2vid model was trained to generate 14 frames at 1024x576.

img2vid-xt was trained to generate 25 frames at 1024x576.

img2vid-xt-1.1 is a more finely tuned version of img2vid-xt.

SVD_img2vid_Conditioning

The SVD_img2vid_Conditioning node controls the motion behavior during image-to-video generation.

For best quality, the width and height should be 1024x576. You can also get get results using 576x1024.

The Frames should be 14 when using img2vid, and 25 when using img2vid-xt or img2vid-xt-1.1.

The Motion Bucket ID is default 127 and normally produces adequate results. You can change the value between 0 to 255. The value refers to a pre selected set of discrete "motion buckets" that the model was trained on. The value controls the intensity and complexity of motion in the generated video. Lower numbers will make the movement appear more static, verses higher numbers more dramatic. But numbers higher the 127 tend to produce more unstable results.

The Augmentation Level is another factor that can effect the final camera shifting, cropping, colours, contrast, gaussian noise injection, texture distortion. Higher numbers tend to filter out noises more so details, such as skin texture can appear more smoothed.

Sample Input Images

"a car on a dusty road", SD1.5, 512x512

a car on a dusty road

"coral reef", 515-inpainting, 768x768

fish coral reef

"A traditional english village, tilt-shift photography", Flux Schnell, 1024x1024

English Village Tilt Focus

"a person with freckels reading a newspaper", Flux Schnell, 1024x1024

Girl with freckels reading newspaper

"photo of a person wearing a high tech scifi armor", SD3.5, 1024x1024

Cyborg

"a fighter pilots view of flying at street level between city buildings in the sunset", Flux Schnell, 1024x1024

Jet Flying Thru Street

"photograph beautiful scenery nature mountains alps river rapids snow sky cumulus clouds", Flux Schnell, 1024x1024

Rapids