Skip to content

Voice Clone

Video Lecture

Section Video Links
Voice Clone Voice Clone

Custom TTS node that clones voice from a reference audio and speaks entered text.

Install Voice Clone Custom Node

Install the ComfyUI Voice Clone custom node using the manager,

Or, install using your command/terminal prompt.

  1. Navigate to your ComfyUI/custom_nodes folder.
  2. Run,
    git clone https://github.com/Sean-Bradley/ComfyUI-Voice-Clone.git
    
  3. Navigate to your ComfyUI_windows_portable folder.
  4. Run,
    python_embeded\python -m pip install -r ComfyUI/custom_nodes/ComfyUI-Voice-Clone/requirements.txt
    
  5. Restart ComfyUI

Install Models

All required models can be downloaded from https://huggingface.co/Sean-Bradley/ComfyUI/tree/main/models/tts/chatterbox

Ensure that your folder structure and downloaded files resemble this below.

📂 ComfyUI/
├── 📂 models/
│   ├── 📂 tts/
│   │   └── 📂 chatterbox/
|   |       ├── conds.pt
|   |       ├── s3gen.safetensors
|   |       ├── t3_cfg.safetensors
|   |       ├── tokenizer.json
|   |       └── ve.safetensors

Sample Workflows

Voice Clone

Drag this link into ComyfUI to see the workflow.

Voice Clone

Voice Replace

Drag this link into ComyfUI to see the workflow.

Voice Replace

Sample Audios

Download (Right Click, Save Audio As...) Description
Audio snippets assembled from So Much for So Little animated cartoon. Copyright © 1949 Warner Bros. Cartoons
Audio snippets assembled from Puss n' Booty animated cartoon. Copyright © 1943 Warner Bros. Cartoons
Audio snippets assembled from Scrap Happy Daffy animated cartoon. Copyright © 1949 Warner Bros. Cartoons
Audio snippets assembled from Night of the Living Dead (1968). Copyright © 1968 Image Ten, Inc
Audio snippets assembled from Night of the Living Dead (1968). Copyright © 1968 Image Ten, Inc
Audio snippets assembled from Psycho (1960). Copyright © 1960 Shamley Productions, Inc.
Audio snippets assembled from Psycho (1960). Copyright © 1960 Shamley Productions, Inc.

Settings

Setting Description
exaggeration Controls the expressiveness / prosody of the generated voice. Higher values make the speech more emphatic and varied; lower values produce a flatter, more neutral delivery. Valid range: 0.25 - 2.0.
temperature Sampling temperature for the text-to-speech decoder. Higher values increase randomness and variety in the generated audio; lower values make outputs more conservative and deterministic. Valid range: 0.15 - 2.0.
cfg_weight Classifier-free guidance (CFG) weight that balances adherence to the text conditioning vs. model priors. Larger values force the model to follow the conditioning (text/prompt) more strongly, which can improve faithfulness but may increase artifacts if set too high. Valid range: 0.05 - 1.0
min_p A lower-probability cutoff used during sampling to filter extremely unlikely tokens or frames. Helps avoid very low-probability outputs that could degrade quality. Valid range: 0.0 - 1.0
top_p Nucleus (top-p) sampling cumulative probability threshold. The decoder samples from the smallest set of tokens whose cumulative probability ≥ top_p. top_p = 1.0 disables nucleus filtering (i.e., sample from full distribution). Valid range: 0.0 - 1.0
repetition_penalty Penalizes repetition during generation. Values > 1.0 discourage repeating the same tokens/frames, reducing looping/redundancy in speech. Valid range: 1.0 - 2.0
voice_embedding (optional) If provided, an audio reference is used as an audio prompt for voice cloning.
disable_watermark By default, audio output is watermarked using PerTh Watermarking. You can disable this by selecting true.

ComfyUI Voice Clone

resemble-ai/chatterbox (github)

List of animated films in public domain United States (wikipedia)