Voice Clone
Custom TTS node that clones voice from a reference audio and speaks entered text.
Install Voice Clone Custom Node
Install the ComfyUI Voice Clone custom node using the manager,

Or, install using your command/terminal prompt.
- Navigate to your
ComfyUI/custom_nodesfolder. - Run,
git clone https://github.com/Sean-Bradley/ComfyUI-Voice-Clone.git - Navigate to your
ComfyUI_windows_portablefolder. - Run,
python_embeded\python -m pip install -r ComfyUI/custom_nodes/ComfyUI-Voice-Clone/requirements.txt - Restart ComfyUI
Install Models
All required models can be downloaded from https://huggingface.co/ResembleAI/chatterbox/tree/main
Ensure that your folder structure and downloaded files resemble this below.
📂 ComfyUI/
├── 📂 models/
│ ├── 📂 tts/
│ │ └── 📂 chatterbox/
| | ├── added_tokens.json
| | ├── conds.pt
| | ├── merges.txt
| | ├── s3gen.safetensors
| | ├── s3gen_meanflow.safetensors
| | ├── special_tokens_map.json
| | ├── t3_turbo_v1.safetensors
| | ├── tokenizer_config.json
| | ├── ve.safetensors
| | └── vocab.json
Sample Workflows
Voice Clone
Drag this link into ComyfUI to see the workflow.

Voice Replace
Drag this link into ComyfUI to see the workflow.

Sample Audios
| Download (Right Click, Save Audio As...) | Description |
|---|---|
| Audio snippets assembled from So Much for So Little animated cartoon. Copyright © 1949 Warner Bros. Cartoons | |
| Audio snippets assembled from Puss n' Booty animated cartoon. Copyright © 1943 Warner Bros. Cartoons | |
| Audio snippets assembled from Scrap Happy Daffy animated cartoon. Copyright © 1949 Warner Bros. Cartoons | |
| Audio snippets assembled from Night of the Living Dead (1968). Copyright © 1968 Image Ten, Inc | |
| Audio snippets assembled from Night of the Living Dead (1968). Copyright © 1968 Image Ten, Inc | |
| Audio snippets assembled from Psycho (1960). Copyright © 1960 Shamley Productions, Inc. | |
| Audio snippets assembled from Psycho (1960). Copyright © 1960 Shamley Productions, Inc. |
Settings
| Setting | Description |
|---|---|
| temperature | Sampling temperature for the text-to-speech decoder. Higher values increase randomness and variety in the generated audio; lower values make outputs more conservative and deterministic. Valid range: 0.15 - 2.0. |
| top_p | Nucleus (top-p) sampling cumulative probability threshold. The decoder samples from the smallest set of tokens whose cumulative probability ≥ top_p. top_p = 1.0 disables nucleus filtering (i.e., sample from full distribution). Valid range: 0.0 - 1.0 |
| repetition_penalty | Penalizes repetition during generation. Values > 1.0 discourage repeating the same tokens/frames, reducing looping/redundancy in speech. Valid range: 1.0 - 2.0 |
| voice_embedding (optional) | If provided, an audio reference is used as an audio prompt for voice cloning. |
| top_k | At each step of generation, the model predicts probabilities for many possible next tokens (text or acoustic tokens). The next token is sampled only from those top k candidates. |
| normalize | Normalize the audio output volume. |
| disable_watermark | By default, audio output is watermarked using PerTh Watermarking. You can disable this by selecting true. |
Paralinguistic tags
[clear throat] [sigh] [shush] [cough] [groan] [sniff] [gasp] [chuckle] [laugh]
Useful Links
resemble-ai/chatterbox (github)
List of animated films in public domain United States (wikipedia)































