ComfyUI Voice-Driven Workflow Tutorial: Create Images with Voice Commands
A step-by-step guide to building voice-controlled AI image generation workflows in ComfyUI—from speech-to-text (STT) integration to voice-adjustable parameters.
Introduction
Voice-driven workflows in ComfyUI let you control AI image generation using natural speech instead of typing prompts or adjusting sliders manually. This setup is ideal for:
- Hands-free operation (e.g., while sketching or multitasking)
- Fast iterations (dictate prompt tweaks in real time)
- Accessibility (for users who prefer voice over text)
- Dynamic parameter control (adjust style, resolution, or strength with voice commands)
This tutorial covers two core workflows:
- Basic: Voice-to-text (STT) → Text-to-image (T2I) (convert speech to prompts for image generation)
- Advanced: Voice-controlled parameters (dictate style, strength, or resolution without touching the UI)
All tools used are free, open-source, and compatible with ComfyUI’s node-based system.
Prerequisites
Before starting, ensure you have:
- An updated ComfyUI installation (nightly build recommended; install guide)
- GPU with 8GB+ VRAM (12GB+ for smooth performance with large models)
- Microphone (built-in or external; required for voice input)
- Stable internet (for downloading STT models and plugins)
- Basic familiarity with ComfyUI’s core nodes (Load Checkpoint, Sampler, etc.)
Required Tools & Models
| Component | Purpose | Download/Installation Link |
|---|---|---|
| ComfyUI Voice Input Plugin | Adds voice recognition nodes to ComfyUI | GitHub (install via ComfyUI Manager) |
| Whisper STT Model | Converts speech to text (open-source) | Hugging Face (or "base" for low VRAM) |
| Text-to-Image Model | Core image generation (e.g., SD 1.5/XL) | Civitai or ComfyUI’s default models |
| Optional: ControlNet | For voice-driven pose/style control | Civitai |
Part 1: Set Up Voice Input in ComfyUI
Step 1: Install the Voice Input Plugin
- Open ComfyUI and go to ComfyUI Manager (top-right → "Manager" icon).
- Search for "Voice Input" in the plugin store.
- Click "Install" and restart ComfyUI to activate the plugin.
Step 2: Download the Whisper STT Model
- Navigate to the Whisper Model Hub (choose a model based on VRAM):
whisper-tiny(1.1GB): 4GB+ VRAM (faster, less accurate)whisper-small(4.1GB): 8GB+ VRAM (balanced speed/accuracy)whisper-medium(13GB): 12GB+ VRAM (most accurate)
- Download the
.binmodel file (e.g.,pytorch_model.bin). - Create a new folder
whisper/inComfyUI/models/and move the downloaded file there.
Step 3: Verify Voice Input Setup
- Reopen ComfyUI and right-click the canvas → Add Node → Voice Input.
- Confirm you see these nodes (plugin installed correctly):
Voice Recorder(captures microphone input)Whisper STT(converts speech to text)Voice Parameter Parser(optional: extracts parameters from voice commands)
Part 2: Basic Workflow – Voice-to-Text → Text-to-Image
This workflow converts your voice prompts into text, then uses that text to generate images.
Step 1: Add Core Nodes
Right-click the ComfyUI canvas and add these nodes:
- Voice Recorder (Voice Input → Voice Recorder)
- Whisper STT (Voice Input → Whisper STT)
- Load Checkpoint (Model → Load Checkpoint)
- CLIP Text Encode (Text → CLIP Text Encode)
- KSampler (Sampler → KSampler)
- Save Image (Output → Save Image)
Step 2: Configure Nodes
1. Voice Recorder Node
- Leave
deviceas "default" (uses your system’s default microphone). - Set
sample_rateto 16000 (Whisper’s preferred rate; no need to change).
2. Whisper STT Node
- Click
model_name→ Select your downloaded Whisper model (e.g.,whisper-small). - Set
languageto "en" (or your preferred language; e.g., "es" for Spanish, "fr" for French).
3. Load Checkpoint Node
- Select a text-to-image model (e.g.,
SDXL-Base-1.0orRealistic Vision 6.0).
Step 3: Connect the Workflow
Follow this node connection order (critical for data flow):
- Connect
Voice Recorder→audioinput ofWhisper STT. - Connect
Whisper STT→textoutput toCLIP Text Encode→textinput. - Connect
Load Checkpoint→modeloutput toKSampler→modelinput. - Connect
Load Checkpoint→clipoutput toCLIP Text Encode→clipinput. - Connect
CLIP Text Encode→conditioningoutput toKSampler→positiveinput. - Connect
KSampler→imageoutput toSave Image→imageinput.
Step 4: Run the Voice-Driven Workflow
- Click the Record button on the
Voice Recordernode (button turns red when recording). - Speak your prompt clearly (e.g., "A cyberpunk city at night with neon lights, photorealistic, 8k resolution").
- Click Stop (or wait for auto-stop after 5 seconds of silence).
- The
Whisper STTnode will convert your speech to text (check the node’stextoutput for accuracy). - Click Queue Prompt (top-right) to generate the image.
:::tip Pro Tip
For longer prompts, pause briefly between sentences—Whisper handles natural pauses better than rapid speech. If the text is inaccurate, re-record with slower, clearer pronunciation.
:::
Part 3: Advanced Workflow – Voice-Controlled Parameters
Take your workflow further by using voice commands to adjust generation parameters (e.g., style, strength, resolution) without editing nodes.
Step 1: Add Advanced Nodes
Add these nodes to your basic workflow:
- Voice Parameter Parser (Voice Input → Voice Parameter Parser)
- Float/Integer Nodes (Utils → Float / Utils → Integer) – for parameter values
- ControlNet Loader (ControlNet → Load ControlNet) – optional (for voice-driven style control)
Step 2: Define Voice-Controllable Parameters
The Voice Parameter Parser lets you map voice commands to numerical values. For example:
- "Set strength to 0.8" → Adjusts LoRA/ControlNet strength
- "Resolution 1024x768" → Sets image width/height
- "Style realistic" → Switches to a photorealistic model
Configure the Voice Parameter Parser:
- In the
Voice Parameter Parsernode, clickadd parameter. - For each parameter, set:
Parameter Name: e.g., "strength", "width", "height", "style"Command Prefix: e.g., "set strength to", "resolution width", "resolution height", "style"Default Value: e.g., 0.7 (strength), 1024 (width), 768 (height), "realistic" (style)
Step 3: Connect Parameter Nodes
- Connect
Whisper STT→textoutput toVoice Parameter Parser→textinput. - For strength control:
- Connect
Voice Parameter Parser→strengthoutput to aFloatnode →valueinput. - Connect the
Floatnode to yourLoad LoRAorControlNetnode →strengthinput.
- Connect
- For resolution control:
- Connect
Voice Parameter Parser→width/heightoutputs toIntegernodes. - Connect these nodes to
KSampler→width/heightinputs.
- Connect
- For style control (advanced):
- Add a
Model Switchnode (Model → Model Switch). - Map voice commands like "style anime" or "style realistic" to different checkpoint models via the parser.
- Add a
Step 4: Test Voice-Controlled Generation
- Record a voice command with parameters (e.g., "Style anime, set strength to 0.9, resolution 1024x1024, a cute cat with wings").
- The
Voice Parameter Parserwill extract values (style: anime, strength: 0.9, width: 1024, height: 1024) and pass them to the relevant nodes. - Queue the prompt to generate an image that matches your voice-controlled parameters.
:::note Important
Keep parameter commands simple and consistent. Avoid ambiguous phrases (e.g., "make it bigger" → instead use "set width to 1280"). The parser works best with clear, direct commands.
:::
Part 4: Troubleshooting Common Issues
Issue 1: Voice Recorder Not Capturing Audio
- Check that your microphone is set as the default device (system settings → Sound).
- Ensure no other app is using the microphone (e.g., Zoom, Discord).
- Reinstall the Voice Input plugin (ComfyUI Manager → Uninstall → Reinstall).
Issue 2: Whisper STT Generates Inaccurate Text
- Use a larger Whisper model (e.g.,
whisper-mediuminstead ofwhisper-tiny). - Speak slower and clearer, avoiding background noise.
- Set the correct
languagein the Whisper STT node (e.g., "en" for English, not "auto").
Issue 3: Parameters Not Updating with Voice Commands
- Verify the
Voice Parameter Parserhas correctCommand Prefix(e.g., "set strength to" matches your speech). - Ensure the parser’s outputs are connected to the right nodes (e.g.,
strength→ LoRA/ControlNet strength). - Test with simple commands first (e.g., "set strength to 0.8") before complex multi-parameter commands.
Issue 4: Out-of-Memory (OOM) Errors
- Use a smaller Whisper model (e.g.,
whisper-smallinstead ofwhisper-large). - Reduce image resolution (e.g., 768x768 instead of 1024x1024).
- Enable model offloading (ComfyUI → Settings → Performance → "Enable Model Offloading").
Issue 5: Workflow Runs but No Image Generates
- Check that all nodes are connected correctly (follow the connection order in Part 2/3).
- Verify the
Save Imagenode has a valid output path (default:ComfyUI/output/). - Ensure the KSampler node has a
positiveprompt connected (empty prompts generate black images).
Part 5: Best Practices
- Train Your Speech Pattern: Stick to consistent command structures (e.g., always use "set parameter to value") for more reliable parsing.
- Minimize Background Noise: Record in a quiet room or use a noise-canceling microphone—Whisper struggles with loud environments.
- Test Prompts First: Type and test prompts before using voice to ensure they generate good results (avoids re-recording due to prompt issues).
- Save Workflows: Save successful voice workflows (File → Save) to reuse them later—no need to rebuild nodes.
- Update Tools Regularly: The Voice Input plugin and Whisper models are updated frequently—new versions improve accuracy and performance.
- Limit Parameters: Don’t overload commands with 5+ parameters—stick to 2-3 at a time for better parsing.
Conclusion
Voice-driven workflows in ComfyUI add a new level of speed and accessibility to AI image generation. Whether you’re a designer looking to iterate quickly, or a user who prefers voice over text, this setup lets you control every step of the process with natural speech.
Start with the basic voice-to-text workflow, then experiment with advanced parameter control to unlock full flexibility. For more resources:
Happy voice-controlled generating! 🎤🎨