ComfyUI Voice-Driven Workflow Tutorial: Create Images with Voice Commands

A step-by-step guide to building voice-controlled AI image generation workflows in ComfyUI—from speech-to-text (STT) integration to voice-adjustable parameters.

Introduction

Voice-driven workflows in ComfyUI let you control AI image generation using natural speech instead of typing prompts or adjusting sliders manually. This setup is ideal for:

Hands-free operation (e.g., while sketching or multitasking)
Fast iterations (dictate prompt tweaks in real time)
Accessibility (for users who prefer voice over text)
Dynamic parameter control (adjust style, resolution, or strength with voice commands)

This tutorial covers two core workflows:

Basic: Voice-to-text (STT) → Text-to-image (T2I) (convert speech to prompts for image generation)
Advanced: Voice-controlled parameters (dictate style, strength, or resolution without touching the UI)

All tools used are free, open-source, and compatible with ComfyUI’s node-based system.

Prerequisites

Before starting, ensure you have:

An updated ComfyUI installation (nightly build recommended; install guide)
GPU with 8GB+ VRAM (12GB+ for smooth performance with large models)
Microphone (built-in or external; required for voice input)
Stable internet (for downloading STT models and plugins)
Basic familiarity with ComfyUI’s core nodes (Load Checkpoint, Sampler, etc.)

Required Tools & Models

Component	Purpose	Download/Installation Link
ComfyUI Voice Input Plugin	Adds voice recognition nodes to ComfyUI	GitHub (install via ComfyUI Manager)
Whisper STT Model	Converts speech to text (open-source)	Hugging Face (or "base" for low VRAM)
Text-to-Image Model	Core image generation (e.g., SD 1.5/XL)	Civitai or ComfyUI’s default models
Optional: ControlNet	For voice-driven pose/style control	Civitai

Part 1: Set Up Voice Input in ComfyUI

Step 1: Install the Voice Input Plugin

Open ComfyUI and go to ComfyUI Manager (top-right → "Manager" icon).
Search for "Voice Input" in the plugin store.
Click "Install" and restart ComfyUI to activate the plugin.

Step 2: Download the Whisper STT Model

Navigate to the Whisper Model Hub (choose a model based on VRAM):
- whisper-tiny (1.1GB): 4GB+ VRAM (faster, less accurate)
- whisper-small (4.1GB): 8GB+ VRAM (balanced speed/accuracy)
- whisper-medium (13GB): 12GB+ VRAM (most accurate)
Download the .bin model file (e.g., pytorch_model.bin).
Create a new folder whisper/ in ComfyUI/models/ and move the downloaded file there.

Step 3: Verify Voice Input Setup

Reopen ComfyUI and right-click the canvas → Add Node → Voice Input.
Confirm you see these nodes (plugin installed correctly):
- Voice Recorder (captures microphone input)
- Whisper STT (converts speech to text)
- Voice Parameter Parser (optional: extracts parameters from voice commands)

Part 2: Basic Workflow – Voice-to-Text → Text-to-Image

This workflow converts your voice prompts into text, then uses that text to generate images.

Step 1: Add Core Nodes

Right-click the ComfyUI canvas and add these nodes:

Voice Recorder (Voice Input → Voice Recorder)
Whisper STT (Voice Input → Whisper STT)
Load Checkpoint (Model → Load Checkpoint)
CLIP Text Encode (Text → CLIP Text Encode)
KSampler (Sampler → KSampler)
Save Image (Output → Save Image)

Step 2: Configure Nodes

1. Voice Recorder Node

Leave device as "default" (uses your system’s default microphone).
Set sample_rate to 16000 (Whisper’s preferred rate; no need to change).

2. Whisper STT Node

Click model_name → Select your downloaded Whisper model (e.g., whisper-small).
Set language to "en" (or your preferred language; e.g., "es" for Spanish, "fr" for French).

3. Load Checkpoint Node

Select a text-to-image model (e.g., SDXL-Base-1.0 or Realistic Vision 6.0).

Step 3: Connect the Workflow

Follow this node connection order (critical for data flow):

Connect Voice Recorder → audio input of Whisper STT.
Connect Whisper STT → text output to CLIP Text Encode → text input.
Connect Load Checkpoint → model output to KSampler → model input.
Connect Load Checkpoint → clip output to CLIP Text Encode → clip input.
Connect CLIP Text Encode → conditioning output to KSampler → positive input.
Connect KSampler → image output to Save Image → image input.

Step 4: Run the Voice-Driven Workflow

Click the Record button on the Voice Recorder node (button turns red when recording).
Speak your prompt clearly (e.g., "A cyberpunk city at night with neon lights, photorealistic, 8k resolution").
Click Stop (or wait for auto-stop after 5 seconds of silence).
The Whisper STT node will convert your speech to text (check the node’s text output for accuracy).
Click Queue Prompt (top-right) to generate the image.

:::tip Pro Tip
For longer prompts, pause briefly between sentences—Whisper handles natural pauses better than rapid speech. If the text is inaccurate, re-record with slower, clearer pronunciation.
:::

Part 3: Advanced Workflow – Voice-Controlled Parameters

Take your workflow further by using voice commands to adjust generation parameters (e.g., style, strength, resolution) without editing nodes.

Step 1: Add Advanced Nodes

Add these nodes to your basic workflow:

Voice Parameter Parser (Voice Input → Voice Parameter Parser)
Float/Integer Nodes (Utils → Float / Utils → Integer) – for parameter values
ControlNet Loader (ControlNet → Load ControlNet) – optional (for voice-driven style control)

Step 2: Define Voice-Controllable Parameters

The Voice Parameter Parser lets you map voice commands to numerical values. For example:

"Set strength to 0.8" → Adjusts LoRA/ControlNet strength
"Resolution 1024x768" → Sets image width/height
"Style realistic" → Switches to a photorealistic model

Configure the Voice Parameter Parser:

In the Voice Parameter Parser node, click add parameter.
For each parameter, set:
- Parameter Name: e.g., "strength", "width", "height", "style"
- Command Prefix: e.g., "set strength to", "resolution width", "resolution height", "style"
- Default Value: e.g., 0.7 (strength), 1024 (width), 768 (height), "realistic" (style)

Step 3: Connect Parameter Nodes

Connect Whisper STT → text output to Voice Parameter Parser → text input.
For strength control:
- Connect Voice Parameter Parser → strength output to a Float node → value input.
- Connect the Float node to your Load LoRA or ControlNet node → strength input.
For resolution control:
- Connect Voice Parameter Parser → width/height outputs to Integer nodes.
- Connect these nodes to KSampler → width/height inputs.
For style control (advanced):
- Add a Model Switch node (Model → Model Switch).
- Map voice commands like "style anime" or "style realistic" to different checkpoint models via the parser.

Step 4: Test Voice-Controlled Generation

Record a voice command with parameters (e.g., "Style anime, set strength to 0.9, resolution 1024x1024, a cute cat with wings").
The Voice Parameter Parser will extract values (style: anime, strength: 0.9, width: 1024, height: 1024) and pass them to the relevant nodes.
Queue the prompt to generate an image that matches your voice-controlled parameters.

:::note Important
Keep parameter commands simple and consistent. Avoid ambiguous phrases (e.g., "make it bigger" → instead use "set width to 1280"). The parser works best with clear, direct commands.
:::

Part 4: Troubleshooting Common Issues

Issue 1: Voice Recorder Not Capturing Audio

Check that your microphone is set as the default device (system settings → Sound).
Ensure no other app is using the microphone (e.g., Zoom, Discord).
Reinstall the Voice Input plugin (ComfyUI Manager → Uninstall → Reinstall).

Issue 2: Whisper STT Generates Inaccurate Text

Use a larger Whisper model (e.g., whisper-medium instead of whisper-tiny).
Speak slower and clearer, avoiding background noise.
Set the correct language in the Whisper STT node (e.g., "en" for English, not "auto").

Issue 3: Parameters Not Updating with Voice Commands

Verify the Voice Parameter Parser has correct Command Prefix (e.g., "set strength to" matches your speech).
Ensure the parser’s outputs are connected to the right nodes (e.g., strength → LoRA/ControlNet strength).
Test with simple commands first (e.g., "set strength to 0.8") before complex multi-parameter commands.

Issue 4: Out-of-Memory (OOM) Errors

Use a smaller Whisper model (e.g., whisper-small instead of whisper-large).
Reduce image resolution (e.g., 768x768 instead of 1024x1024).
Enable model offloading (ComfyUI → Settings → Performance → "Enable Model Offloading").

Issue 5: Workflow Runs but No Image Generates

Check that all nodes are connected correctly (follow the connection order in Part 2/3).
Verify the Save Image node has a valid output path (default: ComfyUI/output/).
Ensure the KSampler node has a positive prompt connected (empty prompts generate black images).

Part 5: Best Practices

Train Your Speech Pattern: Stick to consistent command structures (e.g., always use "set parameter to value") for more reliable parsing.
Minimize Background Noise: Record in a quiet room or use a noise-canceling microphone—Whisper struggles with loud environments.
Test Prompts First: Type and test prompts before using voice to ensure they generate good results (avoids re-recording due to prompt issues).
Save Workflows: Save successful voice workflows (File → Save) to reuse them later—no need to rebuild nodes.
Update Tools Regularly: The Voice Input plugin and Whisper models are updated frequently—new versions improve accuracy and performance.
Limit Parameters: Don’t overload commands with 5+ parameters—stick to 2-3 at a time for better parsing.

Conclusion

Voice-driven workflows in ComfyUI add a new level of speed and accessibility to AI image generation. Whether you’re a designer looking to iterate quickly, or a user who prefers voice over text, this setup lets you control every step of the process with natural speech.

Start with the basic voice-to-text workflow, then experiment with advanced parameter control to unlock full flexibility. For more resources:

Happy voice-controlled generating! 🎤🎨

ComfyUI Tutorial

ComfyUI Voice-Driven Workflow Tutorial: Create Images with Voice Commands