Voice Design Prompting Guide — How to Describe the Perfect AI Voice
Voice Design lets you create entirely new voices by writing a text description. No audio samples, no microphone, no recording — just words. The quality of the voice you get depends almost entirely on how you describe it. This guide walks you through everything you need to know to write great voice prompts.
Understand the Seven Dimensions
Every voice can be broken down into a set of core attributes. The more of these you include in your description, the more control you have over the result.
| Dimension | Example Values |
|---|---|
| Gender | Male, female, neutral |
| Age | Child (5–12), teenager (13–18), young adult (19–35), middle-aged (36–55), elderly (55+) |
| Pitch | High, medium, low, high-pitched with resonance, deep baritone |
| Pace | Fast, moderate, slow, deliberate with pauses, rapid-fire |
| Emotion | Cheerful, calm, serious, gentle, lively, soothing, intense |
| Characteristics | Magnetic, crisp, hoarse, raspy, breathy, mellow, sweet, rich, gravelly |
| Use case | Narration, news broadcast, ad voice-over, audiobook, game character, voice assistant |
You don't need to fill in every dimension for every voice. But the more specific you are, the closer the result will match what you're imagining.
Write Your First Prompt
Start simple. Begin with two or three key traits — gender, age, and one defining quality. Generate a preview, listen, and then add more detail in the next iteration.
- Start with the basics and generate a preview.
"A young female voice, mid-twenties, warm tone."
- Add more detail based on what you hear.
"A young female voice, mid-twenties, warm and smooth tone, moderate pace, suitable for audiobook narration."
- Get specific about the qualities that matter most.
"A young female voice, mid-twenties, warm and smooth with a slight breathiness. Moderate pace, calm and reassuring delivery. Suitable for guided meditation or audiobook narration."
Tip: Each generation produces a unique voice, even with the same description. If you like the direction but not the exact result, generate again with the same prompt before changing it.
Example Prompts That Work
These prompts demonstrate the level of detail that produces strong, distinctive results.
Narrator
"A calm, middle-aged male voice with a deep, magnetic tone. Slow, steady pace with clear articulation. Suitable for documentary narration."
Street Character
"A rough, fast-talking male voice, mid-thirties, medium pitch with sharp rising inflections, raspy and brash, high energy. Suitable for character acting."
Fantasy Game Character
"Gender: female. Age: fifties. Pitch: low with an eerie resonance. Pace: slow and deliberate with dramatic pauses. Emotion: mysterious, commanding. Characteristics: smooth, powerful. Use case: fantasy game dialogue."
Child
"A bright, curious child's voice, around 8 years old, high pitch with expressive intonation. Moderate pace with occasional excited bursts, cheerful and innocent."
Anime Villain
"A low-pitched male voice with dramatic pitch swings, intimidating and mischievous, suitable for anime voice-overs. Add dramatic pauses."
Late-Night Radio Host
"A smooth, alluring young female voice, late twenties, low pitch with a breathy quality. Slow, deliberate pace. Warm and intimate. Suitable for late-night radio."
Notice the pattern: the best prompts combine who the voice belongs to (age, gender), how it sounds (pitch, pace, texture), and where it will be used (narration, game, radio).
Common Mistakes
| What People Write | Why It Doesn't Work | Write This Instead |
|---|---|---|
| "A nice voice" | Too vague — the model has nothing specific to work with | "A warm female voice, mid-thirties, gentle and calm, moderate pace" |
| "Make it sound like [celebrity]" | The model doesn't know specific people | "A deep, authoritative male voice with a commanding presence and measured pace" |
| "Very very very energetic female" | Repeating words doesn't increase intensity | "An energetic young female, fast pace, bright and enthusiastic, high pitch" |
| "Male" | Single-dimension descriptions produce generic results | "A young adult male, smooth baritone, steady pace, confident and relaxed" |
| "Angry screaming old man fast loud" | Keyword lists lack structure | "An elderly male voice, high energy, fast-paced, rough and gravelly, shouting with anger" |
The common thread: vague or unstructured descriptions produce generic voices. Specific, multi-dimensional descriptions produce distinctive ones.
The Iterative Workflow
Voice design works best as a loop, not a single shot.
-
Describe — Write a prompt with 2–4 key attributes. Don't try to get it perfect on the first attempt.
-
Generate and listen — Pay attention to what's close and what's off. Is the pitch right but the pace too fast? Is the emotion right but the age wrong?
-
Refine — Adjust the specific dimension that needs work. Add detail where the voice fell short, remove anything that pulled it in the wrong direction.
Tip: Keep a note of prompts that produced voices you liked. Small wording changes can shift the result significantly, so having a reference point saves time.
Advanced Tips
Use Structured Format for Complex Voices
For voices with many specific requirements, a structured key-value format can be clearer than a single sentence:
"Gender: male. Age: mid-forties. Pitch: deep, low. Pace: steady, measured. Emotion: serious, intense. Characteristics: rough, hoarse, gravelly. Use case: fantasy game dialogue or action trailers."
This makes it easier to isolate and adjust individual dimensions without rewriting the whole prompt.
Match Your Preview Text to the Voice
The text the voice reads during preview matters. An energetic voice description paired with a flat, monotone sentence will produce underwhelming results. Give the voice something to perform:
- For an excited character, use text with exclamations and questions
- For a calm narrator, use smooth, descriptive prose
- For a commanding voice, use short, direct sentences
Save and Reuse
Once you generate a voice you like, save it to your library. You can then use it for text-to-speech generation anytime — just like a cloned voice. This means you don't need to re-describe and hope for the same result.
Start Broad, Then Narrow
If you're not sure exactly what you want, start with a broad archetype ("elderly British professor") and generate a few variations. Once you hear something close, add the specific qualities that would make it perfect ("elderly British professor, dry wit, measured pace, slight rasp").
Frequently Asked Questions
One to three sentences is the sweet spot. Include at least gender, age, and one or two distinctive qualities. Longer descriptions give more control, but extremely long prompts can sometimes produce inconsistent results — aim for specific, not exhaustive.
Every generation creates a unique voice. This is by design — it lets you explore variations without changing your description. If you get a voice you like, save it immediately so you can reuse it.
Try breaking down what's wrong. If the pitch is off, adjust only the pitch-related words. If the emotion doesn't match, swap the emotion descriptors. Changing one dimension at a time makes it easier to converge on the right result. Also try the structured key-value format — it can give the model clearer signals than a single flowing sentence.
Yes. The local REST API exposes a voice design endpoint that accepts the same text descriptions. See the API documentation for details on integrating voice design into your own applications.