What is AI voice design?

Voice design is creating a brand new voice from a description instead of cloning an existing one from a recording. You specify what the voice should sound like (gender, age, accent, tone) and the model generates a matching voice from scratch. No reference audio is required.

What is the difference between voice design and voice cloning?

Voice cloning copies a specific real voice from a short reference clip (3 to 10 seconds). Voice design builds a new voice from a description, with no recording at all. Use cloning when you want a particular person's voice, and design when you want a voice that does not exist yet.

Which model is best for designing a voice with a specific accent?

OmniVoice. It offers ten English accents as direct, selectable options (American, British, Australian, Chinese, Canadian, Indian, Korean, Portuguese, Russian, and Japanese), which makes accents far more reliable than describing them in free text.

Which model lets me describe a voice in my own words?

Qwen3-TTS and VoxCPM 2 both take free-form text descriptions. Qwen3-TTS works from a compact character sketch like 'a gravelly, world-weary man in his sixties.' VoxCPM 2 works best with a longer description covering who is speaking, what the voice sounds like, and how it delivers, and you can also steer it by naming the setting, like 'perfect for a movie trailer.' OmniVoice uses fixed attribute options instead, and DramaBox uses a short speaker phrase inside its prompt.

Which model has the best audio quality for designed voices?

VoxCPM 2. It outputs 48kHz studio-quality audio directly, with no external upsampler, while the other models output at lower sample rates. It also supports 30 languages from a single model, so a designed voice is not limited to English.

Can I design voices in Voice Creator Pro?

Yes. Voice Creator Pro includes OmniVoice, Qwen3-TTS, DramaBox, and VoxCPM 2, so all four voice design approaches are available in one app on Windows and Mac, with VCP Cloud in the browser.

AI Voice Design Guide: 4 Ways to Create a Voice From Scratch (2026)

Most AI voice tools assume you already have a voice to copy. But what if you need a voice that does not exist yet? A narrator with a specific accent, a game character, a brand voice that is not based on any real person? That is what voice design is for: you describe the voice you want, and the model builds it from scratch. No recording required.

Four of the best open-source models, all available in Voice Creator Pro, can do this, and each takes a different approach to the input. This guide breaks down all four so you can pick the right one for the voice in your head.

Hear All Four Approaches

Four models, four ways in. Every voice below was designed from scratch, with no reference audio, using the input style noted next to it. Press play to hear each approach, then read on to learn how to write your own.

OmniVoice An Australian accent, picked from OmniVoice's fixed attributes. Accents are what it does best, and you choose them from a menu rather than describe them.

Qwen3-TTS A smooth, intimate voice written as a free-form description. Copy the prompt to reproduce her.

DramaBox One voice moving through boredom, sarcasm, excitement, and despair in a single take, all directed inside the prompt. The full prompt is in the DramaBox section below.

VoxCPM 2 A calm 48kHz documentary narrator, built from a layered description. Copy it to reproduce the voice.

Same goal, four routes to it: a menu of attributes, a written description, an inline performance prompt, and a layered one. The rest of this guide breaks down each and when to reach for it.

First: Voice Design Is Not Voice Cloning

These get mixed up constantly, so to be clear:

Voice cloning copies a specific real voice from a short reference clip (3 to 10 seconds of audio). Use it when you want a particular person's voice.
Voice design builds a new voice from a description, with no recording at all. Use it when you want a voice that does not exist yet.

This guide is about design. If you want to copy a real voice instead, see our voice cloning comparison.

The Four Approaches at a Glance

	OmniVoice	Qwen3-TTS	DramaBox	VoxCPM 2
How you design	Structured attributes (pick from fixed options)	Free-form text description	Speaker phrase inside a prompt pattern	Layered description (identity, texture, delivery)
Reference audio	Not used	Not used	Optional: clone a voice, then direct it	Not used
Control style	Predictable and repeatable	Descriptive and nuanced	Tied to performance and emotion	Cinematic and scenario-aware
Accents	Ten English accents as direct options	Described in free text (less reliable)	General, via the speaker phrase	Described in free text, plus nine Chinese dialects
Best for	Accents and consistent results	Specific, nuanced characters	Expressive, acted character voices	Studio-quality narration, 30 languages

The rest of this guide covers each one in detail.

1. OmniVoice: Structured Attributes

OmniVoice gives you a fixed set of attributes to dial in, rather than a text box. You choose from preset options and the model assembles the voice. Every attribute also has an Auto setting that lets the model decide, so you only set what you care about.

Gender: Auto, male, female
Age: Auto, child, teenager, young adult, middle-aged, elderly
Pitch: Auto, very low through very high
Style: Auto, whisper
English accent: Auto, plus ten options (American, British, Australian, Chinese, Canadian, Indian, Korean, Portuguese, Russian, Japanese)

The standout is accents. OmniVoice is the most reliable of the four for accent work. If you need an Indian-accented narrator or a British storyteller, this is the model to reach for.

Use OmniVoice when you want a specific accent, or prefer a more straight-forward path to voice design without writing prompts while trading off creativity.

For a full walkthrough of every attribute and a set of ready-made recipes, see the dedicated OmniVoice Voice Design Guide.

2. Qwen3-TTS: Free-Form Description

Qwen3-TTS takes the opposite approach. Instead of fixed options, you describe the voice you want in plain English and the model interprets it. That trades some predictability for a lot more nuance.

Descriptions can be as simple or as detailed as you like:

a middle-aged female professor with a slight British accent

a gravelly, world-weary man in his sixties with a slow, deliberate delivery

a cheerful young woman with a bright, high-energy voice

Because the input is free text, you can express character that does not fit a preset, like "world-weary" or "bright, high-energy." The trade-off is that the model has to interpret your words, so results vary more between generations, and accents are less reliable than OmniVoice's fixed options.

Use Qwen3-TTS when you have a specific, nuanced character in mind and want to describe it in your own words rather than pick from a list.

For the seven controllable dimensions, example prompts, and an iterative workflow, see the Qwen3-TTS Voice Design Prompting Guide.

3. DramaBox: The Prompt Pattern

DramaBox designs the voice right inside the prompt you use to generate speech. There is no separate description field. Instead, you write the voice and its performance together using a simple repeating pattern:

A <speaker> <verb>, "<dialogue>" <pronoun> <verb>, "<dialogue>"

For example:

A man speaks calmly, "I told you this would happen." He sighs heavily, "But nobody ever listens to me."

Two rules drive the whole thing:

Quoted text is spoken literally. Everything inside the quotes comes out of the speaker's mouth, including sounds like "Hahaha" or "Mmmm-mmm."
Unquoted text is stage direction. The speaker phrase and the verbs that follow shape who is talking and how they deliver each line, but the model does not read them aloud.

So the same prompt both defines the voice (the speaker phrase) and directs the performance (the verbs and emotion), which is why design and direction happen in one place. Chaining segments with a pronoun and a fresh verb lets the voice shift tone mid-thought, which is where DramaBox shines.

You can also hand DramaBox an existing voice as reference audio. It clones that voice and then performs your prompt in it, so you keep a specific person's identity while still directing the emotion and delivery through the stage directions. That makes it a bridge between design and cloning: start from a voice you already have, then act it.

Use DramaBox when the voice and the performance are inseparable, like expressive character work, dialogue, and emotional delivery, where you are designing and directing in one go.

For the full prompt structure, vocal effects, and emotion patterns, see the DramaBox Prompting Guide.

4. VoxCPM 2: Layered Descriptions

VoxCPM 2 is the newest model in the lineup. Like Qwen3-TTS, you describe the voice in plain language, but VoxCPM 2 handles longer, more detailed descriptions and it is the only one of the four that outputs studio-quality 48kHz audio and speaks 30 languages from a single model.

The best results come from stacking three layers:

Identity: who is speaking. Gender, age, role. "A middle-aged male broadcaster," "an elderly woman."
Texture: what the voice is made of. Pitch and quality words like low-pitched, raspy, magnetic, grainy, breathy.
Delivery: how it performs. Emotion, pace, volume, and where the voice would be used.

Put together, a full description reads like casting notes:

A quiet raspy, elderly woman of a low-pitched voice with a distinct, grainy texture and subtle breathy tremors. Delivers a slow tone at a very low volume, perfect for historical narration.

Telling the model where the voice belongs ("perfect for historical narration," "perfect for an epic movie trailer") shapes the whole performance to fit that setting. If you know the use case but cannot pin down the right adjectives, describe the scene and let the model work backwards from it.

Two practical notes:

Match the text to the emotion. VoxCPM 2 also infers prosody from the words it is asked to speak. A furious description paired with a calm sentence will underdeliver, so give an angry voice something worth shouting about.
Expect variation, then lock in the winner. Designed voices vary between generations, so plan on two or three attempts. When a generation nails it, reuse its seed to reproduce the take, then save the result as a voice. From there you can clone it across any text with the timbre locked in.

Voice Creator Pro ships starter presets for VoxCPM 2, from broadcaster and documentary narrator to extreme emotions like fury, terror, and cold menace, so you can start from a working description and edit rather than write from scratch.

Use VoxCPM 2 when you want the highest audio fidelity, a language other than English, or a voice that is defined by its scene, like a trailer, a bedtime story, or late-night radio.

For the full three-layer breakdown, a texture vocabulary, and playable sample voices with copyable descriptions, see the VoxCPM 2 Voice Design Guide.

How to Choose

If you need...	Use
A specific accent, or more repeatable results	OmniVoice
A nuanced character described in your own words	Qwen3-TTS
An expressive voice you will also direct and emote	DramaBox
Studio-quality 48kHz audio, 30 languages, or a scene-defined voice	VoxCPM 2

You are not locked into one. A common workflow is to design a clean base voice with OmniVoice, Qwen3-TTS, or VoxCPM 2, then move to DramaBox when you need expressive, acted delivery. Voice design defines who the voice is, and prompting defines how it performs.

Design Voices in Voice Creator Pro

Voice Creator Pro includes OmniVoice, Qwen3-TTS, DramaBox, and VoxCPM 2, so all four design approaches live in one interface with no setup or prompt engineering required to get started, including ready-made starter presets for VoxCPM 2.

The desktop app runs every model locally and offline, with unlimited generations and no subscription, on Windows and Mac. Or try VCP Cloud in your browser on a generous free tier, with no GPU or install.

Either way you get the same models and the same quality. Pick an approach, describe the voice you want, and generate one that did not exist a moment ago.