Introducing Song Creator Pro — create music with AI, locally on your device. Try it now →

Voice Creator Pro

API Pricing FAQ

Getting Started

Tutorials

Clone a Voice Create a Custom Voice Change a Voice Dub a Video Generate an Audiobook Generate Long-Form Audio Add Pauses to Speech Use Multiple Voices in a Project Transcribe Audio to Text

Lab

Voice Cloning Voice Design Text to Speech Speech to Text Voice to Voice (Experimental)

Voice Search

Voice Changer

Projects

Projects Segments Voice Assignment Lexicon Importing & Exporting Project Settings

Dubbing

Speech to Text

Transcribe audio to text from uploads, recordings, or YouTube videos with word-level timestamps.

The STT (Speech to Text) tab transcribes audio into text. You can upload files, record from your microphone, or paste a YouTube URL.

Audio Source

The left panel provides three ways to load audio for transcription:

Upload - Select an audio file from your computer
Record - Record directly from your microphone
YouTube - Paste a YouTube URL to extract and transcribe the audio
History - Select any previous generation from Clone, Design, or TTS and click Use to load it for transcription

Language

Select the language of the audio from the Language dropdown, or leave it on Auto-detect to let the model identify the language automatically.

ASR Model

Click the model badge in the Results panel to switch between available ASR (Automatic Speech Recognition) model families.

Results

After clicking Transcribe, the transcription appears in the Results panel on the right. The output includes the full text of what was spoken.

Transcriptions can be exported as:

SRT - Standard subtitle format, ready to use in video editors
JSON - Includes word-level timestamps for precise alignment and programmatic use

Use Cases

Generating captions

Transcribe video or podcast audio to create subtitles and captions.

Repurposing content

Convert spoken content into written form for blog posts, show notes, or social media.

Transcription for voice cloning

Transcribe a reference audio clip to get an accurate transcript, then use it in the Clone tab for higher quality voice cloning.

Text to Speech

Generate speech using built-in model voices without needing a reference audio sample.

Voice to Voice (Experimental)

Speak into your microphone and generate speech in a different voice in real time. Pair with VB-Cable to use it as your mic in Discord, Slack, and more.

On this page

Audio Source Language ASR Model Results Use Cases Generating captions Repurposing content Transcription for voice cloning