Transcribe Audio to Text
A step-by-step tutorial for transcribing audio to text in Voice Creator Pro using uploads, YouTube URLs, and live recording.
Transcribe Audio to Text
In this tutorial you will use the STT (Speech to Text) tab to transcribe audio into text and export it as subtitles or timestamped JSON.
The whole process takes under a minute per audio source.
Prerequisites
- Voice Creator Pro installed and running on your machine
- An audio file, a YouTube URL, or a microphone (depending on which workflow you follow)
Workflow A: Transcribe an Uploaded File
Step 1: Open the STT Tab
Click Lab in the left sidebar, then select the STT tab at the top. You will see two main areas: the Audio Source panel on the left and the Results panel on the right.
Step 2: Upload Your Audio
Click Upload in the Audio Source panel and select an audio file from your computer (WAV, MP3, or any common format).
Step 3: Set the Language
Choose the correct language from the Language dropdown. If you are not sure or the audio contains multiple languages, leave it on Auto-detect.
Step 4: Pick an ASR Model
Click the model badge in the Results panel to switch between available ASR model families. The default works well for most cases, but try a different model if accuracy is not satisfactory.
Step 5: Transcribe
Click Transcribe. The full transcription text appears in the Results panel once processing finishes.
Workflow B: Transcribe from a YouTube Video
Step 1: Open the STT Tab
Click Lab in the left sidebar, then select the STT tab at the top.
Step 2: Paste the YouTube URL
Click YouTube in the Audio Source panel, paste the video URL, and let Voice Creator Pro extract the audio.
Step 3: Set Language and Model
Choose a language (or leave on Auto-detect) and select an ASR model by clicking the model badge.
Step 4: Transcribe
Click Transcribe. The result appears in the Results panel. This is a quick way to pull text from interviews, podcasts, or any public video.
Workflow C: Record and Transcribe Live
Step 1: Open the STT Tab
Click Lab in the left sidebar, then select the STT tab at the top.
Step 2: Record from Your Microphone
Click Record in the Audio Source panel and speak into your microphone. Click stop when you are done.
Step 3: Set Language and Model
Choose a language and ASR model as described above.
Step 4: Transcribe
Click Transcribe to see the text of what you just recorded.
Workflow D: Transcribe Audio Generated in Voice Creator Pro
Step 1: Open the STT Tab
Click Lab in the left sidebar, then select the STT tab at the top.
Step 2: Load from History
Scroll down to the History section at the bottom of the page. This shows all audio you have previously generated in the Clone, Design, or TTS tabs. Find the generation you want to transcribe and click Use to load it as the audio source.
Step 3: Transcribe
Set your language and ASR model, then click Transcribe. This is useful for verifying what was generated, creating subtitles for generated voiceovers, or repurposing generated audio into written content.
Exporting Results
Once you have a transcription, you can export it in two formats:
- SRT - A standard subtitle file. Drop it into any video editor (Premiere Pro, DaVinci Resolve, CapCut, etc.) to add captions instantly.
- JSON - Includes word-level timestamps. Use this when you need precise alignment for programmatic workflows, custom subtitle styling, or audio editing tools.
Click the corresponding export button in the Results panel to download the file.
Tips
- Use History for previous generations. If you already created audio in Clone, Design, or TTS, open the History source in STT and click Use to load it directly. No need to export and re-upload.
- Auto-detect is good, but explicit is better. If you know the language, select it manually. This gives the model a head start and can improve accuracy on short clips.
- Try a different ASR model if results are rough. Click the model badge and switch families. Different models handle accents, background noise, and speaking speeds differently.
- SRT for video, JSON for code. Pick SRT when you just need subtitles. Pick JSON when you plan to process timestamps programmatically.
Next Steps
- Speech to Text reference - Full details on every setting, audio source, and export format
- Clone a Voice - End-to-end tutorial for voice cloning
- Voice Cloning reference - Deep dive on cloning settings and models