Introducing Song Creator Pro — create music with AI, locally on your device. Try it now →

Transcribe Audio to Text

A step-by-step tutorial for transcribing audio to text in Voice Creator Pro using uploads, YouTube URLs, and live recording.

Transcribe Audio to Text

In this tutorial you will use the STT (Speech to Text) tab to transcribe audio into text and export it as subtitles or timestamped JSON.

The whole process takes under a minute per audio source.


Prerequisites

  • Voice Creator Pro installed and running on your machine
  • An audio file, a YouTube URL, or a microphone (depending on which workflow you follow)

Workflow A: Transcribe an Uploaded File

Step 1: Open the STT Tab

Click Lab in the left sidebar, then select the STT tab at the top. You will see two main areas: the Audio Source panel on the left and the Results panel on the right.

Step 2: Upload Your Audio

Click Upload in the Audio Source panel and select an audio file from your computer (WAV, MP3, or any common format).

Step 3: Set the Language

Choose the correct language from the Language dropdown. If you are not sure or the audio contains multiple languages, leave it on Auto-detect.

Step 4: Pick an ASR Model

Click the model badge in the Results panel to switch between available ASR model families. The default works well for most cases, but try a different model if accuracy is not satisfactory.

Step 5: Transcribe

Click Transcribe. The full transcription text appears in the Results panel once processing finishes.


Workflow B: Transcribe from a YouTube Video

Step 1: Open the STT Tab

Click Lab in the left sidebar, then select the STT tab at the top.

Step 2: Paste the YouTube URL

Click YouTube in the Audio Source panel, paste the video URL, and let Voice Creator Pro extract the audio.

Step 3: Set Language and Model

Choose a language (or leave on Auto-detect) and select an ASR model by clicking the model badge.

Step 4: Transcribe

Click Transcribe. The result appears in the Results panel. This is a quick way to pull text from interviews, podcasts, or any public video.


Workflow C: Record and Transcribe Live

Step 1: Open the STT Tab

Click Lab in the left sidebar, then select the STT tab at the top.

Step 2: Record from Your Microphone

Click Record in the Audio Source panel and speak into your microphone. Click stop when you are done.

Step 3: Set Language and Model

Choose a language and ASR model as described above.

Step 4: Transcribe

Click Transcribe to see the text of what you just recorded.


Workflow D: Transcribe Audio Generated in Voice Creator Pro

Step 1: Open the STT Tab

Click Lab in the left sidebar, then select the STT tab at the top.

Step 2: Load from History

Scroll down to the History section at the bottom of the page. This shows all audio you have previously generated in the Clone, Design, or TTS tabs. Find the generation you want to transcribe and click Use to load it as the audio source.

Step 3: Transcribe

Set your language and ASR model, then click Transcribe. This is useful for verifying what was generated, creating subtitles for generated voiceovers, or repurposing generated audio into written content.


Exporting Results

Once you have a transcription, you can export it in two formats:

  • SRT - A standard subtitle file. Drop it into any video editor (Premiere Pro, DaVinci Resolve, CapCut, etc.) to add captions instantly.
  • JSON - Includes word-level timestamps. Use this when you need precise alignment for programmatic workflows, custom subtitle styling, or audio editing tools.

Click the corresponding export button in the Results panel to download the file.


Tips

  • Use History for previous generations. If you already created audio in Clone, Design, or TTS, open the History source in STT and click Use to load it directly. No need to export and re-upload.
  • Auto-detect is good, but explicit is better. If you know the language, select it manually. This gives the model a head start and can improve accuracy on short clips.
  • Try a different ASR model if results are rough. Click the model badge and switch families. Different models handle accents, background noise, and speaking speeds differently.
  • SRT for video, JSON for code. Pick SRT when you just need subtitles. Pick JSON when you plan to process timestamps programmatically.

Next Steps