Introducing Song Creator Pro — create music with AI, locally on your device. Try it now →

Speech to Text

Transcribe audio to text from uploads, recordings, or YouTube videos with word-level timestamps.

The STT (Speech to Text) tab transcribes audio into text. You can upload files, record from your microphone, or paste a YouTube URL.

Audio Source

The left panel provides three ways to load audio for transcription:

  • Upload - Select an audio file from your computer
  • Record - Record directly from your microphone
  • YouTube - Paste a YouTube URL to extract and transcribe the audio
  • History - Select any previous generation from Clone, Design, or TTS and click Use to load it for transcription

Language

Select the language of the audio from the Language dropdown, or leave it on Auto-detect to let the model identify the language automatically.

ASR Model

Click the model badge in the Results panel to switch between available ASR (Automatic Speech Recognition) model families.

Results

After clicking Transcribe, the transcription appears in the Results panel on the right. The output includes the full text of what was spoken.

Transcriptions can be exported as:

  • SRT - Standard subtitle format, ready to use in video editors
  • JSON - Includes word-level timestamps for precise alignment and programmatic use

Use Cases

Generating captions

Transcribe video or podcast audio to create subtitles and captions.

Repurposing content

Convert spoken content into written form for blog posts, show notes, or social media.

Transcription for voice cloning

Transcribe a reference audio clip to get an accurate transcript, then use it in the Clone tab for higher quality voice cloning.