Speech to Text
Transcribe audio to text from uploads, recordings, or YouTube videos with word-level timestamps.
The STT (Speech to Text) tab transcribes audio into text. You can upload files, record from your microphone, or paste a YouTube URL.
Audio Source
The left panel provides three ways to load audio for transcription:
- Upload - Select an audio file from your computer
- Record - Record directly from your microphone
- YouTube - Paste a YouTube URL to extract and transcribe the audio
- History - Select any previous generation from Clone, Design, or TTS and click Use to load it for transcription
Language
Select the language of the audio from the Language dropdown, or leave it on Auto-detect to let the model identify the language automatically.
ASR Model
Click the model badge in the Results panel to switch between available ASR (Automatic Speech Recognition) model families.
Results
After clicking Transcribe, the transcription appears in the Results panel on the right. The output includes the full text of what was spoken.
Transcriptions can be exported as:
- SRT - Standard subtitle format, ready to use in video editors
- JSON - Includes word-level timestamps for precise alignment and programmatic use
Use Cases
Generating captions
Transcribe video or podcast audio to create subtitles and captions.
Repurposing content
Convert spoken content into written form for blog posts, show notes, or social media.
Transcription for voice cloning
Transcribe a reference audio clip to get an accurate transcript, then use it in the Clone tab for higher quality voice cloning.