Introducing Song Creator Pro — create music with AI, locally on your device. Try it now →

Try Pocket TTS Online for Free

Lightweight voice cloning and text-to-speech that runs on almost any device. No GPU needed, no signup, completely private.

Try Pocket TTS Free
100% Private
No Install Required
Free and Unlimited

Why Pocket TTS

Why Use Pocket TTS

Pocket TTS is an open-source, MIT-licensed text-to-speech model by Kyutai Labs. With roughly 100 million parameters and a Continuous Audio Language Model (CALM) architecture, it was designed from the ground up to run on CPU faster than real-time.

It supports both built-in preset voices and voice cloning from short audio clips, all in a package that weighs just 148 MB as an INT8-quantized ONNX model.

CPU-Optimized

Designed to run faster than real-time on CPU. No GPU required, making it accessible on virtually any device.

Voice Cloning

Clone a voice from a short audio clip. Upload a reference sample and generate speech in that voice instantly.

Built-in Voices

Preset voices available for quick text-to-speech without needing to provide a reference audio sample.

Lightweight

Only 148 MB as an INT8-quantized ONNX model. Loads fast and runs efficiently in your browser.

Get Started

How It Works

For text-to-speech: Pick a built-in voice, type your text, and generate. For voice cloning, upload a short audio clip instead.

1

Open the free tool

Go to the Pocket TTS tool in your browser. Pick a built-in voice for quick TTS, or upload a short audio clip to clone a voice.

2

Type your text

Enter the text you want spoken. Pocket TTS supports English text input.

3

Generate and download

Hit generate and your audio is ready in seconds. Adjust LSD steps if you want to fine-tune the quality-speed tradeoff.

Use Cases

Who Is Pocket TTS For?

Developers

Build voice features for edge and embedded applications. Pocket TTS is small enough to deploy on resource-constrained devices and fast enough to run in real-time on CPU.

Content Creators

Generate voiceovers for videos, podcasts, or social media content. Use built-in voices for quick narration or clone your own voice from a short sample.

Privacy-Conscious Users

Everything runs locally in your browser. No audio is uploaded to any server, no account is required, and no data leaves your device.

Hobbyists and Makers

Experiment with text-to-speech and voice cloning in a lightweight, MIT-licensed model. Perfect for personal projects, learning, and prototyping.

Getting the Most Out of Pocket TTS

Tips for Best Results

Use clean reference audio for cloning

Background noise, music, or room echo will affect the cloned voice quality. Use a clip with clear, isolated speech and minimal interference.

Adjust LSD steps for your needs

Higher LSD steps produce better quality audio but take longer to generate. Lower steps are faster but may reduce fidelity. Start around 5 and adjust from there.

Built-in voices are great for quick TTS

If you just need fast text-to-speech without voice cloning, the built-in preset voices deliver consistent results with no setup required.

Voice Creator Pro

Need more power? Try Voice Creator Pro.

Voice Creator Pro offers GPU acceleration, voice cloning in 600+ languages, voice design from text descriptions, and a local REST API. One-time purchase of $49.99 with a full commercial license included.

GPU-Accelerated Processing

Faster generation with NVIDIA, Apple Silicon, AMD, and Intel GPU support

Voice Cloning in 600+ Languages

Combine multiple open-source models for TTS and voice cloning across 600+ languages

Voice Design from Text

Describe a voice in plain text and the AI creates it. No audio samples needed

Local REST API

Automate voice generation in your own apps and workflows

Commercial License

Full rights to use generated audio in commercial projects. One-time $49.99 purchase

Multiple Models in One App

Access Pocket TTS, Chatterbox, Kokoro, and more from a single desktop application

FAQ

Common Questions

Pocket TTS is a lightweight text-to-speech model created by Kyutai Labs. It uses a Continuous Audio Language Model (CALM) architecture with roughly 100 million parameters, making it small enough to run on CPU faster than real-time. It supports both built-in preset voices and voice cloning from short audio samples.

Pocket TTS was developed by Kyutai Labs, a research organization focused on real-time AI communication. They released it under the MIT license, one of the most permissive open-source licenses available, so anyone can use, inspect, and build on it.

Yes. Pocket TTS can clone a voice from a short audio clip. Simply upload a reference audio sample and the model will generate speech in that voice. It also offers built-in preset voices if you prefer to skip cloning and generate speech right away.

CALM stands for Continuous Audio Language Model. It is the architecture used by Pocket TTS to generate speech. Unlike discrete-token approaches, CALM works with continuous audio representations, allowing the model to stay extremely compact while still producing natural-sounding speech.

Yes. Pocket TTS is open-source software released under the MIT license. On this site, it runs directly in your browser with no account, no signup, and no usage limits. Everything stays on your device.

LSD steps control the quality of the generated audio. A higher number of steps produces better quality output but takes longer to generate. A lower number is faster but may sacrifice some audio fidelity. You can adjust this with a slider from 1 to 10 to find the right balance for your needs.

No. Pocket TTS was specifically designed to run on CPU faster than real-time. The INT8-quantized ONNX model is only about 148 MB, so it loads quickly and runs efficiently on virtually any modern device, including laptops, tablets, and phones.

Chrome, Edge, and other Chromium-based browsers work best. Firefox and Safari have limited support for some of the web features used for acceleration.