Kyutai TTS is an innovative open-source text-to-speech (TTS) model designed for real-time applications. Unlike traditional TTS systems, it streams text input while simultaneously generating audio output, achieving ultra-low latency. This unique capability is underpinned by a novel technique called delayed streams modeling, allowing the model to align text and audio streams dynamically without requiring the full text upfront. As a result, Kyutai TTS is particularly well-suited for latency-sensitive use cases, such as generating speech from large language models (LLMs) in real-time.
The model, kyutai/tts-1.6b-en_fr
, features 1.6 billion parameters and supports English and French, with plans to expand to additional languages. It delivers state-of-the-art performance, achieving superior word error rates (WER) and speaker similarity metrics compared to competing models. Kyutai TTS also excels in voice cloning, replicating voice characteristics—including tone and mannerisms—using just a 10-second audio sample. To ensure ethical use, the voice embedding model is not directly released, and a repository of pre-approved voices is provided.
Kyutai TTS is optimized for long-form audio generation, overcoming limitations of other transformer-based TTS models that struggle with durations beyond 30 seconds. It also outputs precise word-level timestamps, enabling features like real-time subtitles and interruption handling during interactive sessions. The model is production-ready, with a robust Rust server supporting streaming over WebSockets. On an L40S GPU, it can handle up to 16 concurrent connections at over 2x real-time speed.
The model's low-latency performance is further enhanced by its ability to batch up to 32 requests simultaneously, with a typical latency of 350ms when deployed on Unmute.sh. Kyutai TTS is the first to offer true streaming in both text and audio, making it ideal for scenarios where text generation is gradual or resource-constrained. With its advanced capabilities, Kyutai TTS sets a new benchmark for real-time, high-quality text-to-speech systems.