ElevenLabs Launches Scribe: A New Speech-to-Text Model Challenging Industry Leaders

ElevenLabs Launches Scribe: A New Speech-to-Text Model Challenging Industry Leaders

ElevenLabs, the AI startup recently securing $180 million in Series C funding at a $3.3 billion valuation, known for its advanced audio-generation technology, has launched Scribe, its first independent speech-to-text model. This move signifies ElevenLabs’ expansion beyond audio generation and into the competitive speech recognition market.

ElevenLabs has established itself as a key player in the text-to-speech domain, providing a comprehensive library of voices to various companies. With the introduction of Scribe, the company aims to compete directly with established players like Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper models. Scribe’s launch marks a strategic shift, leveraging ElevenLabs’ expertise in audio processing to tackle the challenges of accurate and efficient speech-to-text conversion.

Scribe boasts support for over 99 languages, with over 25 demonstrating exceptional accuracy, achieving a word error rate below 5%. This high-accuracy group includes widely spoken languages such as English (with a claimed 97% accuracy rate), French, German, Hindi, Indonesian, Japanese, and Spanish. The remaining languages are categorized into tiers with high, good, and moderate accuracy levels, based on their respective word error rates. Benchmark tests using FLEURS & Common Voice datasets indicate Scribe’s superior performance compared to Google Gemini 2.0 Flash and Whisper Large V3 across various languages.

The foundation for Scribe was laid with the development of the speech-to-text component within ElevenLabs’ AI conversational agent platform, released last year. However, this is the first time the technology is available as a standalone product. CEO Mati Staniszewski previously emphasized the company’s commitment to enhancing speech detection models. He highlighted the need for improved accuracy in many languages and underscored ElevenLabs’ capability to achieve this through dedicated in-house data annotation and feedback mechanisms.

Beyond language support, Scribe offers features like smart speaker diarization for identifying speakers, word-level timestamps for precise subtitling, and auto-tagging for sound events like laughter. ElevenLabs also provides a streamlined process for users to transcribe video content directly within its studio, facilitating the addition of subtitles or captions. While Scribe currently handles pre-recorded audio, ElevenLabs plans to introduce a real-time, low-latency version soon, expanding its applicability to scenarios like meeting transcriptions and voice note-taking. This future development will further solidify Scribe’s position as a versatile and powerful tool in the speech-to-text landscape.

About The Author

Leave a Comment

Your email address will not be published. Required fields are marked *