Articles Tagged "Speech Recognition"

MAI-Transcribe-1.5

Microsoft's second-generation speech-to-text model with 2.4% WER, 43-language support, keyword biasing, and 5x faster long-audio processing than comparable accuracy models.

Best AI Models for Voice and Speech - June 2026

ElevenLabs Scribe v2 leads ASR at 2.2% WER after a price cut to $3.67/1000 min, Microsoft MAI-Transcribe-1.5 debuted at #3, and Gemini 3.1 Flash TTS now tops the naturalness leaderboard.

Gemini 3.5 Live Translate Rolls Out With 70+ Languages

Google's new streaming audio model translates speech in real time across 70+ languages - available now in Google Translate and via the Gemini Live API.

Best AI Language Learning Tools in 2026

Six AI language learning tools tested and compared by price, language coverage, speaking practice quality, and who each one actually suits.

Best AI Audio Editing Tools in 2026

A hands-on comparison of the top AI audio editing tools in 2026, covering noise removal, stem separation, mastering, and podcast production.

Best AI Voice Agents in 2026 - 5 Platforms Tested

We tested five AI voice agent platforms - ElevenLabs, Vapi, Retell AI, Bland AI, and Play.ai - comparing real latency, per-minute pricing, and which use cases each actually serves.

Qwen3.5-Omni

Alibaba's Qwen3.5-Omni takes text, images, audio, and video as input and streams both text and speech output in a single end-to-end model with a 256K context window.

Audio Understanding Benchmarks Leaderboard 2026

Rankings of the best audio language models on MMAU, MMAU-Pro, and other benchmarks covering speech reasoning, music understanding, and environmental sound identification.

Best AI Transcription Tools 2026: APIs and Services Ranked

A data-driven comparison of the top AI transcription APIs and services for 2026, covering WER accuracy, pricing per hour, speaker diarization, and output formats.

llama.cpp Lands Three Audio Models in 48 Hours

Three separate PRs merged into llama.cpp between April 11-13 add MERaLiON-2, Gemma 4's Conformer encoder, and Qwen3-Omni/ASR - making local voice AI inference practical on consumer hardware for the first time.

Qwen3.5-Omni Does 10-Hour Audio and 4M Video Frames

Alibaba's Qwen3.5-Omni handles audio, video, images, and text in a single model pass - and generates speech in real time. The Plus variant hits SOTA on 215 benchmarks and edges out Gemini 3.1 Pro on audio tasks.

Cohere's Open-Source Transcribe Tops ASR Leaderboard

Cohere releases its first audio model - a 2B-parameter open-source ASR system beating Whisper Large v3 by 27% on the HuggingFace Open ASR Leaderboard.