Articles Tagged "Multimodal"

MAI-Transcribe-1.5

Microsoft's second-generation speech-to-text model with 2.4% WER, 43-language support, keyword biasing, and 5x faster long-audio processing than comparable accuracy models.

GPT-5.1

GPT-5.1 is OpenAI's November 2025 coding and agentic flagship with 400K context, configurable reasoning effort, and 76.3% on SWE-bench Verified.

Alibaba's Qwen-Robot Suite Targets Physical AI Work

Alibaba launches three open-weight models for robot manipulation, navigation, and world prediction, built on a shared Qwen3.5 backbone with open weights.

Alibaba's generalist VLA model for robotic manipulation, built on Qwen3.5-4B with a DiT action decoder, trained on 38,100+ hours of open-source data, and ranked first on the RoboChallenge generalist track.

Qwen3.7-Plus

Alibaba's first multimodal agent model, combining GUI grounding (ScreenSpot Pro 79.0), 1M-token context, and text-plus-vision input at $0.40/M tokens.

Kimi K2.7-Code - Moonshot's Open-Weight Coding Leap

Moonshot AI ships Kimi K2.7-Code with 30% fewer reasoning tokens and a 21.8% gain on its own coding benchmarks, but the model still trails Claude Opus 4.8 on most tests in the same table.

Microsoft MAI Models: Seven-Model Suite Reviewed

A hands-on review of all seven MAI models - from the April transcription and image launch to Build 2026's MAI-Thinking-1, MAI-Code-1-Flash, and the multimodal upgrades.

Gemini 3.5 Live Translate Rolls Out With 70+ Languages

Google's new streaming audio model translates speech in real time across 70+ languages - available now in Google Translate and via the Gemini Live API.

DiffusionGemma 26B

DiffusionGemma 26B is Google DeepMind's open-weight discrete diffusion language model that generates 256 tokens in parallel, reaching 1,100+ tokens/sec on H100 - roughly 4x faster than autoregressive models of the same size.

Ministral 3 14B

Mistral AI's largest Ministral 3 model - 14B parameters, 256K context, Apache 2.0 license, multimodal, built for local deployment and agentic workflows.

MiniMax M3

MiniMax M3 is an open-weight frontier model with a 1M-token context window, native multimodal input, and strong agentic coding at $0.60/M input tokens.

NVIDIA Cosmos 3

NVIDIA Cosmos 3 is an open physical AI omnimodel with Mixture-of-Transformers architecture that natively handles text, images, video, sound, and robot actions in a single 16B or 64B model.

← Previous