
Google Gemma 4 Ships Four Open Models Under Apache 2.0
Google releases Gemma 4 with a 26B MoE, 31B Dense, and two edge variants under Apache 2.0 - claiming the highest intelligence-per-parameter of any open model.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

Google releases Gemma 4 with a 26B MoE, 31B Dense, and two edge variants under Apache 2.0 - claiming the highest intelligence-per-parameter of any open model.

A technical comparison of how Claude, GPT-4o, Gemini, Grok, Pixtral, Qwen, and DeepSeek handle image inputs - resizing pipelines, token math, and undocumented gotchas.

Multimodal AI can see, hear, and read at once - here's how it works and why it matters for everyday users.

Claude Opus 4.6 leads DocVQA at 96.1% while Qwen2.5-VL-72B tops open-source document parsing, making the best PDF analysis model a question of budget and deployment.

Moonshot AI's Kimi K2.5 delivers best-in-class open-weight math and a genuinely novel multi-agent architecture, but a brutal hallucination rate and slow inference limit its real-world reliability.

Ai2's MolmoWeb is a fully open-source web agent that navigates browsers by screenshot alone, beating GPT-4o-based agents at the 8B scale with weights, training data, and code all released under Apache 2.0.

Mistral Small 4 packs reasoning, vision, and agentic coding into a 119B MoE under Apache 2.0 - a serious small-model contender at a price that's hard to ignore.

Cohere Command A Vision is a 112B multimodal model that leads on document and OCR benchmarks, beating GPT-4.1 across seven visual understanding tasks.

Three new papers expose cracks in how AI models think, how benchmarks evaluate multimodal reasoning, and why LLM judges reliably mislead.

Google's Gemini 3.1 Flash-Lite delivers frontier-class benchmarks at a fraction of the cost of Pro - but a sluggish first-token response and preview-only status mean it's not for every workload.

Luma Agents coordinates text, image, video, and audio from a single brief using the Uni-1 unified model - a genuine architectural leap, with some real rough edges still showing.

Google's first natively multimodal embedding model maps text, images, video, audio, and PDFs into a single vector space - now in public preview via Gemini API and Vertex AI.