Articles Tagged "Multimodal"

Google Gemma 4 Ships Four Open Models Under Apache 2.0

Google releases Gemma 4 with a 26B MoE, 31B Dense, and two edge variants under Apache 2.0 - claiming the highest intelligence-per-parameter of any open model.

AI Vision Input Limits - What Every Provider Hides

A technical comparison of how Claude, GPT-4o, Gemini, Grok, Pixtral, Qwen, and DeepSeek handle image inputs - resizing pipelines, token math, and undocumented gotchas.

Multimodal AI Explained - A Beginner's Guide

Multimodal AI can see, hear, and read at once - here's how it works and why it matters for everyday users.

Best AI for Document Understanding - March 2026

Claude Opus 4.6 leads DocVQA at 96.1% while Qwen2.5-VL-72B tops open-source document parsing, making the best PDF analysis model a question of budget and deployment.

Kimi K2.5 Review: Open Weights, Agent Swarms, Caveats

Moonshot AI's Kimi K2.5 delivers best-in-class open-weight math and a genuinely novel multi-agent architecture, but a brutal hallucination rate and slow inference limit its real-world reliability.

Ai2 Drops MolmoWeb - Open-Source Web Agent Beats GPT-4o

Ai2's MolmoWeb is a fully open-source web agent that navigates browsers by screenshot alone, beating GPT-4o-based agents at the 8B scale with weights, training data, and code all released under Apache 2.0.

Mistral Small 4 Review: One Model, Three Jobs

Mistral Small 4 packs reasoning, vision, and agentic coding into a 119B MoE under Apache 2.0 - a serious small-model contender at a price that's hard to ignore.

Cohere Command A Vision

Cohere Command A Vision is a 112B multimodal model that leads on document and OCR benchmarks, beating GPT-4.1 across seven visual understanding tasks.

Balanced Thinking, Broken Judges, Opaque Reasoning

Three new papers expose cracks in how AI models think, how benchmarks evaluate multimodal reasoning, and why LLM judges reliably mislead.

Gemini 3.1 Flash-Lite Review: Fast, Cheap, and Capable

Google's Gemini 3.1 Flash-Lite delivers frontier-class benchmarks at a fraction of the cost of Pro - but a sluggish first-token response and preview-only status mean it's not for every workload.

Luma Agents Review: Creative AI That Actually Ships

Luma Agents coordinates text, image, video, and audio from a single brief using the Uni-1 unified model - a genuine architectural leap, with some real rough edges still showing.

Google Launches Gemini Embedding 2 for Multimodal AI

Google's first natively multimodal embedding model maps text, images, video, audio, and PDFs into a single vector space - now in public preview via Gemini API and Vertex AI.

← Previous