
Rapid-MLX Is 2.6x Faster Than Ollama on Apple Silicon
New open-source inference engine for Apple Silicon benchmarks up to 2.6x faster than Ollama, supports 66 model aliases, and drops in as an OpenAI-compatible server on any Mac.
They summarize our coverage. We write it.
Newsletters like this one rebroadcast our headlines - often without the full review, the source reading, or the analysis underneath. Our weekly briefing sends the work they paraphrase, straight from the desk, before they get to it.
Free, weekly, no spam. One email every Tuesday. Unsubscribe anytime.

New open-source inference engine for Apple Silicon benchmarks up to 2.6x faster than Ollama, supports 66 model aliases, and drops in as an OpenAI-compatible server on any Mac.

Gemini 2.5 Flash Lite still leads the Vectara hallucination leaderboard at 3.3%, while two new entries - Gemini 3.5 Flash and Mistral Large 3 at $0.50/M - shift the value picture considerably since March.

Meta's Llama 3.3 70B Instruct matches Llama 3.1 405B on instruction following and math while running at 4-5x lower cost, with the lowest hallucination rate of any open-weight model on the Vectara summarization leaderboard.

Three new papers show that AI agents fail not by doing the wrong thing, but by doing things when they should have stopped.

Three new papers expose how reasoning traces can be extracted from supposedly hidden model internals, where chain-of-thought hits an architectural ceiling, and how RL teaches models to know when to quit.

Verified June 2026: real cost per million tokens for self-hosting Llama 4 Scout, Maverick, Qwen3-235B, and DeepSeek V3.2 - GPU requirements, cost formulas, and when cheap APIs actually win.

Three papers: smarter CoT trimming cuts reasoning length by 50%, a plug-in context manager rescues frozen agents on long tasks, and a 960K-item clinical benchmark exposes LLM gaps in hospitals.

Three new papers expose how reasoning models silently cave under pressure, how latent-space guardrails cut safety latency 12.9x, and why human curation can hurt alignment in multi-model training loops.

Three new papers decompose alignment faking into measurable drivers, show safety-aligned agents collude when it pays, and find standard guardrails miss the worst safety failures.

Alibaba's agent-first flagship model with a 1M-token context window, topping Terminal-Bench 2.0 and SWE-Bench Pro at roughly one-sixth the cost of Claude Opus 4.7.

Cursor's Composer 2.5 scores within one point of Claude Opus 4.7 on SWE-Bench Multilingual at $0.50 per million tokens - a tenth of Anthropic's price - but the training disclosures deserve scrutiny.

Q2 2026 AI API pricing review: DeepSeek V4 hits the API, GPT-5.5 launches at $5/1M, and overall token costs are down 60-80% year-over-year - but a hidden tokenizer change at Anthropic quietly raised effective prices.