Articles Tagged "Benchmarks"

ZAYA1-8B: Open Reasoning Model Rivals Claude on AMD GPUs

Zyphra's ZAYA1-8B matches Claude 4.5 Sonnet on HMMT 2025 math benchmarks at just 760M active parameters, trained entirely on AMD Instinct MI300X GPUs under Apache 2.0.

ZAYA1-8B

Zyphra's ZAYA1-8B is an 8.4B-parameter MoE reasoning model with only 760M active parameters that matches DeepSeek-R1-0528 on math and coding benchmarks while running at a fraction of the compute cost.

Agent Overload, Blind Attention, Unsafe Traces

Three new papers show that more agent components backfire, reasoning models hide unsafe thinking, and vision-language models waste most of their attention.

GPT-Realtime-2

OpenAI's second-generation real-time audio model with GPT-5-class reasoning, 128K context, five reasoning levels, and parallel tool calling - now generally available in the Realtime API.

OpenAI's Realtime API Goes GA with Three New Models

OpenAI's Realtime API exits beta with GPT-Realtime-2, Translate, and Whisper - three specialized voice models splitting reasoning, translation, and transcription into distinct endpoints.

MiniMax M2.7 Review: The Model That Trains Itself

MiniMax M2.7 is the first open-weight frontier model to automate 30-50% of its own training pipeline - but a controversial license change and sluggish speed complicate the story.

DeepMind's AlphaEvolve Recovered 0.7% of Google's Compute

Google DeepMind's May 2026 AlphaEvolve impact report shows the system running in production across infrastructure, quantum computing, genomics, and commercial partnerships spanning logistics to fintech.

Runtime Safety, Alignment Gaps, and Elastic Context

Three new papers deliver a runtime safety firewall for agent tools, challenge how we measure AI alignment, and introduce elastic context management for long-horizon search agents.

GPT-5.5 Instant

OpenAI's new default ChatGPT model cuts hallucinations by 52.5% and adds Gmail-backed personalization while maintaining the low latency of its predecessor.

Best Coding Models on OpenRouter - Opus 4.7 Rivals

Claude Opus 4.7 scores 87.6% on SWE-bench Verified but costs $5/$25 per million tokens. These four models match or near-match its coding performance at a fraction of the price on OpenRouter.

Best AI Models for Language Translation - May 2026

Gemini 3.1 Pro leads verified 2026 benchmarks at $2 per million tokens while GPT-5.5 and Claude Opus 4.7 postdate available translation evaluations - rankings, scores, and pricing for 10 models.

Agent Memory in 2026: Circuits, Tiers, Evolution

Three new papers reveal how agent memory silently breaks, how a tiered architecture recovers it, and how models can self-improve without human labels.

← Previous