Articles Tagged "Agentic AI"

GLM-5.1 Review: Open-Source Model Tops SWE-Bench Pro

Z.ai's GLM-5.1 is a 754B open-weight model that claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.

Function Calling Benchmarks Leaderboard 2026

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

OpenAI Agents SDK Gets Sandboxing and Guardrails

OpenAI's April 2026 Agents SDK update adds sandboxed execution environments and three-layer guardrails for enterprises building long-horizon agentic systems.

Microsoft Open-Sources Runtime Security for AI Agents

Microsoft released the Agent Governance Toolkit, a seven-package open-source system that enforces policies on autonomous AI agents at sub-millisecond latency and covers all 10 OWASP agentic risks.

AI Agent Failures Need Escrow, Not Just Safety Training

Researchers from Google DeepMind, Microsoft, and Columbia propose financial guardrails for AI agents, with simulations showing up to 61% reduction in user losses.

Anthropic Launches Managed Agents - Runs Your AI for You

Anthropic released Claude Managed Agents in public beta today, a fully managed platform that handles sandboxing, state, and tool execution so developers can skip building agent infrastructure from scratch.

Best AI Models for Agentic Tool Use - April 2026

Claude Opus 4.6 leads SWE-bench Verified at 80.8% and OSWorld at 72.7% for agentic tasks, while GPT-5.4 ties for computer use; no single model dominates every workflow type.

GLM-5.1 Tops SWE-Bench Pro With Zero NVIDIA Hardware

Z.ai's GLM-5.1 scores 58.4 on SWE-bench Pro, edging out GPT-5.4 and Claude Opus 4.6, after being trained on 100,000 Huawei Ascend chips with no US silicon.

Cursor 3 Rebuilds the IDE Around Agents

Cursor's ground-up IDE rebuild ships parallel agent orchestration, Design Mode for frontend work, and cloud-to-local session handoff - all in one unified workspace.

Claude Sonnet 4.6 vs GPT-5.4: Same Price, Different Wins

Claude Sonnet 4.6 and GPT-5.4 cost nearly the same per token but win on opposite benchmarks. Here is where each model leads and which to pick for your workload.

Alibaba Qwen3.6-Plus Launches With 1M Context Window

Alibaba officially launches Qwen3.6-Plus, a 1-million-token context model built for enterprise agentic coding and multimodal reasoning, now free on OpenRouter.

Arm Claims Agents Need New Silicon - Intel Disagrees

Arm says its 136-core AGI CPU is purpose-built for agentic AI workloads. Intel's data center chief - Arm's former head of solutions engineering - says the claim overstates what's actually new.

← Previous