
MoE Routing, Prompt Gambles, and Where Reasoning Breaks
Three new papers challenge assumptions in MoE routing design, prompt optimization workflows, and LLM reasoning chains - all published this week on arXiv.

Three new papers challenge assumptions in MoE routing design, prompt optimization workflows, and LLM reasoning chains - all published this week on arXiv.

Rankings across WebArena, WebVoyager, BrowseComp, Mind2Web, WorkArena, and WebChoreArena - every verified score for browser-driving AI agents as of April 2026.

Rankings of the top AI models on factuality and hallucination benchmarks: TruthfulQA, SimpleQA, FACTS Grounding, Vectara HHEM, HaluEval, HalluLens, and AA-Omniscience as of April 2026.

Z.ai's GLM-5.1 is a 754B open-weight model that claims the top spot on SWE-Bench Pro without a single NVIDIA chip - here's how it holds up in practice.

Rankings of AI video generation models across VBench, VBench-2.0, and the Artificial Analysis Video Arena Elo system, covering text-to-video and image-to-video performance.

Rankings of top LLMs on function calling and tool use benchmarks including BFCL v3, tau-bench, ToolBench, and FinTrace as of April 2026.

A data-driven comparison of 12 vector databases for RAG and AI workloads, with verified pricing, benchmark numbers, and honest trade-off analysis.

A benchmark-driven comparison of the top open-source LLM inference servers - vLLM, SGLang, TGI, llama.cpp, TensorRT-LLM, LMDeploy, and more.

Rankings of AI models on the key visual reasoning benchmarks - MMMU, MathVista, ChartQA, DocVQA, OCRBench, AI2D, CharXiv, and more - focused on image and document understanding.

Rankings of the top embedding and RAG systems across BEIR, MTEB retrieval, MIRACL, MS MARCO, KILT, HotpotQA, and RAGTruth hallucination benchmarks as of April 2026.

A data-driven comparison of the top AI transcription APIs and services for 2026, covering WER accuracy, pricing per hour, speaker diarization, and output formats.

OpenAI launched GPT-Rosalind on April 16, a frontier reasoning model for drug discovery that outranked human experts on RNA prediction and competes directly with Google DeepMind's AlphaFold.