James Kowalski

James Kowalski

AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure. His engineering background means he doesn't just read the spec sheet - he runs the benchmarks, profiles the latency, and checks whether the marketing claims hold up under real workloads.

He studied Computer Science at the University of Illinois at Urbana-Champaign, where he first got hooked on natural language processing during a senior research project on sentiment analysis. He later completed a certificate in data journalism from Northwestern's Medill School.

At Awesome Agents, James owns the leaderboards and tool comparison coverage. He maintains the site's benchmark tracking methodology and is the person who actually runs the numbers before publishing any ranking. He is also an open-source advocate and contributes to several projects in the LLM inference space.

Based in Chicago, IL.

Articles by James Kowalski
Claude Haiku 4.5

Claude Haiku 4.5

Anthropic's fastest and most cost-efficient model, delivering 73.3% on SWE-bench Verified and first-in-family extended thinking and computer use at $1/$5 per million tokens.

GPT-4.1

GPT-4.1

OpenAI's coding-optimized API model with a 1M token context window, 54.6% SWE-bench Verified score, and $2/$8 per million token pricing.

Claude 3.7 Sonnet

Claude 3.7 Sonnet

Anthropic's first hybrid reasoning model with togglable extended thinking, a 200K context window, and state-of-the-art SWE-bench performance at $3/$15 per million tokens.

SU-01

SU-01

SU-01 is a 30B-A3B MoE reasoning model from Shanghai AI Lab that achieves gold-medal performance on IMO 2025, USAMO 2026, and IPhO 2024/2025 using a three-stage training recipe and test-time scaling.

HiDream-O1-Image

HiDream-O1-Image

HiDream-O1-Image is an 8B open-source text-to-image model with a pixel-space diffusion architecture that outperforms 32B FLUX.2 [dev] across five major benchmarks.

SubQ

SubQ

SubQ is the first LLM built on a fully subquadratic attention architecture, achieving a 12M-token research context and 52x faster inference than FlashAttention at 1M tokens.

GPT-OSS 20B

GPT-OSS 20B

OpenAI's open-weight 21B MoE reasoning model with 131K context, Apache 2.0 license, and o3-mini-level benchmark performance running in 16 GB of memory.

OpenAI o3-pro

OpenAI o3-pro

OpenAI's maximum-compute reasoning model targets the hardest problems where o3 falls short, at $20/$80 per million tokens.

OpenAI o3

OpenAI o3

OpenAI's most advanced reasoning model, built for math, science, coding, and visual tasks, with 200K context and adaptive chain-of-thought at $2/$8 per million tokens.