Every agent pipeline eventually bottlenecks on structured output. It doesn't matter how good a model's reasoning is if it can't reliably return a JSON object that matches the schema your downstream code expects. A single missing required field, an extra property the schema forbids, or an incorrectly typed value will break the pipeline as surely as a wrong tool choice.

This leaderboard covers both sides of the structured output problem: native JSON schema enforcement built into model APIs (OpenAI Structured Outputs, Anthropic tool use, Google Gemini responseSchema, and others), and open-source constrained decoding libraries that work at the inference layer (Outlines, Guidance, LM Format Enforcer, XGrammar, SGLang, jsonformer). The two approaches solve the same problem at different levels of the stack, and the right choice depends on whether you control the inference runtime.

Scores come from JSONSchemaBench - a Microsoft Research and EPFL paper from January 2025 testing six constrained decoding frameworks across 10,000 real-world JSON schemas - and from BFCL v3, which covers structured function call formatting across frontier models. Native API providers are assessed separately where official documentation reports compliance behavior.

If you're reading this alongside the function calling benchmarks leaderboard, note the distinction: function calling evaluates whether a model picks the right tool and populates the right arguments. Structured output benchmarks evaluate whether the raw JSON or object the model emits is valid against a schema, regardless of the task it's solving. The instruction following leaderboard covers a third dimension - whether a model follows format constraints given in natural language instructions. All three matter for production agents.

TL;DR

On JSONSchemaBench (10K real-world schemas), Guidance leads on coverage (96% empirical coverage on GlaiveAI schemas) while OpenAI and Gemini native APIs achieve 100% compliance on schemas they support but cover fewer schema types
Constrained decoding libraries handle complex nested schemas better than native APIs but with measurable latency costs - Outlines compiles grammars in 3-8 seconds vs. Guidance's near-zero compile time
On BFCL v3 structured function call formatting, GLM-4.5 (76.7%) and Qwen3 32B (75.7%) lead frontier models; Claude Opus 4 scores 25.3% due to conversational output wrapping that fails AST parsing
For production pipelines: if you control the inference runtime, Guidance delivers the strongest coverage-speed tradeoff; if you're calling a hosted API, OpenAI strict: true mode offers the most reliable guarantee within its supported schema subset

The Benchmarks Explained

JSONSchemaBench

JSONSchemaBench, published by researchers from EPFL, Microsoft Research, and the JSON Schema team in January 2025, is the most systematic evaluation of constrained decoding that exists. The benchmark contains 9,558 real-world JSON schemas organized across 10 datasets with varying complexity levels - from simple flat objects to deeply nested schemas with $ref resolution, anyOf/oneOf combiners, and complex constraint types like pattern, minimum, uniqueItems, and const.

The evaluation measures three things:

Empirical coverage - what fraction of schemas in the dataset does a framework successfully process? A framework that crashes on anyOf schemas will score low here even if it handles simple schemas perfectly.

Compliance rate - for the schemas a framework does process, what fraction of generated outputs actually validate against the schema? A framework can have high empirical coverage but low compliance if it attempts all schemas but fails many.

Efficiency - grammar compilation time (the overhead before generation begins) and time per output token (the generation slowdown introduced by constrained decoding).

The researchers tested six frameworks: Guidance, Outlines, Llamacpp (via llama.cpp's GBNF grammar backend), XGrammar, OpenAI (gpt-4o with response_format: {type: "json_schema"}), and Gemini (gemini-1.5-pro with responseSchema). All constrained decoding frameworks ran against the same base model (Llama-3.1-8B-Instruct) to isolate framework behavior from model capability.

BFCL v3 (Berkeley Function Calling Leaderboard)

BFCL v3 from UC Berkeley's Sky Computing Lab is the standard tool-use benchmark. It's relevant here because function calls require valid JSON payloads matching an argument schema. BFCL uses Abstract Syntax Tree comparison to check structural correctness - catching subtle issues like mistyped field names and incorrectly nested arguments. Results here reflect how frontier models generate structured API call payloads when using their native tool-use APIs.

JSONSchemaBench Results

Coverage and Compliance by Dataset Complexity

Data from JSONSchemaBench (arXiv:2501.10868), Table 4. All constrained decoding frameworks tested against Llama-3.1-8B-Instruct. OpenAI and Gemini tested against their respective hosted models.

GlaiveAI Dataset (moderate complexity)

Rank	Framework / API	Empirical Coverage	Compliance Rate
1	Guidance	96%	98%
2	Llamacpp (GBNF)	95%	97%
3	Outlines	95%	96%
4	XGrammar	93%	93%
5	OpenAI (gpt-4o)	89%	100%
6	Gemini (1.5 Pro)	86%	100%

The native API results reveal an important tradeoff: OpenAI and Gemini achieve perfect compliance on the schemas they process, but skip more schemas than the local frameworks do. When OpenAI's structured output mode encounters a schema type it doesn't support (like certain $ref patterns or anyOf with complex branches), it falls back to unguided generation rather than attempting to comply - so the 100% compliance number comes at the cost of 11% empirical coverage.

GitHub Easy Dataset (simpler, developer-produced schemas)

Rank	Framework / API	Empirical Coverage	Compliance Rate
1	Guidance	86%	96%
2	XGrammar	79%	87%
3	Llamacpp (GBNF)	75%	88%
4	Outlines	59%	83%
5	OpenAI (gpt-4o)	29%	97%
6	Gemini (1.5 Pro)	Not reported	Not reported

OpenAI's coverage drops to 29% on the GitHub Easy dataset - not because the schemas are harder, but because they include more variety in structural patterns (schemas from real GitHub repos) that OpenAI's strict mode doesn't cover. The 97% compliance on what it does handle remains strong.

GitHub Hard Dataset (complex, highly nested real-world schemas)

Rank	Framework / API	Empirical Coverage	Compliance Rate
1	Guidance	41%	69%
2	Llamacpp (GBNF)	39%	63%
3	XGrammar	28%	41%
4	Outlines	3%	6%
5	OpenAI (gpt-4o)	Not reported	Not reported
6	Gemini (1.5 Pro)	Not reported	Not reported

The hard dataset exposes a harsh truth: none of the tested frameworks handles deeply complex schemas reliably. Guidance leads at 41% coverage and 69% compliance, but a 31% failure rate on complex schemas is still meaningful for production use. Outlines' 3% coverage on this dataset is the sharpest drop - its grammar compilation approach struggles significantly with advanced JSON Schema constructs like multi-level $ref resolution and complex combiners.

Grammar Compilation Efficiency

Constrained decoding imposes two types of overhead: grammar compilation time (paid once per schema, before generation starts) and per-token generation overhead (paid on every generated token). Both matter in production. The per-schema compilation time determines whether you can afford to compile dynamically per-request. The per-token overhead determines throughput.

Data from JSONSchemaBench Tables 2-3.

Grammar Compilation Time

Framework	Compilation Time (GlaiveAI)	Compilation Time (GitHub)
Guidance	0.00-0.01 seconds	0.00-0.01 seconds
Llamacpp	0.05-0.06 seconds	0.05-0.06 seconds
XGrammar	0.12-0.30 seconds	0.12-0.30 seconds
Outlines	3.48-8.05 seconds	Variable

Outlines' compile time of 3-8 seconds per schema is its most significant production limitation. For applications that compile grammars once and reuse them across many requests (batch processing, static schema services), this isn't critical. For request-time schema compilation - where you're generating a schema per user request and enforcing it immediately - Outlines' compile time will dominate your latency budget. Guidance's near-zero compile time changes the calculus entirely for this use case.

Time Per Output Token (Generation Overhead)

Framework	TPOT (ms)	vs. Unconstrained
Guidance	6.37-9.47	Minimal overhead
Llamacpp	27.22-29.98	~4x slower
Outlines	30.33-46.57	~5x slower
XGrammar (HF backend)	65.20-66.78	~10x slower

Guidance's per-token speed is notably faster than other libraries. The paper attributes this to Guidance's token-level rather than character-level FSM approach and to its coalescence optimization, which defers constraint application to avoid unnecessary re-computation during generation.

JSON Schema Feature Support

The JSONSchemaBench authors also ran frameworks against the official JSON Schema Test Suite to measure feature coverage. The test suite contains 440 individual constraint categories.

Framework	Categories with 100% Coverage	Categories with > 50% Coverage
Guidance	13	21
Llamacpp	1	5
XGrammar	1	3
Outlines	0	2

Guidance covers more of the JSON Schema specification than any other tested framework. Most libraries implement a working subset of JSON Schema that handles common cases but doesn't cover the full spec. For applications that generate schemas programmatically or accept user-provided schemas, this matters - a user could provide a valid JSON Schema that the framework simply can't process.

BFCL v3 Rankings - Structured Function Call Format

Data from llm-stats.com, April 2026. These scores measure whether models produce structurally valid JSON function call payloads matching tool schemas - a closely related but distinct task from free-form JSON generation.

Rank	Model	Provider	BFCL v3 Score
1	GLM 4.5 Thinking	Z AI	76.7%
2	Qwen3 32B	Alibaba	75.7%
3	Qwen3 235B A22B	Alibaba	74.9%
4	GLM-4.7-Flash	Z AI	74.6%
5	GLM 4.5 Air	Z AI	69.1%
6	Nova Pro 1.0	Amazon	67.9%
7	Kimi K2.5	Moonshot AI	64.5%
8	INTELLECT-3	Prime Intellect	63.5%
9	Llama 4 Scout	Meta	55.7%
10	Gemini 3 Flash Preview Thinking	Google	53.5%
11	MiniMax M1	MiniMax	47.8%
12	Claude Opus 4	Anthropic	25.3%

The Claude result at 25.3% looks alarming but reflects a specific evaluation mismatch rather than genuine inability to produce valid JSON. BFCL's AST parser expects tool calls in a rigid format; Claude wraps its tool invocations in conversational context that the parser rejects even when the underlying JSON structure is correct. When Claude is tested on multi-turn task completion - which tolerates formatting variation - it leads the field. See the function calling benchmarks leaderboard for a full treatment of this split.

For structured output tasks where you need exact schema compliance, not just tool-call formatting, the BFCL results suggest using an explicit enforcement layer (constrained decoding or native strict mode) rather than relying on model instruction following alone.

Native API Approaches Compared

The major hosted model providers all offer some form of structured output enforcement. They differ significantly in scope and reliability.

OpenAI Structured Outputs (`strict: true`)

OpenAI's structured outputs mode, introduced in August 2024 and available on gpt-4o and later models, is the most widely adopted native approach. With strict: true, OpenAI guarantees that the model output matches the provided JSON Schema - it won't return invalid JSON and it won't omit required fields. The guarantee is enforced at the API level using constrained decoding on OpenAI's serving infrastructure.

The tradeoff is that strict: true only supports a specific subset of JSON Schema. Unsupported features include: anyOf branches with incompatible types, certain $ref usage patterns, unevaluatedProperties, and some integer constraint patterns. When you submit a schema that uses unsupported features, OpenAI falls back to non-strict mode without always surfacing a clear warning. The JSONSchemaBench results - 89% empirical coverage on moderate schemas, 29% on developer-produced schemas - reflect this subset limitation.

For applications where you control the schema and can design it to stay within OpenAI's supported subset, the guarantee is strong. For applications that accept or generate schemas dynamically, the coverage gap matters.

Anthropic Tool Use JSON Schema

Anthropic's tool use API requires tool parameter schemas to be provided as JSON Schema. Claude's output for tool calls is structurally JSON - it will produce a JSON object for each tool invocation - but the enforcement is behavioral rather than architectural. Anthropic doesn't apply constrained decoding at the API level; the model is trained to follow tool schemas reliably.

In practice, Claude scores extremely well on multi-turn tool use benchmarks (0.862 on tau-bench retail, leading the field) but produces conversational output that fails strict AST parsing in single-call evaluations. For production use, the key distinction is that Anthropic doesn't provide a hard guarantee of schema validity the way OpenAI strict: true does. Most well-formed requests will produce valid tool calls, but schema violations are possible on unusual or complex schemas.

Google Gemini `responseSchema`

Google Gemini's structured output via responseSchema in the generation config is the Gemini equivalent of OpenAI's strict mode. Available on Gemini 1.5 Pro and later, it accepts a JSON Schema and enforces the output structure. JSONSchemaBench results show 86% empirical coverage and 100% compliance on supported schemas - similar coverage to OpenAI but with the same subset limitation.

Gemini's implementation handles a slightly different subset of JSON Schema than OpenAI's. For teams evaluating both providers, it's worth testing your specific schemas against both - the unsupported schema patterns don't overlap exactly.

Mistral Response Format

Mistral supports a response_format parameter in its chat API that accepts {"type": "json_object"} for JSON mode. This enforces valid JSON but does not validate against a schema. For schema-level enforcement with Mistral models, you need to implement validation client-side or use an external constrained decoding layer. No published benchmark scores are available for Mistral-specific schema adherence.

Together AI and Fireworks JSON Modes

Together AI's JSON mode and Fireworks' structured output both support response_format with schema-based enforcement. Fireworks specifically uses a grammar-based approach (similar to Outlines/GBNF) that supports a wider schema subset than pure API-level enforcement. Neither has published systematic benchmarks against JSONSchemaBench, so schema validity rates are not reported here.

Open-Source Constrained Decoding Libraries

For teams running their own inference, constrained decoding libraries give you framework-level enforcement that works independently of which model you use.

Outlines

Outlines, from dottxt-ai (formerly the outlines-dev organization), is the most popular open-source library for structured generation. It uses a finite state machine (FSM) compiled from a JSON Schema to mask invalid tokens at each generation step - the model can only produce tokens that would lead to a valid completion of the current partial output.

JSONSchemaBench results: 95% coverage, 96% compliance on GlaiveAI schemas. Drops to 3% coverage on GitHub Hard schemas. Grammar compilation takes 3-8 seconds. Per-token generation adds ~5x overhead vs. unconstrained generation using the HuggingFace backend (though Outlines also supports VLLM, which offers substantially better throughput).

Best for: applications where schemas are known at startup, batched generation, or VLLM-backed deployments where the compilation cost is amortized.

Guidance

Guidance, from Microsoft Research, takes a different approach. Rather than compiling a full FSM at the start, it interleaves generation with constraint checking at a finer granularity. Its token-level (rather than character-level) FSM design and coalescence optimization reduce per-token overhead substantially.

JSONSchemaBench results: 96% coverage, 98% compliance on GlaiveAI schemas. 86% coverage on GitHub Easy. 41% coverage and 69% compliance on GitHub Hard - the strongest result in that category. Grammar compilation is near-zero. Per-token generation is 6-9 ms vs. 30+ ms for other libraries.

Guidance is also the only tested framework with meaningful JSON Schema Test Suite coverage across multiple feature categories. For production systems with complex or user-provided schemas, it offers the most complete spec coverage currently available.

Best for: dynamic schema compilation, complex schemas with $ref and combiners, applications where per-token latency matters.

XGrammar

XGrammar, from the MLC AI team (the group behind TVM and MLC-LLM), optimizes for serving throughput at scale. Its context-free grammar approach is designed for GPU batched inference scenarios where you need to enforce different schemas for different items in a batch simultaneously.

JSONSchemaBench results: 93% coverage, 93% compliance on GlaiveAI schemas. 79% coverage on GitHub Easy. 28% coverage on GitHub Hard. Compile time 0.12-0.30 seconds. Per-token generation using the HuggingFace backend is 65-67 ms - slower than Outlines on the same backend - but the library is optimized for dedicated GPU serving runtimes rather than the HuggingFace inference pipeline used in the benchmark.

Best for: GPU serving infrastructure where you need batched constrained decoding across many concurrent requests.

LM Format Enforcer

LM Format Enforcer is a lightweight library that integrates with HuggingFace Transformers, VLLM, and llama.cpp backends. It supports JSON Schema enforcement by filtering the logits at each generation step. It wasn't included in the JSONSchemaBench evaluation, so no systematic coverage data is available. Its main advantage is broad framework compatibility and simple integration: it works as a logit processor that can be dropped into most existing inference pipelines without restructuring the serving setup.

Best for: quick integration into existing HuggingFace or VLLM pipelines.

jsonformer

jsonformer takes a simpler approach than FSM-based libraries: it generates each field of a JSON object separately, using the schema to determine the type and constraints of each field and then calling the model only for the value portion. This avoids generating structural JSON tokens (braces, commas, colons) under model control entirely.

No published JSONSchemaBench scores. The approach works well for simple flat schemas and becomes increasingly limited with complex nested structures, optional fields, and anyOf branching. Not recommended for schemas that go beyond basic typed flat objects.

SGLang

SGLang's structured outputs support is built into its serving framework rather than layered on top. SGLang uses a compiled EBNF grammar approach and is designed for high-throughput serving. It supports JSON Schema, regex, and custom grammars. No independent JSONSchemaBench evaluation exists, but SGLang's XGrammar backend integration means its coverage characteristics approximate XGrammar's results when using that backend.

Best for: high-throughput inference servers where you want constrained decoding integrated at the serving layer rather than in client code.

Abstract visualization of constrained token generation for structured JSON output Constrained decoding libraries enforce JSON Schema compliance by masking invalid tokens at each generation step - ensuring structural validity at the cost of some generation overhead. Source: pexels.com

Quality Impact of Constrained Decoding

One concern about constrained decoding is that forcing the model down a constrained token path might degrade output quality - the model can't choose the best token, only the best valid token. JSONSchemaBench tested this directly on three reasoning tasks, comparing framework outputs against unconstrained base model outputs.

Data from JSONSchemaBench Table 8 (Llama-3.1-8B-Instruct, quality assessment):

Framework	Last Letter (%)	GSM8K (%)	Shuffle Objects (%)
Guidance	54.0	83.8	55.9
Outlines	53.3	81.6	53.0
Llamacpp	52.0	82.4	52.6
XGrammar	51.2	83.7	52.7
Unconstrained	50.7	80.1	52.6

The finding is reassuring: constrained decoding doesn't hurt quality - and in several cases, it marginally improves it. The improvement on GSM8K (math reasoning) is notable: Guidance improves from 80.1% to 83.8% when the output must be formatted as structured JSON. The forced structure appears to help the model organize its answer rather than degrade its reasoning.

This result is consistent with prior work on chain-of-thought formatting: structured output constraints can act as a light scaffold that improves answer quality, especially for numerical tasks.

Key Takeaways

Coverage vs. guarantee: the core tradeoff

The most important finding from JSONSchemaBench is that no approach does everything well. Constrained decoding libraries (especially Guidance) cover more schema types and handle complex structures better than native APIs. But native APIs like OpenAI strict: true offer a hard guarantee - when they support a schema, the output is always valid. Libraries offer probabilistic compliance even at 96-98% rates.

For production systems: if your schema is stable, simple, and within the native API's supported subset, use the native API. If you need complex schema support or schema flexibility, use a constrained decoding library.

Outlines' compilation overhead is a real constraint

The 3-8 second grammar compilation time for Outlines rules out per-request dynamic schema compilation in latency-sensitive applications. If you're building a system where schemas vary per request (user-defined output formats, dynamic API integrations), Guidance's near-zero compile time is a meaningful operational advantage.

BFCL scores alone don't predict JSON validity

The BFCL leaderboard measures structured API call formatting. Claude Opus 4 at 25.3% and Guidance at 96-98% JSON Schema compliance are measuring fundamentally different things. A model's BFCL score tells you about its tool-call formatting behavior; it doesn't predict how it performs on arbitrary JSON Schema enforcement tasks. Use the JSONSchemaBench coverage and compliance numbers for the latter.

Hard schemas are still hard for everyone

On GitHub Hard schemas - the deeply nested, complex real-world schemas - even the best framework (Guidance) only achieves 41% coverage and 69% compliance. This is a research frontier problem, not a solved engineering one. For applications that need to enforce arbitrary complex JSON Schemas today, the practical answer is a combination of constrained decoding and post-generation validation with retry logic.

For broader agent infrastructure context, see our roundup of best AI agent frameworks, which covers how these structured output approaches integrate into full agent orchestration stacks.

Methodology Notes and Caveats

Schema complexity scaling: JSONSchemaBench's three dataset tiers - GlaiveAI (moderate), GitHub Easy, GitHub Hard - show non-linear degradation for all frameworks. Results on GlaiveAI schemas do not predict results on complex schemas. If your application uses schemas with advanced constructs (nested $ref, anyOf/oneOf combiners, pattern validation), test against the GitHub Hard tier performance profile, not the headline GlaiveAI numbers.

Refusal rates and fallback behavior: Native API providers handle unsupported schema types differently. OpenAI silently falls back to non-strict mode; Gemini's behavior on unsupported schemas is less consistently documented. Monitor your API responses for the refusal field in the response object to detect cases where the provider is generating without constraint enforcement.

Speed penalty of constrained decoding: The per-token generation overhead ranges from minimal (Guidance) to ~10x (XGrammar on HuggingFace backend). Benchmark measurements used the HuggingFace Transformers backend for local libraries, which isn't representative of production VLLM or TensorRT-LLM deployments. Outlines on VLLM, for example, runs substantially faster than Outlines on HuggingFace Transformers. Reproduce benchmarks in your target serving environment before drawing production conclusions.

Base model matters for constrained decoding quality: JSONSchemaBench's framework comparison used Llama-3.1-8B-Instruct as the base model. A larger or more capable model will produce higher quality outputs under the same constraints. The coverage and compliance numbers reflect framework capability, not a prediction of what you'd achieve with GPT-4o or Claude running through Guidance.

Native API feature sets evolve: OpenAI, Anthropic, and Google update their structured output implementations and expand supported schema subsets over time. The coverage numbers from JSONSchemaBench (early 2025) may not reflect the current feature sets of hosted APIs. Check current provider documentation for the latest supported schema patterns.

FAQ

Which approach is best for strict JSON Schema enforcement in production?

For simple, stable schemas within the OpenAI-supported subset: OpenAI strict: true. For complex schemas or schema flexibility: Guidance with post-generation validation. Both approaches have different failure modes - test your specific schema against each before committing.

Does constrained decoding hurt the quality of generated content?

Based on JSONSchemaBench's quality measurements: no. Constrained generation matched or slightly exceeded unconstrained generation quality on all three tested tasks. The forced structure appears to help rather than hurt reasoning quality in some cases.

Why does Claude score so low on BFCL structured call formatting?

Claude's BFCL score of 25.3% reflects how it wraps tool calls in conversational context that BFCL's AST parser rejects. It's not evidence of poor JSON generation capability - Claude leads multi-turn tool use benchmarks. Use an explicit schema enforcement layer if you need hard guarantees rather than relying on BFCL scores to predict schema compliance behavior.

Can I use constrained decoding with hosted APIs?

Some hosted inference providers support it. Fireworks has a grammar-based structured output mode. Together AI has JSON mode. For truly arbitrary schema enforcement, you typically need to control the inference runtime (self-hosted models via VLLM, llama.cpp, or similar) to apply a constrained decoding library at the logit level.

How do I handle schemas that fall in the "unsupported" range for native APIs?

Two options: redesign your schema to stay within the provider's supported subset (usually means avoiding advanced $ref, complex anyOf, and certain string pattern constraints), or run a constrained decoding library on a self-hosted model. If the schema is user-provided and you can't control it, add client-side validation with retry logic regardless of which enforcement approach you use.

What is XGrammar's advantage over Outlines or Guidance?

XGrammar is optimized for batched GPU inference at scale - applying different schema constraints to different items in a batch simultaneously. Its HuggingFace Transformers benchmark times look slow, but it's not designed for that backend. In a dedicated GPU serving deployment (e.g., SGLang's XGrammar backend or a custom TensorRT setup), it achieves better throughput than libraries designed for single-request inference.

Sources: