OpenAI Jalapeño - Custom LLM Inference ASIC

OpenAI's first custom AI chip, co-designed with Broadcom on TSMC 3nm, targeting 50% lower inference cost than GPU alternatives.

OpenAI Jalapeño - Custom LLM Inference ASIC

OpenAI and Broadcom unveiled Jalapeño on June 24, 2026 - OpenAI's first custom silicon, co-designed from scratch around the inference math of large language models. It went from initial design to manufacturing tape-out in nine months, a timeline OpenAI claims is the fastest ever publicly reported for a high-performance custom ASIC.

TL;DR

  • First custom AI chip from OpenAI, co-designed with Broadcom and manufactured at TSMC on the 3nm node
  • Reticle-sized ASIC (~840mm²) with systolic array architecture and eight HBM stacks, optimized purely for LLM decode throughput
  • Targets roughly 50% lower inference cost per token vs GPU baselines - unverified, self-reported pre-production claim
  • Inference only - training stays on NVIDIA; small deployments target end of 2026, full production ramp in 2027-2028

Overview

Jalapeño is not a general-purpose AI accelerator adapted from existing designs. OpenAI built it for one job: running production inference on its own models as cheaply as possible. The chip pairs a systolic array compute architecture - similar in philosophy to Google's early TPUs - with eight HBM memory stacks in a reticle-limited package. The design focuses on memory bandwidth over raw TFLOPS, addressing the decode phase bottleneck where data movement limits throughput more than compute capacity.

The development was led by Richard Ho, OpenAI's Head of Hardware, who contributed to Google's original TPU program before joining OpenAI. Broadcom handled silicon engineering and networking; TSMC manufactured on its 3nm node. Celestica will handle board and rack integration for deployment. The Ethernet-based interconnect uses Broadcom's own networking stack rather than NVIDIA's NVLink, keeping the supply chain completely outside NVIDIA's ecosystem.

What Jalapeño isn't: a product for sale. OpenAI has no announced plans to offer cloud access to the chip or license it to third parties. Jalapeño exists to cut OpenAI's own infrastructure costs. Whether that changes over a multi-generation roadmap remains unstated.

OpenAI CEO Sam Altman and Broadcom CEO Hock Tan hold the first Jalapeño wafer during the chip's June 24 announcement Broadcom CEO Hock Tan delivered a 300mm wafer holding roughly 50-60 Jalapeño ASICs to OpenAI CEO Sam Altman and President Greg Brockman on June 24, 2026. Source: the-decoder.com

Key Specifications

OpenAI and Broadcom disclosed architecture details but withheld most performance metrics. The table below uses confirmed figures where available and flags everything else.

SpecificationDetails
ManufacturerOpenAI (design) / Broadcom (silicon) / TSMC (fab)
Product FamilyJalapeño Gen 1
Chip TypeASIC
Process NodeTSMC 3nm
Die Size~840mm² (at EUV reticle limit, ~858mm²)
Memory8x HBM3 or HBM4 stacks
Memory CapacityNot disclosed
Memory BandwidthNot disclosed
FP8 PerformanceNot disclosed
FP16 PerformanceNot disclosed
TDPNot disclosed
InterconnectBroadcom Ethernet
Systems IntegrationCelestica
Target WorkloadLLM Inference only
Release DateQ4 2026 (prototype); 2027-2028 (production)

The die itself measures around 25.46mm × 33mm, filling the reticle almost completely. The package layout places one large compute chiplet at center, surrounded by six HBM memory stacks, with a separate I/O chiplet flanked by two structural dummy dies for mechanical balance - a layout analyzed from the 300mm production wafer shown at announcement.

Performance Benchmarks

No independent benchmarks exist yet. Jalapeño was in engineering sample phase at announcement, with OpenAI promising a full technical report in the months ahead. The only performance figure in circulation is the self-reported estimate of roughly 50% lower inference cost per token compared to current GPU-based alternatives - a claim Bloomberg reported from sources familiar with the chip, not official OpenAI documentation.

MetricJalapeñoNVIDIA H100 SXMNVIDIA B200
Inference Cost/Token~50% lower (claimed)Baseline~30% lower vs H100
Memory CapacityNot disclosed80GB HBM3192GB HBM3e
Memory BandwidthNot disclosed3.35 TB/s8.0 TB/s
FP8 TFLOPSNot disclosed3,9589,000
TDPNot disclosed700W1,000W
Process NodeTSMC 3nmTSMC 4NPTSMC 4NP

The comparison is deliberately limited: without real numbers from OpenAI, any benchmark table would be fabrication. The NVIDIA H100 and NVIDIA B200 figures are real and confirmed. The Jalapeño column shows only what OpenAI has stated.

One meaningful comparison is with the Cerebras WSE-3, another inference-focused design that avoids DRAM completely by fitting 44GB of SRAM on a single wafer-scale chip. OpenAI's Codex-class models currently run on Cerebras hardware for low-latency inference; Jalapeño is intended as the long-term replacement for that dependency.

Key Capabilities

Systolic Array for Decode Throughput

The systolic array architecture passes data cell to cell in a fixed pipeline, which suits the regular matrix multiplications that dominate the decode phase of autoregressive inference. Unlike a GPU's SIMD model - which is flexible but carries significant control overhead - a systolic array can sustain near-theoretical compute use on inference workloads. Google's TPU line has proven this at scale. Jalapeño applies the same principle but built specifically around OpenAI's model architectures rather than a generic training and inference target.

The trade-off is inflexibility. A systolic array optimized for inference is poorly suited for training's irregular, variable compute patterns. OpenAI made an explicit architectural choice: inference efficiency above all else, with training staying on NVIDIA hardware for the foreseeable future.

Memory Bandwidth as the Real Constraint

The eight HBM stacks around the compute chiplet reflect a specific thesis about what limits inference speed. During the decode phase - producing each new output token - the bottleneck is not how fast the chip can multiply matrices, but how fast it can load model weights from memory into compute. A chip with more HBM stacks moving data faster can decode faster, regardless of raw TFLOPS. OpenAI's design explicitly targets this by maximizing bandwidth at the cost of compute density.

AI-Assisted Chip Design

OpenAI used its own language models during the nine-month development cycle to optimize circuit placements and timing paths. The self-referential loop - model designs chip, chip runs model, cheaper inference grows better model - is not incidental to Jalapeño. It was the design methodology. Whether AI-assisted EDA meaningfully compressed the timeline or whether the compressed timeline reflects OpenAI's willingness to accept more risk in tape-out is an open question the technical report should address.

Full-Stack Independence from NVIDIA

The Broadcom Ethernet interconnect isn't a spec detail - it's a strategic statement. NVIDIA's NVLink provides high-bandwidth scale-up networking between GPU nodes but locks buyers into NVIDIA's ecosystem. Jalapeño with Broadcom's networking stack scales across racks without requiring any NVIDIA component. For a company spending north of $10 billion annually on GPU infrastructure, reducing supplier leverage has direct bottom-line value.

Pricing and Availability

Jalapeño isn't available to buy or rent. OpenAI designed it exclusively for internal use and has no announced plans to commercialize access. The chip's commercial impact will show up in OpenAI's inference pricing - if the 50% cost reduction claim holds in production, it enables lower API prices or higher margins on the same output.

The rollout timeline has three phases: small prototype deployments in late 2026, mass production in 2027, and full operational scale in the first half of 2028. Those 2028 data centers will be built in partnership with Microsoft and other infrastructure partners under the Stargate program. The 10-gigawatt program OpenAI and Broadcom announced spans both 3nm and future 2nm chips, suggesting Jalapeño is the first in a planned annual or biennial cadence.

For teams assessing AI accelerators today, Jalapeño isn't an option - prototype volumes are going to OpenAI's own infrastructure, not third parties.

Strengths and Weaknesses

Strengths

  • Purpose-built inference architecture avoids the overhead of GPU general-purpose design, targeting higher hardware use
  • Eight HBM stacks directly address the memory bandwidth bottleneck that limits LLM decode throughput
  • Reticle-limited 3nm die maximizes on-chip compute density within current EUV limits
  • AI-assisted 9-month design cycle shows a new model for custom silicon development
  • Full Broadcom Ethernet stack removes NVIDIA NVLink dependency for scale-up networking
  • Multi-generation roadmap signals long-term commitment to custom silicon

Weaknesses

  • Zero independent benchmark data; all performance claims are self-reported and unverified
  • No commercial availability - teams can't assess or buy access, making competitive comparison academic until production deploys
  • Inference-only scope means OpenAI's training workloads remain on NVIDIA hardware, limiting leverage in negotiations
  • Full production deployment is 18+ months out, during which NVIDIA Blackwell and next-generation AMD chips will continue to improve
  • Architecture optimized for OpenAI's specific model shapes may not generalize well even if commercialized later

Sources:

✓ Last verified July 1, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.