Meta MTIA 400: GenAI Inference at Scale

Meta's second-gen ASIC delivers 6 PFLOPS FP8 and 288 GB HBM for GenAI and recommendation inference inside Meta's data centers.

Meta MTIA 400: GenAI Inference at Scale

Overview

The MTIA 400 is Meta's second-generation custom ASIC for AI inference - a chip that exists entirely inside Meta's own data centers and can't be purchased, rented, or independently tested by anyone outside the company. Announced in March 2026 alongside the full four-chip MTIA roadmap, it doubles the compute chiplet count of its predecessor, the MTIA 300, and expands the target workload from ranking and recommendation (R&R) to include generative AI inference. At 6 PFLOPS FP8 and 288 GB HBM across a 72-chip rack-scale domain, the numbers look competitive on paper. The problem is that every one of those numbers comes from Meta.

There's no MLPerf submission, no independent benchmark, and no way to copy the claimed performance. What Meta calls "competitive with leading commercial products" probably means the H100/MI300X generation - not NVIDIA's Blackwell lineup, which was already shipping when the MTIA 400 announcement dropped. The MX4 number - 12 PFLOPS - sounds impressive but uses a Microscaling 4-bit format that isn't directly comparable to standard FP8 or FP16 PFLOPS figures from competing vendors. These distinctions matter when assessing whether MTIA 400 actually closes the gap with commercial AI accelerators.

Strategically, the chip is significant. Meta is the only source of these workloads, so Meta is the only judge of whether the chip succeeds. Given the company runs hundreds of millions of daily inference requests across Facebook, Instagram, WhatsApp, and an expanding suite of generative AI products, the scale of deployment will be enormous even if the chip never appears on a third-party benchmark.

TL;DR

  • Meta's second-gen MTIA ASIC: 6 PFLOPS FP8, 288 GB HBM, 9.2 TB/s bandwidth, 1,200W TDP
  • Two compute chiplets (doubled vs MTIA 300's one), plus dedicated network and HBM chiplets
  • 12 PFLOPS MX4 figure uses Microscaling 4-bit format - not directly comparable to vendor FP8 numbers
  • 72-chip rack forms a single scale-up domain via switched backplane; 1.2 Tbps scale-up, 100 Gbps scale-out
  • Expands beyond R&R into GenAI inference - image and video generation at Meta's scale
  • Internal use only - no external sales, no cloud access, no independent benchmarks
  • "Competitive with leading commercial products" likely means H100 generation, not Blackwell

Meta MTIA 400 rack-scale system with 72 chips connected via switched backplane The MTIA 400 rack configuration: 72 chips per rack, connected via a switched backplane as a single scale-up domain. Source: xpu.pub / Meta

Key Specifications

SpecificationDetails
ManufacturerMeta Platforms
Co-developerBroadcom
FabricationTSMC (process node not disclosed)
Product FamilyMTIA (Meta Training and Inference Accelerator)
Chip TypeASIC
ArchitectureTwo compute chiplets + network chiplets + HBM stacks (RISC-V vector)
FP8 / MX8 Performance6 PFLOPS (6,000 TFLOPS)
MX4 Performance12 PFLOPS (block-level shared exponent 4-bit format)
BF16 Performance3 PFLOPS
Memory288 GB HBM
Memory Bandwidth9.2 TB/s
TDP1,200W
Scale-up Networking1.2 Tbps
Scale-out Networking100 Gbps
Rack Config72 chips per rack, single scale-up domain (switched backplane)
Target WorkloadsGenAI inference, Ranking & Recommendation
StatusLab testing complete, deployment on track
AvailabilityInternal use only - not sold commercially

The MTIA Roadmap

The MTIA 400 doesn't make much sense outside of Meta's larger silicon strategy. The company published a four-generation roadmap at the same time, covering chips every six months through 2027. The 400 is the second step.

ChipStatusFP8 PerformanceMX4 PerformanceHBM BandwidthHBM CapacityTDP
MTIA 300In production1.2 PFLOPSNot supported6.1 TB/s216 GB800W
MTIA 400Lab complete, deploying 20266 PFLOPS12 PFLOPS9.2 TB/s288 GB1,200W
MTIA 450Mass deploy early 2027~21 PFLOPS (est.)~42 PFLOPS (est.)~18.4 TB/s (est.)~288 GB~1,400W (est.)
MTIA 500Mass deploy late 2027~30 PFLOPS (est.)~60 PFLOPS (est.)~27.6 TB/s (est.)384-512 GB (est.)~1,700W (est.)

The 300-to-500 headline is a 25x FP8 compute increase over four chips in two years. That's an aggressive cadence by any measure. MTIA 450 and 500 estimates in the table above are inferred from Meta's stated roadmap targets, not official disclosed specs. Take them with appropriate skepticism.

The precision format shift across generations is worth understanding before you read any cross-generation FLOPS comparison. MTIA 300 uses standard FP8. MTIA 400 introduces MX8 and MX4 (Microscaling formats, developed under the Open Compute Project MX specification). MX formats use block-level shared exponents - a group of values shares one exponent element, which allows more values to be packed at a given bit width without the overflow risk of reducing per-value exponent bits. The jump from 1.2 PFLOPS FP8 on MTIA 300 to 6 PFLOPS on MTIA 400 reflects both genuine hardware scaling and a format change that inflates the theoretical peak.

MTIA chip roadmap from MTIA 300 through MTIA 500 showing compute scaling Meta's four-chip MTIA roadmap spans MTIA 300 through 500, with deployment every six months through 2027. Source: xpu.pub / Meta

Performance Benchmarks

Every performance number for MTIA 400 comes from Meta. There are no independent benchmarks, no MLPerf submissions, no third-party testing. This is not unusual for internal silicon programs - AWS Trainium2 similarly withholds raw FLOPS, TDP, and process node details - but it does mean you're reading marketing material dressed up as technical specs.

Meta claims MTIA 400 is "competitive with leading commercial products." In early 2026, the realistic reference point for that phrase is the H100/MI300X generation - chips that were already being displaced by NVIDIA B200 and AMD MI350X by the time the MTIA 400 announcement dropped. For context:

MetricMeta MTIA 400NVIDIA H100 SXM5AMD MI300X
Peak FP8 TFLOPS6,0003,9582,610
HBM Capacity288 GB80 GB192 GB
HBM Bandwidth9.2 TB/s3.35 TB/s5.3 TB/s
TDP1,200W700W750W
External AvailabilityNoneYesYes
Independent BenchmarksNoneManyMany

Raw FLOPS favor MTIA 400 on this comparison, but the numbers need context. First, 6 PFLOPS at 1,200W versus 3.9 PFLOPS at 700W means MTIA 400 delivers roughly 5 TFLOPS per watt versus H100's 5.6 TFLOPS per watt - not a power efficiency win. Second, the H100 and MI300X are mature products with years of compiler, driver, and software stack optimization. MTIA 400 runs on Meta's internal software stack tuned for Meta's specific model architectures. Real-world throughput on production GenAI workloads may or may not reflect the headline numbers. Meta hasn't said.

The MX4 figure - 12 PFLOPS - looks strong against anything in the H100/MI300X generation. But MX4 is a non-standard format that isn't supported by most training frameworks or model weights. Comparing it to FP8 TFLOPS from other vendors inflates the apparent lead.

Key Capabilities

Dual-Chiplet Compute Architecture. The MTIA 400 doubles the compute chiplets of the MTIA 300 - two compute dies versus one, plus dedicated network chiplets and HBM stacks in a multi-chip module. The network chiplets handle the PCIe host interface and networking functions independently of the compute dies, keeping collective communication traffic off the main compute path. This separation is architecturally clean: on GPU systems, NVLink and PCIe traffic competes with compute kernels for scheduling and memory resources. MTIA 400 avoids that by design. The RISC-V vector architecture from MTIA 300 carries through, with each Processing Element pairing RISC-V cores with a Dot Product Engine for dense matrix operations and a Special Function Unit for non-linear activations.

MX8 and MX4 Precision Support. MTIA 400 is the first MTIA chip to support Microscaling formats. MX8 and MX4 use block-level shared exponents - rather than each value carrying its own exponent, a block of values shares one exponent element encoded in higher precision. This compresses the bit budget per value while reducing the quantization error that comes from naively cutting exponent bits. The MX specification was co-developed by Microsoft, NVIDIA, AMD, Intel, Qualcomm, and Meta, and ratified through Open Compute Project, so it's an actual open standard rather than a proprietary Meta format. The practical upside is higher arithmetic throughput per watt at equivalent model quality for inference workloads - if Meta's software stack can actually exploit the format efficiently across its GenAI model mix. That's a meaningful if.

72-Chip Scale-Up Domain. A single rack of 72 MTIA 400 chips forms one scale-up domain through a switched backplane, with 1.2 Tbps of scale-up bandwidth across the domain. For reference, Google TPU v7 Ironwood targets pods of 9,216 chips. The MTIA 400 rack is a much smaller collective - appropriate for inference, where jobs are usually smaller than large training runs, but limited compared to what hyperscaler TPU programs can address. Scale-out networking runs at 100 Gbps, which handles communication between racks without the complexity of the scale-up backplane.

Expanded GenAI Workload Support. MTIA 300 focused on ranking and recommendation. MTIA 400 adds GenAI inference - image generation, video generation, and text generation across Meta's product surface. This is a significant expansion in workload complexity. R&R inference involves sparse embedding lookups and shallow models; GenAI inference involves large transformer models, long context handling, and KV cache management. Whether the MTIA 400 architecture handles both efficiently or simply handles both adequately is something Meta hasn't quantified externally.

Pricing and Availability

MTIA 400 isn't for sale. It won't be. No pricing exists because no transaction is possible. The chip is being rolled out in Meta's own data centers, serving Meta's own products, using Meta's own software stack. The Broadcom co-development partnership handles physical design and manufacturing coordination with TSMC; Broadcom doesn't get a commercial product out of this arrangement either.

If you need a custom inference ASIC, your options are AWS Trainium2 through EC2 instances, Google TPU v7 Ironwood through Google Cloud, or commercial hardware from NVIDIA, AMD, and others. MTIA 400 isn't and won't be on that list.

The strategic logic for Meta is straightforward. Meta's inference volume across Facebook, Instagram, WhatsApp, and AI products is enormous enough that custom silicon economics work even with the full R&D amortization. The chip doesn't have to be better than a H100 in every dimension - it just has to be better than a H100 for Meta's specific workloads at Meta's specific scale. Whether it hits that's something only Meta can verify.

Strengths and Weaknesses

Strengths

  • HBM capacity (288 GB) and bandwidth (9.2 TB/s) exceed the H100 SXM5 substantially
  • Dual-chiplet compute architecture doubles compute density versus MTIA 300
  • MX4/MX8 support aligns with an open industry standard (Open Compute Project MX spec)
  • 72-chip scale-up domain via switched backplane avoids PCIe bandwidth bottlenecks
  • Part of a coherent four-generation roadmap with defined delivery targets
  • Real deployment at Meta's scale, not a paper chip or lab demonstration

Weaknesses

  • No external availability - zero relevance to anyone outside Meta
  • All performance claims are self-reported; no independent verification is possible
  • 6 PFLOPS at 1,200W isn't notably power efficient compared to H100's 3.9 PFLOPS at 700W
  • MX4 PFLOPS number isn't comparable to FP8 PFLOPS from NVIDIA or AMD
  • "Competitive with leading commercial products" is a vague claim probably indexed to H100 generation
  • Process node undisclosed, preventing efficiency analysis
  • Software stack is proprietary and non-transferable - no community, no open tooling

Sources

All MTIA 400 performance specifications - FP8 PFLOPS, MX4 PFLOPS, HBM capacity, HBM bandwidth, TDP, networking speeds, and scale-up domain size - are sourced from Meta's March 2026 technical announcement. No independent validation of these figures has been published.

✓ Last verified May 15, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.