Meta MTIA 450: 18.4 TB/s Inference Accelerator

Meta's MTIA 450 doubles HBM bandwidth to 18.4 TB/s and adds FlashAttention hardware acceleration for GenAI inference in 2027.

Meta MTIA 450: 18.4 TB/s Inference Accelerator

Overview

The MTIA 450 is Meta's third-generation ASIC in its custom AI accelerator lineup, co-developed with Broadcom and fabricated at TSMC on an undisclosed process node. It was announced in March 2026 alongside a four-chip roadmap and is scheduled for mass deployment in early 2027 - after MTIA 400 deploys in 2026. Like every chip in the MTIA family, it's internal-only silicon. You can't buy it, rent it, or access it through any cloud API. That limits its relevance to the broader industry, but doesn't make it less interesting to understand.

The headline spec is 18.4 TB/s of HBM bandwidth - double the MTIA 400's 9.2 TB/s. That's the number that matters for GenAI inference, more than the MX4 FLOPS figure that Meta leads with in its press materials. Large language models, and especially mixture-of-experts models, are constrained by how fast you can move data from memory to compute. Compute headroom is rarely the bottleneck in inference. Bandwidth is.

TL;DR

  • RISC-V based ASIC co-developed with Broadcom; TSMC fabrication, process node undisclosed
  • 18.4 TB/s HBM bandwidth - doubled from MTIA 400's 9.2 TB/s - is the key inference improvement
  • 7 PFLOPS FP8, ~21 PFLOPS MX4; 288 GB HBM; 1,400W TDP
  • Hardware accelerators for FlashAttention and Softmax built into silicon, targeting long-context inference
  • MX4 custom low-precision format provides roughly 6x the FLOPS of FP16/BF16 for attention operations
  • Scheduled for mass deployment early 2027; announced March 2026
  • Not available externally - internal Meta infrastructure only

The MTIA 450 also introduces hardware-level acceleration for two specific operations that have historically been software bottlenecks in transformer inference: FlashAttention and Softmax. This is worth paying attention to because it reflects where the real compute pressure lives at inference time, especially for long-context workloads. Custom silicon that offloads those operations in dedicated functional units can do so faster and more efficiently than GPU kernels - but only for the workloads Meta actually runs.

Key Specifications

SpecificationDetails
ManufacturerMeta Platforms
Co-developerBroadcom
FabricationTSMC (process node not disclosed)
Product FamilyMTIA (Meta Training and Inference Accelerator)
Chip TypeASIC
ArchitectureRISC-V based with chiplet design + HBM stacks
Memory288 GB HBM
Memory Bandwidth18.4 TB/s
FP8 Performance7 PFLOPS (7,000 TFLOPS)
MX4 Performance~21 PFLOPS
TDP1,400W
Special AcceleratorsHardware FlashAttention + Softmax units
Primary WorkloadGenAI inference (MoE models)
AnnouncedMarch 12, 2026
Mass DeploymentEarly 2027
AvailabilityInternal use only - not sold commercially

MTIA Generation Comparison

The 450 sits between the MTIA 400 (rolling out 2026) and the MTIA 500 (mass deploy 2027). Understanding what changed across generations clarifies what the 450 actually prioritizes.

SpecificationMTIA 300MTIA 400MTIA 450
StatusIn productionLaunching 2026Mass deploy early 2027
FP8 Performance1.2 PFLOPS6 PFLOPS7 PFLOPS
MX4/MX8 Performance-~12 PFLOPS (MX8)~21 PFLOPS (MX4)
HBM Bandwidth6.1 TB/s9.2 TB/s18.4 TB/s
HBM Capacity216 GB288 GB288 GB
TDP800W1,200W1,400W
FlashAttention HW AccelNoNoYes
Softmax HW AccelNoNoYes
Primary WorkloadR&R inferenceGenAI inferenceGenAI inference

A few things stand out in that table. The FP8 FLOPS gain from 400 to 450 is modest - 6 to 7 PFLOPS, about 17%. The MX4 gain is larger in absolute terms (12 to 21 PFLOPS), but comparing MX8 to MX4 across generations isn't apples-to-apples since the precision formats are different. The bandwidth gain is the real story: 9.2 to 18.4 TB/s is a 100% increase in the spec that most constrains inference throughput. TDP climbs from 1,200W to 1,400W - a 17% increase in power draw to support a doubling of bandwidth. That's a reasonable trade for inference-focused workloads.

Performance Benchmarks

There are no independent benchmarks for the MTIA 450. None. Every number in this article comes from Meta's own announcements. The chip isn't available externally, there are no MLPerf submissions, and no third-party lab has tested it. That makes the quoted figures - 7 PFLOPS FP8, 21 PFLOPS MX4, 18.4 TB/s bandwidth - best understood as theoretical peak specs derived from Meta's own modeling, not measured throughput on real workloads.

Meta's claim of "much higher than leading commercial products" for MX4 FLOPS is hard to evaluate today. It depends completely on what commercial products are available in early 2027. NVIDIA's Blackwell successor and AMD's next-generation GPUs will both be in the market by then. Whether MTIA 450 actually passes them on MX4 compute depends on specs that haven't been announced yet.

The FP8 number is more evaluable in context. The NVIDIA H100 SXM5 delivers 3.9 PFLOPS FP8; the B200 hits 9 PFLOPS FP8 at 1,000W. At 7 PFLOPS FP8 and 1,400W, MTIA 450's raw FP8 throughput is behind current Blackwell hardware on a per-chip basis. That comparison isn't completely fair since MTIA is optimized for inference batch patterns that GPUs handle differently, but the compute numbers don't support the "much higher" framing without significant qualification.

Key Capabilities

Doubled HBM Bandwidth for Inference. The move from 9.2 to 18.4 TB/s is the most consequential change in the 450. Modern large language model inference, especially with long context windows and large batch sizes, is constrained by how quickly the chip can read KV cache entries and model weights from memory. More FLOPS doesn't help when the chip spends most of its time waiting for data. The MTIA 450 addresses this directly by doubling the memory bus capacity rather than just adding more compute. For the mixture-of-experts models Meta is focused on - where only a subset of parameters activate per token but the full parameter set must be available in memory - high bandwidth relative to capacity is exactly the right design choice.

Hardware FlashAttention Acceleration. FlashAttention is an algorithm that restructures the standard attention computation to reduce the number of times activations must be written to and read from memory. On GPUs, it's implemented as a carefully tuned CUDA kernel that avoids materializing the full attention matrix. On MTIA 450, Meta has moved FlashAttention into dedicated hardware functional units. This means the attention kernel runs without competing with other operations for memory bandwidth and at the clock efficiency of custom silicon rather than the general-purpose CUDA stack. Meta reports that this yields "6x the MX4 FLOPS of FP16/BF16" for attention operations specifically - a number that reflects both the hardware acceleration and the MX4 precision format's higher theoretical throughput.

The FlashAttention hardware matters most for long-context inference, where the attention computation scales quadratically with sequence length. A 128K-token context window produces an attention matrix 256x larger than a 8K-token window. Without hardware acceleration, attention becomes the dominant cost. With it, the bottleneck shifts elsewhere. Whether elsewhere is still bandwidth-limited or shifts to compute depends on the specific model architecture and batch configuration - something that will only be answerable once the chip is in production.

Softmax Hardware Acceleration. Softmax appears in the attention normalization step and in the final output projection layer of transformer models. It's not computationally expensive in the abstract, but it's a sequential operation that doesn't parallelize well and frequently appears on the critical path of transformer forward passes. Building it in hardware rather than software reduces latency for these operations, which matters for serving workloads where tail latency is a constraint. This is a smaller improvement than the FlashAttention acceleration but consistent with Meta's stated goal of eliminating attention-related bottlenecks methodically.

MX4 Low-Precision Format. MX4 (Microscaling 4-bit) uses a block-level shared exponent to represent values in 4-bit precision while maintaining more dynamic range than naive INT4 quantization. The practical effect is that you can run models at effectively 4-bit weight precision without the accuracy degradation that makes standard INT4 unsuitable for frontier model inference. This matters for large model deployment because 4-bit weights occupy half the memory of 8-bit weights, which means you can fit larger models in the same 288 GB HBM pool or run larger batch sizes. Meta has developed custom low-precision data types beyond the standard MX4 spec to further optimize for its specific inference workloads, though the details of those custom formats aren't public.

Pricing and Availability

The MTIA 450 isn't for sale. It won't appear in a cloud catalog. Meta's silicon program is a captive infrastructure play - the chip exists to reduce Meta's dependence on NVIDIA GPU supply and to optimize inference cost for workloads at Meta's specific scale. That scale justifies the R&D investment: when you're running inference across billions of daily active users, the economics of custom silicon are favorable in ways they aren't for any organization running a smaller footprint.

For anyone outside Meta assessing AI accelerator options for 2027, the MTIA 450 is interesting context but not a viable alternative. AWS Trainium2 is available on EC2 instances. Google TPU v8i is available on Google Cloud. The MTIA line is internal, and there's no indication that will change.

Mass deployment is scheduled for early 2027. The MTIA 400 rolls out first in 2026, which means the 400 gets first-mover advantage on Meta's production GenAI workloads. The 450 follows with the doubled bandwidth and hardware attention acceleration once it's ready.

Strengths and Weaknesses

Strengths:

  • 18.4 TB/s HBM bandwidth is truly high for inference-class silicon; inference is bandwidth-bound and this addresses that directly
  • Hardware FlashAttention and Softmax acceleration targets the actual bottlenecks in transformer inference rather than padding raw FLOPS
  • MX4 format enables larger effective batch sizes and model capacity within the 288 GB HBM envelope
  • 288 GB HBM capacity is competitive with other inference accelerators at this tier
  • Announced with a clear roadmap; MTIA 400 launching first provides operational learning before the 450 rolls out

Weaknesses:

  • Not available to anyone outside Meta - the core limitation that makes all other specs irrelevant for most readers
  • No independent benchmarks; all numbers are Meta self-reported theoretical peaks
  • 7 PFLOPS FP8 is below current-generation NVIDIA Blackwell GPUs on a per-chip basis
  • Process node undisclosed, preventing meaningful power efficiency comparisons
  • 1,400W TDP is sizable; power and cooling infrastructure requirements are significant
  • "Much higher than leading commercial products" for MX4 FLOPS is a 2027 claim against a 2026 commercial baseline - the comparison may not hold by the time the chip launches

Sources

  • Meta Engineering Blog: MTIA four-generation roadmap announcement (March 2026) - engineering.fb.com
  • Meta About Blog: MTIA chip roadmap details and specifications - about.fb.com
  • Meta AI Blog: MTIA program background - ai.meta.com

Specifications confirmed from Meta's March 12, 2026 roadmap announcement. No independent verification of performance figures is possible since the chip isn't yet in production and is internal-only.

✓ Last verified May 15, 2026

James Kowalski
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.