Meta MTIA 450: 18.4 TB/s Inference Accelerator
Meta's MTIA 450 doubles HBM bandwidth to 18.4 TB/s and adds FlashAttention hardware acceleration for GenAI inference in 2027.

Overview
The MTIA 450 is Meta's third-generation ASIC in its custom AI accelerator lineup, co-developed with Broadcom and fabricated at TSMC on an undisclosed process node. It was announced in March 2026 alongside a four-chip roadmap and is scheduled for mass deployment in early 2027 - after MTIA 400 deploys in 2026. Like every chip in the MTIA family, it's internal-only silicon. You can't buy it, rent it, or access it through any cloud API. That limits its relevance to the broader industry, but doesn't make it less interesting to understand.
The headline spec is 18.4 TB/s of HBM bandwidth - double the MTIA 400's 9.2 TB/s. That's the number that matters for GenAI inference, more than the MX4 FLOPS figure that Meta leads with in its press materials. Large language models, and especially mixture-of-experts models, are constrained by how fast you can move data from memory to compute. Compute headroom is rarely the bottleneck in inference. Bandwidth is.
TL;DR
- RISC-V based ASIC co-developed with Broadcom; TSMC fabrication, process node undisclosed
- 18.4 TB/s HBM bandwidth - doubled from MTIA 400's 9.2 TB/s - is the key inference improvement
- 7 PFLOPS FP8, ~21 PFLOPS MX4; 288 GB HBM; 1,400W TDP
- Hardware accelerators for FlashAttention and Softmax built into silicon, targeting long-context inference
- MX4 custom low-precision format provides roughly 6x the FLOPS of FP16/BF16 for attention operations
- Scheduled for mass deployment early 2027; announced March 2026
- Not available externally - internal Meta infrastructure only
The MTIA 450 also introduces hardware-level acceleration for two specific operations that have historically been software bottlenecks in transformer inference: FlashAttention and Softmax. This is worth paying attention to because it reflects where the real compute pressure lives at inference time, especially for long-context workloads. Custom silicon that offloads those operations in dedicated functional units can do so faster and more efficiently than GPU kernels - but only for the workloads Meta actually runs.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | Meta Platforms |
| Co-developer | Broadcom |
| Fabrication | TSMC (process node not disclosed) |
| Product Family | MTIA (Meta Training and Inference Accelerator) |
| Chip Type | ASIC |
| Architecture | RISC-V based with chiplet design + HBM stacks |
| Memory | 288 GB HBM |
| Memory Bandwidth | 18.4 TB/s |
| FP8 Performance | 7 PFLOPS (7,000 TFLOPS) |
| MX4 Performance | ~21 PFLOPS |
| TDP | 1,400W |
| Special Accelerators | Hardware FlashAttention + Softmax units |
| Primary Workload | GenAI inference (MoE models) |
| Announced | March 12, 2026 |
| Mass Deployment | Early 2027 |
| Availability | Internal use only - not sold commercially |
MTIA Generation Comparison
The 450 sits between the MTIA 400 (rolling out 2026) and the MTIA 500 (mass deploy 2027). Understanding what changed across generations clarifies what the 450 actually prioritizes.
| Specification | MTIA 300 | MTIA 400 | MTIA 450 |
|---|---|---|---|
| Status | In production | Launching 2026 | Mass deploy early 2027 |
| FP8 Performance | 1.2 PFLOPS | 6 PFLOPS | 7 PFLOPS |
| MX4/MX8 Performance | - | ~12 PFLOPS (MX8) | ~21 PFLOPS (MX4) |
| HBM Bandwidth | 6.1 TB/s | 9.2 TB/s | 18.4 TB/s |
| HBM Capacity | 216 GB | 288 GB | 288 GB |
| TDP | 800W | 1,200W | 1,400W |
| FlashAttention HW Accel | No | No | Yes |
| Softmax HW Accel | No | No | Yes |
| Primary Workload | R&R inference | GenAI inference | GenAI inference |
A few things stand out in that table. The FP8 FLOPS gain from 400 to 450 is modest - 6 to 7 PFLOPS, about 17%. The MX4 gain is larger in absolute terms (12 to 21 PFLOPS), but comparing MX8 to MX4 across generations isn't apples-to-apples since the precision formats are different. The bandwidth gain is the real story: 9.2 to 18.4 TB/s is a 100% increase in the spec that most constrains inference throughput. TDP climbs from 1,200W to 1,400W - a 17% increase in power draw to support a doubling of bandwidth. That's a reasonable trade for inference-focused workloads.
Performance Benchmarks
There are no independent benchmarks for the MTIA 450. None. Every number in this article comes from Meta's own announcements. The chip isn't available externally, there are no MLPerf submissions, and no third-party lab has tested it. That makes the quoted figures - 7 PFLOPS FP8, 21 PFLOPS MX4, 18.4 TB/s bandwidth - best understood as theoretical peak specs derived from Meta's own modeling, not measured throughput on real workloads.
Meta's claim of "much higher than leading commercial products" for MX4 FLOPS is hard to evaluate today. It depends completely on what commercial products are available in early 2027. NVIDIA's Blackwell successor and AMD's next-generation GPUs will both be in the market by then. Whether MTIA 450 actually passes them on MX4 compute depends on specs that haven't been announced yet.
The FP8 number is more evaluable in context. The NVIDIA H100 SXM5 delivers 3.9 PFLOPS FP8; the B200 hits 9 PFLOPS FP8 at 1,000W. At 7 PFLOPS FP8 and 1,400W, MTIA 450's raw FP8 throughput is behind current Blackwell hardware on a per-chip basis. That comparison isn't completely fair since MTIA is optimized for inference batch patterns that GPUs handle differently, but the compute numbers don't support the "much higher" framing without significant qualification.
Key Capabilities
Doubled HBM Bandwidth for Inference. The move from 9.2 to 18.4 TB/s is the most consequential change in the 450. Modern large language model inference, especially with long context windows and large batch sizes, is constrained by how quickly the chip can read KV cache entries and model weights from memory. More FLOPS doesn't help when the chip spends most of its time waiting for data. The MTIA 450 addresses this directly by doubling the memory bus capacity rather than just adding more compute. For the mixture-of-experts models Meta is focused on - where only a subset of parameters activate per token but the full parameter set must be available in memory - high bandwidth relative to capacity is exactly the right design choice.
Hardware FlashAttention Acceleration. FlashAttention is an algorithm that restructures the standard attention computation to reduce the number of times activations must be written to and read from memory. On GPUs, it's implemented as a carefully tuned CUDA kernel that avoids materializing the full attention matrix. On MTIA 450, Meta has moved FlashAttention into dedicated hardware functional units. This means the attention kernel runs without competing with other operations for memory bandwidth and at the clock efficiency of custom silicon rather than the general-purpose CUDA stack. Meta reports that this yields "6x the MX4 FLOPS of FP16/BF16" for attention operations specifically - a number that reflects both the hardware acceleration and the MX4 precision format's higher theoretical throughput.
The FlashAttention hardware matters most for long-context inference, where the attention computation scales quadratically with sequence length. A 128K-token context window produces an attention matrix 256x larger than a 8K-token window. Without hardware acceleration, attention becomes the dominant cost. With it, the bottleneck shifts elsewhere. Whether elsewhere is still bandwidth-limited or shifts to compute depends on the specific model architecture and batch configuration - something that will only be answerable once the chip is in production.
Softmax Hardware Acceleration. Softmax appears in the attention normalization step and in the final output projection layer of transformer models. It's not computationally expensive in the abstract, but it's a sequential operation that doesn't parallelize well and frequently appears on the critical path of transformer forward passes. Building it in hardware rather than software reduces latency for these operations, which matters for serving workloads where tail latency is a constraint. This is a smaller improvement than the FlashAttention acceleration but consistent with Meta's stated goal of eliminating attention-related bottlenecks methodically.
MX4 Low-Precision Format. MX4 (Microscaling 4-bit) uses a block-level shared exponent to represent values in 4-bit precision while maintaining more dynamic range than naive INT4 quantization. The practical effect is that you can run models at effectively 4-bit weight precision without the accuracy degradation that makes standard INT4 unsuitable for frontier model inference. This matters for large model deployment because 4-bit weights occupy half the memory of 8-bit weights, which means you can fit larger models in the same 288 GB HBM pool or run larger batch sizes. Meta has developed custom low-precision data types beyond the standard MX4 spec to further optimize for its specific inference workloads, though the details of those custom formats aren't public.
Pricing and Availability
The MTIA 450 isn't for sale. It won't appear in a cloud catalog. Meta's silicon program is a captive infrastructure play - the chip exists to reduce Meta's dependence on NVIDIA GPU supply and to optimize inference cost for workloads at Meta's specific scale. That scale justifies the R&D investment: when you're running inference across billions of daily active users, the economics of custom silicon are favorable in ways they aren't for any organization running a smaller footprint.
For anyone outside Meta assessing AI accelerator options for 2027, the MTIA 450 is interesting context but not a viable alternative. AWS Trainium2 is available on EC2 instances. Google TPU v8i is available on Google Cloud. The MTIA line is internal, and there's no indication that will change.
Mass deployment is scheduled for early 2027. The MTIA 400 rolls out first in 2026, which means the 400 gets first-mover advantage on Meta's production GenAI workloads. The 450 follows with the doubled bandwidth and hardware attention acceleration once it's ready.
Strengths and Weaknesses
Strengths:
- 18.4 TB/s HBM bandwidth is truly high for inference-class silicon; inference is bandwidth-bound and this addresses that directly
- Hardware FlashAttention and Softmax acceleration targets the actual bottlenecks in transformer inference rather than padding raw FLOPS
- MX4 format enables larger effective batch sizes and model capacity within the 288 GB HBM envelope
- 288 GB HBM capacity is competitive with other inference accelerators at this tier
- Announced with a clear roadmap; MTIA 400 launching first provides operational learning before the 450 rolls out
Weaknesses:
- Not available to anyone outside Meta - the core limitation that makes all other specs irrelevant for most readers
- No independent benchmarks; all numbers are Meta self-reported theoretical peaks
- 7 PFLOPS FP8 is below current-generation NVIDIA Blackwell GPUs on a per-chip basis
- Process node undisclosed, preventing meaningful power efficiency comparisons
- 1,400W TDP is sizable; power and cooling infrastructure requirements are significant
- "Much higher than leading commercial products" for MX4 FLOPS is a 2027 claim against a 2026 commercial baseline - the comparison may not hold by the time the chip launches
Related Coverage
- Meta MTIA 300 - The first mass-deployed generation, currently in production at Meta
- Meta Unveils Four MTIA Chip Generations - The full roadmap announcement from March 2026
- Google TPU v8i - Google's inference-specialized chip with hardware attention optimizations, available on Google Cloud
- AWS Trainium2 - Another hyperscaler-internal chip with a different commercial model than Meta's
Sources
- Meta Engineering Blog: MTIA four-generation roadmap announcement (March 2026) - engineering.fb.com
- Meta About Blog: MTIA chip roadmap details and specifications - about.fb.com
- Meta AI Blog: MTIA program background - ai.meta.com
Specifications confirmed from Meta's March 12, 2026 roadmap announcement. No independent verification of performance figures is possible since the chip isn't yet in production and is internal-only.
✓ Last verified May 15, 2026
