Groq LPU - Deterministic Inference at Scale
Groq's Language Processing Unit (LPU) is a purpose-built inference ASIC that trades HBM for 230MB of on-chip SRAM, delivering deterministic latency and record-breaking tokens-per-second for LLM serving.

TL;DR
- Purpose-built inference ASIC with 230MB on-chip SRAM - no HBM, no off-chip memory bottleneck
- Deterministic latency by design - every inference runs in the same number of clock cycles regardless of batch size
- 80 TB/s on-chip SRAM bandwidth eliminates the memory wall that throttles GPU-based inference
- Approximately 750 TFLOPS FP8 on a 14nm process node at ~300W TDP
- Cloud-only availability through GroqCloud API - no chips sold to third parties
Overview
Groq's Language Processing Unit is the most architecturally distinct AI accelerator on the market today. While every other chip in the AI hardware space - from Intel Gaudi 3 to AWS Trainium2 to NVIDIA's entire lineup - uses some variant of the standard HBM-backed accelerator design, Groq went a completely different direction. The LPU has no external memory at all. Every byte of data lives in 230MB of on-chip SRAM, connected by an internal fabric running at roughly 80 TB/s. The result is a chip that cannot match GPUs on training workloads or large-batch throughput but absolutely dominates single-stream inference latency.
The key architectural insight is determinism. On a GPU, inference latency varies with batch size, memory pressure, scheduling decisions, and thermal throttling. The same prompt can take 50ms on one run and 200ms on the next, depending on what else is happening on the chip. On the LPU, none of that variability exists. The compiler maps the entire computation graph to hardware at compile time. Every tensor operation is scheduled statically. Every data movement is predetermined. The chip executes the same number of clock cycles for every inference regardless of load. This makes latency not just fast but predictable - a property that matters enormously for real-time applications where tail latency kills user experience.
Groq fabricates on GlobalFoundries' 14nm process, which is two to three full nodes behind the 5nm silicon used by Cerebras WSE-3 and Intel Gaudi 3. The 14nm choice is deliberate - it keeps per-chip costs low and avoids the TSMC capacity crunch that has plagued GPU supply chains since 2022. Groq compensates for the older node by scaling horizontally across racks of LPU cards connected by their proprietary interconnect. The company does not sell hardware directly. Instead, it offers inference as a service through GroqCloud, where users pay per-token API pricing comparable to or cheaper than GPU-based inference providers.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | Groq |
| Product Family | LPU (Language Processing Unit) |
| Chip Type | ASIC (Tensor Streaming Processor) |
| Process Node | 14nm (GlobalFoundries) |
| Transistors | ~26.8 billion |
| Die Size | ~729 mm2 |
| On-Chip Memory | 230MB SRAM |
| On-Chip Bandwidth | 80 TB/s |
| External Memory | None (no HBM, no DRAM) |
| FP8 Performance | ~750 TFLOPS |
| INT8 Performance | ~750 TOPS |
| FP16 Performance | ~188 TFLOPS |
| FP32 Performance | ~94 TFLOPS |
| TDP | ~300W |
| Chip-to-Chip Interconnect | Proprietary (GroqRack fabric) |
| Form Factor | PCIe card (GroqCard) |
| System | GroqRack (multiple cards per rack) |
| Target Workload | Inference only |
| Availability | Cloud-only (GroqCloud API) |
| Release | Q1 2024 (public GroqCloud access) |
Performance Benchmarks
| Metric | Groq LPU | NVIDIA H100 (SXM) | NVIDIA A100 (80GB) | AMD MI300X |
|---|---|---|---|---|
| Llama 3 70B - tok/s (single stream) | ~500 | ~80-120 | ~40-60 | ~100-140 |
| Llama 3 8B - tok/s (single stream) | ~1,200 | ~200-300 | ~100-150 | ~250-350 |
| Mixtral 8x7B - tok/s (single stream) | ~480 | ~90-130 | ~50-70 | ~110-150 |
| Time-to-first-token (Llama 3 8B) | ~10ms | ~50-100ms | ~80-150ms | ~40-80ms |
| Latency variance (p99/p50 ratio) | ~1.0x | 2-5x | 2-5x | 2-4x |
| Max batch throughput (tok/s/chip) | Low | High | High | High |
| FP8 TFLOPS | ~750 | 3,958 | N/A (no FP8) | 2,615 |
| Power (per chip) | ~300W | 700W | 400W | 750W |
The benchmarks tell a clear story: Groq wins single-stream latency by 3-5x over the best GPU solutions, but GPUs win on raw FLOPS and batch throughput by an even larger margin. The LPU is not a general-purpose accelerator. It is a latency-optimized inference engine built for a specific class of workloads.
If your workload is interactive chat with real-time latency requirements - voice assistants, coding copilots, game NPCs, customer service bots - the LPU is in a class of its own. If your workload is batch processing thousands of requests simultaneously where latency does not matter, a GPU cluster will deliver significantly better cost efficiency by packing more concurrent requests onto each chip.
The latency variance metric is arguably the most important number in the table. A p99/p50 ratio of ~1.0x means Groq's worst-case latency is essentially identical to its average-case latency. On GPUs, the worst-case can be 2-5x the average due to memory contention, scheduling jitter, and thermal management. For production systems where you need to guarantee sub-100ms response times at the 99th percentile, this difference is the gap between meeting your SLA and blowing it.
To put the single-stream numbers in perspective: at 500 tokens per second on Llama 3 70B, Groq generates roughly 375 words per second. That is faster than any human can read. For conversational AI, the response appears effectively instantaneous - the model generates its entire output before the user has finished reading the first sentence.
A Note on Methodology
These benchmarks reflect single-stream inference performance - one request at a time, measuring raw speed per request. This is the scenario where the LPU excels.
GPU benchmarks in production deployments typically use continuous batching with dozens to hundreds of concurrent requests, which increases total throughput at the cost of per-request latency. The two paradigms are not directly comparable because they optimize for different objectives. Groq optimizes for the latency of each individual request. GPUs optimize for the total throughput of all requests combined.
When evaluating the LPU against GPU alternatives, the right question is not "which has more TFLOPS" but "which delivers the best user experience at an acceptable cost for my specific workload." For latency-critical interactive workloads, the LPU wins on both axes. For throughput-critical batch workloads, GPUs win on both axes.
Key Capabilities
Tensor Streaming Processor (TSP) Architecture. The LPU is built on Groq's TSP architecture, which is fundamentally different from the SIMT (Single Instruction, Multiple Threads) model used by GPUs. In a TSP, data flows through the chip in a predetermined pattern set at compile time. The compiler maps every operation to specific hardware units at specific clock cycles before execution begins.
There is no dynamic scheduling. No thread divergence. No cache hierarchy to manage. No branch prediction. The entire computational graph is unrolled into a static schedule that the hardware executes deterministically.
This eliminates an entire class of performance variability that plagues GPU inference. On an NVIDIA GPU, a matrix multiplication might take slightly different amounts of time depending on cache hit rates, warp scheduling decisions, memory bank conflicts, and dozens of other microarchitectural factors. On the LPU, it takes exactly the same number of cycles every time.
The trade-off is flexibility - the LPU cannot dynamically adapt to workload changes the way a GPU can. Every model must be compiled specifically for the hardware, and the compilation process determines the fixed execution schedule. You cannot simply load a new model at runtime - it must be compiled first.
The TSP architecture was originally designed by Groq's founder Jonathan Ross, who previously led the development of Google's TPU v1. The architectural lineage is visible - both the TPU and the LPU prioritize deterministic execution over dynamic flexibility - but the LPU takes the concept further by eliminating all external memory access.
No-Batch Inference. Most GPU inference engines rely on batching to amortize the cost of memory transfers across multiple requests. Loading model weights from HBM to compute units is expensive, so serving frameworks like vLLM and TensorRT-LLM batch multiple requests together to share that cost. The problem is that batching adds latency - each request waits for the batch to fill before processing begins, and the slowest request in the batch determines the completion time for all of them.
The LPU does not need batching because there are no external memory transfers to amortize. The model weights live in SRAM, directly adjacent to the compute units, with 80 TB/s of bandwidth connecting them. Each request runs independently at full speed. This means latency does not degrade under load the way it does on GPU clusters where increasing batch size trades latency for throughput.
For applications like real-time voice assistants - where 200ms of latency means an awkward pause in conversation - and interactive coding tools - where developers expect sub-second completions - this is a fundamental advantage. The user experience difference between a 50ms response and a 300ms response is the difference between a tool that feels instant and one that feels sluggish.
Horizontal Scaling via GroqRack. A single LPU has only 230MB of SRAM, which limits the maximum model size a single chip can hold. A typical 7B parameter model in FP16 requires about 14GB - far beyond what a single LPU can store. To run any production LLM, Groq distributes the model across multiple LPU cards within a GroqRack system using tensor parallelism.
The model's layers are partitioned across chips, and each chip holds a slice of the parameters in its local SRAM. The proprietary chip-to-chip interconnect maintains deterministic behavior across the multi-chip configuration. Because data movement between chips is also statically scheduled at compile time, the inter-chip communication does not introduce the kind of non-deterministic delays that PCIe and NVLink introduce in GPU clusters.
Groq has not published detailed interconnect bandwidth specifications, but the end-to-end latency numbers for 70B-parameter models - distributed across potentially hundreds of LPU chips - suggest the inter-chip communication overhead is remarkably low.
For Llama 3 70B at FP8 precision, the model requires roughly 70GB of parameter storage. Distributed across a GroqRack, this means approximately 300+ LPU chips working in concert, each holding a ~230MB slice of the model. The fact that Groq achieves ~500 tokens per second on this configuration - despite the massive fan-out across chips - is a testament to the quality of the interconnect and the compiler's ability to overlap computation with communication.
Architecture Deep Dive
The Memory Wall Problem
To understand why the LPU exists, you need to understand the memory wall. In LLM inference, the dominant cost is not computation - it is reading model weights from memory. For every token generated, the model must read every weight in every layer. A 70B parameter model in FP8 requires reading 70GB of data per token. On an H100 with 3,350 GB/s of HBM bandwidth, that takes approximately 21ms - setting a hard floor on per-token generation time regardless of how many TFLOPS the GPU has.
This is why GPU inference is "memory bandwidth bound" rather than "compute bound." The FLOPS are there - the H100 has 3,958 TFLOPS of FP8 compute. But the memory system cannot feed data to the compute units fast enough to use those FLOPS.
How the LPU Breaks the Wall
The LPU's memory hierarchy is the inverse of a GPU. On an NVIDIA H100, the compute units (SMs) have a few MB of L2 cache and shared memory, backed by 80GB of HBM3 running at 3,350 GB/s. The vast majority of data lives in HBM, and the chip spends a significant fraction of its time waiting for data to arrive from memory.
On the LPU, all 230MB of memory is SRAM running at 80 TB/s - roughly 24x the bandwidth of H100's HBM3. There is no HBM. There is no cache hierarchy. Every byte of data is equally accessible at the same latency and bandwidth from any compute unit. This flat memory architecture eliminates cache misses, DRAM refresh cycles, and the complex memory controller logic that consumes significant die area and power on GPUs.
The 230MB capacity constraint is real and unavoidable. SRAM is roughly 10-20x more expensive per bit than HBM in terms of die area. Groq chose to maximize bandwidth per bit rather than capacity, and compensates for the limited capacity through model sharding across multiple chips.
This is the right trade-off for inference, where you need to read each weight exactly once per token. It is the wrong trade-off for training, where you need to store gradients, optimizer states, and activations that can exceed the model size by 10-20x.
Power Efficiency
The ~300W TDP is notably lower than the 700W of an H100 or 900W of an Intel Gaudi 3. On a per-chip basis, the LPU is more power-efficient for inference workloads. The lower power is partly due to the 14nm process (lower clock speeds than 5nm designs) and partly due to the simpler microarchitecture (no dynamic scheduling hardware, no cache controllers, no branch predictors).
However, because serving a large model requires many more LPU chips than GPUs, the total system power for a GroqRack serving a 70B model may be comparable to or higher than a GPU cluster serving the same model. A GroqRack with 300+ chips at 300W each would draw roughly 90kW, while an 8-GPU H100 node at approximately 10kW can serve the same model (albeit at lower per-request latency). Groq has not published system-level power consumption figures, which makes it difficult to calculate true power efficiency at the system level.
Why 14nm?
The choice of GlobalFoundries 14nm is often cited as a weakness, but it is a deliberate strategic decision with several advantages:
Cost: 14nm wafers from GlobalFoundries cost a fraction of TSMC 5nm wafers. This keeps per-chip costs low, which matters when you need hundreds of chips per rack.
Supply: GlobalFoundries 14nm has excess capacity, while TSMC 5nm is capacity-constrained with years-long waitlists. Groq can scale production without competing with NVIDIA, AMD, and Apple for TSMC allocation.
Maturity: 14nm is a mature, high-yield process. Manufacturing defect rates are low, which improves the economics of building large multi-chip systems.
The bottleneck is bandwidth, not compute: For inference workloads, the limiting factor is memory bandwidth, not raw FLOPS. The LPU's 80 TB/s SRAM bandwidth solves the bandwidth problem regardless of process node. Moving to 5nm would add more FLOPS, but the chip is already bandwidth-limited on most workloads, so the additional FLOPS would go unused.
The counter-argument is that a 5nm version of the LPU could fit more SRAM per die (more bandwidth, more model capacity per chip) at lower power, which would reduce the number of chips needed per model and improve system-level efficiency. Groq is rumored to be developing a next-generation LPU on a more advanced process node, but nothing has been announced.
Pricing and Availability
Groq does not sell LPU hardware. All access is through the GroqCloud API, which offers per-token pricing competitive with GPU-based inference providers.
| Provider / Chip | Llama 3 70B Input (per M tokens) | Llama 3 70B Output (per M tokens) |
|---|---|---|
| GroqCloud (LPU) | $0.59 | $0.79 |
| Together AI (GPU) | $0.88 | $0.88 |
| Fireworks AI (GPU) | $0.90 | $0.90 |
| AWS Bedrock (Trainium/GPU) | $1.30 | $1.70 |
GroqCloud pricing is competitive on cost and dominant on latency. The combination of lower per-token pricing and 3-5x lower latency makes it attractive for latency-sensitive workloads.
For a production chatbot processing 100 million tokens per day on Llama 3 70B, the annual cost difference between GroqCloud and Together AI is roughly $105,000 - not trivial, and that is before factoring in the latency advantage.
However, for high-throughput batch workloads where latency does not matter - document processing, embedding generation, offline analysis - GPU-based providers can often deliver better total cost of ownership by packing more concurrent requests onto each chip. The key question for any deployment is whether you are optimizing for latency or throughput. If latency, Groq wins. If throughput, GPUs win.
Funding and Capacity
Groq has raised over $640 million in funding, including a $640 million Series D in August 2024 that valued the company at $2.8 billion. The company is expanding its data center footprint aggressively, with plans to deploy over 100,000 LPU chips across multiple regions by the end of 2026.
Availability has been expanding steadily since the public launch of GroqCloud in early 2024, but capacity remains more constrained than hyperscaler GPU instances. Rate limits on the free tier are strict, and enterprise customers can request dedicated capacity and volume pricing through GroqCloud sales.
Model Availability
GroqCloud does not support arbitrary model uploads. The available model library includes:
- Llama 3.x (8B, 70B, and 405B) - Meta's open-weight models
- Mixtral (8x7B, 8x22B) - Mistral's MoE models
- Gemma (2B, 7B) - Google's open models
- Whisper (large-v3) - OpenAI's speech recognition model
Each model must be compiled for the TSP architecture before it can be served, which limits the catalog to models Groq has explicitly optimized. Expanding model support requires Groq engineering effort, not just uploading weights.
This is a meaningful limitation compared to GPU-based providers where any GGUF or safetensors model can be loaded and served without vendor involvement.
Who Should Use the Groq LPU
The LPU makes sense for a narrow but important set of use cases:
Strong fit:
- Real-time conversational AI where latency directly impacts user experience
- Voice-based AI assistants that need sub-100ms response times to feel natural
- Interactive coding assistants where developers expect instant completions
- Gaming and entertainment applications with AI-driven NPCs
- Any deployment where p99 latency guarantees are contractually required
Weak fit:
- Batch processing and offline inference where throughput matters more than latency
- Model training of any kind - the LPU cannot train models
- Workloads requiring custom or fine-tuned models not in GroqCloud's catalog
- Organizations that require on-premises deployment or multi-cloud portability
- Applications that need very long context windows (large KV caches consume limited SRAM)
Strengths
- Deterministic latency eliminates tail-latency variance that plagues GPU inference - p99 equals p50
- 3-5x faster single-stream inference than the best GPU solutions on supported models
- 80 TB/s on-chip bandwidth removes the memory wall bottleneck entirely
- No batching required - each request runs at full speed regardless of concurrent load
- Competitive per-token API pricing despite the specialized hardware
- Time-to-first-token under 10ms for smaller models enables true real-time applications
- ~300W per chip is significantly less than competing accelerators
Weaknesses
- 230MB SRAM severely limits per-chip model capacity - large models require hundreds of chips
- No training capability whatsoever - inference only, cannot replace GPU/TPU training clusters
- Cloud-only access through GroqCloud - no option to purchase hardware for on-premises deployment
- 14nm process node is two to three generations behind competitors on transistor density
- Batch throughput significantly lower than GPUs - not cost-effective for offline or bulk processing
- Limited model catalog - only models Groq has specifically compiled and optimized are available
- Proprietary architecture with no open-source toolchain - complete dependency on Groq's software stack
Related Coverage
- Cerebras WSE-3 - Another non-GPU architecture taking a radically different approach to AI compute with wafer-scale integration
- Intel Gaudi 3 - Intel's ASIC alternative for training and inference at lower cost than NVIDIA
- AWS Trainium2 - Amazon's cloud-native training chip with a similar cloud-only availability model
