Llama 3.3 70B Instruct
Meta's Llama 3.3 70B Instruct matches Llama 3.1 405B on instruction following and math while running at 4-5x lower cost, with the lowest hallucination rate of any open-weight model on the Vectara summarization leaderboard.

Llama 3.3 70B Instruct is the model that made the Llama 3.1 405B largely irrelevant for most production workloads. Released December 6, 2024, it matches or beats the 405B on instruction following and math while using 5-6x less VRAM and costing around 5x less per token to serve. For a 70B model, that efficiency story is truly striking.
TL;DR
- Matches Llama 3.1 405B on IFEval (92.1) and MATH (77.0) at 5x lower serving cost
- 4.1% hallucination rate on the Vectara HHEM leaderboard - the lowest of any open-weight model
- $0.59/M input tokens on Groq at 314 tokens/second output speed
- 128K context window, Llama 3.3 Community License, weights on HuggingFace
Overview
The efficiency story is what gets attention, but the hallucination result is what matters most for production deployments. On the Vectara hallucination leaderboard, Llama 3.3 70B scores 4.1% - below GPT-4.1 at 5.6% and well below GPT-4o at 9.6%. For any team building RAG systems or summarization pipelines that can't send data to commercial APIs, that number is the deciding factor.
Meta trained this on around 15 trillion tokens from publicly available sources, with a knowledge cutoff of December 2023. Post-training used supervised fine-tuning and RLHF alignment with over 25 million synthetically created examples. The architecture is a standard dense transformer with Grouped-Query Attention - the same basic design as Llama 3.1, with training improvements that account for the benchmark gains.
The comparison to Llama 4 Maverick is worth making explicit. Maverick has higher benchmark ceilings on complex reasoning and vision tasks. But Maverick needs 800GB+ of VRAM in BF16, while Llama 3.3 70B runs in 75GB at FP16 or 43GB at Q4. If your workload is text-only, instruction-following, or summarization - and you don't need multimodal inputs - Llama 3.3 70B is the more practical choice on open-source hardware today.
Key Specifications
| Specification | Details |
|---|---|
| Provider | Meta |
| Model Family | Llama 3 |
| Parameters | 70B (dense transformer) |
| Context Window | 128K tokens |
| Training Data | ~15 trillion tokens |
| Knowledge Cutoff | December 2023 |
| Architecture | Dense transformer with Grouped-Query Attention |
| Input/Output | Text in, text and code out |
| Supported Languages | English, German, French, Italian, Portuguese, Hindi, Spanish, Thai |
| Training Compute | 7.0M GPU hours (H100-80GB) |
| Input Price | $0.59/M tokens (Groq) |
| Output Price | $0.79/M tokens (Groq) |
| Release Date | December 6, 2024 |
| License | Llama 3.3 Community License |
Benchmark Performance
Meta's own benchmark data, which compares Llama 3.3 70B against the previous 70B and 405B instruct models, shows the gap clearly:
| Benchmark | Llama 3.3 70B | Llama 3.1 70B | Llama 3.1 405B |
|---|---|---|---|
| MMLU (CoT) | 86.0 | 86.0 | 88.6 |
| MMLU-Pro (CoT) | 68.9 | 66.4 | 73.3 |
| IFEval | 92.1 | 87.5 | 88.6 |
| GPQA Diamond | 50.5 | 48.0 | 49.0 |
| HumanEval | 88.4 | 80.5 | 89.0 |
| MBPP EvalPlus | 87.6 | 86.0 | 88.6 |
| MATH (CoT) | 77.0 | 68.0 | 73.8 |
| BFCL v2 (Tool Use) | 77.3 | 77.5 | 81.1 |
| MGSM (Multilingual) | 91.1 | 86.9 | 91.6 |
The IFEval and MATH results are the ones that matter most. On instruction following, the 70B model at 92.1 beats the 405B at 88.6 - that's a genuine regression in the larger model that the 3.3 training run corrected and passed. On MATH, 77.0 surpasses the 405B's 73.8 by a meaningful margin. The 405B holds an edge on MMLU-Pro (73.3 vs 68.9) and BFCL v2 (81.1 vs 77.3), but those differences are small enough that the cost and speed advantages overwhelm them for most workloads.
Against closed-source competition, GPT-4o Mini gets beaten on MMLU-Pro (68.9 vs 63.1) while Llama 3.3 remains open-weight and self-hostable. Full GPT-4o and Claude models still lead on advanced reasoning benchmarks, but the gap is no longer as wide as it was with Llama 3.1 70B.
The Vectara hallucination score deserves separate attention: 4.1% means that on summarization tasks, the model introduces fabricated or unsupported claims in about 4 out of every 100 summaries. That's lower than every commercial frontier model that was tested at the time, and the lowest open-weight score on that leaderboard.
Key Capabilities
Instruction following is this model's clearest strength. The IFEval score of 92.1 means it reliably executes constrained prompts - system prompts with formatting rules, chain-of-thought requirements, response length limits, or multi-step behavioral constraints. Any application that encodes complex rules in the system prompt benefits disproportionately from a high IFEval score.
For text summarization and RAG workloads, Llama 3.3 70B is the standout open-weight option. The low Vectara hallucination rate makes it defensible for organizations processing sensitive documents internally - legal, medical, financial - where sending data to external APIs isn't acceptable. The 128K context window accommodates moderately long documents; practical accuracy degrades past roughly 100K tokens, which is still sufficient for most document-level tasks.
Function calling support uses the same schema as Llama 3.2, which is more flexible than the 3.1 format. The BFCL v2 score of 77.3 is competitive, though it trails the 405B slightly. Multilingual performance is strong across the eight supported languages; MGSM at 91.1 is within 0.5 points of the 405B's 91.6.
One limitation to note: the model is text-only. There are no vision capabilities here. If your pipeline needs image understanding, Llama 4 Maverick or Llama 4 Scout are the open-weight alternatives with native multimodal support.
Pricing and Availability
Weights are on HuggingFace under the Llama 3.3 Community License. Commercial use is permitted up to 700M monthly active users, above which Meta requires a separate agreement. The license isn't formally OSI-approved open source, but it's functionally open for nearly all commercial use cases.
For self-hosting, VRAM requirements by quantization level:
| Quantization | VRAM Required |
|---|---|
| FP16 (full precision) | ~141 GB |
| INT8 (8-bit) | ~75 GB |
| Q4 (4-bit) | ~43 GB |
Two RTX 4090s pooled (48 GB) covers Q4 with minimal headroom. A single H100 (80 GB) handles INT8 comfortably. For production serving with multiple concurrent users, vLLM is the right framework; for solo local use, Ollama is the simplest path in.
Across API providers, pricing varies significantly:
| Provider | Input Price | Output Price | Output Speed |
|---|---|---|---|
| Groq | $0.59/M | $0.79/M | 314 tokens/s |
| DeepInfra | $0.35/M | $0.40/M | ~120 tokens/s |
| Together AI | $0.88/M | $0.88/M | ~80 tokens/s |
| Fireworks AI | $0.90/M | $0.90/M | ~85 tokens/s |
Groq leads on speed by a wide margin at 314 tokens/second, making it the right choice for latency-sensitive applications like chatbots. DeepInfra is the cheapest option if throughput matters more than latency.
Strengths
- IFEval 92.1 and MATH 77.0 both beat Llama 3.1 405B, at 5x lower serving cost
- 4.1% hallucination rate on Vectara HHEM - lowest of any open-weight model tested
- Competitive function calling (BFCL v2 77.3) and multilingual performance (MGSM 91.1)
- Available on HuggingFace under a broadly permissive commercial license
- Wide provider support with Groq delivering 314 tokens/s output speed
- Self-hostable on dual RTX 4090 at Q4 quantization
Weaknesses
- Text-only: no image, audio, or video input support
- Knowledge cutoff of December 2023 is now over two years stale
- Practical context length degrades past ~100K tokens despite the 128K advertised window
- MMLU-Pro 68.9 still trails the 405B (73.3) and frontier closed-source models
- Q4 quantization can degrade complex reasoning tasks compared to FP16
- W8A8 per-channel quantization shows known accuracy degradation compared to other model families
Related Coverage
- Llama 4 Maverick - Meta's current multimodal flagship
- Hallucination Benchmarks Leaderboard
- Best AI Models for Text Summarization
- Open Source LLM Leaderboard
- Best Open-Source LLMs You Can Self-Host in 2026
FAQ
Is Llama 3.3 70B better than Llama 3.1 405B?
On instruction following (IFEval 92.1 vs 88.6) and math (MATH 77.0 vs 73.8) it is. The 405B leads on MMLU-Pro and BFCL v2. For most deployments the 70B is the better choice: 5x cheaper and 5x less VRAM.
Can I use Llama 3.3 70B commercially?
Yes. The Llama 3.3 Community License permits commercial use. Businesses with over 700M monthly active users must contact Meta for a separate license agreement.
What's the minimum GPU setup for self-hosting?
At Q4 quantization, Llama 3.3 70B needs about 43 GB of VRAM. Two NVIDIA RTX 4090 GPUs (48 GB pooled) is the minimum practical consumer setup. A single H100 or A100 80GB handles INT8 with headroom.
Why does Llama 3.3 70B score well on hallucination?
On the Vectara HHEM benchmark, which tests how often a model introduces unsupported claims when summarizing a document, the model scores 4.1% - lower than GPT-4o (9.6%) and GPT-4.1 (5.6%). Smaller, more tightly aligned models can beat larger general-purpose models on faithfulness tasks.
What inference frameworks support Llama 3.3 70B?
vLLM, Ollama, llama.cpp, SGLang, and the HuggingFace Transformers pipeline all support the model. vLLM is recommended for production multi-user serving. Ollama is the easiest local setup.
Sources:
- Llama-3.3-70B-Instruct Model Card (HuggingFace)
- Llama 3.3 70B Benchmarks and Pricing (Artificial Analysis)
- Llama 3.3 70B vs GPT-4o (Vellum)
- Vectara Hallucination Leaderboard (GitHub)
- Llama 3.3 70B VRAM Requirements (Novita AI)
- Llama 3.3 Pricing (OpenRouter)
- Llama 3.3 70B on Fireworks AI
- Llama 3.3 70B vs Llama 3.1 405B (Novita AI)
✓ Last verified June 3, 2026
