Name: Llama 3.3 70B Instruct
Author: Meta

Llama 3.3 70B Instruct is the model that made the Llama 3.1 405B largely irrelevant for most production workloads. Released December 6, 2024, it matches or beats the 405B on instruction following and math while using 5-6x less VRAM and costing around 5x less per token to serve. For a 70B model, that efficiency story is truly striking.

TL;DR

Matches Llama 3.1 405B on IFEval (92.1) and MATH (77.0) at 5x lower serving cost
4.1% hallucination rate on the Vectara HHEM leaderboard - the lowest of any open-weight model
$0.59/M input tokens on Groq at 314 tokens/second output speed
128K context window, Llama 3.3 Community License, weights on HuggingFace

Overview

The efficiency story is what gets attention, but the hallucination result is what matters most for production deployments. On the Vectara hallucination leaderboard, Llama 3.3 70B scores 4.1% - below GPT-4.1 at 5.6% and well below GPT-4o at 9.6%. For any team building RAG systems or summarization pipelines that can't send data to commercial APIs, that number is the deciding factor.

Meta trained this on around 15 trillion tokens from publicly available sources, with a knowledge cutoff of December 2023. Post-training used supervised fine-tuning and RLHF alignment with over 25 million synthetically created examples. The architecture is a standard dense transformer with Grouped-Query Attention - the same basic design as Llama 3.1, with training improvements that account for the benchmark gains.

The comparison to Llama 4 Maverick is worth making explicit. Maverick has higher benchmark ceilings on complex reasoning and vision tasks. But Maverick needs 800GB+ of VRAM in BF16, while Llama 3.3 70B runs in 75GB at FP16 or 43GB at Q4. If your workload is text-only, instruction-following, or summarization - and you don't need multimodal inputs - Llama 3.3 70B is the more practical choice on open-source hardware today.

Key Specifications

Specification	Details
Provider	Meta
Model Family	Llama 3
Parameters	70B (dense transformer)
Context Window	128K tokens
Training Data	~15 trillion tokens
Knowledge Cutoff	December 2023
Architecture	Dense transformer with Grouped-Query Attention
Input/Output	Text in, text and code out
Supported Languages	English, German, French, Italian, Portuguese, Hindi, Spanish, Thai
Training Compute	7.0M GPU hours (H100-80GB)
Input Price	$0.59/M tokens (Groq)
Output Price	$0.79/M tokens (Groq)
Release Date	December 6, 2024
License	Llama 3.3 Community License

Benchmark Performance

Meta's own benchmark data, which compares Llama 3.3 70B against the previous 70B and 405B instruct models, shows the gap clearly:

Benchmark	Llama 3.3 70B	Llama 3.1 70B	Llama 3.1 405B
MMLU (CoT)	86.0	86.0	88.6
MMLU-Pro (CoT)	68.9	66.4	73.3
IFEval	92.1	87.5	88.6
GPQA Diamond	50.5	48.0	49.0
HumanEval	88.4	80.5	89.0
MBPP EvalPlus	87.6	86.0	88.6
MATH (CoT)	77.0	68.0	73.8
BFCL v2 (Tool Use)	77.3	77.5	81.1
MGSM (Multilingual)	91.1	86.9	91.6

The IFEval and MATH results are the ones that matter most. On instruction following, the 70B model at 92.1 beats the 405B at 88.6 - that's a genuine regression in the larger model that the 3.3 training run corrected and passed. On MATH, 77.0 surpasses the 405B's 73.8 by a meaningful margin. The 405B holds an edge on MMLU-Pro (73.3 vs 68.9) and BFCL v2 (81.1 vs 77.3), but those differences are small enough that the cost and speed advantages overwhelm them for most workloads.

Against closed-source competition, GPT-4o Mini gets beaten on MMLU-Pro (68.9 vs 63.1) while Llama 3.3 remains open-weight and self-hostable. Full GPT-4o and Claude models still lead on advanced reasoning benchmarks, but the gap is no longer as wide as it was with Llama 3.1 70B.

The Vectara hallucination score deserves separate attention: 4.1% means that on summarization tasks, the model introduces fabricated or unsupported claims in about 4 out of every 100 summaries. That's lower than every commercial frontier model that was tested at the time, and the lowest open-weight score on that leaderboard.

Key Capabilities

Instruction following is this model's clearest strength. The IFEval score of 92.1 means it reliably executes constrained prompts - system prompts with formatting rules, chain-of-thought requirements, response length limits, or multi-step behavioral constraints. Any application that encodes complex rules in the system prompt benefits disproportionately from a high IFEval score.

For text summarization and RAG workloads, Llama 3.3 70B is the standout open-weight option. The low Vectara hallucination rate makes it defensible for organizations processing sensitive documents internally - legal, medical, financial - where sending data to external APIs isn't acceptable. The 128K context window accommodates moderately long documents; practical accuracy degrades past roughly 100K tokens, which is still sufficient for most document-level tasks.

Function calling support uses the same schema as Llama 3.2, which is more flexible than the 3.1 format. The BFCL v2 score of 77.3 is competitive, though it trails the 405B slightly. Multilingual performance is strong across the eight supported languages; MGSM at 91.1 is within 0.5 points of the 405B's 91.6.

One limitation to note: the model is text-only. There are no vision capabilities here. If your pipeline needs image understanding, Llama 4 Maverick or Llama 4 Scout are the open-weight alternatives with native multimodal support.

Pricing and Availability

Weights are on HuggingFace under the Llama 3.3 Community License. Commercial use is permitted up to 700M monthly active users, above which Meta requires a separate agreement. The license isn't formally OSI-approved open source, but it's functionally open for nearly all commercial use cases.

For self-hosting, VRAM requirements by quantization level:

Quantization	VRAM Required
FP16 (full precision)	~141 GB
INT8 (8-bit)	~75 GB
Q4 (4-bit)	~43 GB

Two RTX 4090s pooled (48 GB) covers Q4 with minimal headroom. A single H100 (80 GB) handles INT8 comfortably. For production serving with multiple concurrent users, vLLM is the right framework; for solo local use, Ollama is the simplest path in.

Across API providers, pricing varies significantly:

Provider	Input Price	Output Price	Output Speed
Groq	$0.59/M	$0.79/M	314 tokens/s
DeepInfra	$0.35/M	$0.40/M	~120 tokens/s
Together AI	$0.88/M	$0.88/M	~80 tokens/s
Fireworks AI	$0.90/M	$0.90/M	~85 tokens/s

Groq leads on speed by a wide margin at 314 tokens/second, making it the right choice for latency-sensitive applications like chatbots. DeepInfra is the cheapest option if throughput matters more than latency.

Strengths

IFEval 92.1 and MATH 77.0 both beat Llama 3.1 405B, at 5x lower serving cost
4.1% hallucination rate on Vectara HHEM - lowest of any open-weight model tested
Competitive function calling (BFCL v2 77.3) and multilingual performance (MGSM 91.1)
Available on HuggingFace under a broadly permissive commercial license
Wide provider support with Groq delivering 314 tokens/s output speed
Self-hostable on dual RTX 4090 at Q4 quantization

Weaknesses

Text-only: no image, audio, or video input support
Knowledge cutoff of December 2023 is now over two years stale
Practical context length degrades past ~100K tokens despite the 128K advertised window
MMLU-Pro 68.9 still trails the 405B (73.3) and frontier closed-source models
Q4 quantization can degrade complex reasoning tasks compared to FP16
W8A8 per-channel quantization shows known accuracy degradation compared to other model families

Llama 4 Maverick - Meta's current multimodal flagship
Hallucination Benchmarks Leaderboard
Best AI Models for Text Summarization
Open Source LLM Leaderboard
Best Open-Source LLMs You Can Self-Host in 2026

FAQ

Is Llama 3.3 70B better than Llama 3.1 405B?

On instruction following (IFEval 92.1 vs 88.6) and math (MATH 77.0 vs 73.8) it is. The 405B leads on MMLU-Pro and BFCL v2. For most deployments the 70B is the better choice: 5x cheaper and 5x less VRAM.

Can I use Llama 3.3 70B commercially?

Yes. The Llama 3.3 Community License permits commercial use. Businesses with over 700M monthly active users must contact Meta for a separate license agreement.

What's the minimum GPU setup for self-hosting?

At Q4 quantization, Llama 3.3 70B needs about 43 GB of VRAM. Two NVIDIA RTX 4090 GPUs (48 GB pooled) is the minimum practical consumer setup. A single H100 or A100 80GB handles INT8 with headroom.

Why does Llama 3.3 70B score well on hallucination?

On the Vectara HHEM benchmark, which tests how often a model introduces unsupported claims when summarizing a document, the model scores 4.1% - lower than GPT-4o (9.6%) and GPT-4.1 (5.6%). Smaller, more tightly aligned models can beat larger general-purpose models on faithfulness tasks.

What inference frameworks support Llama 3.3 70B?

vLLM, Ollama, llama.cpp, SGLang, and the HuggingFace Transformers pipeline all support the model. vLLM is recommended for production multi-user serving. Ollama is the easiest local setup.

Sources: