GPT-4 to Self-Hosted Llama 4 Migration Guide

TL;DR

API compatibility is very high - most code changes are just swapping an URL and model name
EU-based operators can't self-host Llama 4 due to its multimodal license terms
Coding tasks are a weak spot: Llama 4 Maverick scored only 16% on Aider Polyglot vs GPT-4o's ~40%
Difficulty: Medium; budget 2-4 days for a full production setup with vLLM

Overview

If you're paying OpenAI's API bills, the math makes Llama 4 look attractive. GPT-4o costs $2.50 per million input tokens and $10.00 per million output tokens. A hosted Llama 4 Scout on Groq runs at roughly $0.11 per million tokens - a drop that adds up fast on large workloads. Self-hosting pushes the variable cost close to zero, with only hardware or cloud GPU rental as the ongoing expense.

The good news is that the switch is mostly mechanical. Meta designed Llama 4 to be served through OpenAI-compatible endpoints, so vLLM, Ollama, and llama.cpp all expose /v1/chat/completions with the same request and response format you're already using. For basic chat and streaming, the code change can truly be just one line.

The bad news has a few dimensions. The advertised 10M-token context window for Scout is largely theoretical - practical quality degrades past ~256K tokens. Coding performance lags noticeably behind GPT-4o. And if your team is based in the EU, Meta's Community License Agreement explicitly bars EU-domiciled operators from installing or fine-tuning Llama 4's multimodal models - which is every Llama 4 model, since they're all natively multimodal. Read Section 6 (Gotchas) carefully before committing.

With realistic expectations in hand, though, this is one of the cleaner API migrations in the open-source LLM space. You don't need to rewrite your prompts from scratch, and the same SDK you use for OpenAI works against a local Llama 4 server.

Feature Parity Table

Feature	GPT-4o (OpenAI)	Self-hosted Llama 4	Notes
Chat completions	`POST /v1/chat/completions`	`POST /v1/chat/completions`	Direct equivalent
Streaming	`stream: true` via SSE	`stream: true` via SSE	Direct equivalent
Tool/function calling	`tools[]`, `tool_choice`	`tools[]`, `tool_choice`	Full in vLLM; partial in Ollama (no `tool_choice`)
Parallel tool calls	Yes	Yes (Llama 4 only, not Llama 3)	Direct equivalent in vLLM
Image input	`image_url` in content array	`image_url` in content array	Direct equivalent; vLLM supports up to 10 images
JSON mode	`response_format: {type: "json_object"}`	Same	Direct equivalent
Logprobs	Yes	Yes (vLLM); No (Ollama)	Partial equivalent
Embeddings	`POST /v1/embeddings`	`POST /v1/embeddings`	Direct equivalent
Structured outputs (schema)	Yes	Limited	Partial equivalent
Assistants API	Yes (threads, runs, file search)	No equivalent	No equivalent
Batch API	Yes (50% discount)	No equivalent	No equivalent
Context window	128K tokens	Scout: up to 256K practical; Maverick: up to 1M	Target-only advantage (with caveats)
Multimodal (text + image)	Yes	Yes	Direct equivalent
Fine-tuning	Via API	Via Hugging Face or local scripts	Partial equivalent
Usage/billing dashboard	Yes	Self-managed	No equivalent

API Mapping

The base URL and authentication are the only mandatory changes for basic usage.

Before (OpenAI):

Base URL: https://api.openai.com/v1
Authorization: Bearer sk-your-openai-key
Model: gpt-4o

After (self-hosted via vLLM on localhost):

Base URL: http://localhost:8000/v1
Authorization: Bearer EMPTY  (any string works)
Model: meta-llama/Llama-4-Scout-17B-16E-Instruct

After (via Ollama on localhost):

Base URL: http://localhost:11434/v1
Authorization: Bearer ollama
Model: llama4:scout

vLLM inference engine logo vLLM is the recommended production server for self-hosted Llama 4 - it has OpenAI-compatible endpoints and day-0 support for Llama 4's MoE attention pattern. Source: github.com/vllm-project

Request format - basic chat

No changes needed:

# Works identically against OpenAI or self-hosted Llama 4
client = OpenAI(
    base_url="http://localhost:8000/v1",  # only change
    api_key="EMPTY",                       # only change
)

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain what a vector database does."},
    ],
    temperature=0.7,
    max_tokens=512,
)
print(response.choices[0].message.content)

Request format - streaming

stream = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "Write a haiku about migrations."}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Request format - tool calling (vLLM)

Tool definitions use the same schema as OpenAI. When serving with vLLM, add --tool-call-parser llama4_pythonic --enable-auto-tool-choice to your vllm serve command first.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

Request format - image input

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What does this diagram show?"},
                {"type": "image_url", "image_url": {"url": "https://example.com/diagram.png"}},
            ],
        }
    ],
)

Code Examples

Start a vLLM server

Scout on a single H100 with INT4 quantization (recommended starting point):

pip install -U vllm

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --quantization awq \
  --max-model-len 131072 \
  --tool-call-parser llama4_pythonic \
  --enable-auto-tool-choice \
  --port 8000

Scout on 8x H100 with full 1M context (production):

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 8 \
  --max-model-len 1000000 \
  --override-generation-config='{"attn_temperature_tuning": true}' \
  --kv-cache-dtype fp8 \
  --tool-call-parser llama4_pythonic \
  --enable-auto-tool-choice

Maverick FP8 on 8x H100 (~430K context):

VLLM_DISABLE_COMPILE_CACHE=1 vllm serve meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 430000

Start Ollama (development)

# Pull and run Scout (67GB download)
ollama run llama4:scout

# Or Maverick (245GB download)
ollama run llama4:maverick

# Then point your OpenAI client at Ollama
export OPENAI_BASE_URL=http://localhost:11434/v1
export OPENAI_API_KEY=ollama

Environment variable swap pattern

For teams using environment variables for configuration, this is the full change set:

# Before
export OPENAI_API_KEY="sk-..."
export OPENAI_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o"

# After (vLLM local)
export OPENAI_API_KEY="EMPTY"
export OPENAI_BASE_URL="http://localhost:8000/v1"
export MODEL_NAME="meta-llama/Llama-4-Scout-17B-16E-Instruct"

If your code reads these variables, no application code changes are needed.

Pricing Impact

At 100M tokens/month, switching from GPT-4o to a hosted Llama 4 API saves roughly $4,200 per month. Self-hosting that same volume saves more, but only if you already have engineers to manage it.

API pricing comparison (per 1M tokens, as of March 2026):

Provider	Model	Input	Output	Blended (3:1 ratio)
OpenAI	GPT-4o	$2.50	$10.00	$4.38
OpenAI	GPT-4o mini	$0.15	$0.60	$0.26
Groq	Llama 4 Scout	~$0.11	~$0.11	~$0.11
Together AI	Llama 4 Maverick FP8	~$0.20	~$0.20	~$0.20
Self-hosted (1x H100, Scout Q4)	Llama 4 Scout	-	-	~$0.30-$0.49*

*Self-hosted cost is amortized GPU rental at moderate use. At low utilization, effective token cost rises sharply.

When self-hosting wins (and when it doesn't):

Under 10M tokens/month, the OpenAI API is almost certainly cheaper once you count engineering time. Between 10M and 500M tokens/month, a hosted Llama 4 API like Groq is the sweet spot - 40x cheaper than GPT-4o with zero infrastructure work. Self-hosting only becomes clearly ROI-positive above 500M tokens/month, assuming you have existing DevOps capacity. Most teams underestimate this threshold.

For a concrete example: processing 10M tokens/month costs roughly $43.80 on GPT-4o, $1.10 on Groq's Llama 4, and $90-$150 on a self-hosted H100 (at typical 30-40% GPU use).

See our guide on running open-source LLMs locally for a full hardware cost breakdown.

Server racks in a data center Running Llama 4 Maverick in full BF16 precision needs roughly 800GB of VRAM - that's about 10 H100 80GB GPUs. Quantized versions bring this into reach for smaller clusters. Source: commons.wikimedia.org (Victorgrigas, CC BY-SA 3.0)

Known Gotchas

EU licensing blocks self-hosting. Llama 4's Community License Agreement explicitly excludes EU-domiciled individuals and companies with principal place of business in the EU from installing or fine-tuning Llama 4 multimodal models. Since all Llama 4 models are multimodal, this covers every variant. End-users of products built on Llama 4 aren't restricted - only operators running the model. If your company is EU-based, consult legal before proceeding. Using a non-EU cloud provider doesn't clearly resolve the issue.
The 10M context window is theoretical. Scout's advertised 10M-token context was tested with limited evaluation coverage. The model was pre-trained on sequences up to 256K tokens. In practice, quality degrades noticeably past ~120K-300K tokens. Use 256K as your working limit for production prompts requiring high accuracy.
Coding tasks lag significantly. On the Aider Polyglot benchmark (real-world multi-file edits), Maverick scores around 16% - well below GPT-4o and Claude Sonnet. If code generation is your primary use case, test thoroughly before migrating. Llama 4 works fine for code explanation and review; it struggles with complex, multi-step code generation.
vLLM compilation cache bug. You must set VLLM_DISABLE_COMPILE_CACHE=1 when serving Llama 4, or vLLM hits a compilation error due to Llama 4's chunked attention pattern. This is a known issue with vLLM and specific to Llama 4's architecture.
Long-context temperature scaling needed. When using Scout with context over ~100K tokens via vLLM, pass --override-generation-config='{"attn_temperature_tuning": true}' or output quality degrades. This flag activates NoPE layer temperature scaling designed for Llama 4's iRoPE architecture.
Ollama doesn't support tool_choice. If your GPT-4 code uses tool_choice="required" or specifies a named function with tool_choice={"type": "function", "function": {"name": "..."}}, Ollama ignores it silently. Use vLLM for any production tool-calling workloads.
Ollama doesn't support streaming with tool calls. stream=True combined with tools will error in Ollama. Set stream=False for tool-calling requests when using Ollama.
llama.cpp tool_call argument format bug. The llama-server binary returns tool_calls arguments as a JSON object instead of a JSON string, which breaks strict OpenAI client parsing (GitHub issue #20198). Workaround: check isinstance(args, dict) and skip json.loads() if true.
Verbose outputs by default. Llama 4 models tend to over-explain and repeat themselves compared to GPT-4o's more clipped default style. Add "Be concise. Do not repeat yourself." to your system prompts to tighten output density.
tiktoken doesn't count Llama 4 tokens accurately. Llama 4 uses its own tokenizer with a larger vocabulary including many multimodal special tokens. If you're using tiktoken for token budget estimates, switch to transformers' AutoTokenizer from the Llama 4 model repo for accurate counts.
Benchmark controversy around Maverick. Meta submitted a non-public "experimental chat version" of Maverick to the LMArena leaderboard that scored second overall. The public release performs substantially lower - when LMArena applies style controls to isolate content quality, Maverick drops from 2nd to 5th place. Factor this into your evaluation.
No Llama 4 small model. Unlike Llama 3.x which had 8B and 70B options, Llama 4 only ships the large MoE variants. If you need a lightweight model for latency-sensitive or resource-constrained tasks, Llama 4 Scout is the smallest option at 17B active parameters - still requiring 24GB VRAM minimum for a quantized run.

FAQ

Can I use the same Python OpenAI SDK?

Yes. Point the base_url to your local vLLM or Ollama endpoint and set any api_key string. The SDK works without modification for chat completions, streaming, and tool calling.

Will my system prompts work without changes?

Mostly. GPT-4o's RLHF tuning produces more concise, naturally hedged defaults. Llama 4 follows instructions more literally and can be more verbose. Expect to tune system prompts for response length and tone, but no structural rewrites.

Is there an OpenAI-to-Llama proxy or compatibility layer?

No official one. vLLM and Ollama both expose native OpenAI-compatible endpoints, which is the standard approach. Third-party proxies like LiteLLM can also route OpenAI-format requests to local Llama servers.

Does Llama 4 support the legacy `functions` parameter?

No. The functions parameter was deprecated by OpenAI and isn't implemented in Llama 4 servers. Convert functions to the tools array format (wrapping each function in {"type": "function", "function": {...}}) before migrating.

How long does it take to download and run Scout locally?

The Ollama Q4 version of Scout is ~67GB. On a 1Gbps connection that's roughly 9 minutes. First generation on a H100 takes about 30-60 seconds for model load, then runs at 15-20 tokens/second. Maverick's Q4 is 245GB - allow 30+ minutes on fast connections.

Can I use the Assistants API features (threads, file search, code interpreter)?

No. The Assistants API is an OpenAI-specific layer with no equivalent in the Llama 4 self-hosted ecosystem. Teams using threads, runs, or file search need a full rewrite - consider a framework like LlamaIndex for retrieval or LangChain for orchestration as replacements.

Sources: