Rapid-MLX Is 2.6x Faster Than Ollama on Apple Silicon

Running a 9B model on a MacBook Pro has rarely produced numbers worth writing about. Rapid-MLX v0.6.74, released Wednesday on GitHub, changes that. The open-source inference engine clocks 108 tokens per second on Qwen3.5-9B against Ollama's 41 on the same hardware - a 2.6x gap that persists across model sizes and comes from two concrete architectural choices the project documents in detail.

TL;DR

Apache 2.0 license, available via Homebrew or pip, no account required
Benchmarks 2.6x faster than Ollama on Qwen3.5-9B; up to 2.9x faster time-to-first-token on 24B models
Supports 66 model aliases across 13 families including DeepSeek V4, Qwen, Gemma, and Llama
Drop-in OpenAI-compatible server that works with Claude Code, Cursor, and Aider without configuration

What Makes It Faster

The speed advantage comes from two decisions: a prompt caching layer that most local inference servers skip, and confirmed speculative decoding for a subset of supported models.

Prompt Cache Architecture

Most local inference engines cold-start on every request. Rapid-MLX maintains a KV cache for transformer models and records full RNN state snapshots for hybrid architectures like Qwen3.5. On Devstral-Small-24B and Mistral Small 24B, cached time-to-first-token drops from 0.38 seconds cold to 0.13 seconds warm - a 2.9x improvement in the latency that matters most for interactive coding assistants.

For DeltaNet-based hybrid models the difference is larger. Qwen3-Coder-Next at 6-bit quantization falls from 0.66 seconds cold to 0.16 seconds warm, a 4.3x improvement. Multi-turn conversations effectively get free restarts.

Speculative Decoding

The project ships DFlash block-diffusion speculative decoding, currently confirmed on Qwen3.5 and Qwen3.6 27B variants. Measured throughput gain sits at 1.31-1.49x on top of base speed. It's opt-in and validated per model family rather than applied globally - a meaningful distinction, since a server that silently misapplies speculative decoding produces incorrect output without obvious error signals.

Installation and Basic Use

Homebrew is the simplest path:

brew install raullenchai/rapid-mlx/rapid-mlx

Or via pip:

pip install rapid-mlx

# With vision endpoint support
pip install 'rapid-mlx[vision]'

Running a model as an OpenAI-compatible HTTP server on port 8000:

rapid-mlx serve qwen3.5-4b

Sharing a model over a public HTTPS tunnel for remote access:

rapid-mlx share qwen3.6-27b-8bit

That last command is practical for developers running inference on a Mac Studio while connecting from another machine or editor. The server exposes /v1/chat/completions in standard OpenAI format, so Cursor, Claude Code, and Aider pick it up without changes to their configuration.

Rapid-MLX GitHub repository showing benchmark results and 2.7k stars The Rapid-MLX GitHub page as of June 4, 2026, showing the benchmark table and 2.7k stars. Source: github.com

Benchmark Numbers

The figures below come from the project's own README, tested on Apple Silicon. They represent the best case for Rapid-MLX, not independent third-party benchmarks.

Model	Rapid-MLX (tok/s)	Ollama (tok/s)	mlx-lm (tok/s)
Qwen3.5-4B	160	-	155
Qwen3.5-9B	108	41	-
Qwen3.5-35B (8-bit)	83	-	75
Qwen3.5-122B	44	-	43
GPT-OSS 20B	127	-	79

The Ollama comparison appears only for Qwen3.5-9B in the current README. The 2.6x headline applies to that specific model and configuration. Other comparisons are against mlx-lm, Apple's own MLX inference library, where gains range from 1.0x to 1.6x depending on the model.

On cached time-to-first-token, Hermes-3-Llama 8B drops from 0.18 seconds to 0.10 seconds. Kimi-Linear-48B reaches 0.08 seconds cached. These numbers matter for the coding assistant use case where latency on the first token determines whether the tool feels responsive or sluggish.

Hardware Requirements

The project maps models to RAM tiers explicitly:

Mac RAM	Recommended Model	Speed
16 GB	Qwen3.5-4B (4-bit)	160 tok/s
24 GB	Qwen3.5-9B (4-bit)	108 tok/s
32 GB	Nemotron-Nano 30B (4-bit)	141 tok/s
48 GB	Qwen3.5-35B (8-bit)	83 tok/s
96+ GB	Qwen3.5-122B (mxfp4)	57 tok/s
128 GB	DeepSeek V4 Flash (2-bit)	56 tok/s

Apple M3 chip as seen in a teardown review Apple's M3 chip - the unified memory architecture that makes running large models locally on Mac viable. Source: commons.wikimedia.org

The DeepSeek V4 Flash row is worth pausing on. Running a 158B-parameter model locally at 56 tokens per second on a Mac Studio M3 Ultra - even at 2-bit quantization - would have been an unusual claim a year ago. The Apple M5 Pro Max arrived in late 2025 carrying enough memory bandwidth to push this further.

What It Plugs Into

Tool calling works across 17 different parser formats with automatic recovery for malformed output from quantized models - a specific pain point that most servers leave to the application layer. For Qwen 3.5 and GLM models, chain-of-thought reasoning output lands in a separate reasoning_content field rather than mixed into the main response, so frameworks can log or suppress it without string parsing.

Compatible integrations by category:

IDEs and editors: Cursor, Continue.dev, Claude Code, Claw Code, OpenCode
Agent frameworks: PydanticAI, LangChain, smolagents, Aider, Anthropic SDK
Chat UIs: Open WebUI, LibreChat, Big-AGI

The project's Model-Harness Index (MHI) scores tool calling accuracy on combined benchmarks. Qwopus 27B reaches 92/100 and Qwen3.5-27B reaches 82/100 under that scoring. Both are open-weight models available in tools for running LLMs locally.

Where It Falls Short

Rapid-MLX is at v0.6.74 with 2.7k GitHub stars and 606 commits. That's substantial for a project of this scope, but Ollama has years of ecosystem momentum and is the default recommendation in most local inference guides. A few practical limits apply.

Platform lock-in. Rapid-MLX runs only on Apple Silicon. The MLX framework it builds on is Apple-proprietary. AMD, Intel, Windows, and Linux users need Ollama, llama.cpp, or vLLM.

Self-reported benchmarks. The throughput numbers come from the project's own testing runs. The Ollama comparison for Qwen3.5-9B is real, but Ollama's lower number partly reflects different default quantization settings. Independent benchmarks across a wider hardware range don't exist yet.

Speculative decoding scope. DFlash speculative decoding is validated only on certain Qwen3.5/3.6 27B variants. Enabling it on unvalidated models produces undefined behavior. The validated list is narrow.

The Apple MLX framework Rapid-MLX builds on has also seen notable contributor turnover since its creator Awni Hannun left Apple earlier this year. That's not a Rapid-MLX problem directly, but it's context for anyone considering how much production infrastructure to put on top of it.

The release is at github.com/raullenchai/Rapid-MLX under Apache 2.0. The Homebrew install is one command, model downloads happen automatically on first use, and any OpenAI-compatible client connects without changes.

Sources: