Liquid AI Drops LFM2-24B - A 24 Billion Parameter Model That Runs on Your Laptop
MIT spinoff Liquid AI releases LFM2-24B-A2B, a hybrid mixture-of-experts model that activates only 2.3B parameters per token, fits in 32GB RAM, and hits 112 tokens per second on a consumer CPU.

TL;DR
- Liquid AI releases LFM2-24B-A2B, a 24B parameter model that activates only 2.3B params per token
- Fits in 32GB RAM at Q4 quantization - runs on consumer laptops and desktops
- Hits 112 tokens/sec on AMD Ryzen AI Max+ 395 CPU, 26,800 tok/s aggregate on a single H100
- Hybrid architecture: 30 gated convolution layers + 10 grouped-query attention layers
- Available on Hugging Face with GGUF quantizations, llama.cpp, vLLM, and LM Studio support
The Pitch - 24 Billion Parameters, 2.3 Billion Active
Liquid AI, the MIT spinoff valued at over $2 billion, released LFM2-24B-A2B on Tuesday - a hybrid mixture-of-experts model that packs 24 billion total parameters but activates only 2.3 billion per token. The result is a model with the knowledge capacity of a large model but the inference cost of a small one.
The math works out to roughly 9.6% of total parameters firing on any given forward pass. The rest sit dormant, waiting for queries that match their expertise. This is the core premise of mixture-of-experts architectures, but Liquid AI's implementation is anything but standard.
Not Your Typical Transformer
LFM2-24B's architecture is a hybrid that would make a traditional Transformer purist uncomfortable. Of its 40 layers, only 10 use grouped-query attention (GQA) - the mechanism that powers GPT, Claude, and most frontier models. The other 30 layers use gated short convolution blocks with a kernel size of 3.
| Specification | Details |
|---|---|
| Total Parameters | 24 billion |
| Active Parameters/Token | 2.3 billion |
| Architecture | Hybrid Conv + GQA MoE |
| Experts per MoE Block | 64 (top-4 routing) |
| Context Window | 32,768 tokens |
| Training Data | 17 trillion tokens (ongoing) |
| Memory (Q4_K_M) | ~32 GB RAM |
| Languages | English, Arabic, Chinese, French, German, Japanese, Korean, Spanish, Portuguese |
This design choice is not arbitrary. Liquid AI's hardware-in-the-loop architecture search validated that convolution-dominant stacks outperform alternatives under fixed on-device performance budgets. The team explicitly tested adding linear attention, state-space, and additional convolution operators - none improved aggregate quality given the same compute budget.
The practical payoff: depthwise convolutions have O(1) per-step decode cost, while attention's KV cache grows with context length. When 75% of your layers are convolutions, you get dramatically faster sequential decoding on CPUs.
Speed That Matters
The headline numbers are real. On an AMD Ryzen AI Max+ 395 - a high-end mobile APU, not a server chip - the model decodes at 112 tokens per second using Q4_K_M quantization through llama.cpp.
| Platform | Decode Speed | Notes |
|---|---|---|
| AMD Ryzen AI Max+ 395 (CPU) | 112 tok/s | Q4_K_M via llama.cpp |
| NVIDIA H100 SXM5 (single stream) | 293 tok/s | vLLM |
| NVIDIA H100 SXM5 (1,024 concurrent) | 26,800 tok/s aggregate | vLLM continuous batching |
For context, the smaller LFM2-8B-A1B already runs at 48.6 tokens per second on a Samsung Galaxy S25. The 24B model is targeting laptop-class hardware, not phones, but the trend is clear: Liquid AI is building for the edge, not just the data center.
On the server side, the model surpasses both Qwen3-30B-A3B and gpt-oss-20b in GPU throughput under continuous batching on a single H100 - despite having fewer active parameters than either competitor.
How It Stacks Up
This is an early checkpoint release - training is still ongoing at 17 trillion tokens. Liquid AI says quality improves log-linearly as models scale from 350M to 24B total parameters. The final benchmarks will change, but the smaller models in the family give us anchor points.
For reference, the LFM2-8B-A1B (the previous flagship with 1.5B active params) already beats Llama 3.2-3B-Instruct across MMLU, MMLU-Pro, and GSM8K while matching or approaching Gemma-3-4b-it.
The 24B model is positioned against:
- Qwen3-30B-A3B (30.5B total, 3.3B active) - Liquid AI's primary benchmark target
- gpt-oss-20b (21B total, 3.6B active) - another MoE competitor
With only 2.3B active parameters versus 3.3-3.6B for the competition, the efficiency advantage is structural, not just incremental. Whether it can match on quality remains to be seen once training completes and the instruction-tuned LFM2.5-24B variant ships.
What Is Missing
The 32K context window is a notable limitation. Qwen3-30B-A3B offers 128K tokens. For summarization and short agentic tasks this is fine, but long document analysis will hit the wall.
There is no reasoning or "thinking" mode yet. Liquid AI has confirmed that LFM2.5-24B-A2B - the post-trained version with reinforcement learning - is coming, but no date has been given.
The MIT Spinoff With $297 Million and a Worm Brain
Liquid AI was founded in 2023 by four MIT researchers: Ramin Hasani (CEO), Mathias Lechner (CTO), Alexander Amini (Chief Scientific Officer), and Daniela Rus, the director of MIT CSAIL and one of the most prominent robotics researchers in the world.
The company's intellectual origins are fascinatingly biological. The founding research studied C. elegans - a 1mm roundworm with exactly 302 neurons - and how its neural architecture processes information through graded analog signals rather than digital spikes. That work produced Liquid Time-Constant Networks in 2020, neural networks whose parameters dynamically change based on input.
The production LFM2 architecture has evolved past the original liquid neural network research into something more pragmatic - gated convolutions plus attention - but the efficiency-first philosophy remains.
Liquid AI has raised $297 million total, including a $250 million Series A led by AMD Ventures in December 2024 that valued the company at over $2 billion. That AMD relationship explains why the primary benchmark hardware is AMD's Ryzen AI Max+ rather than an Intel or Apple chip.
Getting Started
Weights are on Hugging Face with 10 GGUF quantization variants ready for llama.cpp. The model works with Transformers, vLLM, SGLang, MLX (Apple Silicon), and LM Studio out of the box. Fine-tuning is supported through Unsloth and TRL with LoRA, DPO, and GRPO.
One licensing caveat: LFM2 uses Liquid AI's custom LFM Open License v1.0, based on Apache 2.0 but with a revenue threshold. Companies under $10 million annual revenue can use it freely. Above that, you need a commercial license from Liquid AI. This makes it "open weight" rather than truly open source in the OSI sense - a distinction that matters if you are building a product around it.
What To Watch
The immediate question is what the final benchmarks look like once training completes past 17 trillion tokens. The LFM2 family's scaling curves suggest substantial headroom.
The bigger story is whether convolution-dominant architectures can genuinely compete with pure Transformers at scale. If LFM2.5-24B-A2B ships with competitive reasoning capabilities, Liquid AI will have proven that the Transformer is not the only viable path to capable AI - and that consumer hardware can run serious models without an API call in sight. For anyone interested in running open-source LLMs locally, the 32GB RAM requirement puts this within reach of a well-equipped laptop or any modern desktop.
Sources:
