Hailo-10H - Edge AI With On-Device LLMs
Complete specs, benchmarks, and analysis of the Hailo-10H - a 2.5W edge AI accelerator with 40 TOPS INT4, on-module LPDDR4, and the ability to run LLMs and VLMs on a Raspberry Pi at 10 tokens per second.

TL;DR
- 40 TOPS INT4 / 20 TOPS INT8 edge AI accelerator in a M.2 form factor at just 2.5W typical power consumption
- First edge accelerator under 5W that can run LLMs - Qwen2 1.5B at 9.45 tokens per second, Llama 3.2 1B, DeepSeek R1 Distill
- On-module 4/8GB LPDDR4/4X solves the critical limitation of the Hailo-8 - models no longer limited by on-chip SRAM
- Available as the $130 Raspberry Pi AI HAT+ 2 or as standalone M.2 modules for embedded integration
- AEC-Q100 Grade 2 automotive certified - targets smart cameras, vehicles, robotics, and consumer devices
Overview
The Hailo-10H occupies a unique niche in AI hardware. While datacenter accelerators like the NVIDIA H100 and AWS Trainium3 push performance ceilings with kilowatts of power and terabytes of HBM, the Hailo-10H brings generative AI to devices that run on batteries. At 2.5W typical power draw, it runs a 1.5-billion-parameter language model at nearly 10 tokens per second - fast enough for responsive on-device chatbots, voice assistants, and vision-language tasks without any cloud connection.
This is the second-generation AI processor from Hailo, the Israeli chip company that built its reputation on efficient edge vision processing. The original Hailo-8 (26 TOPS INT8) was a capable vision accelerator, but it had a hard limitation: no external memory interface. All model weights had to fit in the chip's on-chip SRAM, which capped the model size at relatively small vision networks. The Hailo-10H fixes this with a direct DDR interface to on-module LPDDR4/4X memory (4GB or 8GB configurations), enabling models with billions of parameters - LLMs, vision-language models, and stable diffusion - that the Hailo-8 could never run.
The architecture uses Hailo's proprietary dataflow design, where the chip's compute, memory, and control blocks are physically distributed across the die and allocated to specific layers of the neural network at compile time. Instead of fetching instructions and scheduling threads (like a GPU), the Hailo-10H creates a custom data pipeline for each model, with data flowing through dedicated hardware blocks. This dataflow approach is what enables 40 TOPS at 2.5W - a power efficiency of 16 TOPS per watt that no general-purpose processor can match.
The Hailo-10H launched commercially in July 2025 and quickly found its most visible platform: the Raspberry Pi AI HAT+ 2 at $130, which pairs the Hailo-10H (8GB) with the Raspberry Pi 5. This combination puts on-device LLM inference, real-time object detection, and vision-language processing on a $200 total platform - opening generative AI to hobbyists, educators, and embedded developers who have no access to cloud GPUs.
Key Specifications
| Specification | Details |
|---|---|
| Manufacturer | Hailo |
| Architecture | Second-generation neural core (dataflow) |
| AI Performance (INT4) | 40 TOPS |
| AI Performance (INT8) | 20 TOPS |
| Typical Power | 2.5W |
| On-Module Memory | 4 GB or 8 GB LPDDR4/4X |
| Host Interface | PCIe Gen 3.0 x4 |
| Form Factors | M.2 2242 Key M, M.2 2280 Key M, Chip On Board |
| Host Architecture | x86 and ARM (aarch64) |
| OS Support | Linux, Windows, Android |
| Industrial Temp Range | -40C to 85C |
| Automotive Temp Range | -40C to 105C |
| Automotive Qualification | AEC-Q100 Grade 2 |
| Quantization | W4A8 (4-bit weights, 8-bit activations) |
| KV-Cache | INT8, up to 2048 tokens |
| Process Node | Not disclosed |
Dataflow Architecture
Hailo's architecture is fundamentally different from GPUs and even from most other AI accelerators. The key concepts:
Distributed Building Blocks. The silicon is organized as a collection of compute, memory, and control blocks distributed across the die. These aren't fixed-function units for specific operations (like NVIDIA's Tensor Cores); they're generic building blocks that get assigned to specific neural network layers during compilation.
Compile-Time Mapping. The Hailo Dataflow Compiler decomposes each neural network into a resource graph and maps it onto the physical hardware. Compute blocks for each layer are placed as close as possible to their corresponding memory blocks on the die, minimizing data travel distance and power consumption. This physical proximity is what enables the extreme power efficiency.
Model-Specific Pipelines. Each compiled model creates a unique data pipeline through the chip. Data streams through the allocated blocks without global scheduling, instruction fetch, or thread management overhead. This is why the chip achieves near-peak utilization at minimal power - there's no wasted energy on control overhead.
The trade-off is flexibility. Unlike a GPU that can run any program, the Hailo-10H requires each model to be compiled through the Hailo Dataflow Compiler. New models and architectures need explicit compiler support, and the compilation process itself takes significant time and expertise.
Performance Benchmarks
LLM Performance
| Model | Hailo-10H (8GB) | Notes |
|---|---|---|
| Qwen2-1.5B-Instruct | 9.45 tok/s | Time-to-first-token: 289ms (96 input tokens) |
| DeepSeek R1 Distill Qwen 1.5B | ~6.5 tok/s | Measured on Raspberry Pi AI HAT+ 2 |
| Llama 3.2 1B | ~6-7 tok/s | Estimated from community benchmarks |
| Llama2-7B | ~10 tok/s | Demonstrated by Hailo (likely INT4 quantization) |
The LLM performance tells an interesting story. At 9.45 tokens per second on Qwen2-1.5B, the Hailo-10H delivers an usable - if not fast - interactive experience for on-device chatbots. The 289ms time-to-first-token is responsive enough for conversational AI. At 2.1W average power during inference, this is approximately 4.5 tokens per second per watt - a metric that no cloud GPU can approach.
Quantization Accuracy
| Benchmark | Quantized (W4A8) | Full Precision (FP16) |
|---|---|---|
| HellaSwag | 66.06 | 64.3 |
| C4 (perplexity) | 14.38 | 15.1 |
| WikiText2 (perplexity) | 10.08 | 10.5 |
The W4A8 quantization (4-bit weights, 8-bit activations) reaches through Hailo's QuaROT + GPTQ fusion pipeline actually improves HellaSwag accuracy and shows minimal perplexity degradation. This demonstrates that INT4 quantization on the Hailo-10H is not a significant accuracy trade-off for small language models.
Reality Check: CPU Comparison
Independent benchmarks from CNX Software testing the Raspberry Pi AI HAT+ 2 revealed a counterintuitive result: for pure LLM token generation, the Raspberry Pi 5's CPU (BCM2712 at full frequency) was sometimes faster than the Hailo-10H.
| Model | Hailo-10H (tok/s) | CPU-only BCM2712 (tok/s) |
|---|---|---|
| DeepSeek R1 1.5B | ~6.5 | ~9-10.6 |
| Qwen2 1.5B | ~6.7 | Higher than Hailo |
This does not invalidate the Hailo-10H's value proposition, but it reframes it. The accelerator's advantage is not raw token speed for small models - it is offloading. When the Hailo-10H runs the model, the CPU and system RAM are free for other tasks. Total system power with Hailo running is 7.2-7.6W versus 10.2-10.6W for CPU-only inference. For always-on applications (security cameras with VLM analysis, voice assistants, continuous monitoring), the offloading and power savings are more important than peak throughput.
Competitor Comparison
| Feature | Hailo-10H | Google Coral Edge TPU | NVIDIA Jetson Orin Nano Super |
|---|---|---|---|
| INT8 TOPS | 20 | 4 | 67 |
| INT4 TOPS | 40 | N/A | N/A |
| Power | 2.5W | ~2W | 7-25W |
| Efficiency (TOPS/W) | 16 (INT4) | 2 | 2.7 |
| Memory | 4/8GB LPDDR4 | None (host memory) | 8GB LPDDR5 |
| LLM Capable | Yes (up to ~3B) | No | Yes (up to 7B+) |
| GenAI (VLM, SD) | Yes | No | Yes |
| Form Factor | M.2 card | M.2/USB | Full SoM |
| Price | ~$130 (HAT+) | ~$25-60 | $249 (dev kit) |
The Hailo-10H is the only discrete add-in accelerator under 5W that can run LLMs and generative AI. The Coral Edge TPU and Intel Movidius are limited to classical vision inference. The Jetson Orin Nano Super is far more powerful but consumes 3-10x more power and costs 2x more. The Hailo-10H's niche is clear: generative AI at edge power budgets.
Key Capabilities
On-Module LPDDR4/4X. The most important hardware change from the Hailo-8. By putting 4/8GB of LPDDR4 directly on the M.2 module, the Hailo-10H can hold model weights that far exceed the chip's on-chip SRAM. A 1.5B parameter model at INT4 quantization requires about 750MB-1.2GB of memory, well within the 4GB configuration. The 8GB variant supports larger models (up to ~3B parameters) and provides headroom for KV-cache and activation storage.
Generative AI Model Support. Supported models include:
- LLMs: Qwen2/2.5 1.5B, Llama 3.2 1B, DeepSeek R1 Distill Qwen 1.5B, Llama2-7B
- Vision-Language Models: Qwen2-VL-2B-Instruct
- Image Generation: Stable Diffusion 2.1 (under 5 seconds per 512x512 image)
- Vision: YOLOv11, YOLOv8, pose estimation, depth estimation, hundreds of pre-compiled models
- ASR: Automatic speech recognition models
The Hailo-10H offloads the entire LLM pipeline - tokenization, model inference, decoding, and KV-cache management - to the accelerator, freeing the host CPU completely.
Automotive Certification. AEC-Q100 Grade 2 certification means the Hailo-10H is qualified for automotive use at temperatures from -40C to 105C. Target applications include cockpit displays, driver monitoring systems, and in-vehicle infotainment with on-device AI. Hailo expects automotive-grade production to start in 2026.
Software Ecosystem. The Hailo AI Software Suite includes:
- Dataflow Compiler: Compiles models from TensorFlow, PyTorch, and ONNX to Hailo Executable Format (HEF)
- HailoRT: Runtime library with C/C++ and Python APIs
- Model Zoo: Hundreds of pre-compiled vision models plus a growing GenAI model collection
- hailo-ollama: Integration with the Ollama framework for LLM serving
- Camera Integration: Direct integration with Raspberry Pi camera stack
Pricing and Availability
| Product | Price | Config | Notes |
|---|---|---|---|
| Raspberry Pi AI HAT+ 2 | $130 | Hailo-10H, 8GB | Complete HAT board for Raspberry Pi 5 |
| ASUS UGen300 | TBD | Hailo-10H, USB | Announced CES 2026 |
| Standalone M.2 module | Contact Hailo | 4/8GB options | For OEM/embedded integration |
The Hailo-10H chip became commercially available in July 2025. The Raspberry Pi AI HAT+ 2 launched in January 2026 at $130 and is the most accessible way to get started. Raspberry Pi guarantees production of the AI HAT+ 2 through January 2036.
For embedded developers and OEMs, standalone M.2 modules are available through Hailo's sales channels and authorized distributors. HP and Fujitsu have already integrated the Hailo-10H into retail and enterprise products.
Total Platform Cost
| Platform | Total Cost | Components |
|---|---|---|
| Raspberry Pi 5 + AI HAT+ 2 | ~$195 | $65 (Pi 5 8GB) + $130 (HAT+) |
| x86 Mini-PC + Hailo-10H M.2 | ~$300-500 | Depends on mini-PC + module pricing |
| ASUS UGen300 + USB host | TBD | USB accelerator, price pending |
At $195 for a complete Raspberry Pi + Hailo-10H platform, this is the cheapest way to run on-device LLM inference in a standalone system. The nearest alternative is a used NVIDIA Jetson Nano at ~$150, which can't run generative AI models at usable speeds.
Strengths
- 40 TOPS INT4 at 2.5W delivers 16 TOPS/W - the best power efficiency in any discrete AI accelerator
- On-module LPDDR4 enables LLM inference that was impossible on the previous-generation Hailo-8
- $130 Raspberry Pi AI HAT+ 2 makes on-device generative AI accessible to hobbyists and educators
- AEC-Q100 Grade 2 automotive certification opens vehicle and industrial deployment paths
- M.2 form factor drops into any standard M.2 slot - no custom hardware design required
- W4A8 quantization maintains near-FP16 accuracy for small language models
- 2048-token KV-cache context window is sufficient for basic conversational AI
- Dataflow architecture offloads inference completely, freeing host CPU and RAM for other tasks
- 10,000+ monthly active developers and growing open-source model zoo
Weaknesses
- LLM throughput (~6-10 tok/s) can be slower than CPU-only inference on the Raspberry Pi 5 for small models
- Limited to models under ~3B parameters with the 8GB configuration - no 7B+ models in practice
- 2048-token context window is far shorter than cloud-based models (128K-1M tokens)
- PCIe Gen 3.0 x4 interface limits host-to-device bandwidth compared to modern accelerators
- LPDDR4 memory bandwidth constrains decode throughput for larger models
- Dataflow compiler requires model-specific compilation - not all models have optimized support
- No standard FP8/FP16 compute - INT4/INT8 only, limiting precision for some workloads
- Standalone M.2 module pricing and availability through standard distributors is limited
- VLM support is early - hailo-ollama currently supports LLMs only, VLM integration pending
- Process node undisclosed - unclear how much headroom exists for future power/performance improvement
Related Coverage
- NVIDIA RTX 5090 - Desktop AI Flagship - High-power desktop AI for comparison at the opposite end of the power range
- Qualcomm AI200 - Mobile AI processor competing for edge AI workloads
- Apple M4 Max - Apple Silicon for AI - Integrated AI acceleration in a consumer platform
- Groq LPU - Deterministic Inference at Scale - Another specialized inference chip, but targeting datacenter-scale deployment
Sources
- Hailo-10H AI Accelerator - Official Product Page
- Hailo-10H M.2 AI Acceleration Module - Hailo
- Bringing Generative AI to the Edge: LLM on Hailo-10H - Hailo Blog
- Raspberry Pi AI HAT+ 2 Review - CNX Software
- Hailo Announces Commercial Availability of Hailo-10H - Embedded Computing Design
- Hailo's Latest Accelerator Promises On-Device Gen AI in a Sub-5W Envelope - Hackster.io
- Hailo Announces General Availability of Hailo-10H - Hailo Newsroom
- Raspberry Pi AI HAT+ 2 Product Page
- Hailo-10 M.2 Module Brings Generative AI to the Edge - CNX Software
