TL;DR

40 TOPS INT4 / 20 TOPS INT8 edge AI accelerator in a M.2 form factor at just 2.5W typical power consumption
First edge accelerator under 5W that can run LLMs - Qwen2 1.5B at 9.45 tokens per second, Llama 3.2 1B, DeepSeek R1 Distill
On-module 4/8GB LPDDR4/4X solves the critical limitation of the Hailo-8 - models no longer limited by on-chip SRAM
Available as the $130 Raspberry Pi AI HAT+ 2 or as standalone M.2 modules for embedded integration
AEC-Q100 Grade 2 automotive certified - targets smart cameras, vehicles, robotics, and consumer devices

Overview

The Hailo-10H occupies a unique niche in AI hardware. While datacenter accelerators like the NVIDIA H100 and AWS Trainium3 push performance ceilings with kilowatts of power and terabytes of HBM, the Hailo-10H brings generative AI to devices that run on batteries. At 2.5W typical power draw, it runs a 1.5-billion-parameter language model at nearly 10 tokens per second - fast enough for responsive on-device chatbots, voice assistants, and vision-language tasks without any cloud connection.

This is the second-generation AI processor from Hailo, the Israeli chip company that built its reputation on efficient edge vision processing. The original Hailo-8 (26 TOPS INT8) was a capable vision accelerator, but it had a hard limitation: no external memory interface. All model weights had to fit in the chip's on-chip SRAM, which capped the model size at relatively small vision networks. The Hailo-10H fixes this with a direct DDR interface to on-module LPDDR4/4X memory (4GB or 8GB configurations), enabling models with billions of parameters - LLMs, vision-language models, and stable diffusion - that the Hailo-8 could never run.

The architecture uses Hailo's proprietary dataflow design, where the chip's compute, memory, and control blocks are physically distributed across the die and allocated to specific layers of the neural network at compile time. Instead of fetching instructions and scheduling threads (like a GPU), the Hailo-10H creates a custom data pipeline for each model, with data flowing through dedicated hardware blocks. This dataflow approach is what enables 40 TOPS at 2.5W - a power efficiency of 16 TOPS per watt that no general-purpose processor can match.

The Hailo-10H launched commercially in July 2025 and quickly found its most visible platform: the Raspberry Pi AI HAT+ 2 at $130, which pairs the Hailo-10H (8GB) with the Raspberry Pi 5. This combination puts on-device LLM inference, real-time object detection, and vision-language processing on a $200 total platform - opening generative AI to hobbyists, educators, and embedded developers who have no access to cloud GPUs.

Key Specifications

Specification	Details
Manufacturer	Hailo
Architecture	Second-generation neural core (dataflow)
AI Performance (INT4)	40 TOPS
AI Performance (INT8)	20 TOPS
Typical Power	2.5W
On-Module Memory	4 GB or 8 GB LPDDR4/4X
Host Interface	PCIe Gen 3.0 x4
Form Factors	M.2 2242 Key M, M.2 2280 Key M, Chip On Board
Host Architecture	x86 and ARM (aarch64)
OS Support	Linux, Windows, Android
Industrial Temp Range	-40C to 85C
Automotive Temp Range	-40C to 105C
Automotive Qualification	AEC-Q100 Grade 2
Quantization	W4A8 (4-bit weights, 8-bit activations)
KV-Cache	INT8, up to 2048 tokens
Process Node	Not disclosed

Dataflow Architecture

Hailo's architecture is fundamentally different from GPUs and even from most other AI accelerators. The key concepts:

Distributed Building Blocks. The silicon is organized as a collection of compute, memory, and control blocks distributed across the die. These aren't fixed-function units for specific operations (like NVIDIA's Tensor Cores); they're generic building blocks that get assigned to specific neural network layers during compilation.

Compile-Time Mapping. The Hailo Dataflow Compiler decomposes each neural network into a resource graph and maps it onto the physical hardware. Compute blocks for each layer are placed as close as possible to their corresponding memory blocks on the die, minimizing data travel distance and power consumption. This physical proximity is what enables the extreme power efficiency.

Model-Specific Pipelines. Each compiled model creates a unique data pipeline through the chip. Data streams through the allocated blocks without global scheduling, instruction fetch, or thread management overhead. This is why the chip achieves near-peak utilization at minimal power - there's no wasted energy on control overhead.

The trade-off is flexibility. Unlike a GPU that can run any program, the Hailo-10H requires each model to be compiled through the Hailo Dataflow Compiler. New models and architectures need explicit compiler support, and the compilation process itself takes significant time and expertise.

Performance Benchmarks

LLM Performance

Model	Hailo-10H (8GB)	Notes
Qwen2-1.5B-Instruct	9.45 tok/s	Time-to-first-token: 289ms (96 input tokens)
DeepSeek R1 Distill Qwen 1.5B	~6.5 tok/s	Measured on Raspberry Pi AI HAT+ 2
Llama 3.2 1B	~6-7 tok/s	Estimated from community benchmarks
Llama2-7B	~10 tok/s	Demonstrated by Hailo (likely INT4 quantization)

The LLM performance tells an interesting story. At 9.45 tokens per second on Qwen2-1.5B, the Hailo-10H delivers an usable - if not fast - interactive experience for on-device chatbots. The 289ms time-to-first-token is responsive enough for conversational AI. At 2.1W average power during inference, this is approximately 4.5 tokens per second per watt - a metric that no cloud GPU can approach.

Quantization Accuracy

Benchmark	Quantized (W4A8)	Full Precision (FP16)
HellaSwag	66.06	64.3
C4 (perplexity)	14.38	15.1
WikiText2 (perplexity)	10.08	10.5

The W4A8 quantization (4-bit weights, 8-bit activations) reaches through Hailo's QuaROT + GPTQ fusion pipeline actually improves HellaSwag accuracy and shows minimal perplexity degradation. This demonstrates that INT4 quantization on the Hailo-10H is not a significant accuracy trade-off for small language models.

Reality Check: CPU Comparison

Independent benchmarks from CNX Software testing the Raspberry Pi AI HAT+ 2 revealed a counterintuitive result: for pure LLM token generation, the Raspberry Pi 5's CPU (BCM2712 at full frequency) was sometimes faster than the Hailo-10H.

Model	Hailo-10H (tok/s)	CPU-only BCM2712 (tok/s)
DeepSeek R1 1.5B	~6.5	~9-10.6
Qwen2 1.5B	~6.7	Higher than Hailo

This does not invalidate the Hailo-10H's value proposition, but it reframes it. The accelerator's advantage is not raw token speed for small models - it is offloading. When the Hailo-10H runs the model, the CPU and system RAM are free for other tasks. Total system power with Hailo running is 7.2-7.6W versus 10.2-10.6W for CPU-only inference. For always-on applications (security cameras with VLM analysis, voice assistants, continuous monitoring), the offloading and power savings are more important than peak throughput.

Competitor Comparison

Feature	Hailo-10H	Google Coral Edge TPU	NVIDIA Jetson Orin Nano Super
INT8 TOPS	20	4	67
INT4 TOPS	40	N/A	N/A
Power	2.5W	~2W	7-25W
Efficiency (TOPS/W)	16 (INT4)	2	2.7
Memory	4/8GB LPDDR4	None (host memory)	8GB LPDDR5
LLM Capable	Yes (up to ~3B)	No	Yes (up to 7B+)
GenAI (VLM, SD)	Yes	No	Yes
Form Factor	M.2 card	M.2/USB	Full SoM
Price	~$130 (HAT+)	~$25-60	$249 (dev kit)

The Hailo-10H is the only discrete add-in accelerator under 5W that can run LLMs and generative AI. The Coral Edge TPU and Intel Movidius are limited to classical vision inference. The Jetson Orin Nano Super is far more powerful but consumes 3-10x more power and costs 2x more. The Hailo-10H's niche is clear: generative AI at edge power budgets.

Key Capabilities

On-Module LPDDR4/4X. The most important hardware change from the Hailo-8. By putting 4/8GB of LPDDR4 directly on the M.2 module, the Hailo-10H can hold model weights that far exceed the chip's on-chip SRAM. A 1.5B parameter model at INT4 quantization requires about 750MB-1.2GB of memory, well within the 4GB configuration. The 8GB variant supports larger models (up to ~3B parameters) and provides headroom for KV-cache and activation storage.

Generative AI Model Support. Supported models include:

LLMs: Qwen2/2.5 1.5B, Llama 3.2 1B, DeepSeek R1 Distill Qwen 1.5B, Llama2-7B
Vision-Language Models: Qwen2-VL-2B-Instruct
Image Generation: Stable Diffusion 2.1 (under 5 seconds per 512x512 image)
Vision: YOLOv11, YOLOv8, pose estimation, depth estimation, hundreds of pre-compiled models
ASR: Automatic speech recognition models

The Hailo-10H offloads the entire LLM pipeline - tokenization, model inference, decoding, and KV-cache management - to the accelerator, freeing the host CPU completely.

Automotive Certification. AEC-Q100 Grade 2 certification means the Hailo-10H is qualified for automotive use at temperatures from -40C to 105C. Target applications include cockpit displays, driver monitoring systems, and in-vehicle infotainment with on-device AI. Hailo expects automotive-grade production to start in 2026.

Software Ecosystem. The Hailo AI Software Suite includes:

Dataflow Compiler: Compiles models from TensorFlow, PyTorch, and ONNX to Hailo Executable Format (HEF)
HailoRT: Runtime library with C/C++ and Python APIs
Model Zoo: Hundreds of pre-compiled vision models plus a growing GenAI model collection
hailo-ollama: Integration with the Ollama framework for LLM serving
Camera Integration: Direct integration with Raspberry Pi camera stack

Pricing and Availability

Product	Price	Config	Notes
Raspberry Pi AI HAT+ 2	$130	Hailo-10H, 8GB	Complete HAT board for Raspberry Pi 5
ASUS UGen300	TBD	Hailo-10H, USB	Announced CES 2026
Standalone M.2 module	Contact Hailo	4/8GB options	For OEM/embedded integration

The Hailo-10H chip became commercially available in July 2025. The Raspberry Pi AI HAT+ 2 launched in January 2026 at $130 and is the most accessible way to get started. Raspberry Pi guarantees production of the AI HAT+ 2 through January 2036.

For embedded developers and OEMs, standalone M.2 modules are available through Hailo's sales channels and authorized distributors. HP and Fujitsu have already integrated the Hailo-10H into retail and enterprise products.

Total Platform Cost

Platform	Total Cost	Components
Raspberry Pi 5 + AI HAT+ 2	~$195	$65 (Pi 5 8GB) + $130 (HAT+)
x86 Mini-PC + Hailo-10H M.2	~$300-500	Depends on mini-PC + module pricing
ASUS UGen300 + USB host	TBD	USB accelerator, price pending

At $195 for a complete Raspberry Pi + Hailo-10H platform, this is the cheapest way to run on-device LLM inference in a standalone system. The nearest alternative is a used NVIDIA Jetson Nano at ~$150, which can't run generative AI models at usable speeds.

Strengths

40 TOPS INT4 at 2.5W delivers 16 TOPS/W - the best power efficiency in any discrete AI accelerator
On-module LPDDR4 enables LLM inference that was impossible on the previous-generation Hailo-8
$130 Raspberry Pi AI HAT+ 2 makes on-device generative AI accessible to hobbyists and educators
AEC-Q100 Grade 2 automotive certification opens vehicle and industrial deployment paths
M.2 form factor drops into any standard M.2 slot - no custom hardware design required
W4A8 quantization maintains near-FP16 accuracy for small language models
2048-token KV-cache context window is sufficient for basic conversational AI
Dataflow architecture offloads inference completely, freeing host CPU and RAM for other tasks
10,000+ monthly active developers and growing open-source model zoo

Weaknesses

LLM throughput (~6-10 tok/s) can be slower than CPU-only inference on the Raspberry Pi 5 for small models
Limited to models under ~3B parameters with the 8GB configuration - no 7B+ models in practice
2048-token context window is far shorter than cloud-based models (128K-1M tokens)
PCIe Gen 3.0 x4 interface limits host-to-device bandwidth compared to modern accelerators
LPDDR4 memory bandwidth constrains decode throughput for larger models
Dataflow compiler requires model-specific compilation - not all models have optimized support
No standard FP8/FP16 compute - INT4/INT8 only, limiting precision for some workloads
Standalone M.2 module pricing and availability through standard distributors is limited
VLM support is early - hailo-ollama currently supports LLMs only, VLM integration pending
Process node undisclosed - unclear how much headroom exists for future power/performance improvement

NVIDIA RTX 5090 - Desktop AI Flagship - High-power desktop AI for comparison at the opposite end of the power range
Qualcomm AI200 - Mobile AI processor competing for edge AI workloads
Apple M4 Max - Apple Silicon for AI - Integrated AI acceleration in a consumer platform
Groq LPU - Deterministic Inference at Scale - Another specialized inference chip, but targeting datacenter-scale deployment

Hailo-10H - Edge AI With On-Device LLMs

Overview

Key Specifications

Dataflow Architecture

Performance Benchmarks

LLM Performance

Quantization Accuracy

Reality Check: CPU Comparison

Competitor Comparison

Key Capabilities

Pricing and Availability

Total Platform Cost

Strengths

Weaknesses

Sources

Overview

Key Specifications

Dataflow Architecture

Performance Benchmarks

LLM Performance

Quantization Accuracy

Reality Check: CPU Comparison

Competitor Comparison

Key Capabilities

Pricing and Availability

Total Platform Cost

Strengths

Weaknesses

Related Coverage

Sources

Google Analytics