Hardware

Hailo-10H - Edge AI With On-Device LLMs

Complete specs, benchmarks, and analysis of the Hailo-10H - a 2.5W edge AI accelerator with 40 TOPS INT4, on-module LPDDR4, and the ability to run LLMs and VLMs on a Raspberry Pi at 10 tokens per second.

Hailo-10H - Edge AI With On-Device LLMs

TL;DR

  • 40 TOPS INT4 / 20 TOPS INT8 edge AI accelerator in a M.2 form factor at just 2.5W typical power consumption
  • First edge accelerator under 5W that can run LLMs - Qwen2 1.5B at 9.45 tokens per second, Llama 3.2 1B, DeepSeek R1 Distill
  • On-module 4/8GB LPDDR4/4X solves the critical limitation of the Hailo-8 - models no longer limited by on-chip SRAM
  • Available as the $130 Raspberry Pi AI HAT+ 2 or as standalone M.2 modules for embedded integration
  • AEC-Q100 Grade 2 automotive certified - targets smart cameras, vehicles, robotics, and consumer devices

Overview

The Hailo-10H occupies a unique niche in AI hardware. While datacenter accelerators like the NVIDIA H100 and AWS Trainium3 push performance ceilings with kilowatts of power and terabytes of HBM, the Hailo-10H brings generative AI to devices that run on batteries. At 2.5W typical power draw, it runs a 1.5-billion-parameter language model at nearly 10 tokens per second - fast enough for responsive on-device chatbots, voice assistants, and vision-language tasks without any cloud connection.

This is the second-generation AI processor from Hailo, the Israeli chip company that built its reputation on efficient edge vision processing. The original Hailo-8 (26 TOPS INT8) was a capable vision accelerator, but it had a hard limitation: no external memory interface. All model weights had to fit in the chip's on-chip SRAM, which capped the model size at relatively small vision networks. The Hailo-10H fixes this with a direct DDR interface to on-module LPDDR4/4X memory (4GB or 8GB configurations), enabling models with billions of parameters - LLMs, vision-language models, and stable diffusion - that the Hailo-8 could never run.

The architecture uses Hailo's proprietary dataflow design, where the chip's compute, memory, and control blocks are physically distributed across the die and allocated to specific layers of the neural network at compile time. Instead of fetching instructions and scheduling threads (like a GPU), the Hailo-10H creates a custom data pipeline for each model, with data flowing through dedicated hardware blocks. This dataflow approach is what enables 40 TOPS at 2.5W - a power efficiency of 16 TOPS per watt that no general-purpose processor can match.

The Hailo-10H launched commercially in July 2025 and quickly found its most visible platform: the Raspberry Pi AI HAT+ 2 at $130, which pairs the Hailo-10H (8GB) with the Raspberry Pi 5. This combination puts on-device LLM inference, real-time object detection, and vision-language processing on a $200 total platform - opening generative AI to hobbyists, educators, and embedded developers who have no access to cloud GPUs.

Key Specifications

SpecificationDetails
ManufacturerHailo
ArchitectureSecond-generation neural core (dataflow)
AI Performance (INT4)40 TOPS
AI Performance (INT8)20 TOPS
Typical Power2.5W
On-Module Memory4 GB or 8 GB LPDDR4/4X
Host InterfacePCIe Gen 3.0 x4
Form FactorsM.2 2242 Key M, M.2 2280 Key M, Chip On Board
Host Architecturex86 and ARM (aarch64)
OS SupportLinux, Windows, Android
Industrial Temp Range-40C to 85C
Automotive Temp Range-40C to 105C
Automotive QualificationAEC-Q100 Grade 2
QuantizationW4A8 (4-bit weights, 8-bit activations)
KV-CacheINT8, up to 2048 tokens
Process NodeNot disclosed

Dataflow Architecture

Hailo's architecture is fundamentally different from GPUs and even from most other AI accelerators. The key concepts:

Distributed Building Blocks. The silicon is organized as a collection of compute, memory, and control blocks distributed across the die. These aren't fixed-function units for specific operations (like NVIDIA's Tensor Cores); they're generic building blocks that get assigned to specific neural network layers during compilation.

Compile-Time Mapping. The Hailo Dataflow Compiler decomposes each neural network into a resource graph and maps it onto the physical hardware. Compute blocks for each layer are placed as close as possible to their corresponding memory blocks on the die, minimizing data travel distance and power consumption. This physical proximity is what enables the extreme power efficiency.

Model-Specific Pipelines. Each compiled model creates a unique data pipeline through the chip. Data streams through the allocated blocks without global scheduling, instruction fetch, or thread management overhead. This is why the chip achieves near-peak utilization at minimal power - there's no wasted energy on control overhead.

The trade-off is flexibility. Unlike a GPU that can run any program, the Hailo-10H requires each model to be compiled through the Hailo Dataflow Compiler. New models and architectures need explicit compiler support, and the compilation process itself takes significant time and expertise.

Performance Benchmarks

LLM Performance

ModelHailo-10H (8GB)Notes
Qwen2-1.5B-Instruct9.45 tok/sTime-to-first-token: 289ms (96 input tokens)
DeepSeek R1 Distill Qwen 1.5B~6.5 tok/sMeasured on Raspberry Pi AI HAT+ 2
Llama 3.2 1B~6-7 tok/sEstimated from community benchmarks
Llama2-7B~10 tok/sDemonstrated by Hailo (likely INT4 quantization)

The LLM performance tells an interesting story. At 9.45 tokens per second on Qwen2-1.5B, the Hailo-10H delivers an usable - if not fast - interactive experience for on-device chatbots. The 289ms time-to-first-token is responsive enough for conversational AI. At 2.1W average power during inference, this is approximately 4.5 tokens per second per watt - a metric that no cloud GPU can approach.

Quantization Accuracy

BenchmarkQuantized (W4A8)Full Precision (FP16)
HellaSwag66.0664.3
C4 (perplexity)14.3815.1
WikiText2 (perplexity)10.0810.5

The W4A8 quantization (4-bit weights, 8-bit activations) reaches through Hailo's QuaROT + GPTQ fusion pipeline actually improves HellaSwag accuracy and shows minimal perplexity degradation. This demonstrates that INT4 quantization on the Hailo-10H is not a significant accuracy trade-off for small language models.

Reality Check: CPU Comparison

Independent benchmarks from CNX Software testing the Raspberry Pi AI HAT+ 2 revealed a counterintuitive result: for pure LLM token generation, the Raspberry Pi 5's CPU (BCM2712 at full frequency) was sometimes faster than the Hailo-10H.

ModelHailo-10H (tok/s)CPU-only BCM2712 (tok/s)
DeepSeek R1 1.5B~6.5~9-10.6
Qwen2 1.5B~6.7Higher than Hailo

This does not invalidate the Hailo-10H's value proposition, but it reframes it. The accelerator's advantage is not raw token speed for small models - it is offloading. When the Hailo-10H runs the model, the CPU and system RAM are free for other tasks. Total system power with Hailo running is 7.2-7.6W versus 10.2-10.6W for CPU-only inference. For always-on applications (security cameras with VLM analysis, voice assistants, continuous monitoring), the offloading and power savings are more important than peak throughput.

Competitor Comparison

FeatureHailo-10HGoogle Coral Edge TPUNVIDIA Jetson Orin Nano Super
INT8 TOPS20467
INT4 TOPS40N/AN/A
Power2.5W~2W7-25W
Efficiency (TOPS/W)16 (INT4)22.7
Memory4/8GB LPDDR4None (host memory)8GB LPDDR5
LLM CapableYes (up to ~3B)NoYes (up to 7B+)
GenAI (VLM, SD)YesNoYes
Form FactorM.2 cardM.2/USBFull SoM
Price~$130 (HAT+)~$25-60$249 (dev kit)

The Hailo-10H is the only discrete add-in accelerator under 5W that can run LLMs and generative AI. The Coral Edge TPU and Intel Movidius are limited to classical vision inference. The Jetson Orin Nano Super is far more powerful but consumes 3-10x more power and costs 2x more. The Hailo-10H's niche is clear: generative AI at edge power budgets.

Key Capabilities

On-Module LPDDR4/4X. The most important hardware change from the Hailo-8. By putting 4/8GB of LPDDR4 directly on the M.2 module, the Hailo-10H can hold model weights that far exceed the chip's on-chip SRAM. A 1.5B parameter model at INT4 quantization requires about 750MB-1.2GB of memory, well within the 4GB configuration. The 8GB variant supports larger models (up to ~3B parameters) and provides headroom for KV-cache and activation storage.

Generative AI Model Support. Supported models include:

  • LLMs: Qwen2/2.5 1.5B, Llama 3.2 1B, DeepSeek R1 Distill Qwen 1.5B, Llama2-7B
  • Vision-Language Models: Qwen2-VL-2B-Instruct
  • Image Generation: Stable Diffusion 2.1 (under 5 seconds per 512x512 image)
  • Vision: YOLOv11, YOLOv8, pose estimation, depth estimation, hundreds of pre-compiled models
  • ASR: Automatic speech recognition models

The Hailo-10H offloads the entire LLM pipeline - tokenization, model inference, decoding, and KV-cache management - to the accelerator, freeing the host CPU completely.

Automotive Certification. AEC-Q100 Grade 2 certification means the Hailo-10H is qualified for automotive use at temperatures from -40C to 105C. Target applications include cockpit displays, driver monitoring systems, and in-vehicle infotainment with on-device AI. Hailo expects automotive-grade production to start in 2026.

Software Ecosystem. The Hailo AI Software Suite includes:

  • Dataflow Compiler: Compiles models from TensorFlow, PyTorch, and ONNX to Hailo Executable Format (HEF)
  • HailoRT: Runtime library with C/C++ and Python APIs
  • Model Zoo: Hundreds of pre-compiled vision models plus a growing GenAI model collection
  • hailo-ollama: Integration with the Ollama framework for LLM serving
  • Camera Integration: Direct integration with Raspberry Pi camera stack

Pricing and Availability

ProductPriceConfigNotes
Raspberry Pi AI HAT+ 2$130Hailo-10H, 8GBComplete HAT board for Raspberry Pi 5
ASUS UGen300TBDHailo-10H, USBAnnounced CES 2026
Standalone M.2 moduleContact Hailo4/8GB optionsFor OEM/embedded integration

The Hailo-10H chip became commercially available in July 2025. The Raspberry Pi AI HAT+ 2 launched in January 2026 at $130 and is the most accessible way to get started. Raspberry Pi guarantees production of the AI HAT+ 2 through January 2036.

For embedded developers and OEMs, standalone M.2 modules are available through Hailo's sales channels and authorized distributors. HP and Fujitsu have already integrated the Hailo-10H into retail and enterprise products.

Total Platform Cost

PlatformTotal CostComponents
Raspberry Pi 5 + AI HAT+ 2~$195$65 (Pi 5 8GB) + $130 (HAT+)
x86 Mini-PC + Hailo-10H M.2~$300-500Depends on mini-PC + module pricing
ASUS UGen300 + USB hostTBDUSB accelerator, price pending

At $195 for a complete Raspberry Pi + Hailo-10H platform, this is the cheapest way to run on-device LLM inference in a standalone system. The nearest alternative is a used NVIDIA Jetson Nano at ~$150, which can't run generative AI models at usable speeds.

Strengths

  • 40 TOPS INT4 at 2.5W delivers 16 TOPS/W - the best power efficiency in any discrete AI accelerator
  • On-module LPDDR4 enables LLM inference that was impossible on the previous-generation Hailo-8
  • $130 Raspberry Pi AI HAT+ 2 makes on-device generative AI accessible to hobbyists and educators
  • AEC-Q100 Grade 2 automotive certification opens vehicle and industrial deployment paths
  • M.2 form factor drops into any standard M.2 slot - no custom hardware design required
  • W4A8 quantization maintains near-FP16 accuracy for small language models
  • 2048-token KV-cache context window is sufficient for basic conversational AI
  • Dataflow architecture offloads inference completely, freeing host CPU and RAM for other tasks
  • 10,000+ monthly active developers and growing open-source model zoo

Weaknesses

  • LLM throughput (~6-10 tok/s) can be slower than CPU-only inference on the Raspberry Pi 5 for small models
  • Limited to models under ~3B parameters with the 8GB configuration - no 7B+ models in practice
  • 2048-token context window is far shorter than cloud-based models (128K-1M tokens)
  • PCIe Gen 3.0 x4 interface limits host-to-device bandwidth compared to modern accelerators
  • LPDDR4 memory bandwidth constrains decode throughput for larger models
  • Dataflow compiler requires model-specific compilation - not all models have optimized support
  • No standard FP8/FP16 compute - INT4/INT8 only, limiting precision for some workloads
  • Standalone M.2 module pricing and availability through standard distributors is limited
  • VLM support is early - hailo-ollama currently supports LLMs only, VLM integration pending
  • Process node undisclosed - unclear how much headroom exists for future power/performance improvement

Sources

Hailo-10H - Edge AI With On-Device LLMs
About the author AI Benchmarks & Tools Analyst

James is a software engineer turned tech writer who spent six years building backend systems at a fintech startup in Chicago before pivoting to full-time analysis of AI tools and infrastructure.