Someone Reverse-Engineered Apple's Neural Engine and Trained a Model on It

Apple's Neural Engine has never been designed for training. There's no public API. No documentation. No supported path to run anything except inference through CoreML. Apple's own MLX framework - their open-source ML library - can't use the ANE because the only access path goes through proprietary CoreML internals.

Developer Manjeet Singh (maderix) just cracked it open anyway. His ANE project on GitHub reverse-engineers the M4 Neural Engine's private APIs, measures its true performance characteristics, and trains a 109M-parameter Llama2-architecture transformer on hardware Apple never intended to expose for training.

TL;DR

Developer maderix reverse-engineered Apple's Neural Engine private APIs and trained a 109M-parameter Llama2 transformer on it - the first known ANE training
Apple markets the M4 ANE at 38 TOPS, but the real FP16 throughput is 19 TFLOPS - INT8 doesn't provide a compute speedup, only memory bandwidth savings
M4 ANE efficiency: 6.6 TFLOPS/W at 2.8W peak power - roughly 80x more efficient per FLOP than an A100
The ANE bakes weights at compile time, forcing a full recompile every training batch - plus a ~119 compile limit per process due to resource leaks
Discovered 67 private Objective-C classes, including _ANEInMemoryModelDescriptor for in-memory MIL compilation without CoreML
Still early research - CPU bottlenecks dominate training time, and the approach uses private APIs that could break with any macOS update

What was achieved

Singh trained a Stories110M model (dim=768, hidden=2048, 12 attention heads, 12 layers, 32K vocabulary) on the TinyStories dataset. Both forward and backward passes run on ANE hardware, with a full training loop including Adam optimizer, gradient accumulation, and checkpoint/resume.

The 12-layer model runs at 107 ms per training step using 72 ANE kernels per compile. For a single-layer benchmark: 9.3 ms/step at 11.2% ANE use, sustaining 1.78 TFLOPS.

What runs on the ANE: forward attention (RMSNorm + QKV + SDPA + output projection), forward FFN (RMSNorm + SwiGLU), and all backward dx passes. What still runs on CPU: RMSNorm backward, residual connections, loss computation, weight gradient accumulation via Accelerate's cblas_sgemm, and Adam optimizer updates.

The build command tells you everything about the project's dependencies - there are none:

xcrun clang -O2 -framework Foundation -framework IOSurface \
  -framework CoreML -framework Accelerate -ldl -lobjc \
  -o train_large training/train_large.m

Pure Objective-C, system frameworks only, plus dlopen to load the private AppleNeuralEngine.framework at runtime.

How the ANE was cracked open

The reverse engineering combined several techniques. Using dyld_info for Objective-C class enumeration, Singh discovered 67 private classes in the AppleNeuralEngine.framework. Runtime method swizzling intercepted CoreML calls to map the execution chain. FlatBuffer binary analysis decoded the E5 compiled format. And scaling experiments varying tensor dimensions inferred the ANE's internal topology.

The key private APIs:

API	Purpose
`_ANEClient`	Direct hardware access gateway (the `sharedConnection` singleton)
`_ANEInMemoryModelDescriptor`	In-memory MIL compilation - no disk-based mlmodelc files needed
`_ANEInMemoryModel`	Compile, load, assess, unload cycle
`_ANERequest`	Execution request with I/O surface bindings
`_ANEIOSurfaceObject`	Shared memory I/O wrapper
`_ANEPerformanceStats`	Hardware counters (partially explored)
`_ANEChainingRequest`	Model chaining with loopback support

The breakthrough was cracking _ANEInMemoryModelDescriptor - the path to compile MIL programs directly in memory without going through CoreML's disk-based workflow. Three details that took effort to figure out: the MIL text parameter expects NSData* UTF-8 bytes (not NSString*), weights accept NSDictionary* mapping names to data blobs, and temporary directory access is mandatory even for "in-memory" operations.

Rather than hacking binary formats, the project constructs MIL (Model Intermediate Language) program text at runtime. MIL is a typed SSA representation Apple uses internally. Linear layers are expressed as 1x1 convolutions because the ANE is fundamentally a convolution engine, and tensors use [1, channels, 1, spatial] NCDHW format.

The "38 TOPS" problem

Apple markets the M4 ANE at 38 TOPS. The reverse engineering reveals what that number actually means:

The 38 TOPS figure is INT8, computed as 19 TFLOPS FP16 x 2 following industry convention. But the ANE dequantizes INT8 weights to FP16 before computing. There is no 2x INT8 speedup on the compute side. INT8 saves only memory bandwidth (smaller weights to load from DRAM), not compute cycles. The true peak is 19 TFLOPS FP16.

Measured throughput on the M4:

Matrix Size	Throughput	Notes
512x512	0.22 TFLOPS	Dispatch-limited
1024x1024	3.1 TFLOPS
2048x2048	5.7 TFLOPS	Peak single-op (24 MB fits in SRAM)
4096x4096	4.0 TFLOPS	30% drop (96 MB spills to DRAM)
32+ layer deep graph	~19 TFLOPS	94% use approaching theoretical peak

Single operations use only about 30% of ANE capacity. Chaining 16-64 operations in one compiled program is where the ANE hits its stride - at 32+ layers, it reaches 94% utilization and approaches the 19 TFLOPS ceiling.

The 2048-to-4096 performance drop uncovered the on-chip SRAM: approximately 32 MB. At 2048x2048, the 24 MB working set (three matrices) fits completely in SRAM. At 4096x4096, the 96 MB working set forces DRAM spills, cutting throughput by 30%.

Another finding: CoreML adds 2-4x overhead for small operations compared to direct _ANEClient API access. The XPC + IOKit dispatch overhead alone is about 0.095ms per operation.

The power efficiency story

This is where the numbers get interesting for local AI:

Metric	M4 ANE	NVIDIA A100
FP16 Peak	19 TFLOPS	312 TFLOPS
TDP	2.8W	400W
Efficiency	6.6 TFLOPS/W	~0.08 TFLOPS/W
Idle Power	0 mW (hard power gating)	N/A

The M4 ANE is roughly 80x more efficient per FLOP than an A100. It won't outrun a datacenter GPU in raw throughput - 19 TFLOPS versus 312 TFLOPS isn't close. But at 2.8 watts peak and zero watts idle (complete hardware shutdown, not clock gating), the power profile is transformative for always-on local inference on battery-powered devices.

As Singh noted: "Instead of raw speed, the biggest win is power efficiency (4-5x against GPU for same compute peak)."

Why training is hard

Apple designed the ANE for inference, and one fundamental constraint makes training on it painful: weights are irrevocably baked at compile time.

Singh confirmed this through systematic testing on both M4 and M5. Overwriting weight files on disk and reloading the model produces identical output. The weightsBuffer IOSurface parameter does not override compiled weights. There's no hot-swap path.

This means the model must be recompiled every batch when weights change during training. Combined with a ~119 compile limit per process (the ANE compiler leaks resources and eventually refuses new compilations), the training loop requires:

Gradient accumulation across 10 steps between recompiles
exec() process restart with checkpoint resume to reset the compile count
Async background compilation overlapped with forward/backward passes

The per-step breakdown shows where time actually goes:

Component	Time (ms)
ANE eval (72 kernels)	9.6
Classifier (cblas_sgemm on CPU)	9.1
Cross-entropy + residuals (CPU)	14.4
IO (fp16 conversion)	4.1
Total	107

The ANE itself takes only 9.6ms out of a 107ms step. CPU-side operations - loss computation, weight gradients, the classifier matmul - dominate the training time by a factor of 10x. Making training practical would require moving more of these operations to the ANE or GPU.

Context and prior work

Singh isn't the first to poke at the ANE. Matthijs Hollemans' hollance/neural-engine remains the best community documentation of ANE behavior. The eiln/ane project reverse-engineered a Linux driver for ANE as part of the Asahi Linux effort. Security researcher Wish Wu presented ANE internals at BlackHat Asia 2021, documenting register functions across user space, kernel space, and firmware. George Hotz attempted bare-metal ANE access for tinygrad but abandoned the effort due to OS certificate-signing restrictions.

What Singh did that's new: direct _ANEClient API access on M4, cracking the in-memory MIL compilation path, measuring true peak throughput bypassing CoreML overhead, and - the headline - training a neural network on hardware designed exclusively for inference.

The project was built collaboratively with Claude (Anthropic's AI), which Singh acknowledges in his Substack write-up. Some Hacker News commenters noted "LLMisms" in the writing style and raised verification questions - fair pushback for any AI-assisted research.

The M5 and Apple's direction

Singh has already tested on M5 hardware. The M5 ANE appears to be the same H16 family as M4 - same weight-baking limitation, same QoS behavior. But Apple's strategic direction is shifting: the M5 adds Neural Accelerators directly in each GPU core, programmable via Metal 4 Tensor APIs. Ronald Mannak noted on X that "Apple is focused on improving the GPUs for AI inference. The Apple Neural Engine wasn't even mentioned once" at a recent keynote.

The irony is that MLX - Apple's own open-source ML framework - cannot use the ANE because the only access path goes through proprietary CoreML. One Hacker News commenter put it plainly: "MLX is made by Apple and yet, they can't support ANE given its closed-source API!"

What this means

This is early research, not a production tool. The private API dependency means any macOS update could break everything. The compile-per-batch requirement makes training impractical for anything beyond proof-of-concept. And the CPU bottlenecks mean the ANE sits mostly idle even during its own training loop.

But the door is now open. The M4 ANE delivers 6.6 TFLOPS/W - hardware that's sitting mostly unused in every Mac, iPad, and iPhone shipped in the last two years. CoreML restricts it to inference-only through a high-overhead API that adds 2-4x latency for small operations. The true peak of 19 TFLOPS FP16 at 2.8 watts is remarkable for local AI workloads.

Singh's legal position cites Sega v. Accolade (1992) and DMCA section 1201(f) - reverse engineering for interoperability as fair use. No Apple proprietary code or binaries are included in the repository. Apple hasn't responded.

The repository is at github.com/maderix/ANE, MIT licensed. The three-part technical write-up is on Singh's Substack.

Sources: