Metal GPU Programming - A Practical Guide for macOS Developers
A hands-on guide to Metal compute programming on Apple Silicon. Covers architecture, unified memory, real code examples in MSL and Swift, and how Metal compares to CUDA.

You own a Mac with Apple Silicon. You've heard that the GPU is capable of more than rendering your desktop. Maybe you've run a local LLM with llama.cpp or watched Stable Diffusion generate images and noticed the GPU activity spike. But you've never written a GPU kernel yourself, and the gap between "my Mac has a GPU" and "I understand how to program it" feels wide.
This guide closes that gap. If you've read our companion CUDA programming guide, you already know GPU concepts - this guide shows how they map to Apple's world. If you haven't, don't worry. We'll cover everything from scratch with real, compilable code.
By the end, you'll understand Metal's compute model, its unified memory architecture, and how to write GPU kernels in Metal Shading Language (MSL) with Swift host code. We'll build four steadily complex examples and cover how Metal compares to CUDA for developers considering both platforms.
Why Metal Exists
Metal is Apple's low-level GPU programming API, introduced in 2014 to replace OpenGL ES and OpenCL on Apple platforms. It handles both graphics rendering and general-purpose compute - the same API that renders your SwiftUI animations also powers on-device ML inference.
On Apple Silicon (M1 and later), Metal runs on a GPU that shares physical memory with the CPU, Neural Engine, and media engines. This Unified Memory Architecture (UMA) is Metal's defining feature: the CPU and GPU see the same memory, so there's no copying data between host and device. If you've written CUDA code, you know the cudaMemcpy dance - allocate on the host, allocate on the device, copy host-to-device, run the kernel, copy device-to-host. Metal skips all of that.
The Metal Compute Model
Metal's thread hierarchy mirrors CUDA's, with different names:
| Concept | Metal | CUDA Equivalent |
|---|---|---|
| Smallest execution unit | SIMD-group (32 threads) | Warp (32 threads) |
| Cooperative thread group | Threadgroup (max 1024 threads) | Block (max 1024 threads) |
| All dispatched threads | Grid | Grid |
| Fast on-chip shared memory | threadgroup memory | __shared__ memory |
| Per-thread private memory | thread memory | Registers |
Three things are different in practice:
Thread indexing is declarative. In CUDA, you compute your global thread ID manually: blockIdx.x * blockDim.x + threadIdx.x. In Metal, you declare what you need as a kernel parameter with an attribute:
uint gid [[thread_position_in_grid]] // Global thread ID - Metal gives it to you
uint lid [[thread_position_in_threadgroup]] // Local ID within threadgroup
uint tgid [[threadgroup_position_in_grid]] // Threadgroup ID
Address spaces are explicit. Every pointer in Metal Shading Language must be qualified with an address space. This is mandatory - the compiler rejects unqualified pointers:
device- GPU-accessible buffers (read/write)constant- Read-only data, optimized for broadcast across threadsthreadgroup- Shared within a threadgroup (fast on-chip SRAM)thread- Per-thread private memory
Dispatch is verbose. CUDA launches a kernel in one line: myKernel<<<grid, block>>>(args). Metal requires creating a command queue, command buffer, compute command encoder, setting each buffer, dispatching, ending encoding, committing, and waiting. It's roughly 15-20 lines of Swift versus 2 lines of CUDA. The verbosity gives you explicit control over resource binding and submission, but it's a lot of boilerplate for simple operations.
Memory Architecture
Understanding memory is the key to GPU performance. This is true for CUDA and equally true for Metal - but the memory model is fundamentally different.
Unified memory - the key advantage
In a discrete GPU system (NVIDIA), the CPU has its own DRAM and the GPU has its own VRAM. Data must cross the PCIe bus between them. This copy is the single biggest source of overhead in simple CUDA programs.
Apple Silicon eliminates this entirely. The CPU, GPU, and Neural Engine all share the same physical memory pool. When you create a Metal buffer with .storageModeShared, the CPU writes data and the GPU reads it from the same address. No copy. No PCIe bus. No cudaMemcpy.
// Create a buffer visible to both CPU and GPU - zero copy
let buffer = device.makeBuffer(bytes: data,
length: dataSize,
options: .storageModeShared)
The tradeoff: Apple Silicon's memory bandwidth is lower than dedicated HBM on datacenter GPUs. An M4 Max delivers up to 546 GB/s. A H100 delivers 3,352 GB/s. But for many workloads - especially those that are compute-bound rather than bandwidth-bound - unified memory's zero-copy advantage matters more than raw bandwidth.
Storage modes
Metal offers three storage modes for buffers:
| Mode | CPU Access | GPU Access | Best For |
|---|---|---|---|
.storageModeShared | Read/write | Read/write | Data that changes frequently, small-to-medium buffers |
.storageModePrivate | None | Read/write | GPU-only intermediates, maximum GPU bandwidth |
.storageModeManaged | Read/write (with sync) | Read/write | Large textures on macOS (separate CPU/GPU copies) |
For compute workloads, .storageModeShared is the default choice. Use .storageModePrivate for intermediate buffers that the CPU never touches - the GPU may optimize access patterns for private buffers.
Memory hierarchy
| Level | Metal | CUDA Equivalent | Latency |
|---|---|---|---|
| Registers | Per-thread | Registers | 0 cycles |
| Threadgroup memory | threadgroup address space | __shared__ memory | ~20-30 cycles |
| System-Level Cache (SLC) | ~48 MB on Max chips | L2 cache | ~100 cycles |
| Unified memory (LPDDR5/5X) | Up to 512 GB on Ultra | Global (HBM/GDDR) | ~200-400 cycles |
The gap between threadgroup memory and unified memory is ~10-20x in latency. Every tiling and data reuse optimization in Metal exists to keep data in threadgroup memory and avoid hitting main memory.
Example 1: Vector Addition (Hello World)
The simplest possible Metal compute program. It adds two arrays element-by-element. You need two files: the kernel in MSL and the host code in Swift.
Kernel: compute.metal
#include <metal_stdlib>
using namespace metal;
kernel void vector_add(
device const float* a [[buffer(0)]],
device const float* b [[buffer(1)]],
device float* c [[buffer(2)]],
uint id [[thread_position_in_grid]]
) {
c[id] = a[id] + b[id];
}
Host: main.swift
import Metal
import Foundation
let N = 1_000_000
let dataSize = N * MemoryLayout<Float>.stride
// 1. Get the GPU and create a command queue
guard let device = MTLCreateSystemDefaultDevice(),
let commandQueue = device.makeCommandQueue() else {
fatalError("Metal is not supported on this device")
}
// 2. Compile the kernel into a pipeline
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "vector_add")!
let pipeline = try! device.makeComputePipelineState(function: function)
// 3. Create input data
var a = (0..<N).map { Float($0) }
var b = (0..<N).map { Float($0) * 2.0 }
// 4. Create Metal buffers - unified memory, no copy needed
let bufferA = device.makeBuffer(bytes: &a, length: dataSize, options: .storageModeShared)!
let bufferB = device.makeBuffer(bytes: &b, length: dataSize, options: .storageModeShared)!
let bufferC = device.makeBuffer(length: dataSize, options: .storageModeShared)!
// 5. Encode the compute command
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(bufferA, offset: 0, index: 0)
encoder.setBuffer(bufferB, offset: 0, index: 1)
encoder.setBuffer(bufferC, offset: 0, index: 2)
// 6. Dispatch threads
let gridSize = MTLSize(width: N, height: 1, depth: 1)
let threadgroupSize = MTLSize(
width: min(pipeline.maxTotalThreadsPerThreadgroup, N),
height: 1, depth: 1
)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
// 7. Submit and wait
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
// 8. Read results directly - no device-to-host copy
let result = bufferC.contents().bindMemory(to: Float.self, capacity: N)
print("c[0] = \(result[0])") // 0.0
print("c[1] = \(result[1])") // 3.0
print("c[999999] = \(result[999999])") // 2999997.0
To build this in Xcode: create a new macOS Command Line Tool project, add compute.metal to the target, and replace main.swift with the host code above.
What this teaches:
kernel voidmarks a GPU function (equivalent to CUDA's__global__)[[buffer(0)]]binds a parameter to a specific buffer index - the host code'ssetBuffer(..., index: 0)matches this[[thread_position_in_grid]]gives each thread a unique global ID. No manual index math like CUDA'sblockIdx.x * blockDim.x + threadIdx.xdevice const float*anddevice float*are address-space-qualified pointers - MSL requires thisdispatchThreadshandles grid sizing automatically. Metal will dispatch exactlyNthreads and pad the last threadgroup if needed- The result is read directly from
bufferC.contents()- nocudaMemcpyback to host. This is unified memory in action.
Compare this to the CUDA vector addition: the kernel logic is nearly identical, but Metal eliminates all six memory management calls (cudaMalloc x3, cudaMemcpy x2, cudaFree x3) in exchange for the more verbose command encoding.
Example 2: Matrix Multiplication (Naive vs Tiled)
Matrix multiplication is where GPUs truly shine, and where threadgroup memory becomes essential. We'll build both a naive version and an optimized tiled version to show the difference.
Naive kernel
#include <metal_stdlib>
using namespace metal;
kernel void matmul_naive(
device const float* A [[buffer(0)]],
device const float* B [[buffer(1)]],
device float* C [[buffer(2)]],
constant uint& N [[buffer(3)]],
uint2 gid [[thread_position_in_grid]]
) {
uint row = gid.y;
uint col = gid.x;
if (row >= N || col >= N) return;
float sum = 0.0f;
for (uint k = 0; k < N; ++k) {
sum += A[row * N + k] * B[k * N + col];
}
C[row * N + col] = sum;
}
The naive version works but wastes memory bandwidth. Each thread loads an entire row of A and column of B from unified memory. For a 1024x1024 multiply, each element of A and B gets loaded 1024 times across different threads.
Tiled version with threadgroup memory
#include <metal_stdlib>
using namespace metal;
#define TILE_SIZE 16
kernel void matmul_tiled(
device const float* A [[buffer(0)]],
device const float* B [[buffer(1)]],
device float* C [[buffer(2)]],
constant uint& N [[buffer(3)]],
threadgroup float* tileA [[threadgroup(0)]],
threadgroup float* tileB [[threadgroup(1)]],
uint2 tgid [[threadgroup_position_in_grid]],
uint2 lid [[thread_position_in_threadgroup]]
) {
uint row = tgid.y * TILE_SIZE + lid.y;
uint col = tgid.x * TILE_SIZE + lid.x;
float sum = 0.0f;
uint numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;
for (uint t = 0; t < numTiles; ++t) {
// Cooperatively load tile of A
uint aCol = t * TILE_SIZE + lid.x;
if (row < N && aCol < N)
tileA[lid.y * TILE_SIZE + lid.x] = A[row * N + aCol];
else
tileA[lid.y * TILE_SIZE + lid.x] = 0.0f;
// Cooperatively load tile of B
uint bRow = t * TILE_SIZE + lid.y;
if (bRow < N && col < N)
tileB[lid.y * TILE_SIZE + lid.x] = B[bRow * N + col];
else
tileB[lid.y * TILE_SIZE + lid.x] = 0.0f;
// Wait for all threads to finish loading
threadgroup_barrier(mem_flags::mem_threadgroup);
// Compute partial dot product from this tile
for (uint k = 0; k < TILE_SIZE; ++k) {
sum += tileA[lid.y * TILE_SIZE + k] * tileB[k * TILE_SIZE + lid.x];
}
// Wait before loading next tile
threadgroup_barrier(mem_flags::mem_threadgroup);
}
if (row < N && col < N) {
C[row * N + col] = sum;
}
}
Host code for tiled matmul
import Metal
let N: UInt32 = 1024
let matSize = Int(N * N) * MemoryLayout<Float>.stride
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "matmul_tiled")!
let pipeline = try! device.makeComputePipelineState(function: function)
// Create matrices with random data
var matA = (0..<Int(N * N)).map { _ in Float.random(in: 0...1) }
var matB = (0..<Int(N * N)).map { _ in Float.random(in: 0...1) }
var n = N
let bufA = device.makeBuffer(bytes: &matA, length: matSize, options: .storageModeShared)!
let bufB = device.makeBuffer(bytes: &matB, length: matSize, options: .storageModeShared)!
let bufC = device.makeBuffer(length: matSize, options: .storageModeShared)!
let bufN = device.makeBuffer(bytes: &n, length: MemoryLayout<UInt32>.stride,
options: .storageModeShared)!
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(bufA, offset: 0, index: 0)
encoder.setBuffer(bufB, offset: 0, index: 1)
encoder.setBuffer(bufC, offset: 0, index: 2)
encoder.setBuffer(bufN, offset: 0, index: 3)
// Allocate threadgroup memory for tiles
let tileSize = 16
let tileBytes = tileSize * tileSize * MemoryLayout<Float>.stride
encoder.setThreadgroupMemoryLength(tileBytes, index: 0) // tileA
encoder.setThreadgroupMemoryLength(tileBytes, index: 1) // tileB
let threadgroupSize = MTLSize(width: tileSize, height: tileSize, depth: 1)
let threadgroupCount = MTLSize(
width: (Int(N) + tileSize - 1) / tileSize,
height: (Int(N) + tileSize - 1) / tileSize,
depth: 1
)
encoder.dispatchThreadgroups(threadgroupCount, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
// Read result directly
let result = bufC.contents().bindMemory(to: Float.self, capacity: Int(N * N))
print("C[0,0] = \(result[0])")
What the tiled version teaches:
threadgroup float* tileA [[threadgroup(0)]]declares shared memory allocated from the host viasetThreadgroupMemoryLength. In CUDA, you'd declare__shared__ float tileA[16][16]directly in the kernel.threadgroup_barrier(mem_flags::mem_threadgroup)is Metal's equivalent of CUDA's__syncthreads(). All threads in the threadgroup must reach the barrier before any can proceed.- The algorithm loads 16x16 tiles of A and B into threadgroup memory, computes partial results, then moves to the next tile. Each element gets loaded from unified memory once per tile, then reused 16 times from fast on-chip memory.
dispatchThreadgroups(note: notdispatchThreads) dispatches a specific number of threadgroups. This gives you explicit control over the grid layout, which you need when your kernel depends on threadgroup-level cooperation.- On a 1024x1024 multiply, the tiled version is usually 3-5x faster than naive - the same improvement you'd see in CUDA, because the optimization is about memory reuse, not the specific API.
Example 3: Parallel Reduction (Sum with SIMD-group Operations)
Reduction - computing a single value from an array - shows SIMD-group operations, Metal's equivalent of CUDA's warp shuffle instructions.
#include <metal_stdlib>
using namespace metal;
kernel void parallel_reduce_sum(
device const float* input [[buffer(0)]],
device atomic_float* output [[buffer(1)]],
constant uint& count [[buffer(2)]],
threadgroup float* partials [[threadgroup(0)]],
uint gid [[thread_position_in_grid]],
uint lid [[thread_position_in_threadgroup]],
uint simd_lane [[thread_index_in_simdgroup]],
uint simd_id [[simdgroup_index_in_threadgroup]],
uint simd_size [[threads_per_simdgroup]],
uint tg_size [[threads_per_threadgroup]]
) {
// Step 1: Each thread loads its element
float val = (gid < count) ? input[gid] : 0.0f;
// Step 2: Reduce within each SIMD-group using shuffle
// No shared memory needed - values pass directly between lanes
for (uint offset = simd_size / 2; offset > 0; offset /= 2) {
val += simd_shuffle_down(val, offset);
}
// Step 3: Lane 0 of each SIMD-group writes to threadgroup memory
uint num_simdgroups = (tg_size + simd_size - 1) / simd_size;
if (simd_lane == 0) {
partials[simd_id] = val;
}
threadgroup_barrier(mem_flags::mem_threadgroup);
// Step 4: First SIMD-group reduces the partial sums
if (simd_id == 0) {
val = (lid < num_simdgroups) ? partials[lid] : 0.0f;
for (uint offset = simd_size / 2; offset > 0; offset /= 2) {
val += simd_shuffle_down(val, offset);
}
}
// Step 5: Thread 0 atomically adds this threadgroup's total
if (lid == 0) {
atomic_fetch_add_explicit(output, val, memory_order_relaxed);
}
}
Host code for reduction
import Metal
let N: UInt32 = 1 << 24 // ~16 million elements
let dataSize = Int(N) * MemoryLayout<Float>.stride
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "parallel_reduce_sum")!
let pipeline = try! device.makeComputePipelineState(function: function)
// Fill with 1.0 so expected sum = N
var input = [Float](repeating: 1.0, count: Int(N))
var outputValue: Float = 0.0
var count = N
let inputBuffer = device.makeBuffer(bytes: &input, length: dataSize,
options: .storageModeShared)!
let outputBuffer = device.makeBuffer(bytes: &outputValue,
length: MemoryLayout<Float>.stride,
options: .storageModeShared)!
let countBuffer = device.makeBuffer(bytes: &count,
length: MemoryLayout<UInt32>.stride,
options: .storageModeShared)!
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)
encoder.setBuffer(countBuffer, offset: 0, index: 2)
// Allocate threadgroup memory for partial sums (one per SIMD-group)
let threadgroupSize = min(pipeline.maxTotalThreadsPerThreadgroup, Int(N))
let numSimdgroups = (threadgroupSize + 31) / 32
encoder.setThreadgroupMemoryLength(numSimdgroups * MemoryLayout<Float>.stride, index: 0)
let gridSize = MTLSize(width: Int(N), height: 1, depth: 1)
let tgSize = MTLSize(width: threadgroupSize, height: 1, depth: 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: tgSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let result = outputBuffer.contents().bindMemory(to: Float.self, capacity: 1)
print("Sum: \(result[0]) (expected: \(N))")
What this teaches:
simd_shuffle_down(val, offset)passes a value from lanecurrent + offsetto the current lane - directly between SIMD lanes, no memory access at all. This is the fastest possible data exchange between threads. The CUDA equivalent is__shfl_down_sync(0xffffffff, val, offset).- Metal provides the SIMD-group index and lane index as kernel parameters via
[[thread_index_in_simdgroup]]and[[simdgroup_index_in_threadgroup]]. In CUDA, you compute these manually:threadIdx.x % 32andthreadIdx.x / 32. atomic_fetch_add_explicitlets multiple threadgroups contribute safely. This is Metal's equivalent of CUDA'satomicAdd.- The reduction proceeds in two phases: within each SIMD-group (shuffles, no memory), then across SIMD-groups (threadgroup memory). This two-level approach minimizes memory traffic.
- Metal also provides
simd_sum(val)as a built-in single-call reduction across a SIMD-group, which is more concise but less educational.
Example 4: Image Brightness (2D Grid, Practical Kernel)
A practical example: adjusting brightness of a RGBA image using a 2D dispatch grid.
Kernel
#include <metal_stdlib>
using namespace metal;
kernel void adjust_brightness(
device const uchar4* input [[buffer(0)]],
device uchar4* output [[buffer(1)]],
constant float& factor [[buffer(2)]],
constant uint2& dims [[buffer(3)]],
uint2 gid [[thread_position_in_grid]]
) {
if (gid.x >= dims.x || gid.y >= dims.y) return;
uint idx = gid.y * dims.x + gid.x;
float4 pixel = float4(input[idx]) / 255.0f;
// Adjust brightness, clamp to valid range
pixel.rgb = clamp(pixel.rgb * factor, 0.0f, 1.0f);
output[idx] = uchar4(pixel * 255.0f);
}
Host code
import Metal
import simd
let width = 3840
let height = 2160 // 4K image
let pixelCount = width * height
let pixelSize = pixelCount * MemoryLayout<SIMD4<UInt8>>.stride
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "adjust_brightness")!
let pipeline = try! device.makeComputePipelineState(function: function)
// Simulate image data (normally loaded from a file)
var pixels = [SIMD4<UInt8>](repeating: SIMD4<UInt8>(128, 128, 128, 255),
count: pixelCount)
var factor: Float = 1.5
var dims = SIMD2<UInt32>(UInt32(width), UInt32(height))
let inputBuffer = device.makeBuffer(bytes: &pixels, length: pixelSize,
options: .storageModeShared)!
let outputBuffer = device.makeBuffer(length: pixelSize, options: .storageModeShared)!
let factorBuffer = device.makeBuffer(bytes: &factor,
length: MemoryLayout<Float>.stride,
options: .storageModeShared)!
let dimsBuffer = device.makeBuffer(bytes: &dims,
length: MemoryLayout<SIMD2<UInt32>>.stride,
options: .storageModeShared)!
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)
encoder.setBuffer(factorBuffer, offset: 0, index: 2)
encoder.setBuffer(dimsBuffer, offset: 0, index: 3)
// 2D dispatch - 8x8 threadgroups are common for image processing
let gridSize = MTLSize(width: width, height: height, depth: 1)
let threadgroupSize = MTLSize(width: 8, height: 8, depth: 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
let result = outputBuffer.contents().bindMemory(to: SIMD4<UInt8>.self, capacity: pixelCount)
print("Output pixel [0]: R=\(result[0].x) G=\(result[0].y) B=\(result[0].z) A=\(result[0].w)")
What this teaches:
uint2 gid [[thread_position_in_grid]]gives you a 2D thread position - natural for image processing. The CUDA equivalent usesblockIdx.x * blockDim.x + threadIdx.xandblockIdx.y * blockDim.y + threadIdx.y.uchar4andfloat4are Metal's built-in vector types. The.rgbswizzle selects the first three components - this is a feature inherited from graphics shading languages that CUDA doesn't have.clamp()is a built-in MSL function for branchless range clamping. No warp/SIMD-group divergence.- 8x8 threadgroups (64 threads) are common for 2D image kernels. They map well to the 2D spatial locality of image data.
- The entire input/output stays in unified memory. In CUDA, processing a 4K image requires ~48 MB of PCIe transfers (24 MB up, 24 MB down). In Metal, zero.
Metal Shading Language Reference
MSL is based on C++14 with GPU-specific extensions and restrictions. Here's what you need to know for compute kernels:
Supported types
| Type | Description | Notes |
|---|---|---|
float / half | 32-bit / 16-bit floating point | half has native hardware support, 2x throughput of float |
int / uint | 32-bit signed/unsigned integer | |
short / ushort | 16-bit signed/unsigned integer | |
char / uchar | 8-bit signed/unsigned integer | |
float2/3/4 | Vector types | Component access via .x .y .z .w or swizzles like .rgb |
float2x2 to float4x4 | Matrix types | Column-major storage |
atomic_int / atomic_float | Atomic types | For thread-safe read-modify-write operations |
bool | Boolean |
No double. This is Metal's biggest data type limitation compared to CUDA. There's no hardware FP64 on Apple GPUs. If you need double precision for scientific computing, Metal is not the right tool.
Key kernel attributes
kernel void my_kernel(
// Buffer bindings
device float* data [[buffer(0)]],
constant Params& params [[buffer(1)]],
threadgroup float* shared [[threadgroup(0)]],
// Thread identification
uint gid [[thread_position_in_grid]],
uint2 gid2d [[thread_position_in_grid]], // For 2D grids
uint lid [[thread_position_in_threadgroup]],
uint tgid [[threadgroup_position_in_grid]],
uint tg_size [[threads_per_threadgroup]],
// SIMD-group identification
uint simd_lane [[thread_index_in_simdgroup]],
uint simd_id [[simdgroup_index_in_threadgroup]],
uint simd_size [[threads_per_simdgroup]]
);
What MSL doesn't support
- No lambda expressions
- No dynamic memory allocation (
new/delete) - No recursion
- No virtual functions
- No exception handling
- No function pointers
- No
goto
These restrictions exist because GPU hardware doesn't support stack-based control flow the way CPUs do. Every thread must execute the same instruction sequence (with masking for divergent branches).
Metal vs CUDA: When to Use Which
If you've read the CUDA guide, here's the honest comparison:
| Metal | CUDA | |
|---|---|---|
| Memory model | Unified (zero-copy) | Discrete (explicit transfers) |
| Launch syntax | ~20 lines Swift boilerplate | <<<grid, block>>> - 2 lines |
| FP64 support | No | Yes |
| Max memory | 512 GB (M4 Ultra) | 80 GB (H100), 24 GB (RTX 4090) |
| Memory bandwidth | 546 GB/s (M4 Max) | 3,352 GB/s (H100) |
| FP32 TFLOPS | ~18 (M4 Max) | ~82 (RTX 4090), ~67 (H100) |
| Power draw | 40-80W | 300-700W |
| Multi-GPU | No | Yes (NVLink, NCCL) |
| Platform | macOS / iOS only | Linux, Windows, cloud |
| Library ecosystem | MPS, Accelerate | cuBLAS, cuDNN, cuFFT, TensorRT, NCCL, Thrust |
| Profiling tools | Xcode GPU debugger | Nsight Systems, Nsight Compute |
Use Metal when
- You're building a Mac or iOS app that needs GPU compute
- You want zero-copy memory sharing between CPU and GPU
- You need to load a large model (70B+ parameters) that won't fit in 24 GB of discrete VRAM but fits in 128-512 GB of unified memory
- Power efficiency matters (laptops, edge devices, always-on workloads)
- You're doing media processing (images, video, audio) in an Apple app
- You're using MLX, Core ML, or the PyTorch MPS backend
Use CUDA when
- You need maximum raw throughput for training or batch inference
- You need FP64 for scientific computing
- You need multi-GPU scaling across a cluster
- You depend on the library ecosystem (cuBLAS, cuDNN, TensorRT)
- You're launching to cloud GPU instances
- You're targeting production ML infrastructure
For a detailed look at CUDA programming with equivalent examples, see our CUDA programming guide.
The Metal Ecosystem
Metal Performance Shaders (MPS)
MPS is Apple's library of pre-optimized GPU kernels - hand-tuned by Apple engineers for each GPU generation. It covers:
- Image processing: Convolution, Gaussian blur, Sobel edge detection, histogram, median filter, Lanczos resampling
- Linear algebra: Matrix multiplication (
MPSMatrixMultiplication), decomposition, sparse operations - Neural networks: Convolution, fully connected, pooling, normalization, softmax, LSTM
- Ray tracing: Acceleration structure building, ray intersection
For standard operations, MPS kernels normally outperform hand-written Metal compute shaders. Use them before writing custom kernels.
PyTorch MPS backend
PyTorch supports Apple GPUs via the MPS backend (since PyTorch 1.12):
import torch
device = torch.device("mps")
model = MyModel().to(device)
x = torch.randn(32, 3, 224, 224).to(device)
output = model(x) # Runs on Apple GPU
The MPS backend uses MPSGraph under the hood. Most common operations are supported, but coverage isn't complete. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to fall back to CPU for unsupported ops.
MLX
MLX is Apple's open-source ML framework, designed specifically for Apple Silicon. It provides a NumPy-like API with lazy evaluation, automatic differentiation, and native Metal acceleration:
import mlx.core as mx
a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = a + b # Runs on GPU via Metal
MLX is the fastest path for LLM inference on Mac: ~50 tokens/s on quantized Llama 3B with M3 Max, and the MLX community maintains a growing library of optimized models.
Metal 4
Metal 4, announced at WWDC 2025, is the largest API overhaul since Metal's introduction:
- Native tensor types:
MTLTensoris now a first-class resource with buffers and textures - multi-dimensional containers designed for ML workloads - ML command encoders:
MTL4MachineLearningCommandEncoderencodes inference directly onto the GPU timeline with graphics and compute work - Shader ML: Embed small ML models directly inside vertex, fragment, or compute shaders - inference happens without round-tripping through memory
- Unified command encoding: A single encoder type handles compute dispatch, memory copies, and acceleration structure operations
- Frame interpolation: MetalFX can create intermediate frames (comparable to NVIDIA DLSS 3 frame generation)
Metal 4 requires M1 or later on Mac, A14 or later on iOS/iPadOS.
Debugging and Profiling
Xcode Metal Debugger
The Metal debugger in Xcode provides GPU frame capture for inspecting all compute dispatches and resource states. Key features:
- Shader debugger: Step through compute shaders line-by-line with real GPU data. View execution across thousands of threads simultaneously.
- Dependency viewer: Graphical view of resource dependencies and synchronization between passes.
- Shader profiling: Inline per-line cost statistics in your shader source. Edit and reload to compare performance.
- Performance heat maps: Visualize expensive threads. Select a SIMD-group to see its full execution history.
Metal System Trace (Instruments)
Instruments' Metal System Trace shows CPU and GPU timelines side-by-side:
- Command buffer submission and completion timing
- Encoder-level performance breakdown
- CPU-GPU synchronization stalls (the most common performance problem in Metal compute)
GPU Performance Counters
- Occupancy: How well the GPU's execution units are utilized
- Bandwidth: Memory bandwidth use
- Limiter counters: Which hardware unit is the bottleneck
The general profiling workflow: use Instruments to find which kernel is slow, then use the Xcode shader debugger to find which lines in that kernel are expensive.
Common Pitfalls
Forgetting threadgroup_barrier. If threads in a threadgroup cooperate through shared memory (loading tiles, reduction), you need barriers between the write phase and the read phase. Without barriers, threads read partially-written data. This produces incorrect results that are difficult to debug because the timing varies.
Dispatching more threads than data. Always check bounds in your kernel: if (gid >= count) return;. Metal's dispatchThreads may round up to fill the last threadgroup, creating threads that index past your buffer.
Using .storageModeShared for everything. For intermediate buffers that only the GPU touches, use .storageModePrivate. The GPU can optimize access patterns for private buffers, and you avoid polluting the CPU's cache.
Ignoring command buffer errors. Check commandBuffer.error after waitUntilCompleted(). GPU errors are asynchronous - a kernel might silently produce wrong results from out-of-bounds access:
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
if let error = commandBuffer.error {
print("GPU error: \(error)")
}
Blocking the main thread. waitUntilCompleted() blocks the calling thread. In an app with a UI, dispatch GPU work on a background queue and use addCompletedHandler instead:
commandBuffer.addCompletedHandler { cb in
// Read results here, on a background thread
}
commandBuffer.commit()
Over-dispatching small workloads. Metal has non-trivial dispatch overhead (~10-50 microseconds for command buffer creation, encoding, and submission). If your data is small (fewer than ~10,000 elements), the dispatch overhead may exceed the compute time. For small workloads, batch multiple operations into a single command buffer or use the CPU.
Where to Go From Here
The examples in this guide cover the core Metal compute patterns: element-wise operations, tiled matrix math, SIMD-group reductions, and 2D image processing. From here:
- Read the official Metal Shading Language Specification - it's the final MSL reference
- Watch Discover Metal 4 from WWDC 2025 for the latest API features
- Try MLX if you want ML on Apple Silicon without writing custom kernels
- Profile before optimizing - use Instruments' Metal System Trace to find actual bottlenecks before writing threadgroup memory code
- Read the companion CUDA programming guide if you work across both platforms - the optimization concepts (tiling, coalescing, reduction hierarchies) transfer directly
- For hands-on Metal compute examples, Metal by Example provides excellent walkthroughs
The mental model for Metal compute is the same as CUDA: think about data locality, minimize memory traffic, and use the thread hierarchy to cooperate on shared data. The API is different. The hardware is different. The optimization principles are the same.
Sources:
- Metal Overview - Apple Developer
- Metal Shading Language Specification v4 - Apple Developer
- Metal Performance Shaders - Apple Developer Documentation
- Discover Metal 4 - WWDC25 - Apple Developer
- Introduction to Compute Programming in Metal - Metal by Example
- MLX - Apple ML Framework
- Accelerated PyTorch Training on Mac - Apple Developer
- Metal Developer Tools - Apple Developer
