You own a Mac with Apple Silicon. You've heard that the GPU is capable of more than rendering your desktop. Maybe you've run a local LLM with llama.cpp or watched Stable Diffusion generate images and noticed the GPU activity spike. But you've never written a GPU kernel yourself, and the gap between "my Mac has a GPU" and "I understand how to program it" feels wide.

This guide closes that gap. If you've read our companion CUDA programming guide, you already know GPU concepts - this guide shows how they map to Apple's world. If you haven't, don't worry. We'll cover everything from scratch with real, compilable code.

By the end, you'll understand Metal's compute model, its unified memory architecture, and how to write GPU kernels in Metal Shading Language (MSL) with Swift host code. We'll build four steadily complex examples and cover how Metal compares to CUDA for developers considering both platforms.

Why Metal Exists

Metal is Apple's low-level GPU programming API, introduced in 2014 to replace OpenGL ES and OpenCL on Apple platforms. It handles both graphics rendering and general-purpose compute - the same API that renders your SwiftUI animations also powers on-device ML inference.

On Apple Silicon (M1 and later), Metal runs on a GPU that shares physical memory with the CPU, Neural Engine, and media engines. This Unified Memory Architecture (UMA) is Metal's defining feature: the CPU and GPU see the same memory, so there's no copying data between host and device. If you've written CUDA code, you know the cudaMemcpy dance - allocate on the host, allocate on the device, copy host-to-device, run the kernel, copy device-to-host. Metal skips all of that.

The Metal Compute Model

Metal's thread hierarchy mirrors CUDA's, with different names:

Concept	Metal	CUDA Equivalent
Smallest execution unit	SIMD-group (32 threads)	Warp (32 threads)
Cooperative thread group	Threadgroup (max 1024 threads)	Block (max 1024 threads)
All dispatched threads	Grid	Grid
Fast on-chip shared memory	`threadgroup` memory	`__shared__` memory
Per-thread private memory	`thread` memory	Registers

Three things are different in practice:

Thread indexing is declarative. In CUDA, you compute your global thread ID manually: blockIdx.x * blockDim.x + threadIdx.x. In Metal, you declare what you need as a kernel parameter with an attribute:

uint gid [[thread_position_in_grid]]        // Global thread ID - Metal gives it to you
uint lid [[thread_position_in_threadgroup]]  // Local ID within threadgroup
uint tgid [[threadgroup_position_in_grid]]   // Threadgroup ID

Address spaces are explicit. Every pointer in Metal Shading Language must be qualified with an address space. This is mandatory - the compiler rejects unqualified pointers:

device - GPU-accessible buffers (read/write)
constant - Read-only data, optimized for broadcast across threads
threadgroup - Shared within a threadgroup (fast on-chip SRAM)
thread - Per-thread private memory

Dispatch is verbose. CUDA launches a kernel in one line: myKernel<<<grid, block>>>(args). Metal requires creating a command queue, command buffer, compute command encoder, setting each buffer, dispatching, ending encoding, committing, and waiting. It's roughly 15-20 lines of Swift versus 2 lines of CUDA. The verbosity gives you explicit control over resource binding and submission, but it's a lot of boilerplate for simple operations.

Memory Architecture

Understanding memory is the key to GPU performance. This is true for CUDA and equally true for Metal - but the memory model is fundamentally different.

Unified memory - the key advantage

In a discrete GPU system (NVIDIA), the CPU has its own DRAM and the GPU has its own VRAM. Data must cross the PCIe bus between them. This copy is the single biggest source of overhead in simple CUDA programs.

Apple Silicon eliminates this entirely. The CPU, GPU, and Neural Engine all share the same physical memory pool. When you create a Metal buffer with .storageModeShared, the CPU writes data and the GPU reads it from the same address. No copy. No PCIe bus. No cudaMemcpy.

// Create a buffer visible to both CPU and GPU - zero copy
let buffer = device.makeBuffer(bytes: data,
                                length: dataSize,
                                options: .storageModeShared)

The tradeoff: Apple Silicon's memory bandwidth is lower than dedicated HBM on datacenter GPUs. An M4 Max delivers up to 546 GB/s. A H100 delivers 3,352 GB/s. But for many workloads - especially those that are compute-bound rather than bandwidth-bound - unified memory's zero-copy advantage matters more than raw bandwidth.

Storage modes

Metal offers three storage modes for buffers:

Mode	CPU Access	GPU Access	Best For
`.storageModeShared`	Read/write	Read/write	Data that changes frequently, small-to-medium buffers
`.storageModePrivate`	None	Read/write	GPU-only intermediates, maximum GPU bandwidth
`.storageModeManaged`	Read/write (with sync)	Read/write	Large textures on macOS (separate CPU/GPU copies)

For compute workloads, .storageModeShared is the default choice. Use .storageModePrivate for intermediate buffers that the CPU never touches - the GPU may optimize access patterns for private buffers.

Memory hierarchy

Level	Metal	CUDA Equivalent	Latency
Registers	Per-thread	Registers	0 cycles
Threadgroup memory	`threadgroup` address space	`__shared__` memory	~20-30 cycles
System-Level Cache (SLC)	~48 MB on Max chips	L2 cache	~100 cycles
Unified memory (LPDDR5/5X)	Up to 512 GB on Ultra	Global (HBM/GDDR)	~200-400 cycles

The gap between threadgroup memory and unified memory is ~10-20x in latency. Every tiling and data reuse optimization in Metal exists to keep data in threadgroup memory and avoid hitting main memory.

Example 1: Vector Addition (Hello World)

The simplest possible Metal compute program. It adds two arrays element-by-element. You need two files: the kernel in MSL and the host code in Swift.

Kernel: compute.metal

#include <metal_stdlib>
using namespace metal;

kernel void vector_add(
    device const float* a [[buffer(0)]],
    device const float* b [[buffer(1)]],
    device float*       c [[buffer(2)]],
    uint id [[thread_position_in_grid]]
) {
    c[id] = a[id] + b[id];
}

Host: main.swift

import Metal
import Foundation

let N = 1_000_000
let dataSize = N * MemoryLayout<Float>.stride

// 1. Get the GPU and create a command queue
guard let device = MTLCreateSystemDefaultDevice(),
      let commandQueue = device.makeCommandQueue() else {
    fatalError("Metal is not supported on this device")
}

// 2. Compile the kernel into a pipeline
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "vector_add")!
let pipeline = try! device.makeComputePipelineState(function: function)

// 3. Create input data
var a = (0..<N).map { Float($0) }
var b = (0..<N).map { Float($0) * 2.0 }

// 4. Create Metal buffers - unified memory, no copy needed
let bufferA = device.makeBuffer(bytes: &a, length: dataSize, options: .storageModeShared)!
let bufferB = device.makeBuffer(bytes: &b, length: dataSize, options: .storageModeShared)!
let bufferC = device.makeBuffer(length: dataSize, options: .storageModeShared)!

// 5. Encode the compute command
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(bufferA, offset: 0, index: 0)
encoder.setBuffer(bufferB, offset: 0, index: 1)
encoder.setBuffer(bufferC, offset: 0, index: 2)

// 6. Dispatch threads
let gridSize = MTLSize(width: N, height: 1, depth: 1)
let threadgroupSize = MTLSize(
    width: min(pipeline.maxTotalThreadsPerThreadgroup, N),
    height: 1, depth: 1
)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()

// 7. Submit and wait
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// 8. Read results directly - no device-to-host copy
let result = bufferC.contents().bindMemory(to: Float.self, capacity: N)
print("c[0] = \(result[0])")             // 0.0
print("c[1] = \(result[1])")             // 3.0
print("c[999999] = \(result[999999])")   // 2999997.0

To build this in Xcode: create a new macOS Command Line Tool project, add compute.metal to the target, and replace main.swift with the host code above.

What this teaches:

kernel void marks a GPU function (equivalent to CUDA's __global__)
[[buffer(0)]] binds a parameter to a specific buffer index - the host code's setBuffer(..., index: 0) matches this
[[thread_position_in_grid]] gives each thread a unique global ID. No manual index math like CUDA's blockIdx.x * blockDim.x + threadIdx.x
device const float* and device float* are address-space-qualified pointers - MSL requires this
dispatchThreads handles grid sizing automatically. Metal will dispatch exactly N threads and pad the last threadgroup if needed
The result is read directly from bufferC.contents() - no cudaMemcpy back to host. This is unified memory in action.

Compare this to the CUDA vector addition: the kernel logic is nearly identical, but Metal eliminates all six memory management calls (cudaMalloc x3, cudaMemcpy x2, cudaFree x3) in exchange for the more verbose command encoding.

Example 2: Matrix Multiplication (Naive vs Tiled)

Matrix multiplication is where GPUs truly shine, and where threadgroup memory becomes essential. We'll build both a naive version and an optimized tiled version to show the difference.

Naive kernel

#include <metal_stdlib>
using namespace metal;

kernel void matmul_naive(
    device const float* A [[buffer(0)]],
    device const float* B [[buffer(1)]],
    device float*       C [[buffer(2)]],
    constant uint&      N [[buffer(3)]],
    uint2 gid [[thread_position_in_grid]]
) {
    uint row = gid.y;
    uint col = gid.x;
    if (row >= N || col >= N) return;

    float sum = 0.0f;
    for (uint k = 0; k < N; ++k) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}

The naive version works but wastes memory bandwidth. Each thread loads an entire row of A and column of B from unified memory. For a 1024x1024 multiply, each element of A and B gets loaded 1024 times across different threads.

Tiled version with threadgroup memory

#include <metal_stdlib>
using namespace metal;

#define TILE_SIZE 16

kernel void matmul_tiled(
    device const float* A     [[buffer(0)]],
    device const float* B     [[buffer(1)]],
    device float*       C     [[buffer(2)]],
    constant uint&      N     [[buffer(3)]],
    threadgroup float*  tileA [[threadgroup(0)]],
    threadgroup float*  tileB [[threadgroup(1)]],
    uint2 tgid [[threadgroup_position_in_grid]],
    uint2 lid  [[thread_position_in_threadgroup]]
) {
    uint row = tgid.y * TILE_SIZE + lid.y;
    uint col = tgid.x * TILE_SIZE + lid.x;
    float sum = 0.0f;

    uint numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;

    for (uint t = 0; t < numTiles; ++t) {
        // Cooperatively load tile of A
        uint aCol = t * TILE_SIZE + lid.x;
        if (row < N && aCol < N)
            tileA[lid.y * TILE_SIZE + lid.x] = A[row * N + aCol];
        else
            tileA[lid.y * TILE_SIZE + lid.x] = 0.0f;

        // Cooperatively load tile of B
        uint bRow = t * TILE_SIZE + lid.y;
        if (bRow < N && col < N)
            tileB[lid.y * TILE_SIZE + lid.x] = B[bRow * N + col];
        else
            tileB[lid.y * TILE_SIZE + lid.x] = 0.0f;

        // Wait for all threads to finish loading
        threadgroup_barrier(mem_flags::mem_threadgroup);

        // Compute partial dot product from this tile
        for (uint k = 0; k < TILE_SIZE; ++k) {
            sum += tileA[lid.y * TILE_SIZE + k] * tileB[k * TILE_SIZE + lid.x];
        }

        // Wait before loading next tile
        threadgroup_barrier(mem_flags::mem_threadgroup);
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

Host code for tiled matmul

import Metal

let N: UInt32 = 1024
let matSize = Int(N * N) * MemoryLayout<Float>.stride

let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "matmul_tiled")!
let pipeline = try! device.makeComputePipelineState(function: function)

// Create matrices with random data
var matA = (0..<Int(N * N)).map { _ in Float.random(in: 0...1) }
var matB = (0..<Int(N * N)).map { _ in Float.random(in: 0...1) }
var n = N

let bufA = device.makeBuffer(bytes: &matA, length: matSize, options: .storageModeShared)!
let bufB = device.makeBuffer(bytes: &matB, length: matSize, options: .storageModeShared)!
let bufC = device.makeBuffer(length: matSize, options: .storageModeShared)!
let bufN = device.makeBuffer(bytes: &n, length: MemoryLayout<UInt32>.stride,
                              options: .storageModeShared)!

let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(bufA, offset: 0, index: 0)
encoder.setBuffer(bufB, offset: 0, index: 1)
encoder.setBuffer(bufC, offset: 0, index: 2)
encoder.setBuffer(bufN, offset: 0, index: 3)

// Allocate threadgroup memory for tiles
let tileSize = 16
let tileBytes = tileSize * tileSize * MemoryLayout<Float>.stride
encoder.setThreadgroupMemoryLength(tileBytes, index: 0) // tileA
encoder.setThreadgroupMemoryLength(tileBytes, index: 1) // tileB

let threadgroupSize = MTLSize(width: tileSize, height: tileSize, depth: 1)
let threadgroupCount = MTLSize(
    width: (Int(N) + tileSize - 1) / tileSize,
    height: (Int(N) + tileSize - 1) / tileSize,
    depth: 1
)
encoder.dispatchThreadgroups(threadgroupCount, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// Read result directly
let result = bufC.contents().bindMemory(to: Float.self, capacity: Int(N * N))
print("C[0,0] = \(result[0])")

What the tiled version teaches:

threadgroup float* tileA [[threadgroup(0)]] declares shared memory allocated from the host via setThreadgroupMemoryLength. In CUDA, you'd declare __shared__ float tileA[16][16] directly in the kernel.
threadgroup_barrier(mem_flags::mem_threadgroup) is Metal's equivalent of CUDA's __syncthreads(). All threads in the threadgroup must reach the barrier before any can proceed.
The algorithm loads 16x16 tiles of A and B into threadgroup memory, computes partial results, then moves to the next tile. Each element gets loaded from unified memory once per tile, then reused 16 times from fast on-chip memory.
dispatchThreadgroups (note: not dispatchThreads) dispatches a specific number of threadgroups. This gives you explicit control over the grid layout, which you need when your kernel depends on threadgroup-level cooperation.
On a 1024x1024 multiply, the tiled version is usually 3-5x faster than naive - the same improvement you'd see in CUDA, because the optimization is about memory reuse, not the specific API.

Example 3: Parallel Reduction (Sum with SIMD-group Operations)

Reduction - computing a single value from an array - shows SIMD-group operations, Metal's equivalent of CUDA's warp shuffle instructions.

#include <metal_stdlib>
using namespace metal;

kernel void parallel_reduce_sum(
    device const float*  input   [[buffer(0)]],
    device atomic_float* output  [[buffer(1)]],
    constant uint&       count   [[buffer(2)]],
    threadgroup float*   partials [[threadgroup(0)]],
    uint gid       [[thread_position_in_grid]],
    uint lid       [[thread_position_in_threadgroup]],
    uint simd_lane [[thread_index_in_simdgroup]],
    uint simd_id   [[simdgroup_index_in_threadgroup]],
    uint simd_size [[threads_per_simdgroup]],
    uint tg_size   [[threads_per_threadgroup]]
) {
    // Step 1: Each thread loads its element
    float val = (gid < count) ? input[gid] : 0.0f;

    // Step 2: Reduce within each SIMD-group using shuffle
    // No shared memory needed - values pass directly between lanes
    for (uint offset = simd_size / 2; offset > 0; offset /= 2) {
        val += simd_shuffle_down(val, offset);
    }

    // Step 3: Lane 0 of each SIMD-group writes to threadgroup memory
    uint num_simdgroups = (tg_size + simd_size - 1) / simd_size;
    if (simd_lane == 0) {
        partials[simd_id] = val;
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);

    // Step 4: First SIMD-group reduces the partial sums
    if (simd_id == 0) {
        val = (lid < num_simdgroups) ? partials[lid] : 0.0f;
        for (uint offset = simd_size / 2; offset > 0; offset /= 2) {
            val += simd_shuffle_down(val, offset);
        }
    }

    // Step 5: Thread 0 atomically adds this threadgroup's total
    if (lid == 0) {
        atomic_fetch_add_explicit(output, val, memory_order_relaxed);
    }
}

Host code for reduction

import Metal

let N: UInt32 = 1 << 24  // ~16 million elements
let dataSize = Int(N) * MemoryLayout<Float>.stride

let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "parallel_reduce_sum")!
let pipeline = try! device.makeComputePipelineState(function: function)

// Fill with 1.0 so expected sum = N
var input = [Float](repeating: 1.0, count: Int(N))
var outputValue: Float = 0.0
var count = N

let inputBuffer = device.makeBuffer(bytes: &input, length: dataSize,
                                     options: .storageModeShared)!
let outputBuffer = device.makeBuffer(bytes: &outputValue,
                                      length: MemoryLayout<Float>.stride,
                                      options: .storageModeShared)!
let countBuffer = device.makeBuffer(bytes: &count,
                                     length: MemoryLayout<UInt32>.stride,
                                     options: .storageModeShared)!

let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)
encoder.setBuffer(countBuffer, offset: 0, index: 2)

// Allocate threadgroup memory for partial sums (one per SIMD-group)
let threadgroupSize = min(pipeline.maxTotalThreadsPerThreadgroup, Int(N))
let numSimdgroups = (threadgroupSize + 31) / 32
encoder.setThreadgroupMemoryLength(numSimdgroups * MemoryLayout<Float>.stride, index: 0)

let gridSize = MTLSize(width: Int(N), height: 1, depth: 1)
let tgSize = MTLSize(width: threadgroupSize, height: 1, depth: 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: tgSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

let result = outputBuffer.contents().bindMemory(to: Float.self, capacity: 1)
print("Sum: \(result[0]) (expected: \(N))")

What this teaches:

simd_shuffle_down(val, offset) passes a value from lane current + offset to the current lane - directly between SIMD lanes, no memory access at all. This is the fastest possible data exchange between threads. The CUDA equivalent is __shfl_down_sync(0xffffffff, val, offset).
Metal provides the SIMD-group index and lane index as kernel parameters via [[thread_index_in_simdgroup]] and [[simdgroup_index_in_threadgroup]]. In CUDA, you compute these manually: threadIdx.x % 32 and threadIdx.x / 32.
atomic_fetch_add_explicit lets multiple threadgroups contribute safely. This is Metal's equivalent of CUDA's atomicAdd.
The reduction proceeds in two phases: within each SIMD-group (shuffles, no memory), then across SIMD-groups (threadgroup memory). This two-level approach minimizes memory traffic.
Metal also provides simd_sum(val) as a built-in single-call reduction across a SIMD-group, which is more concise but less educational.

Example 4: Image Brightness (2D Grid, Practical Kernel)

A practical example: adjusting brightness of a RGBA image using a 2D dispatch grid.

Kernel

#include <metal_stdlib>
using namespace metal;

kernel void adjust_brightness(
    device const uchar4* input  [[buffer(0)]],
    device uchar4*       output [[buffer(1)]],
    constant float&      factor [[buffer(2)]],
    constant uint2&      dims   [[buffer(3)]],
    uint2 gid [[thread_position_in_grid]]
) {
    if (gid.x >= dims.x || gid.y >= dims.y) return;

    uint idx = gid.y * dims.x + gid.x;
    float4 pixel = float4(input[idx]) / 255.0f;

    // Adjust brightness, clamp to valid range
    pixel.rgb = clamp(pixel.rgb * factor, 0.0f, 1.0f);

    output[idx] = uchar4(pixel * 255.0f);
}

Host code

import Metal
import simd

let width = 3840
let height = 2160  // 4K image
let pixelCount = width * height
let pixelSize = pixelCount * MemoryLayout<SIMD4<UInt8>>.stride

let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "adjust_brightness")!
let pipeline = try! device.makeComputePipelineState(function: function)

// Simulate image data (normally loaded from a file)
var pixels = [SIMD4<UInt8>](repeating: SIMD4<UInt8>(128, 128, 128, 255),
                             count: pixelCount)
var factor: Float = 1.5
var dims = SIMD2<UInt32>(UInt32(width), UInt32(height))

let inputBuffer = device.makeBuffer(bytes: &pixels, length: pixelSize,
                                     options: .storageModeShared)!
let outputBuffer = device.makeBuffer(length: pixelSize, options: .storageModeShared)!
let factorBuffer = device.makeBuffer(bytes: &factor,
                                      length: MemoryLayout<Float>.stride,
                                      options: .storageModeShared)!
let dimsBuffer = device.makeBuffer(bytes: &dims,
                                    length: MemoryLayout<SIMD2<UInt32>>.stride,
                                    options: .storageModeShared)!

let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)
encoder.setBuffer(factorBuffer, offset: 0, index: 2)
encoder.setBuffer(dimsBuffer, offset: 0, index: 3)

// 2D dispatch - 8x8 threadgroups are common for image processing
let gridSize = MTLSize(width: width, height: height, depth: 1)
let threadgroupSize = MTLSize(width: 8, height: 8, depth: 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

let result = outputBuffer.contents().bindMemory(to: SIMD4<UInt8>.self, capacity: pixelCount)
print("Output pixel [0]: R=\(result[0].x) G=\(result[0].y) B=\(result[0].z) A=\(result[0].w)")

What this teaches:

uint2 gid [[thread_position_in_grid]] gives you a 2D thread position - natural for image processing. The CUDA equivalent uses blockIdx.x * blockDim.x + threadIdx.x and blockIdx.y * blockDim.y + threadIdx.y.
uchar4 and float4 are Metal's built-in vector types. The .rgb swizzle selects the first three components - this is a feature inherited from graphics shading languages that CUDA doesn't have.
clamp() is a built-in MSL function for branchless range clamping. No warp/SIMD-group divergence.
8x8 threadgroups (64 threads) are common for 2D image kernels. They map well to the 2D spatial locality of image data.
The entire input/output stays in unified memory. In CUDA, processing a 4K image requires ~48 MB of PCIe transfers (24 MB up, 24 MB down). In Metal, zero.

Metal Shading Language Reference

MSL is based on C++14 with GPU-specific extensions and restrictions. Here's what you need to know for compute kernels:

Supported types

Type	Description	Notes
`float` / `half`	32-bit / 16-bit floating point	`half` has native hardware support, 2x throughput of `float`
`int` / `uint`	32-bit signed/unsigned integer
`short` / `ushort`	16-bit signed/unsigned integer
`char` / `uchar`	8-bit signed/unsigned integer
`float2/3/4`	Vector types	Component access via `.x .y .z .w` or swizzles like `.rgb`
`float2x2` to `float4x4`	Matrix types	Column-major storage
`atomic_int` / `atomic_float`	Atomic types	For thread-safe read-modify-write operations
`bool`	Boolean

No double. This is Metal's biggest data type limitation compared to CUDA. There's no hardware FP64 on Apple GPUs. If you need double precision for scientific computing, Metal is not the right tool.

Key kernel attributes

kernel void my_kernel(
    // Buffer bindings
    device float*       data    [[buffer(0)]],
    constant Params&    params  [[buffer(1)]],
    threadgroup float*  shared  [[threadgroup(0)]],

    // Thread identification
    uint  gid       [[thread_position_in_grid]],
    uint2 gid2d     [[thread_position_in_grid]],       // For 2D grids
    uint  lid       [[thread_position_in_threadgroup]],
    uint  tgid      [[threadgroup_position_in_grid]],
    uint  tg_size   [[threads_per_threadgroup]],

    // SIMD-group identification
    uint  simd_lane [[thread_index_in_simdgroup]],
    uint  simd_id   [[simdgroup_index_in_threadgroup]],
    uint  simd_size [[threads_per_simdgroup]]
);

What MSL doesn't support

No lambda expressions
No dynamic memory allocation (new/delete)
No recursion
No virtual functions
No exception handling
No function pointers
No goto

These restrictions exist because GPU hardware doesn't support stack-based control flow the way CPUs do. Every thread must execute the same instruction sequence (with masking for divergent branches).

Metal vs CUDA: When to Use Which

If you've read the CUDA guide, here's the honest comparison:

	Metal	CUDA
Memory model	Unified (zero-copy)	Discrete (explicit transfers)
Launch syntax	~20 lines Swift boilerplate	`<<<grid, block>>>` - 2 lines
FP64 support	No	Yes
Max memory	512 GB (M4 Ultra)	80 GB (H100), 24 GB (RTX 4090)
Memory bandwidth	546 GB/s (M4 Max)	3,352 GB/s (H100)
FP32 TFLOPS	~18 (M4 Max)	~82 (RTX 4090), ~67 (H100)
Power draw	40-80W	300-700W
Multi-GPU	No	Yes (NVLink, NCCL)
Platform	macOS / iOS only	Linux, Windows, cloud
Library ecosystem	MPS, Accelerate	cuBLAS, cuDNN, cuFFT, TensorRT, NCCL, Thrust
Profiling tools	Xcode GPU debugger	Nsight Systems, Nsight Compute

Use Metal when

You're building a Mac or iOS app that needs GPU compute
You want zero-copy memory sharing between CPU and GPU
You need to load a large model (70B+ parameters) that won't fit in 24 GB of discrete VRAM but fits in 128-512 GB of unified memory
Power efficiency matters (laptops, edge devices, always-on workloads)
You're doing media processing (images, video, audio) in an Apple app
You're using MLX, Core ML, or the PyTorch MPS backend

Use CUDA when

You need maximum raw throughput for training or batch inference
You need FP64 for scientific computing
You need multi-GPU scaling across a cluster
You depend on the library ecosystem (cuBLAS, cuDNN, TensorRT)
You're launching to cloud GPU instances
You're targeting production ML infrastructure

For a detailed look at CUDA programming with equivalent examples, see our CUDA programming guide.

The Metal Ecosystem

Metal Performance Shaders (MPS)

MPS is Apple's library of pre-optimized GPU kernels - hand-tuned by Apple engineers for each GPU generation. It covers:

Image processing: Convolution, Gaussian blur, Sobel edge detection, histogram, median filter, Lanczos resampling
Linear algebra: Matrix multiplication (MPSMatrixMultiplication), decomposition, sparse operations
Neural networks: Convolution, fully connected, pooling, normalization, softmax, LSTM
Ray tracing: Acceleration structure building, ray intersection

For standard operations, MPS kernels normally outperform hand-written Metal compute shaders. Use them before writing custom kernels.

PyTorch MPS backend

PyTorch supports Apple GPUs via the MPS backend (since PyTorch 1.12):

import torch

device = torch.device("mps")
model = MyModel().to(device)
x = torch.randn(32, 3, 224, 224).to(device)
output = model(x)  # Runs on Apple GPU

The MPS backend uses MPSGraph under the hood. Most common operations are supported, but coverage isn't complete. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to fall back to CPU for unsupported ops.

MLX

MLX is Apple's open-source ML framework, designed specifically for Apple Silicon. It provides a NumPy-like API with lazy evaluation, automatic differentiation, and native Metal acceleration:

import mlx.core as mx

a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = a + b  # Runs on GPU via Metal

MLX is the fastest path for LLM inference on Mac: ~50 tokens/s on quantized Llama 3B with M3 Max, and the MLX community maintains a growing library of optimized models.

Metal 4

Metal 4, announced at WWDC 2025, is the largest API overhaul since Metal's introduction:

Native tensor types: MTLTensor is now a first-class resource with buffers and textures - multi-dimensional containers designed for ML workloads
ML command encoders: MTL4MachineLearningCommandEncoder encodes inference directly onto the GPU timeline with graphics and compute work
Shader ML: Embed small ML models directly inside vertex, fragment, or compute shaders - inference happens without round-tripping through memory
Unified command encoding: A single encoder type handles compute dispatch, memory copies, and acceleration structure operations
Frame interpolation: MetalFX can create intermediate frames (comparable to NVIDIA DLSS 3 frame generation)

Metal 4 requires M1 or later on Mac, A14 or later on iOS/iPadOS.

Debugging and Profiling

Xcode Metal Debugger

The Metal debugger in Xcode provides GPU frame capture for inspecting all compute dispatches and resource states. Key features:

Shader debugger: Step through compute shaders line-by-line with real GPU data. View execution across thousands of threads simultaneously.
Dependency viewer: Graphical view of resource dependencies and synchronization between passes.
Shader profiling: Inline per-line cost statistics in your shader source. Edit and reload to compare performance.
Performance heat maps: Visualize expensive threads. Select a SIMD-group to see its full execution history.

Metal System Trace (Instruments)

Instruments' Metal System Trace shows CPU and GPU timelines side-by-side:

Command buffer submission and completion timing
Encoder-level performance breakdown
CPU-GPU synchronization stalls (the most common performance problem in Metal compute)

GPU Performance Counters

Occupancy: How well the GPU's execution units are utilized
Bandwidth: Memory bandwidth use
Limiter counters: Which hardware unit is the bottleneck

The general profiling workflow: use Instruments to find which kernel is slow, then use the Xcode shader debugger to find which lines in that kernel are expensive.

Common Pitfalls

Forgetting threadgroup_barrier. If threads in a threadgroup cooperate through shared memory (loading tiles, reduction), you need barriers between the write phase and the read phase. Without barriers, threads read partially-written data. This produces incorrect results that are difficult to debug because the timing varies.

Dispatching more threads than data. Always check bounds in your kernel: if (gid >= count) return;. Metal's dispatchThreads may round up to fill the last threadgroup, creating threads that index past your buffer.

Using .storageModeShared for everything. For intermediate buffers that only the GPU touches, use .storageModePrivate. The GPU can optimize access patterns for private buffers, and you avoid polluting the CPU's cache.

Ignoring command buffer errors. Check commandBuffer.error after waitUntilCompleted(). GPU errors are asynchronous - a kernel might silently produce wrong results from out-of-bounds access:

commandBuffer.commit()
commandBuffer.waitUntilCompleted()
if let error = commandBuffer.error {
    print("GPU error: \(error)")
}

Blocking the main thread. waitUntilCompleted() blocks the calling thread. In an app with a UI, dispatch GPU work on a background queue and use addCompletedHandler instead:

commandBuffer.addCompletedHandler { cb in
    // Read results here, on a background thread
}
commandBuffer.commit()

Over-dispatching small workloads. Metal has non-trivial dispatch overhead (~10-50 microseconds for command buffer creation, encoding, and submission). If your data is small (fewer than ~10,000 elements), the dispatch overhead may exceed the compute time. For small workloads, batch multiple operations into a single command buffer or use the CPU.

Where to Go From Here

The examples in this guide cover the core Metal compute patterns: element-wise operations, tiled matrix math, SIMD-group reductions, and 2D image processing. From here:

Read the official Metal Shading Language Specification - it's the final MSL reference
Watch Discover Metal 4 from WWDC 2025 for the latest API features
Try MLX if you want ML on Apple Silicon without writing custom kernels
Profile before optimizing - use Instruments' Metal System Trace to find actual bottlenecks before writing threadgroup memory code
Read the companion CUDA programming guide if you work across both platforms - the optimization concepts (tiling, coalescing, reduction hierarchies) transfer directly
For hands-on Metal compute examples, Metal by Example provides excellent walkthroughs

The mental model for Metal compute is the same as CUDA: think about data locality, minimize memory traffic, and use the thread hierarchy to cooperate on shared data. The API is different. The hardware is different. The optimization principles are the same.

Sources:

Why Metal Exists

The Metal Compute Model

Memory Architecture

Unified memory - the key advantage

Storage modes

Memory hierarchy

Example 1: Vector Addition (Hello World)

Kernel: compute.metal

Host: main.swift

Example 2: Matrix Multiplication (Naive vs Tiled)

Naive kernel

Tiled version with threadgroup memory

Host code for tiled matmul

Example 3: Parallel Reduction (Sum with SIMD-group Operations)

Host code for reduction

Example 4: Image Brightness (2D Grid, Practical Kernel)

Kernel

Host code

Metal Shading Language Reference

Supported types

Key kernel attributes

What MSL doesn't support

Metal vs CUDA: When to Use Which

Use Metal when

Use CUDA when

The Metal Ecosystem

Metal Performance Shaders (MPS)

PyTorch MPS backend

MLX

Metal 4

Debugging and Profiling

Xcode Metal Debugger

Metal System Trace (Instruments)

GPU Performance Counters

Common Pitfalls

Where to Go From Here

Google Analytics