Guides

Metal GPU Programming - A Practical Guide for macOS Developers

A hands-on guide to Metal compute programming on Apple Silicon. Covers architecture, unified memory, real code examples in MSL and Swift, and how Metal compares to CUDA.

Metal GPU Programming - A Practical Guide for macOS Developers

You own a Mac with Apple Silicon. You've heard that the GPU is capable of more than rendering your desktop. Maybe you've run a local LLM with llama.cpp or watched Stable Diffusion generate images and noticed the GPU activity spike. But you've never written a GPU kernel yourself, and the gap between "my Mac has a GPU" and "I understand how to program it" feels wide.

This guide closes that gap. If you've read our companion CUDA programming guide, you already know GPU concepts - this guide shows how they map to Apple's world. If you haven't, don't worry. We'll cover everything from scratch with real, compilable code.

By the end, you'll understand Metal's compute model, its unified memory architecture, and how to write GPU kernels in Metal Shading Language (MSL) with Swift host code. We'll build four steadily complex examples and cover how Metal compares to CUDA for developers considering both platforms.

Why Metal Exists

Metal is Apple's low-level GPU programming API, introduced in 2014 to replace OpenGL ES and OpenCL on Apple platforms. It handles both graphics rendering and general-purpose compute - the same API that renders your SwiftUI animations also powers on-device ML inference.

On Apple Silicon (M1 and later), Metal runs on a GPU that shares physical memory with the CPU, Neural Engine, and media engines. This Unified Memory Architecture (UMA) is Metal's defining feature: the CPU and GPU see the same memory, so there's no copying data between host and device. If you've written CUDA code, you know the cudaMemcpy dance - allocate on the host, allocate on the device, copy host-to-device, run the kernel, copy device-to-host. Metal skips all of that.

The Metal Compute Model

Metal's thread hierarchy mirrors CUDA's, with different names:

ConceptMetalCUDA Equivalent
Smallest execution unitSIMD-group (32 threads)Warp (32 threads)
Cooperative thread groupThreadgroup (max 1024 threads)Block (max 1024 threads)
All dispatched threadsGridGrid
Fast on-chip shared memorythreadgroup memory__shared__ memory
Per-thread private memorythread memoryRegisters

Three things are different in practice:

Thread indexing is declarative. In CUDA, you compute your global thread ID manually: blockIdx.x * blockDim.x + threadIdx.x. In Metal, you declare what you need as a kernel parameter with an attribute:

uint gid [[thread_position_in_grid]]        // Global thread ID - Metal gives it to you
uint lid [[thread_position_in_threadgroup]]  // Local ID within threadgroup
uint tgid [[threadgroup_position_in_grid]]   // Threadgroup ID

Address spaces are explicit. Every pointer in Metal Shading Language must be qualified with an address space. This is mandatory - the compiler rejects unqualified pointers:

  • device - GPU-accessible buffers (read/write)
  • constant - Read-only data, optimized for broadcast across threads
  • threadgroup - Shared within a threadgroup (fast on-chip SRAM)
  • thread - Per-thread private memory

Dispatch is verbose. CUDA launches a kernel in one line: myKernel<<<grid, block>>>(args). Metal requires creating a command queue, command buffer, compute command encoder, setting each buffer, dispatching, ending encoding, committing, and waiting. It's roughly 15-20 lines of Swift versus 2 lines of CUDA. The verbosity gives you explicit control over resource binding and submission, but it's a lot of boilerplate for simple operations.

Memory Architecture

Understanding memory is the key to GPU performance. This is true for CUDA and equally true for Metal - but the memory model is fundamentally different.

Unified memory - the key advantage

In a discrete GPU system (NVIDIA), the CPU has its own DRAM and the GPU has its own VRAM. Data must cross the PCIe bus between them. This copy is the single biggest source of overhead in simple CUDA programs.

Apple Silicon eliminates this entirely. The CPU, GPU, and Neural Engine all share the same physical memory pool. When you create a Metal buffer with .storageModeShared, the CPU writes data and the GPU reads it from the same address. No copy. No PCIe bus. No cudaMemcpy.

// Create a buffer visible to both CPU and GPU - zero copy
let buffer = device.makeBuffer(bytes: data,
                                length: dataSize,
                                options: .storageModeShared)

The tradeoff: Apple Silicon's memory bandwidth is lower than dedicated HBM on datacenter GPUs. An M4 Max delivers up to 546 GB/s. A H100 delivers 3,352 GB/s. But for many workloads - especially those that are compute-bound rather than bandwidth-bound - unified memory's zero-copy advantage matters more than raw bandwidth.

Storage modes

Metal offers three storage modes for buffers:

ModeCPU AccessGPU AccessBest For
.storageModeSharedRead/writeRead/writeData that changes frequently, small-to-medium buffers
.storageModePrivateNoneRead/writeGPU-only intermediates, maximum GPU bandwidth
.storageModeManagedRead/write (with sync)Read/writeLarge textures on macOS (separate CPU/GPU copies)

For compute workloads, .storageModeShared is the default choice. Use .storageModePrivate for intermediate buffers that the CPU never touches - the GPU may optimize access patterns for private buffers.

Memory hierarchy

LevelMetalCUDA EquivalentLatency
RegistersPer-threadRegisters0 cycles
Threadgroup memorythreadgroup address space__shared__ memory~20-30 cycles
System-Level Cache (SLC)~48 MB on Max chipsL2 cache~100 cycles
Unified memory (LPDDR5/5X)Up to 512 GB on UltraGlobal (HBM/GDDR)~200-400 cycles

The gap between threadgroup memory and unified memory is ~10-20x in latency. Every tiling and data reuse optimization in Metal exists to keep data in threadgroup memory and avoid hitting main memory.

Example 1: Vector Addition (Hello World)

The simplest possible Metal compute program. It adds two arrays element-by-element. You need two files: the kernel in MSL and the host code in Swift.

Kernel: compute.metal

#include <metal_stdlib>
using namespace metal;

kernel void vector_add(
    device const float* a [[buffer(0)]],
    device const float* b [[buffer(1)]],
    device float*       c [[buffer(2)]],
    uint id [[thread_position_in_grid]]
) {
    c[id] = a[id] + b[id];
}

Host: main.swift

import Metal
import Foundation

let N = 1_000_000
let dataSize = N * MemoryLayout<Float>.stride

// 1. Get the GPU and create a command queue
guard let device = MTLCreateSystemDefaultDevice(),
      let commandQueue = device.makeCommandQueue() else {
    fatalError("Metal is not supported on this device")
}

// 2. Compile the kernel into a pipeline
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "vector_add")!
let pipeline = try! device.makeComputePipelineState(function: function)

// 3. Create input data
var a = (0..<N).map { Float($0) }
var b = (0..<N).map { Float($0) * 2.0 }

// 4. Create Metal buffers - unified memory, no copy needed
let bufferA = device.makeBuffer(bytes: &a, length: dataSize, options: .storageModeShared)!
let bufferB = device.makeBuffer(bytes: &b, length: dataSize, options: .storageModeShared)!
let bufferC = device.makeBuffer(length: dataSize, options: .storageModeShared)!

// 5. Encode the compute command
let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(bufferA, offset: 0, index: 0)
encoder.setBuffer(bufferB, offset: 0, index: 1)
encoder.setBuffer(bufferC, offset: 0, index: 2)

// 6. Dispatch threads
let gridSize = MTLSize(width: N, height: 1, depth: 1)
let threadgroupSize = MTLSize(
    width: min(pipeline.maxTotalThreadsPerThreadgroup, N),
    height: 1, depth: 1
)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()

// 7. Submit and wait
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// 8. Read results directly - no device-to-host copy
let result = bufferC.contents().bindMemory(to: Float.self, capacity: N)
print("c[0] = \(result[0])")             // 0.0
print("c[1] = \(result[1])")             // 3.0
print("c[999999] = \(result[999999])")   // 2999997.0

To build this in Xcode: create a new macOS Command Line Tool project, add compute.metal to the target, and replace main.swift with the host code above.

What this teaches:

  • kernel void marks a GPU function (equivalent to CUDA's __global__)
  • [[buffer(0)]] binds a parameter to a specific buffer index - the host code's setBuffer(..., index: 0) matches this
  • [[thread_position_in_grid]] gives each thread a unique global ID. No manual index math like CUDA's blockIdx.x * blockDim.x + threadIdx.x
  • device const float* and device float* are address-space-qualified pointers - MSL requires this
  • dispatchThreads handles grid sizing automatically. Metal will dispatch exactly N threads and pad the last threadgroup if needed
  • The result is read directly from bufferC.contents() - no cudaMemcpy back to host. This is unified memory in action.

Compare this to the CUDA vector addition: the kernel logic is nearly identical, but Metal eliminates all six memory management calls (cudaMalloc x3, cudaMemcpy x2, cudaFree x3) in exchange for the more verbose command encoding.

Example 2: Matrix Multiplication (Naive vs Tiled)

Matrix multiplication is where GPUs truly shine, and where threadgroup memory becomes essential. We'll build both a naive version and an optimized tiled version to show the difference.

Naive kernel

#include <metal_stdlib>
using namespace metal;

kernel void matmul_naive(
    device const float* A [[buffer(0)]],
    device const float* B [[buffer(1)]],
    device float*       C [[buffer(2)]],
    constant uint&      N [[buffer(3)]],
    uint2 gid [[thread_position_in_grid]]
) {
    uint row = gid.y;
    uint col = gid.x;
    if (row >= N || col >= N) return;

    float sum = 0.0f;
    for (uint k = 0; k < N; ++k) {
        sum += A[row * N + k] * B[k * N + col];
    }
    C[row * N + col] = sum;
}

The naive version works but wastes memory bandwidth. Each thread loads an entire row of A and column of B from unified memory. For a 1024x1024 multiply, each element of A and B gets loaded 1024 times across different threads.

Tiled version with threadgroup memory

#include <metal_stdlib>
using namespace metal;

#define TILE_SIZE 16

kernel void matmul_tiled(
    device const float* A     [[buffer(0)]],
    device const float* B     [[buffer(1)]],
    device float*       C     [[buffer(2)]],
    constant uint&      N     [[buffer(3)]],
    threadgroup float*  tileA [[threadgroup(0)]],
    threadgroup float*  tileB [[threadgroup(1)]],
    uint2 tgid [[threadgroup_position_in_grid]],
    uint2 lid  [[thread_position_in_threadgroup]]
) {
    uint row = tgid.y * TILE_SIZE + lid.y;
    uint col = tgid.x * TILE_SIZE + lid.x;
    float sum = 0.0f;

    uint numTiles = (N + TILE_SIZE - 1) / TILE_SIZE;

    for (uint t = 0; t < numTiles; ++t) {
        // Cooperatively load tile of A
        uint aCol = t * TILE_SIZE + lid.x;
        if (row < N && aCol < N)
            tileA[lid.y * TILE_SIZE + lid.x] = A[row * N + aCol];
        else
            tileA[lid.y * TILE_SIZE + lid.x] = 0.0f;

        // Cooperatively load tile of B
        uint bRow = t * TILE_SIZE + lid.y;
        if (bRow < N && col < N)
            tileB[lid.y * TILE_SIZE + lid.x] = B[bRow * N + col];
        else
            tileB[lid.y * TILE_SIZE + lid.x] = 0.0f;

        // Wait for all threads to finish loading
        threadgroup_barrier(mem_flags::mem_threadgroup);

        // Compute partial dot product from this tile
        for (uint k = 0; k < TILE_SIZE; ++k) {
            sum += tileA[lid.y * TILE_SIZE + k] * tileB[k * TILE_SIZE + lid.x];
        }

        // Wait before loading next tile
        threadgroup_barrier(mem_flags::mem_threadgroup);
    }

    if (row < N && col < N) {
        C[row * N + col] = sum;
    }
}

Host code for tiled matmul

import Metal

let N: UInt32 = 1024
let matSize = Int(N * N) * MemoryLayout<Float>.stride

let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "matmul_tiled")!
let pipeline = try! device.makeComputePipelineState(function: function)

// Create matrices with random data
var matA = (0..<Int(N * N)).map { _ in Float.random(in: 0...1) }
var matB = (0..<Int(N * N)).map { _ in Float.random(in: 0...1) }
var n = N

let bufA = device.makeBuffer(bytes: &matA, length: matSize, options: .storageModeShared)!
let bufB = device.makeBuffer(bytes: &matB, length: matSize, options: .storageModeShared)!
let bufC = device.makeBuffer(length: matSize, options: .storageModeShared)!
let bufN = device.makeBuffer(bytes: &n, length: MemoryLayout<UInt32>.stride,
                              options: .storageModeShared)!

let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(bufA, offset: 0, index: 0)
encoder.setBuffer(bufB, offset: 0, index: 1)
encoder.setBuffer(bufC, offset: 0, index: 2)
encoder.setBuffer(bufN, offset: 0, index: 3)

// Allocate threadgroup memory for tiles
let tileSize = 16
let tileBytes = tileSize * tileSize * MemoryLayout<Float>.stride
encoder.setThreadgroupMemoryLength(tileBytes, index: 0) // tileA
encoder.setThreadgroupMemoryLength(tileBytes, index: 1) // tileB

let threadgroupSize = MTLSize(width: tileSize, height: tileSize, depth: 1)
let threadgroupCount = MTLSize(
    width: (Int(N) + tileSize - 1) / tileSize,
    height: (Int(N) + tileSize - 1) / tileSize,
    depth: 1
)
encoder.dispatchThreadgroups(threadgroupCount, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

// Read result directly
let result = bufC.contents().bindMemory(to: Float.self, capacity: Int(N * N))
print("C[0,0] = \(result[0])")

What the tiled version teaches:

  • threadgroup float* tileA [[threadgroup(0)]] declares shared memory allocated from the host via setThreadgroupMemoryLength. In CUDA, you'd declare __shared__ float tileA[16][16] directly in the kernel.
  • threadgroup_barrier(mem_flags::mem_threadgroup) is Metal's equivalent of CUDA's __syncthreads(). All threads in the threadgroup must reach the barrier before any can proceed.
  • The algorithm loads 16x16 tiles of A and B into threadgroup memory, computes partial results, then moves to the next tile. Each element gets loaded from unified memory once per tile, then reused 16 times from fast on-chip memory.
  • dispatchThreadgroups (note: not dispatchThreads) dispatches a specific number of threadgroups. This gives you explicit control over the grid layout, which you need when your kernel depends on threadgroup-level cooperation.
  • On a 1024x1024 multiply, the tiled version is usually 3-5x faster than naive - the same improvement you'd see in CUDA, because the optimization is about memory reuse, not the specific API.

Example 3: Parallel Reduction (Sum with SIMD-group Operations)

Reduction - computing a single value from an array - shows SIMD-group operations, Metal's equivalent of CUDA's warp shuffle instructions.

#include <metal_stdlib>
using namespace metal;

kernel void parallel_reduce_sum(
    device const float*  input   [[buffer(0)]],
    device atomic_float* output  [[buffer(1)]],
    constant uint&       count   [[buffer(2)]],
    threadgroup float*   partials [[threadgroup(0)]],
    uint gid       [[thread_position_in_grid]],
    uint lid       [[thread_position_in_threadgroup]],
    uint simd_lane [[thread_index_in_simdgroup]],
    uint simd_id   [[simdgroup_index_in_threadgroup]],
    uint simd_size [[threads_per_simdgroup]],
    uint tg_size   [[threads_per_threadgroup]]
) {
    // Step 1: Each thread loads its element
    float val = (gid < count) ? input[gid] : 0.0f;

    // Step 2: Reduce within each SIMD-group using shuffle
    // No shared memory needed - values pass directly between lanes
    for (uint offset = simd_size / 2; offset > 0; offset /= 2) {
        val += simd_shuffle_down(val, offset);
    }

    // Step 3: Lane 0 of each SIMD-group writes to threadgroup memory
    uint num_simdgroups = (tg_size + simd_size - 1) / simd_size;
    if (simd_lane == 0) {
        partials[simd_id] = val;
    }
    threadgroup_barrier(mem_flags::mem_threadgroup);

    // Step 4: First SIMD-group reduces the partial sums
    if (simd_id == 0) {
        val = (lid < num_simdgroups) ? partials[lid] : 0.0f;
        for (uint offset = simd_size / 2; offset > 0; offset /= 2) {
            val += simd_shuffle_down(val, offset);
        }
    }

    // Step 5: Thread 0 atomically adds this threadgroup's total
    if (lid == 0) {
        atomic_fetch_add_explicit(output, val, memory_order_relaxed);
    }
}

Host code for reduction

import Metal

let N: UInt32 = 1 << 24  // ~16 million elements
let dataSize = Int(N) * MemoryLayout<Float>.stride

let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "parallel_reduce_sum")!
let pipeline = try! device.makeComputePipelineState(function: function)

// Fill with 1.0 so expected sum = N
var input = [Float](repeating: 1.0, count: Int(N))
var outputValue: Float = 0.0
var count = N

let inputBuffer = device.makeBuffer(bytes: &input, length: dataSize,
                                     options: .storageModeShared)!
let outputBuffer = device.makeBuffer(bytes: &outputValue,
                                      length: MemoryLayout<Float>.stride,
                                      options: .storageModeShared)!
let countBuffer = device.makeBuffer(bytes: &count,
                                     length: MemoryLayout<UInt32>.stride,
                                     options: .storageModeShared)!

let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)
encoder.setBuffer(countBuffer, offset: 0, index: 2)

// Allocate threadgroup memory for partial sums (one per SIMD-group)
let threadgroupSize = min(pipeline.maxTotalThreadsPerThreadgroup, Int(N))
let numSimdgroups = (threadgroupSize + 31) / 32
encoder.setThreadgroupMemoryLength(numSimdgroups * MemoryLayout<Float>.stride, index: 0)

let gridSize = MTLSize(width: Int(N), height: 1, depth: 1)
let tgSize = MTLSize(width: threadgroupSize, height: 1, depth: 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: tgSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

let result = outputBuffer.contents().bindMemory(to: Float.self, capacity: 1)
print("Sum: \(result[0]) (expected: \(N))")

What this teaches:

  • simd_shuffle_down(val, offset) passes a value from lane current + offset to the current lane - directly between SIMD lanes, no memory access at all. This is the fastest possible data exchange between threads. The CUDA equivalent is __shfl_down_sync(0xffffffff, val, offset).
  • Metal provides the SIMD-group index and lane index as kernel parameters via [[thread_index_in_simdgroup]] and [[simdgroup_index_in_threadgroup]]. In CUDA, you compute these manually: threadIdx.x % 32 and threadIdx.x / 32.
  • atomic_fetch_add_explicit lets multiple threadgroups contribute safely. This is Metal's equivalent of CUDA's atomicAdd.
  • The reduction proceeds in two phases: within each SIMD-group (shuffles, no memory), then across SIMD-groups (threadgroup memory). This two-level approach minimizes memory traffic.
  • Metal also provides simd_sum(val) as a built-in single-call reduction across a SIMD-group, which is more concise but less educational.

Example 4: Image Brightness (2D Grid, Practical Kernel)

A practical example: adjusting brightness of a RGBA image using a 2D dispatch grid.

Kernel

#include <metal_stdlib>
using namespace metal;

kernel void adjust_brightness(
    device const uchar4* input  [[buffer(0)]],
    device uchar4*       output [[buffer(1)]],
    constant float&      factor [[buffer(2)]],
    constant uint2&      dims   [[buffer(3)]],
    uint2 gid [[thread_position_in_grid]]
) {
    if (gid.x >= dims.x || gid.y >= dims.y) return;

    uint idx = gid.y * dims.x + gid.x;
    float4 pixel = float4(input[idx]) / 255.0f;

    // Adjust brightness, clamp to valid range
    pixel.rgb = clamp(pixel.rgb * factor, 0.0f, 1.0f);

    output[idx] = uchar4(pixel * 255.0f);
}

Host code

import Metal
import simd

let width = 3840
let height = 2160  // 4K image
let pixelCount = width * height
let pixelSize = pixelCount * MemoryLayout<SIMD4<UInt8>>.stride

let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let library = device.makeDefaultLibrary()!
let function = library.makeFunction(name: "adjust_brightness")!
let pipeline = try! device.makeComputePipelineState(function: function)

// Simulate image data (normally loaded from a file)
var pixels = [SIMD4<UInt8>](repeating: SIMD4<UInt8>(128, 128, 128, 255),
                             count: pixelCount)
var factor: Float = 1.5
var dims = SIMD2<UInt32>(UInt32(width), UInt32(height))

let inputBuffer = device.makeBuffer(bytes: &pixels, length: pixelSize,
                                     options: .storageModeShared)!
let outputBuffer = device.makeBuffer(length: pixelSize, options: .storageModeShared)!
let factorBuffer = device.makeBuffer(bytes: &factor,
                                      length: MemoryLayout<Float>.stride,
                                      options: .storageModeShared)!
let dimsBuffer = device.makeBuffer(bytes: &dims,
                                    length: MemoryLayout<SIMD2<UInt32>>.stride,
                                    options: .storageModeShared)!

let commandBuffer = commandQueue.makeCommandBuffer()!
let encoder = commandBuffer.makeComputeCommandEncoder()!
encoder.setComputePipelineState(pipeline)
encoder.setBuffer(inputBuffer, offset: 0, index: 0)
encoder.setBuffer(outputBuffer, offset: 0, index: 1)
encoder.setBuffer(factorBuffer, offset: 0, index: 2)
encoder.setBuffer(dimsBuffer, offset: 0, index: 3)

// 2D dispatch - 8x8 threadgroups are common for image processing
let gridSize = MTLSize(width: width, height: height, depth: 1)
let threadgroupSize = MTLSize(width: 8, height: 8, depth: 1)
encoder.dispatchThreads(gridSize, threadsPerThreadgroup: threadgroupSize)
encoder.endEncoding()
commandBuffer.commit()
commandBuffer.waitUntilCompleted()

let result = outputBuffer.contents().bindMemory(to: SIMD4<UInt8>.self, capacity: pixelCount)
print("Output pixel [0]: R=\(result[0].x) G=\(result[0].y) B=\(result[0].z) A=\(result[0].w)")

What this teaches:

  • uint2 gid [[thread_position_in_grid]] gives you a 2D thread position - natural for image processing. The CUDA equivalent uses blockIdx.x * blockDim.x + threadIdx.x and blockIdx.y * blockDim.y + threadIdx.y.
  • uchar4 and float4 are Metal's built-in vector types. The .rgb swizzle selects the first three components - this is a feature inherited from graphics shading languages that CUDA doesn't have.
  • clamp() is a built-in MSL function for branchless range clamping. No warp/SIMD-group divergence.
  • 8x8 threadgroups (64 threads) are common for 2D image kernels. They map well to the 2D spatial locality of image data.
  • The entire input/output stays in unified memory. In CUDA, processing a 4K image requires ~48 MB of PCIe transfers (24 MB up, 24 MB down). In Metal, zero.

Metal Shading Language Reference

MSL is based on C++14 with GPU-specific extensions and restrictions. Here's what you need to know for compute kernels:

Supported types

TypeDescriptionNotes
float / half32-bit / 16-bit floating pointhalf has native hardware support, 2x throughput of float
int / uint32-bit signed/unsigned integer
short / ushort16-bit signed/unsigned integer
char / uchar8-bit signed/unsigned integer
float2/3/4Vector typesComponent access via .x .y .z .w or swizzles like .rgb
float2x2 to float4x4Matrix typesColumn-major storage
atomic_int / atomic_floatAtomic typesFor thread-safe read-modify-write operations
boolBoolean

No double. This is Metal's biggest data type limitation compared to CUDA. There's no hardware FP64 on Apple GPUs. If you need double precision for scientific computing, Metal is not the right tool.

Key kernel attributes

kernel void my_kernel(
    // Buffer bindings
    device float*       data    [[buffer(0)]],
    constant Params&    params  [[buffer(1)]],
    threadgroup float*  shared  [[threadgroup(0)]],

    // Thread identification
    uint  gid       [[thread_position_in_grid]],
    uint2 gid2d     [[thread_position_in_grid]],       // For 2D grids
    uint  lid       [[thread_position_in_threadgroup]],
    uint  tgid      [[threadgroup_position_in_grid]],
    uint  tg_size   [[threads_per_threadgroup]],

    // SIMD-group identification
    uint  simd_lane [[thread_index_in_simdgroup]],
    uint  simd_id   [[simdgroup_index_in_threadgroup]],
    uint  simd_size [[threads_per_simdgroup]]
);

What MSL doesn't support

  • No lambda expressions
  • No dynamic memory allocation (new/delete)
  • No recursion
  • No virtual functions
  • No exception handling
  • No function pointers
  • No goto

These restrictions exist because GPU hardware doesn't support stack-based control flow the way CPUs do. Every thread must execute the same instruction sequence (with masking for divergent branches).

Metal vs CUDA: When to Use Which

If you've read the CUDA guide, here's the honest comparison:

MetalCUDA
Memory modelUnified (zero-copy)Discrete (explicit transfers)
Launch syntax~20 lines Swift boilerplate<<<grid, block>>> - 2 lines
FP64 supportNoYes
Max memory512 GB (M4 Ultra)80 GB (H100), 24 GB (RTX 4090)
Memory bandwidth546 GB/s (M4 Max)3,352 GB/s (H100)
FP32 TFLOPS~18 (M4 Max)~82 (RTX 4090), ~67 (H100)
Power draw40-80W300-700W
Multi-GPUNoYes (NVLink, NCCL)
PlatformmacOS / iOS onlyLinux, Windows, cloud
Library ecosystemMPS, AcceleratecuBLAS, cuDNN, cuFFT, TensorRT, NCCL, Thrust
Profiling toolsXcode GPU debuggerNsight Systems, Nsight Compute

Use Metal when

  • You're building a Mac or iOS app that needs GPU compute
  • You want zero-copy memory sharing between CPU and GPU
  • You need to load a large model (70B+ parameters) that won't fit in 24 GB of discrete VRAM but fits in 128-512 GB of unified memory
  • Power efficiency matters (laptops, edge devices, always-on workloads)
  • You're doing media processing (images, video, audio) in an Apple app
  • You're using MLX, Core ML, or the PyTorch MPS backend

Use CUDA when

  • You need maximum raw throughput for training or batch inference
  • You need FP64 for scientific computing
  • You need multi-GPU scaling across a cluster
  • You depend on the library ecosystem (cuBLAS, cuDNN, TensorRT)
  • You're launching to cloud GPU instances
  • You're targeting production ML infrastructure

For a detailed look at CUDA programming with equivalent examples, see our CUDA programming guide.

The Metal Ecosystem

Metal Performance Shaders (MPS)

MPS is Apple's library of pre-optimized GPU kernels - hand-tuned by Apple engineers for each GPU generation. It covers:

  • Image processing: Convolution, Gaussian blur, Sobel edge detection, histogram, median filter, Lanczos resampling
  • Linear algebra: Matrix multiplication (MPSMatrixMultiplication), decomposition, sparse operations
  • Neural networks: Convolution, fully connected, pooling, normalization, softmax, LSTM
  • Ray tracing: Acceleration structure building, ray intersection

For standard operations, MPS kernels normally outperform hand-written Metal compute shaders. Use them before writing custom kernels.

PyTorch MPS backend

PyTorch supports Apple GPUs via the MPS backend (since PyTorch 1.12):

import torch

device = torch.device("mps")
model = MyModel().to(device)
x = torch.randn(32, 3, 224, 224).to(device)
output = model(x)  # Runs on Apple GPU

The MPS backend uses MPSGraph under the hood. Most common operations are supported, but coverage isn't complete. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to fall back to CPU for unsupported ops.

MLX

MLX is Apple's open-source ML framework, designed specifically for Apple Silicon. It provides a NumPy-like API with lazy evaluation, automatic differentiation, and native Metal acceleration:

import mlx.core as mx

a = mx.array([1.0, 2.0, 3.0])
b = mx.array([4.0, 5.0, 6.0])
c = a + b  # Runs on GPU via Metal

MLX is the fastest path for LLM inference on Mac: ~50 tokens/s on quantized Llama 3B with M3 Max, and the MLX community maintains a growing library of optimized models.

Metal 4

Metal 4, announced at WWDC 2025, is the largest API overhaul since Metal's introduction:

  • Native tensor types: MTLTensor is now a first-class resource with buffers and textures - multi-dimensional containers designed for ML workloads
  • ML command encoders: MTL4MachineLearningCommandEncoder encodes inference directly onto the GPU timeline with graphics and compute work
  • Shader ML: Embed small ML models directly inside vertex, fragment, or compute shaders - inference happens without round-tripping through memory
  • Unified command encoding: A single encoder type handles compute dispatch, memory copies, and acceleration structure operations
  • Frame interpolation: MetalFX can create intermediate frames (comparable to NVIDIA DLSS 3 frame generation)

Metal 4 requires M1 or later on Mac, A14 or later on iOS/iPadOS.

Debugging and Profiling

Xcode Metal Debugger

The Metal debugger in Xcode provides GPU frame capture for inspecting all compute dispatches and resource states. Key features:

  • Shader debugger: Step through compute shaders line-by-line with real GPU data. View execution across thousands of threads simultaneously.
  • Dependency viewer: Graphical view of resource dependencies and synchronization between passes.
  • Shader profiling: Inline per-line cost statistics in your shader source. Edit and reload to compare performance.
  • Performance heat maps: Visualize expensive threads. Select a SIMD-group to see its full execution history.

Metal System Trace (Instruments)

Instruments' Metal System Trace shows CPU and GPU timelines side-by-side:

  • Command buffer submission and completion timing
  • Encoder-level performance breakdown
  • CPU-GPU synchronization stalls (the most common performance problem in Metal compute)

GPU Performance Counters

  • Occupancy: How well the GPU's execution units are utilized
  • Bandwidth: Memory bandwidth use
  • Limiter counters: Which hardware unit is the bottleneck

The general profiling workflow: use Instruments to find which kernel is slow, then use the Xcode shader debugger to find which lines in that kernel are expensive.

Common Pitfalls

Forgetting threadgroup_barrier. If threads in a threadgroup cooperate through shared memory (loading tiles, reduction), you need barriers between the write phase and the read phase. Without barriers, threads read partially-written data. This produces incorrect results that are difficult to debug because the timing varies.

Dispatching more threads than data. Always check bounds in your kernel: if (gid >= count) return;. Metal's dispatchThreads may round up to fill the last threadgroup, creating threads that index past your buffer.

Using .storageModeShared for everything. For intermediate buffers that only the GPU touches, use .storageModePrivate. The GPU can optimize access patterns for private buffers, and you avoid polluting the CPU's cache.

Ignoring command buffer errors. Check commandBuffer.error after waitUntilCompleted(). GPU errors are asynchronous - a kernel might silently produce wrong results from out-of-bounds access:

commandBuffer.commit()
commandBuffer.waitUntilCompleted()
if let error = commandBuffer.error {
    print("GPU error: \(error)")
}

Blocking the main thread. waitUntilCompleted() blocks the calling thread. In an app with a UI, dispatch GPU work on a background queue and use addCompletedHandler instead:

commandBuffer.addCompletedHandler { cb in
    // Read results here, on a background thread
}
commandBuffer.commit()

Over-dispatching small workloads. Metal has non-trivial dispatch overhead (~10-50 microseconds for command buffer creation, encoding, and submission). If your data is small (fewer than ~10,000 elements), the dispatch overhead may exceed the compute time. For small workloads, batch multiple operations into a single command buffer or use the CPU.

Where to Go From Here

The examples in this guide cover the core Metal compute patterns: element-wise operations, tiled matrix math, SIMD-group reductions, and 2D image processing. From here:

  • Read the official Metal Shading Language Specification - it's the final MSL reference
  • Watch Discover Metal 4 from WWDC 2025 for the latest API features
  • Try MLX if you want ML on Apple Silicon without writing custom kernels
  • Profile before optimizing - use Instruments' Metal System Trace to find actual bottlenecks before writing threadgroup memory code
  • Read the companion CUDA programming guide if you work across both platforms - the optimization concepts (tiling, coalescing, reduction hierarchies) transfer directly
  • For hands-on Metal compute examples, Metal by Example provides excellent walkthroughs

The mental model for Metal compute is the same as CUDA: think about data locality, minimize memory traffic, and use the thread hierarchy to cooperate on shared data. The API is different. The hardware is different. The optimization principles are the same.

Sources:

Metal GPU Programming - A Practical Guide for macOS Developers
About the author AI Education & Guides Writer

Priya is an AI educator and technical writer whose mission is making artificial intelligence approachable for everyone - not just engineers.