Meta TRIBE v2 Predicts Brain Activity From Any Media

Meta released TRIBE v2 today, a foundation model that predicts human fMRI brain activity from video, audio, and text stimuli. The model builds on TRIBE v1, which won first place at the Algonauts 2025 brain encoding competition, but scales from four training subjects to over 720 healthy volunteers and roughly 1,000 hours of fMRI data.

TL;DR

TRIBE v2 predicts high-resolution fMRI responses across ~70,000 brain voxels - up from ~1,000 in the original
Trained on 720+ volunteers and 1,000+ hours of fMRI recordings spanning video, audio, images, and text
Hits 2-3x improvement over previous methods on zero-shot prediction for unseen individuals
Released under CC-BY-NC-4.0 with model weights, code, and an interactive demo

What TRIBE v2 Actually Does

The name stands for TRImodal Brain Encoder. Given a piece of media - a movie clip, a podcast segment, a paragraph of text - TRIBE v2 predicts how a human brain would respond, vertex by vertex, across the cortical surface. The output maps onto the fsaverage5 cortical mesh, covering roughly 20,000 vertices at high spatial resolution.

The practical pitch is efficiency. Running an fMRI experiment costs thousands of dollars per session, requires institutional review board approval, and produces noisy data from small samples. A reliable computational model of brain responses lets researchers test hypotheses in silico before committing to human trials. Meta's paper, authored by Stephane d'Ascoli, Jeremy Rapin, Yohann Benchetrit, and colleagues at Meta FAIR, frames TRIBE v2 as a tool for making cognitive neuroscience cheaper and faster.

An fMRI brain scan showing neural activation patterns across cortical regions Functional MRI captures blood-oxygen-level-dependent signals across the cortex - the ground truth that TRIBE v2 learns to predict. Source: unsplash.com

The Architecture - Three Models in a Trenchcoat

TRIBE v2's pipeline has three stages: feature extraction, multimodal integration, and cortical mapping.

Feature Extraction

Each modality gets its own pretrained encoder. Video frames pass through V-JEPA2 (facebook/vjepa2-vitg-fpc64-256), Meta's self-supervised video model built on a Vision Transformer with over 1 billion parameters. Audio goes through Wav2Vec-BERT 2.0, Meta's speech and audio encoder from the Seamless project. Text is processed by LLaMA 3.2-3B, the 3-billion-parameter language model from Meta's Llama family.

Integration and Mapping

The extracted features feed into a Transformer-based fusion module called FmriEncoder. This component handles the temporal dynamics of brain responses - the hemodynamic delay between stimulus and measurable fMRI signal, plus the complex interactions between modalities. The fused representations are then projected onto the cortical surface mesh.

Two training innovations from the original TRIBE carry over. Modality dropout randomly suppresses one input channel during training, forcing the model to remain solid when a modality is absent. Parcel-specific ensembling weights predictions for each brain region based on validation performance, so the model doesn't apply the same strategy uniformly across the entire cortex.

A brain model showing the complexity of cortical surface mapping The cortical surface contains thousands of functionally distinct regions that respond differently to visual, auditory, and linguistic stimuli. Source: unsplash.com

The Numbers - From Four Subjects to 720

The original TRIBE trained on low-resolution fMRI recordings from just four individuals watching about 80 hours of movies as part of the Courtois NeuroMod project. TRIBE v2 draws on a unified dataset of over 1,000 hours of fMRI across 720 subjects. These volunteers were exposed to varied stimuli: movies, podcasts, images, and written text in multiple languages.

Version	Subjects	fMRI Hours	Brain Parcels/Voxels	Modalities
TRIBE v1	4	~80	~1,000 parcels	Video, audio, text
TRIBE v2	720+	1,000+	~70,000 voxels	Video, audio, text, images

The resolution jump matters. Going from 1,000 standardized cortical parcels to roughly 70,000 voxels lets researchers examine fine-grained functional organization rather than broad regional averages.

Meta reports TRIBE v2 reaches "several-fold improvements in accuracy" over traditional linear encoding models. The specific claim from their announcement: a nearly 2-3x improvement over previous methods for both movies and audiobooks, especially in zero-shot generalization to individuals the model never trained on.

The Algonauts Win

TRIBE v1 placed first at the Algonauts 2025 brain encoding competition, beating more than 260 teams. The challenge required participants to predict moment-by-moment fMRI responses while subjects watched movies. TRIBE's performance held up on out-of-distribution content the model never saw during training, including animation, nature documentaries, and silent black-and-white clips.

That competition result is what separates this from a typical research paper. Algonauts is one of the few standardized benchmarks for computational neuroscience models, and winning it with a significant margin verified the multimodal fusion approach.

An MRI laboratory setting where brain imaging research takes place MRI facilities require significant investment - TRIBE v2 aims to reduce how many hours of scan time researchers need for early-stage hypothesis testing. Source: pexels.com

What It Does Not Tell You

The CC-BY-NC-4.0 license restricts commercial use. This means TRIBE v2 can't be directly integrated into clinical diagnostic tools, brain-computer interface products, or commercial neurotechnology platforms without a separate agreement with Meta. For an academic research tool, that's standard. For anything that touches patients or consumers, it's a hard stop.

The zero-shot claims deserve scrutiny. Predicting brain responses for unseen individuals is impressive, but the model predicts activity for an "average" brain on a standardized mesh. Individual brains vary substantially in both anatomy and functional organization. The gap between predicting average cortical responses and predicting what a specific patient's brain will do remains wide.

There's also the question of what "several-fold improvement" means in absolute terms. Brain encoding models still explain a modest fraction of the variance in fMRI data. A 2-3x improvement over a method that explains 5% of variance gets you to 10-15% - useful for research, but not a solved problem.

Meta's framing emphasizes neuroscience research applications, not brain reading or engagement optimization. The paper describes TRIBE v2 as recovering "a variety of results established by decades of empirical research" when tested against classic visual and neuro-linguistic experiments. That's a validation claim, not a capability claim. The model reproduces known neuroscience findings computationally, which is different from discovering new ones.

Privacy and Dual Use

The training data comes from volunteers who consented to neuroimaging research. The model itself doesn't contain individual brain data - it predicts average responses. But as brain encoding models improve, questions about neural data privacy and the potential for reconstructing mental content from brain signals will intensify. Meta's non-commercial license provides some guardrails, though it doesn't prevent the underlying techniques from being replicated with fewer restrictions.

TRIBE v2 is a strong research tool with clear limitations that Meta is being fairly transparent about. The 720-subject dataset, the Algonauts validation, and the open release of weights and code make it the most accessible brain encoding model available to the neuroscience community. Whether it becomes useful beyond academic benchmarks depends on what independent researchers find when they actually run it on their own data - the code and demo are live today.

Sources: