Google Launches Gemini Embedding 2 for Multimodal AI
Google's first natively multimodal embedding model maps text, images, video, audio, and PDFs into a single vector space - now in public preview via Gemini API and Vertex AI.

Google released Gemini Embedding 2, its first natively multimodal embedding model. Available now in public preview as gemini-embedding-2-preview via the Gemini API and Vertex AI, the model maps text, images, video, audio, and PDF documents into a single unified vector space - enabling cross-modal search, classification, and clustering without separate pipelines per modality.
TL;DR
- First multimodal embedding model from Google - handles text, images (up to 6), video (128s), audio (80s), and PDFs (6 pages) in a single request
- Maps all modalities into one vector space: search across media types with a single query
- Matryoshka vectors: choose from 128 to 3,072 dimensions to trade off quality vs. storage/compute
- Supports 100+ languages; top MTEB Multilingual leaderboard ranking
- Pricing starts at $0.20/M text tokens; batch API at 50% off; embedding spaces are incompatible with the text-only embedding-001
What It Does
Previous embedding models worked with text only. If you wanted to search across images, audio, and documents, you needed separate models, separate indexes, and glue code to merge results. Gemini Embedding 2 eliminates that by processing all five modalities natively and projecting them into a shared vector space.
A single API call can take interleaved inputs - a paragraph of text alongside three images and an audio clip - and return one embedding that captures the cross-modal relationships. This is the difference between bolting modalities together after the fact and understanding them together from the start.
Technical Specs
| Parameter | Limit |
|---|---|
| Text input | 8,192 tokens |
| Images per request | 6 (PNG, JPEG) |
| Video per request | 128 seconds (MP4, MOV) |
| Audio per request | 80 seconds (MP3, WAV) |
| PDF per request | 6 pages |
| Output dimensions | 128 - 3,072 (default 3,072) |
| Languages | 100+ |
The model uses Matryoshka Representation Learning, which means you can truncate embeddings to smaller dimensions (128, 256, 512, 768, 1,536, or 2,048) without retraining. Smaller vectors trade some quality for significantly cheaper storage and faster similarity search - useful when you're indexing millions of items and the full 3,072-dimensional vector is overkill.
Supported Tasks
The model handles eight explicit task types: semantic similarity, classification, clustering, retrieval (both document and query sides), code retrieval, question answering, and fact verification. Setting the task type at embedding time optimizes the vector for that specific use case.
Pricing
Text embeddings are priced at $0.20 per million tokens. The batch API offers the same at 50% off for workloads that don't need real-time responses. Image, audio, and video pricing follows the standard Gemini API media token rates.
Breaking Change
The embedding spaces between gemini-embedding-001 (text-only, GA) and gemini-embedding-2-preview are incompatible. If you're upgrading, you need to re-embed your entire dataset. There's no migration path that preserves existing vectors - the model architectures produce fundamentally different representations.
This is expected for a generational jump, but worth flagging for anyone running production search systems on embedding-001.
Why It Matters
The practical value is in what you can build without stitching together multiple models. A few examples that become straightforward:
- Audio knowledge bases without transcription - embed meeting recordings directly and search them with text queries
- Multimodal RAG - retrieve relevant images, documents, and video clips alongside text passages for grounded generation
- Cross-modal recommendations - "find videos similar to this article" or "find articles that match this image"
The audio embedding without intermediate transcription is particularly notable. Transcription adds latency, loses paralinguistic information (tone, emphasis, speaker identity), and fails on non-speech audio. Direct audio embedding sidesteps all of that.
Multimodal embeddings have been available from smaller players (Nomic, Jina, CLIP derivatives), but Google bringing it to the Gemini API with a single model covering five modalities and 100+ languages raises the baseline for what developers can expect from a commodity embedding endpoint. The 8,192 token context window and flexible Matryoshka dimensions are competitive, and the pricing is aggressive enough that cost won't be the bottleneck.
Sources: Gemini Embedding 2: our first natively multimodal embedding model - Google Blog | Embeddings documentation - Gemini API | Release notes - Gemini API
