Introducing vdr-2b-multi-v1: A Multilingual Embedding Model for Visual Document Retrieval
- Nikhil Upadhyay
- Jan 15
- 3 min read
Updated: Jun 2
Hugging Face has unveiled vdr-2b-multi-v1, a cutting-edge multilingual embedding model tailored for visual document retrieval across diverse languages and domains. This model encodes document page screenshots into dense single-vector representations, enabling efficient search and query of visually rich multilingual documents without relying on OCR or data extraction pipelines.

Key Features:
Multilingual Support: Trained on a comprehensive dataset encompassing Italian, Spanish, English, French, and German, vdr-2b-multi-v1 facilitates seamless cross-lingual retrieval. This capability allows users to search documents in one language using queries in another, significantly enhancing accessibility in multilingual environments.
Matryoshka Representation Learning: Employing this innovative technique, the model can reduce vector sizes by up to three times while preserving 98% of embedding quality. This optimization accelerates retrieval processes and reduces storage requirements without compromising performance.
Efficient Resource Utilization: Designed with efficiency in mind, vdr-2b-multi-v1 utilizes bf16 tensors and requires approximately 4.4 GB of VRAM when loaded. It supports inference with 768 image patches and a batch size of 16, making it operable even on cost-effective NVIDIA T4 GPUs.
Feature | Details |
Multilingual Capability | Supports multiple languages including English, French, German, Spanish, and Italian. |
Matryoshka Representation | Efficient vector size reduction while maintaining embedding quality (up to 98%). |
GPU Resource Efficiency | Operable on NVIDIA T4 GPUs with 4.4 GB VRAM using bf16 tensors and moderate batch sizes. |
Cross-Lingual Retrieval | Enables document search in one language using queries in another. |
Dataset | Trained on 500,000 multilingual query-image pairs generated from publicly available internet PDFs. |
Training Dataset:
The model's robustness stems from a meticulously curated multilingual query-image dataset comprising 500,000 high-quality samples. This dataset was constructed by collecting and generating multilingual query-image pairs from publicly available internet PDFs. Synthetic queries were generated using advanced vision-language models, ensuring diversity and relevance across various topics and domains.
GitHub
Applications:
vdr-2b-multi-v1 is particularly beneficial in scenarios requiring efficient retrieval of information from visually rich documents across multiple languages. Its ability to perform cross-lingual searches without the need for OCR or extensive data extraction pipelines makes it a valuable tool for researchers, multilingual content managers, and organizations operating in diverse linguistic environments.
How to Use vdr-2b-multi-v1
Installing Required Libraries
bash
pip install -U llama-index-embeddings-huggingfaceCode Example to Encode Documents and Queries
Here is a Python example to generate image and query embeddings:
python
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Initialize the model
model = HuggingFaceEmbedding( model_name="llamaindex/vdr-2b-multi-v1", device="cuda", # Use "cpu", "cuda" (NVIDIA GPUs), or "mps" (Mac M1/M2) trust_remote_code=True, )
# Encode an image into an embedding
image_path = "path_to_your_image.png"
image_embedding = model.get_image_embedding(image_path)
# Encode a text query
query_text = "Retrieve relevant documents about climate change"
query_embedding = model.get_query_embedding(query_text)
print("Image Embedding:", image_embedding)
print("Query Embedding:", query_embedding)Optimal Resource Requirements
Resource | Details |
GPU | NVIDIA T4 or higher recommended |
VRAM Required | Approx. 4.4 GB |
Batch Size | Up to 16 for 768 image patches |
This model is designed for efficient resource usage, ensuring cost-effectiveness even for smaller-scale implementations.
Dataset Details
vdr-2b-multi-v1 is trained on a unique dataset that includes:
500,000 Query-Image Pairs: Generated using advanced vision-language models.
Multilingual Coverage: Includes content in English, Spanish, French, Italian, and German.
Synthetic Query Generation: Uses diverse synthetic queries to enhance retrieval accuracy.
Performance Metrics
Metric | Value |
Embedding Quality | 98% (via Matryoshka) |
Languages Supported | 5 |
Training Data Volume | 500,000 samples |
Advanced Features
Matryoshka Representation Learning
Reduces vector sizes by up to 3x.
Preserves nearly all embedding quality, enabling faster and resource-efficient searches.
Cross-Lingual Search
Search for documents in Spanish using an English query or vice versa. This capability simplifies multilingual information retrieval.
Conclusion:
vdr-2b-multi-v1 represents a significant advancement in the field of visual document retrieval, offering robust multilingual support, efficient resource utilization, and the ability to perform cross-lingual searches without the need for traditional OCR methods. Its development underscores the importance of accessible and efficient information retrieval in our increasingly interconnected and multilingual world.






































































Comments