top of page

Introducing vdr-2b-multi-v1: A Multilingual Embedding Model for Visual Document Retrieval

Updated: Jun 2

Hugging Face has unveiled vdr-2b-multi-v1, a cutting-edge multilingual embedding model tailored for visual document retrieval across diverse languages and domains. This model encodes document page screenshots into dense single-vector representations, enabling efficient search and query of visually rich multilingual documents without relying on OCR or data extraction pipelines.


vdr-2b-multi-v1
vdr-2b-multi-v1

Key Features:

  • Multilingual Support: Trained on a comprehensive dataset encompassing Italian, Spanish, English, French, and German, vdr-2b-multi-v1 facilitates seamless cross-lingual retrieval. This capability allows users to search documents in one language using queries in another, significantly enhancing accessibility in multilingual environments.

  • Matryoshka Representation Learning: Employing this innovative technique, the model can reduce vector sizes by up to three times while preserving 98% of embedding quality. This optimization accelerates retrieval processes and reduces storage requirements without compromising performance.

  • Efficient Resource Utilization: Designed with efficiency in mind, vdr-2b-multi-v1 utilizes bf16 tensors and requires approximately 4.4 GB of VRAM when loaded. It supports inference with 768 image patches and a batch size of 16, making it operable even on cost-effective NVIDIA T4 GPUs.

Feature

Details

Multilingual Capability

Supports multiple languages including English, French, German, Spanish, and Italian.

Matryoshka Representation

Efficient vector size reduction while maintaining embedding quality (up to 98%).

GPU Resource Efficiency

Operable on NVIDIA T4 GPUs with 4.4 GB VRAM using bf16 tensors and moderate batch sizes.

Cross-Lingual Retrieval

Enables document search in one language using queries in another.

Dataset

Trained on 500,000 multilingual query-image pairs generated from publicly available internet PDFs.


Training Dataset:

The model's robustness stems from a meticulously curated multilingual query-image dataset comprising 500,000 high-quality samples. This dataset was constructed by collecting and generating multilingual query-image pairs from publicly available internet PDFs. Synthetic queries were generated using advanced vision-language models, ensuring diversity and relevance across various topics and domains.

GitHub


Applications:

vdr-2b-multi-v1 is particularly beneficial in scenarios requiring efficient retrieval of information from visually rich documents across multiple languages. Its ability to perform cross-lingual searches without the need for OCR or extensive data extraction pipelines makes it a valuable tool for researchers, multilingual content managers, and organizations operating in diverse linguistic environments.


How to Use vdr-2b-multi-v1

Installing Required Libraries

bash

pip install -U llama-index-embeddings-huggingface

Code Example to Encode Documents and Queries

Here is a Python example to generate image and query embeddings:

python

from llama_index.embeddings.huggingface import HuggingFaceEmbedding 
# Initialize the model

model = HuggingFaceEmbedding( model_name="llamaindex/vdr-2b-multi-v1", device="cuda", # Use "cpu", "cuda" (NVIDIA GPUs), or "mps" (Mac M1/M2) trust_remote_code=True, )

# Encode an image into an embedding
image_path = "path_to_your_image.png"
image_embedding = model.get_image_embedding(image_path)

# Encode a text query
query_text = "Retrieve relevant documents about climate change"
query_embedding = model.get_query_embedding(query_text)

print("Image Embedding:", image_embedding)
print("Query Embedding:", query_embedding)

Optimal Resource Requirements

Resource

Details

GPU

NVIDIA T4 or higher recommended

VRAM Required

Approx. 4.4 GB

Batch Size

Up to 16 for 768 image patches

This model is designed for efficient resource usage, ensuring cost-effectiveness even for smaller-scale implementations.


Dataset Details

vdr-2b-multi-v1 is trained on a unique dataset that includes:

  1. 500,000 Query-Image Pairs: Generated using advanced vision-language models.

  2. Multilingual Coverage: Includes content in English, Spanish, French, Italian, and German.

  3. Synthetic Query Generation: Uses diverse synthetic queries to enhance retrieval accuracy.


Performance Metrics

Metric

Value

Embedding Quality

98% (via Matryoshka)

Languages Supported

5

Training Data Volume

500,000 samples

Advanced Features

Matryoshka Representation Learning

  • Reduces vector sizes by up to 3x.

  • Preserves nearly all embedding quality, enabling faster and resource-efficient searches.

Cross-Lingual Search

Search for documents in Spanish using an English query or vice versa. This capability simplifies multilingual information retrieval.


Conclusion:

vdr-2b-multi-v1 represents a significant advancement in the field of visual document retrieval, offering robust multilingual support, efficient resource utilization, and the ability to perform cross-lingual searches without the need for traditional OCR methods. Its development underscores the importance of accessible and efficient information retrieval in our increasingly interconnected and multilingual world.

 
 
 

Recent Posts

See All

Comments


Subscribe for Newsletters

Thanks for submitting!

  • Facebook
  • Twitter
  • LinkedIn

©2025 by Priheni Blogs.

bottom of page