MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval

This repository contains the MoLoRAG model, a logic-aware retrieval framework for multi-modal, multi-page document understanding, as presented in the paper MoLoRAG: Bootstrapping Document Understanding via Multi-modal Logic-aware Retrieval.

MoLoRAG introduces a novel approach to Document Question Answering (DocQA) by constructing a page graph to capture contextual and logical relationships between pages. A lightweight VLM performs graph traversal to retrieve relevant pages, combining both semantic and logical relevance for more accurate retrieval. The top-K retrieved pages are then fed into arbitrary Large Vision-Language Models (LVLMs) for question answering. The framework offers both a training-free solution for easy deployment and a fine-tuned version for enhanced logical relevance checking.

For more details, please refer to the official GitHub repository.

Downloads last month
34
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for xxwu/MoLoRAG-QwenVL-3B