Papers
arxiv:2602.22897

OmniGAIA: Towards Native Omni-Modal AI Agents

Published on Feb 26
ยท Submitted by
Xiaoxi Li
on Feb 27
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

OmniGAIA benchmark evaluates multi-modal agents on complex reasoning tasks across video, audio, and image modalities, while OmniAtlas agent improves tool-use capabilities through hindsight-guided tree exploration and OmniDPO fine-tuning.

AI-generated summary

Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.

Community

Paper submitter

๐Ÿ’ก Overview

OmniGAIA is a comprehensive benchmark designed to evaluate the capabilities of omni-modal general AI assistants. Unlike existing benchmarks that focus on a single modality, OmniGAIA requires agents to jointly reason over video, audio, and image inputs while leveraging external tools such as web search and code execution.

We also introduce OmniAtlas, an agentic reasoning system that extends a base LLM with active perception tools, enabling the model to request and examine additional media segments during multi-step reasoning.

๐ŸŽฌ Demo

1. Agentic Reasoning on "Image + Audio" Scenario

2. Agentic Reasoning on "Video w/ Audio" Scenario

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.22897 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 1

Collections including this paper 2