Enhancing Spatial Understanding in Image Generation via Reward Modeling
Abstract
A new reward model called SpatialScore is introduced to improve spatial relationship understanding in text-to-image generation through reinforcement learning with a large-scale dataset of preference pairs.
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity-particularly in encoding intricate spatial relationships. In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models. We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation. Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
Community
Accepted at CVPR 2026.
arXivLens breakdown of this paper ๐ https://arxivlens.com/PaperView/Details/enhancing-spatial-understanding-in-image-generation-via-reward-modeling-1714-f732fb98
- Executive Summary
- Detailed Breakdown
- Practical Applications
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpatialReward: Bridging the Perception Gap in Online RL for Image Editing via Explicit Spatial Reasoning (2026)
- Understanding Reward Hacking in Text-to-Image Reinforcement Learning (2026)
- DeepGen 1.0: A Lightweight Unified Multimodal Model for Advancing Image Generation and Editing (2026)
- PosterOmni: Generalized Artistic Poster Creation via Task Distillation and Unified Reward Feedback (2026)
- Unified Thinker: A General Reasoning Modular Core for Image Generation (2026)
- RL-RIG: A Generative Spatial Reasoner via Intrinsic Reflection (2026)
- DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper