Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes
Abstract
Talk2Move presents a reinforcement learning-based diffusion framework that enables precise, semantically faithful spatial transformations of objects in scenes using natural language instructions.
We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization (2025)
- LoVoRA: Text-guided and Mask-free Video Object Removal and Addition with Learnable Object-aware Localization (2025)
- CogniEdit: Dense Gradient Flow Optimization for Fine-Grained Image Editing (2025)
- PaCo-RL: Advancing Reinforcement Learning for Consistent Image Generation with Pairwise Reward Modeling (2025)
- ConsistCompose: Unified Multimodal Layout Control for Image Composition (2025)
- Loom: Diffusion-Transformer for Interleaved Generation (2025)
- What Happens Next? Next Scene Prediction with a Unified Video Model (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper