MY ALT TEXT

Although unified multimodal generative models such as Qwen-Edit have substantially improved editing quality, their underlying reasoning remains underexplored, especially for reasoning-centric editing. In contrast, our method delivers accurate edits with deep reasoning, achieving strong consistency and high perceptual quality across diverse reasoning-driven editing scenarios.

Abstract

Instruction-driven image editing with unified multimodal generative models has advanced rapidly, yet their underlying visual reasoning remains limited, leading to suboptimal performance on reasoning-centric edits. Reinforcement learning (RL) has been investigated for improving the quality of image editing, but it faces three key challenges: (1) limited reasoning exploration confined to denoising stochasticity, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards. In this work, we propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis and expands reasoning exploration beyond denoising. To the end, we introduce Chain-of-Thought (CoT)–based reasoning sampling with planning and reflection stages prior to generation in online sampling, compelling the model to explore multiple semantic hypotheses and validate their plausibility before committing to a visual outcome. To avoid the failures of weighted aggregation, we propose an unbiased chain preference grouping strategy across multiple reward dimensions. Moreover, we replace interval-based VLM scores with a binary checklist, yielding more precise, lower-variance, and interpretable rewards for complex reasoning. Experiments show our method significantly outperforms prior work on reasoning-centric image editing, producing instruction-faithful, visually coherent, and semantically grounded edits.

Motivation

MY ALT TEXT

Prior RL methods for visual generation focus on exploration within the stochastic space of generation, improving synthesis quality but offering limited reasoning capability. To address this issue, we decouple and optimize the understanding and generation modules to preserve high-fidelity synthesis while enabling exploration of optimal trajectories in the reasoning space. Besides, we introduce CoT-based sampling and optimization to further expand stochastic exploration over reasoning pathways.

Overall Framework of ThinkRL-Edit

MY ALT TEXT

During sampling, we perform Chain-of-Thought reasoning with explicit planning and reflection to enlarge stochasticity in the reasoning space. For rewards, a fine-grained, sample-specific checklist guides the VLM to produce accurate and stable reasoning scores. In grouping, we construct an unbiased preference chain across candidates to select training samples and compute advantages $A$. Finally, policy updates apply a unified editing reward while decoupling updates to the reasoning, understanding, and generation modules, enhancing reasoning capability without sacrificing synthesis quality.

Qualitative Comparison with Baselines

MY ALT TEXT

We conduct the comparison across diverse reasoning-centric editing tasks. As observed, our method achieves precise instruction following with strong consistency and high quality, which significantly surpasses previous methods. Blue text denotes the instruction, and green text indicates the desired editing outcome.

Citation

If you find this project useful in your research, please consider citing:

@article{li2026thinkrl,
          title={ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing},
          author={Li, Hengjia and Jiang, Liming and Yan, Qing and Song, Yizhi and Kang, Hao and Liu, Zichuan and Lu, Xin and Wu, Boxi and Cai, Deng},
          journal={arXiv preprint arXiv:2601.03467},
          year={2026}
        }