TL;DR: a visually-aware multi-modal CoT for unified understanding and generation model toward multi-modal generation.
Problem: Existing Chain-of-Thought (CoT) models prioritize text consistency but neglect critical visual context consistency — maintaining identity, attributes, and style from reference images during multi-modal generation.
Our Solution: Visual-Aware CoT (VACoT) integrates visual consistency directly into the reasoning process, enabling models to explicitly preserve visual elements while generating content.
We propose VACoT (Visual-Aware CoT), integrating visual context consistency into unified model reasoning through two key innovations:
Generates structured visual checklists to systematically identify visual elements requiring consistency preservation:
Performs self-reflection and iterative refinement using visual checklists to evaluate and progressively improve generation quality.
Key Innovation: VACoT explicitly incorporates visual awareness into both planning and correction phases, ensuring high-fidelity visual consistency with reference images.
We construct specialized datasets to train visual-aware reasoning capabilities.
A two-stage approach infuses visual awareness into the unified model.
VACOT significantly outperforms existing CoT approaches in visual context preservation.
VACoT ensures stable identity preservation and visual coherence in multi-reference generation, unlike text-focused baselines.
Images are typically successfully refined within 3 iterations. Iterations identify issues (e.g., missing identity, incorrect object) and guide the model to self-correct.
Zero-shot capability in Style Transfer.
If you find this work useful for your research, please consider citing our paper:
@article{ye2025visual,
title={Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models},
author={Ye, Zixuan and Liu, Quande and Wei, Cong and Zhang, Yuanxing and Wang, Xintao and Wan, Pengfei and Gai, Kun and Luo, Wenhan},
journal={arXiv preprint arXiv:25XX.XXXXX},
year={2025}
}