Visual-Aware CoT: Achieving High-Fidelity Visual Consistency
in Unified Models

Zixuan Ye¹†, Quande Liu²‡, Cong Wei², Yuanxing Zhang², Xintao Wang², Pengfei Wan², Kun Gai², Wenhan Luo¹‡
¹Hong Kong University of Science and Technology, ²Kuaishou Technology †Work done during an internship at KwaiVGI, Kuaishou Technology. ‡Corresponding Author

TL;DR: a visually-aware multi-modal CoT for unified understanding and generation model toward multi-modal generation.

Motivation

CoT approaches for generation

Comparison of Text CoT, Text-align Multi-modal CoT, and Visual-Aware Multi-modal CoT workflows.

Problem: Existing Chain-of-Thought (CoT) models prioritize text consistency but neglect critical visual context consistency — maintaining identity, attributes, and style from reference images during multi-modal generation.

Our Solution: Visual-Aware CoT (VACoT) integrates visual consistency directly into the reasoning process, enabling models to explicitly preserve visual elements while generating content.

Method

We propose VACoT (Visual-Aware CoT), integrating visual context consistency into unified model reasoning through two key innovations:

🎯 Adaptive Visual Planning

Generates structured visual checklists to systematically identify visual elements requiring consistency preservation:

🔄 Iterative Visual Correction

Performs self-reflection and iterative refinement using visual checklists to evaluate and progressively improve generation quality.

Key Innovation: VACoT explicitly incorporates visual awareness into both planning and correction phases, ensuring high-fidelity visual consistency with reference images.

Dataset Construction

We construct specialized datasets to train visual-aware reasoning capabilities.

Dataset construction pipeline showing generation of planning and correction datasets

Training Strategy

A two-stage approach infuses visual awareness into the unified model.

Two-stage training pipeline: SFT followed by flow-GRPO

Results

VACOT significantly outperforms existing CoT approaches in visual context preservation.

Qualitative Comparison

Qualitative comparison showing VACOT's superior identity consistency over UiG, BAGEL, and UniCoT.

VACoT ensures stable identity preservation and visual coherence in multi-reference generation, unlike text-focused baselines.

Iterative Correction Example

Example showing initial generated image being corrected over several iterations using VACOT.

Images are typically successfully refined within 3 iterations. Iterations identify issues (e.g., missing identity, incorrect object) and guide the model to self-correct.

Style Transfer

Zero-shot example of combining identity preservation and style transfer.

Zero-shot capability in Style Transfer.

Citation

If you find this work useful for your research, please consider citing our paper:

@article{ye2025visual,
  title={Visual-Aware CoT: Achieving High-Fidelity Visual Consistency in Unified Models},
  author={Ye, Zixuan and Liu, Quande and Wei, Cong and Zhang, Yuanxing and Wang, Xintao and Wan, Pengfei and Gai, Kun and Luo, Wenhan},
  journal={arXiv preprint arXiv:25XX.XXXXX},
  year={2025}
}