UNIC: Unified In-Context Video Editing

Zixuan Ye^1,*, Xuanhua He^1,*, Quande Liu^2,✉, Qiulin Wang², Xintao Wang², Pengfei Wan²,
Di Zhang², Kun Gai², Qifeng Chen¹, Wenhan Luo^1,✉

¹The Hong Kong University of Science and Technology

²Kuaishou Technology

^*Equal contribution. Work done during an internship at KwaiVGI, Kuaishou Technology.

^✉Corresponding author

Project Overview

UNIC (UNified In-Context Video Editing) is a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner.

UNIC: Framework & Designs

Motivation

(a) DDIM Inversion-based Methods (e.g., Video-P2P, FLATTEN)::

Sub-optimal performance.
Additional stage, which doubles the inference steps and overall cost.

(b) Adapter-based Methods::

Requiring modifications to model architectures.
Introducing parameter redundancy through the addition of adapter modules.

They are generally task-specific, requiring the training of separate modules for each distinct condition signal. This severely hinders task extendability and the unification of various editing capabilities.

Diagram illustrating challenges in video editing conditioning and UNIC's motivation

Unified In-Context Framework

UNIC unifies video editing by processing all inputs—noisy video latents, reference video tokens, and varied multi-modal condition tokens—as a combined sequence. This allows the native attention mechanisms of a Diffusion Transformer (DiT) to learn complex editing tasks "in-context," offering flexibility and simplicity.

Unified model for diverse tasks.
Define input tokens into three types.
No task-specific adapter modules.

UNIC Overall Pipeline Visualization - Figure 3

Task-aware RoPE

Dynamically assigns unique Rotary Positional Embedding (RoPE) frame indices based on task type and video length. This ensures coherent temporal understanding and correct alignment for diverse conditions.

Condition Bias

Adds a task-specific learnable embedding to condition tokens. This helps the model differentiate target tasks when modalities overlap, resolving ambiguity effectively.

Capabilities

ID Insertion

ID Swap

ID Deletion

Re-Camera Control

Stylization

First-frame Propagation

Emergent Task Composition

UNIC also exhibits emergent task composition abilities.

Re-Camera + Stylization

Applying style while changing camera view.

ID + Stylization

Inserting/Swapping an ID and applying an artistic style simultaneously.

Visual Comparisons with Other Methods

Showcasing UNIC's visual quality against other methods. Select a comparison task below. For detailed metrics, see Quantitative Results.

Comparison: ID Insertion

Comparison: ID Swap

Comparison: ID Deletion

Comparison: Re-Camera Control

Comparison: Propagation

Comparison: Stylization

Quantitative Benchmarks

UNIC consistently demonstrates competitive or superior performance across all benchmarked tasks. The following table mirrors Table 1 from our paper.

Method	Identity		Alignment	Video Quality
Method	CLIP-I↑	DINO-I↑	CLIP-score↑	Smoothness↑	Dynamic↑	Aesthetic↑
ID Insert
VACE [20]	0.522	0.110	0.100	0.933	44.568	5.407
Ours	0.598	0.245	0.216	0.961	11.07	5.627
ID Swap
VACE [20]	0.712	0.423	0.230	0.964	29.306	6.015
AnyV2V(Prop) [13]	0.605	0.229	0.218	0.917	7.596	4.842
Ours(Prop)	0.693	0.414	0.236	0.980	5.153	5.801
Ours	0.725	0.429	0.242	0.971	7.500	6.056
Method	Video Reconstruction		Alignment	Video Quality
Method	PSNR↑	RefVideo-CLIP↑	CLIP-score↑	Smoothness↑	Dynamic↑	Aesthetic↑
ID Delete
AnyV2V(Prop) [13]	19.504	0.869	0.205	0.964	4.980	5.325
VACE [20]	20.947	0.883	0.211	0.966	15.441	5.332
VideoPainter [32]	22.987	0.920	0.212	0.957	13.759	5.403
Ours(Prop)	20.378	0.906	0.209	0.968	9.017	5.408
Ours	19.171	0.900	0.217	0.970	10.934	5.493
Method	Frame Alignment		Alignment	Video Quality
Method	RefVideo-CLIP↑		CLIP-score↑	Smoothness↑	Dynamic↑	Aesthetic↑
Propagation
AnyV2V [13]	0.812		0.229	0.935	13.462	5.136
VACE(I2V) [20]	0.574		0.234	0.932	36.783	5.425
Ours	0.840		0.226	0.966	12.762	5.565
Method	Style & Content		Alignment	Video Quality
Method	CSD-Score↑	ArtFID↓	CLIP-score↑	Smoothness↑	Dynamic↑	Aesthetic↑
Stylization
AnyV2V(Prop) [13]	0.207	43.299	0.195	0.937	9.227	4.640
StyleMaster [11]	0.306	38.213	0.188	0.952	9.758	5.121
Ours(Prop)	0.197	36.198	0.215	0.932	11.569	5.045
Ours	0.259	37.619	0.171	0.945	9.370	5.276
Method	Camera Control		Alignment	Video Quality
Method	RotErr↓	TransErr↓	CLIP-score↑	Smoothness↑	Dynamic↑	Aesthetic↑
Re-Camera Control
ReCamMaster-Wan [17]	1.454	5.695	0.219	0.917	31.751	4.738
Ours	1.275	5.667	0.220	0.933	24.21	4.826

Note: ↑ indicates higher is better, ↓ indicates lower is better. Best results are in bold. Please refer to the full Table 1 in the paper for all metrics and detailed comparisons.

Training Strategy Visualization

Figure illustrating the training stages/curriculum - e.g., Figure 2 from paper

Here we demonstrate two training strategies and the data volume of each task.