UNIC: Unified In-Context Video Editing

UNIC: One Unified Model, Diverse Video Editing Tasks, Creative Task Composition.

1The Hong Kong University of Science and Technology

2Kuaishou Technology

*Equal contribution. Work done during an internship at KwaiVGI, Kuaishou Technology.

Corresponding author

Project Overview

UNIC (UNified In-Context Video Editing) is a simple yet effective framework that unifies diverse video editing tasks within a single model in an in-context manner.

UNIC: Framework & Designs

Motivation

(a) DDIM Inversion-based Methods (e.g., Video-P2P, FLATTEN)::

  • Sub-optimal performance.
  • Additional stage, which doubles the inference steps and overall cost.

(b) Adapter-based Methods::

  • Requiring modifications to model architectures.
  • Introducing parameter redundancy through the addition of adapter modules.
They are generally task-specific, requiring the training of separate modules for each distinct condition signal. This severely hinders task extendability and the unification of various editing capabilities.
Diagram illustrating challenges in video editing conditioning and UNIC's motivation

Unified In-Context Framework

UNIC unifies video editing by processing all inputs—noisy video latents, reference video tokens, and varied multi-modal condition tokens—as a combined sequence. This allows the native attention mechanisms of a Diffusion Transformer (DiT) to learn complex editing tasks "in-context," offering flexibility and simplicity.

  • Unified model for diverse tasks.
  • Define input tokens into three types.
  • No task-specific adapter modules.
UNIC Overall Pipeline Visualization - Figure 3

Task-aware RoPE

Dynamically assigns unique Rotary Positional Embedding (RoPE) frame indices based on task type and video length. This ensures coherent temporal understanding and correct alignment for diverse conditions.

Condition Bias

Adds a task-specific learnable embedding to condition tokens. This helps the model differentiate target tasks when modalities overlap, resolving ambiguity effectively.

Diagram of Task-Aware RoPE & Condition Bias

Capabilities

ID Insertion

ID Swap

ID Deletion

Re-Camera Control

Stylization

First-frame Propagation


Emergent Task Composition

UNIC also exhibits emergent task composition abilities.

Re-Camera + Stylization

Applying style while changing camera view.

ID + Stylization

Inserting/Swapping an ID and applying an artistic style simultaneously.

Visual Comparisons with Other Methods

Showcasing UNIC's visual quality against other methods. Select a comparison task below. For detailed metrics, see Quantitative Results.

Comparison: ID Insertion

Comparison: ID Swap

Comparison: ID Deletion

Comparison: Re-Camera Control

Comparison: Propagation

Comparison: Stylization

Quantitative Performance Insights

Quantitative Benchmarks

UNIC consistently demonstrates competitive or superior performance across all benchmarked tasks. The following table mirrors Table 1 from our paper.

Method Identity Alignment Video Quality
CLIP-I↑ DINO-I↑ CLIP-score↑ Smoothness↑ Dynamic↑ Aesthetic↑
ID Insert
VACE [20] 0.522 0.110 0.100 0.933 44.568 5.407
Ours 0.598 0.245 0.216 0.961 11.07 5.627
ID Swap
VACE [20] 0.712 0.423 0.230 0.964 29.306 6.015
AnyV2V(Prop) [13] 0.605 0.229 0.218 0.917 7.596 4.842
Ours(Prop) 0.693 0.414 0.236 0.980 5.153 5.801
Ours 0.725 0.429 0.242 0.971 7.500 6.056
Method Video Reconstruction Alignment Video Quality
PSNR↑ RefVideo-CLIP↑ CLIP-score↑ Smoothness↑ Dynamic↑ Aesthetic↑
ID Delete
AnyV2V(Prop) [13] 19.504 0.869 0.205 0.964 4.980 5.325
VACE [20] 20.947 0.883 0.211 0.966 15.441 5.332
VideoPainter [32] 22.987 0.920 0.212 0.957 13.759 5.403
Ours(Prop) 20.378 0.906 0.209 0.968 9.017 5.408
Ours 19.171 0.900 0.217 0.970 10.934 5.493
Method Frame Alignment Alignment Video Quality
RefVideo-CLIP↑ CLIP-score↑ Smoothness↑ Dynamic↑ Aesthetic↑
Propagation
AnyV2V [13] 0.812 0.229 0.935 13.462 5.136
VACE(I2V) [20] 0.574 0.234 0.932 36.783 5.425
Ours 0.840 0.226 0.966 12.762 5.565
Method Style & Content Alignment Video Quality
CSD-Score↑ ArtFID↓ CLIP-score↑ Smoothness↑ Dynamic↑ Aesthetic↑
Stylization
AnyV2V(Prop) [13] 0.20743.2990.1950.9379.2274.640
StyleMaster [11] 0.30638.2130.1880.9529.7585.121
Ours(Prop) 0.19736.1980.2150.93211.5695.045
Ours 0.25937.6190.1710.9459.3705.276
Method Camera Control Alignment Video Quality
RotErr↓ TransErr↓ CLIP-score↑ Smoothness↑ Dynamic↑ Aesthetic↑
Re-Camera Control
ReCamMaster-Wan [17] 1.4545.6950.2190.91731.7514.738
Ours 1.2755.6670.2200.93324.214.826

Note: ↑ indicates higher is better, ↓ indicates lower is better. Best results are in bold. Please refer to the full Table 1 in the paper for all metrics and detailed comparisons.

Training Strategy Visualization

Figure illustrating the training stages/curriculum - e.g., Figure 2 from paper

Here we demonstrate two training strategies and the data volume of each task.