Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics

Tianshuo Xu1, Zhifei Chen1, Leyi Wu1, Hao Lu1, Ying-cong Chen1,2*

1HKUST (GZ)    2HKUST    *Corresponding Author

arXiv arXiv GitHub 🤗 Model

Given a single image and user-drawn motion trajectories, Motion Forcing generates physically consistent videos through a hierarchical Point → Shape → Appearance pipeline — decoupling 3D geometry reasoning from texture synthesis to handle complex dynamics such as collisions, cut-ins, and multi-object interactions.

Pipeline Illustration

(a) Preparation of Motion Representations
Input Image Input Image
Depth & Seg
Models
Static Depth Static Depth
+
Ego Motion Ego Motion
Depth Warping Video
Point Image Point Image
+
Object Motion Object Motion
Point Video

Static depth = depth × ~seg (dynamic objects removed); Point image = minimum inscribed circles of object masks.

(b) Two-Stage Generation
Input Image Input Image
Depth Warping
Point Video
Stage 1 Motion Forcing Point → Shape
Depth Video
Stage 2 Motion Forcing Shape → Appearance
RGB Video

Comparisons — Driving

Case 1

conditionCondition
Ours
Wan 2.6
Seed Dance 2.0
MOFA-Video

Case 2

conditionCondition
Ours
Wan 2.6
Seed Dance 2.0
MOFA-Video

More Driving Scenes

Scene 1

Dangerous Cut-in
Double Cut-in
Left Cut-in & Brake
Right Cut-in

Scene 2 & 3

Front Car Braking
Dangerous Right Cut-out
Reverse Car Left Cut-in
Right Cut-in

Ego-Motion Control

The same scene with different ego-vehicle trajectories (speed up / slow down / turn left / turn right).

Speed Up
fast condition
Slow Down
slow condition
Turn Left
left condition
Turn Right
right condition

Comparisons — Physics (Physion)

Case 1

conditionCondition
Ours
Wan 2.6
Seed Dance 2.0
MOFA-Video

Case 2

conditionCondition
Ours
Wan 2.6
Seed Dance 2.0
MOFA-Video

More Physics Actions

Action 2 — Condition
Action 2 — Generated
Action 4 — Condition
Action 4 — Generated

Embodied AI (Jaco Play)

Case 1 · Action 1
Case 1 · Action 2
Case 2 · Action 1
Case 2 · Action 2
Case 3 · Action 1
Case 3 · Action 2
Case 4 · Action 1
Case 4 · Action 2

Failure Cases

When the control signals deviate significantly from realistic scenarios, the model still produces incorrect results.

Case 1 — Condition
fail1 condition
Case 1 — Result
Case 2 — Condition
fail2 condition
Case 2 — Result

Acknowledgements

We thank the authors of CogVideoX, Video-Depth-Anything, VGGT, and Ultralytics YOLO for their outstanding open-source contributions.

Citation

@misc{xu2026motion,
      title={Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics}, 
      author={Tianshuo Xu and Zhifei Chen and Leyi Wu and Hao Lu and Ying-cong Chen},
      year={2026},
      eprint={2603.10408},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.10408}, 
}