Pipeline Illustration
Input Image
Models
Static Depth
Ego Motion
Point Image
Object Motion
Static depth = depth × ~seg (dynamic objects removed); Point image = minimum inscribed circles of object masks.
Input Image
Comparisons — Driving
Case 1
ConditionCase 2
ConditionMore Driving Scenes
Scene 1




Scene 2 & 3




Ego-Motion Control
The same scene with different ego-vehicle trajectories (speed up / slow down / turn left / turn right).
Comparisons — Physics (Physion)
Case 1
ConditionCase 2
ConditionMore Physics Actions


Embodied AI (Jaco Play)








Failure Cases
When the control signals deviate significantly from realistic scenarios, the model still produces incorrect results.


Acknowledgements
We thank the authors of CogVideoX, Video-Depth-Anything, VGGT, and Ultralytics YOLO for their outstanding open-source contributions.
Citation
@misc{xu2026motion,
title={Motion Forcing: A Decoupled Framework for Robust Video Generation in Motion Dynamics},
author={Tianshuo Xu and Zhifei Chen and Leyi Wu and Hao Lu and Ying-cong Chen},
year={2026},
eprint={2603.10408},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2603.10408},
}