Static point cloud from input image, rendered as background RGB/depth.
User-specified camera path for controllable viewpoint exploration.
Per-object trajectories encoding position and orientation over time.
Specify camera motion while keeping objects stationary.
Control object trajectories with fixed camera viewpoint.
Simultaneously specify camera and multi-object trajectories.
VerseCrafter is a controllable video world model for real-world scenes. It provides explicit 4D geometric control by allowing users to specify a target camera trajectory and multi-object 3D Gaussian trajectories, so the generated video follows both viewpoint changes and object motions with strong spatiotemporal consistency. Given a single reference view, we lift the scene into geometry-aware controls and render them into per-frame maps that serve as conditioning signals for diffusion-based video synthesis. VerseCrafter is trained on VerseControl4D, a large-scale in-the-wild dataset with automatically derived camera trajectories and multi-object 3D Gaussian trajectories, enabling robust control across diverse dynamic and static scenes.
We adopt Wan2.1 as a frozen latent video diffusion prior. Keeping the backbone unchanged preserves its strong realism and generalization, while our geometry-aware controls learn to steer generation without degrading visual quality.
A lightweight GeoAdapter encodes the rendered 4D control maps and injects them into selected diffusion blocks as zero-initialized residual modulations. This design enables precise camera and multi-object motion control while maintaining sharp, geometrically coherent videos.
This video shows how users can intuitively design custom camera trajectory and 3D Gaussian object trajectories within Blender. The resulting trajectories are exported as control maps and used by VerseCrafter for geometry-consistent, controllable video generation.
Training a video world model with precise 4D geometric control requires large-scale data with accurate annotations. We introduce VerseControl4D, a dataset constructed from Sekai-Real-HQ and SpatialVID-HQ with complete geometric supervision.
VerseControl4D contains 35,000 training clips and 1,000 validation/test clips. Overall, 26% of the clips come from Sekai-Real-HQ and 74% from SpatialVID-HQ, reflecting their complementary scene coverage.
To support both camera-only world exploration and coordinated camera-object control, VerseControl4D includes dynamic scenes (clips with salient foreground object motion together with camera motion) and static scenes (clips with negligible object motion and only camera movement). About 20% of the training clips are static scenes, and the validation set additionally includes 250 static-scene clips for dedicated camera-only evaluation.
Rendered control signals (left) are faithfully translated into high-quality videos (right). Swipe through the carousel to inspect twenty diverse scenes with aligned render controls and generated outputs.
We capture the same dynamic scene from two player viewpoints and independently generate videos from each view. VerseCrafter produces consistent multi-view world dynamics with aligned camera and object motions.
If you find VerseCrafter useful in your research, please cite us:
@article{zheng2026versecrafter,
title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei},
journal={arXiv preprint arXiv:2601.05138},
year={2026}
}