VerseCrafter

Dynamic Realistic Video World Model
with 4D Geometric Control
1Fudan University 2Shanghai Innovation Institute 3HKU 4ARC Lab, Tencent PCG
†Corresponding authors.
✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion.
VerseCrafter Logo

πŸ“‹ TL;DR

🎯 Flexible 4D Geometric Control

4D Representation

πŸ— Background Geometry

Static point cloud from input image, rendered as background RGB/depth.

🎬 Camera Trajectory

User-specified camera path for controllable viewpoint exploration.

πŸƒ 3D Gaussian Trajectories

Per-object trajectories encoding position and orientation over time.

Control Modes

Camera-Only

Specify camera motion while keeping objects stationary.

Object-Only

Control object trajectories with fixed camera viewpoint.

Joint Control

Simultaneously specify camera and multi-object trajectories.

Camera Trajectory Control

Trajectory Animation
Generated Video

3D Gaussian Trajectory Control

Trajectory Animation
Generated Video

Camera + 3D Gaussian Control

Trajectory Animation
Generated Video

Camera Trajectory Control

Trajectory Animation
Generated Video

3D Gaussian Trajectory Control

Trajectory Animation
Generated Video

Camera + 3D Gaussian Control

Trajectory Animation
Generated Video

πŸ—οΈ Framework

VerseCrafter Framework
Framework of VerseCrafter. We render 4D geometric controls (camera and 3D Gaussian trajectories) as multi-channel maps and inject them into a frozen Wan2.1 backbone via GeoAdapter for geometry-consistent generation.

VerseCrafter is a controllable video world model for real-world scenes. It provides explicit 4D geometric control by allowing users to specify a target camera trajectory and multi-object 3D Gaussian trajectories, so the generated video follows both viewpoint changes and object motions with strong spatiotemporal consistency. Given a single reference view, we lift the scene into geometry-aware controls and render them into per-frame maps that serve as conditioning signals for diffusion-based video synthesis. VerseCrafter is trained on VerseControl4D, a large-scale in-the-wild dataset with automatically derived camera trajectories and multi-object 3D Gaussian trajectories, enabling robust control across diverse dynamic and static scenes.

Frozen Wan2.1 Backbone

We adopt Wan2.1 as a frozen latent video diffusion prior. Keeping the backbone unchanged preserves its strong realism and generalization, while our geometry-aware controls learn to steer generation without degrading visual quality.

GeoAdapter for 4D Control

A lightweight GeoAdapter encodes the rendered 4D control maps and injects them into selected diffusion blocks as zero-initialized residual modulations. This design enables precise camera and multi-object motion control while maintaining sharp, geometrically coherent videos.

βš–οΈ Comparison with SOTA Methods

πŸ–₯️ Interactive 4D Control Interface

This video shows how users can intuitively design custom camera trajectory and 3D Gaussian object trajectories within Blender. The resulting trajectories are exported as control maps and used by VerseCrafter for geometry-consistent, controllable video generation.

πŸ—‚οΈ VerseControl4D Dataset

Training a video world model with precise 4D geometric control requires large-scale data with accurate annotations. We introduce VerseControl4D, a dataset constructed from Sekai-Real-HQ and SpatialVID-HQ with complete geometric supervision.

Dataset Pipeline
Dataset construction pipeline. Starting from Sekai-Real-HQ and SpatialVID-HQ, we extract 81-frame clips and apply quality filtering. For each retained clip, Qwen2.5-VL-72B, Grounded-SAM2, and MegaSAM provide captions, object masks, depth, and camera poses, which are lifted into background/object point clouds, fitted with 3D Gaussian trajectories, and rendered as background/trajectory maps plus a merged mask that constitute our 4D Geometric Control.

VerseControl4D contains 35,000 training clips and 1,000 validation/test clips. Overall, 26% of the clips come from Sekai-Real-HQ and 74% from SpatialVID-HQ, reflecting their complementary scene coverage.

To support both camera-only world exploration and coordinated camera-object control, VerseControl4D includes dynamic scenes (clips with salient foreground object motion together with camera motion) and static scenes (clips with negligible object motion and only camera movement). About 20% of the training clips are static scenes, and the validation set additionally includes 250 static-scene clips for dedicated camera-only evaluation.

35K
Training Clips
1K
Val/Test Clips
26%
Sekai-Real-HQ
74%
SpatialVID-HQ

Dataset Examples

Camera Trajectory
Input Image
Ground Truth
Background RGB
Background Depth
3D Gaussian Trajectory RGB
3D Gaussian Trajectory Depth
Merged Mask
Camera Trajectory
Input Image
Ground Truth
Background RGB
Background Depth
3D Gaussian Trajectory RGB
3D Gaussian Trajectory Depth
Merged Mask

✨ Additional Results

πŸ‘₯ Multi-player View Consistency

Player A View β†’ Generated Video
Player B View β†’ Generated Video
Trajectory Animation
Rendered Controls
Generated Video
Trajectory Animation
Rendered Controls
Generated Video

We capture the same dynamic scene from two player viewpoints and independently generate videos from each view. VerseCrafter produces consistent multi-view world dynamics with aligned camera and object motions.

πŸ“š Citation

If you find VerseCrafter useful in your research, please cite us:

@article{zheng2026versecrafter,
    title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
    author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei},
    journal={arXiv preprint arXiv:2601.05138},
    year={2026}
}