VerseCrafter

Dynamic Realistic Video World Model
with 4D Geometric Control

CVPR 2026

Sixiao Zheng^1,2 Minghao Yin³ Wenbo Hu⁴^† Xiaoyu Li⁴ Ying Shan⁴ Yanwei Fu^1,2^†

¹Fudan University ²Shanghai Innovation Institute ³HKU ⁴ARC Lab, Tencent PCG

^†Corresponding authors.

Paper Code Model

✨ A controllable video world model with explicit 4D geometric control over camera and multi-object motion.

📋 TL;DR

Dynamic Realistic Video World Model: VerseCrafter learns a realistic and controllable video world prior from large-scale in-the-wild data, handling challenging dynamic scenes with strong spatial-temporal coherence.
4D Geometric Control: A unified 4D control state provides direct, interpretable control over camera motion, multi-object motion, and their joint coordination, improving geometric faithfulness.
Frozen Video Prior + GeoAdapter: We attach a geometry-aware GeoAdapter to a frozen Wan2.1 backbone, injecting 4D controls into diffusion blocks for precise control without sacrificing video quality.
VerseControl4D Dataset: We introduce a large-scale real-world dataset with automatically rendered camera trajectories and multi-object 3D Gaussian trajectories to supervise 4D controllable generation.

🎯 Flexible 4D Geometric Control

4D Representation

🏗 Background Geometry

Static point cloud from input image, rendered as background RGB/depth.

🎬 Camera Trajectory

User-specified camera path for controllable viewpoint exploration.

🏃 3D Gaussian Trajectories

Per-object trajectories encoding position and orientation over time.

Control Modes

Camera-Only

Specify camera motion while keeping objects stationary.

Object-Only

Control object trajectories with fixed camera viewpoint.

Joint Control

Simultaneously specify camera and multi-object trajectories.

Camera Trajectory Control

Trajectory Animation

Generated Video

3D Gaussian Trajectory Control

Trajectory Animation

Generated Video

Camera + 3D Gaussian Control

Trajectory Animation

Generated Video

Camera Trajectory Control

Trajectory Animation

Generated Video

3D Gaussian Trajectory Control

Trajectory Animation

Generated Video

Camera + 3D Gaussian Control

Trajectory Animation

Generated Video

🏗️ Framework

Framework of VerseCrafter. We render 4D geometric controls (camera and 3D Gaussian trajectories) as multi-channel maps and inject them into a frozen Wan2.1 backbone via GeoAdapter for geometry-consistent generation.

VerseCrafter is a controllable video world model for real-world scenes. It provides explicit 4D geometric control by allowing users to specify a target camera trajectory and multi-object 3D Gaussian trajectories, so the generated video follows both viewpoint changes and object motions with strong spatiotemporal consistency. Given a single reference view, we lift the scene into geometry-aware controls and render them into per-frame maps that serve as conditioning signals for diffusion-based video synthesis. VerseCrafter is trained on VerseControl4D, a large-scale in-the-wild dataset with automatically derived camera trajectories and multi-object 3D Gaussian trajectories, enabling robust control across diverse dynamic and static scenes.

Frozen Wan2.1 Backbone

We adopt Wan2.1 as a frozen latent video diffusion prior. Keeping the backbone unchanged preserves its strong realism and generalization, while our geometry-aware controls learn to steer generation without degrading visual quality.

GeoAdapter for 4D Control

A lightweight GeoAdapter encodes the rendered 4D control maps and injects them into selected diffusion blocks as zero-initialized residual modulations. This design enables precise camera and multi-object motion control while maintaining sharp, geometrically coherent videos.

⚖️ Comparison with SOTA Methods

Joint Control Comparison (Dynamic Scenes)

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Camera Trajectory

Ground Truth

Perception-as-Control

Yume

Uni3C

Ours

Page 1 / 9

Camera-only Comparison (Static Scenes)

Camera Trajectory

Ground Truth

ViewCrafter

Voyager

FlashWorld

Ours

Camera Trajectory

Ground Truth

ViewCrafter

Voyager

FlashWorld

Ours

Camera Trajectory

Ground Truth

ViewCrafter

Voyager

FlashWorld

Ours

Camera Trajectory

Ground Truth

ViewCrafter

Voyager

FlashWorld

Ours

Camera Trajectory

Ground Truth

ViewCrafter

Voyager

FlashWorld

Ours

Camera Trajectory

Ground Truth

ViewCrafter

Voyager

FlashWorld

Ours

Page 1 / 3

🧭 Orientation Control

Trajectory Animation

Generated Video

🦴 Articulated & Deformable Control

Trajectory Animation

Generated Video

🖥️ Interactive 4D Control Interface

This video shows how users can intuitively design custom camera trajectory and 3D Gaussian object trajectories within Blender. The resulting trajectories are exported as control maps and used by VerseCrafter for geometry-consistent, controllable video generation.

🗂️ VerseControl4D Dataset

Training a video world model with precise 4D geometric control requires large-scale data with accurate annotations. We introduce VerseControl4D, a dataset constructed from Sekai-Real-HQ and SpatialVID-HQ with complete geometric supervision.

Dataset construction pipeline. Starting from Sekai-Real-HQ and SpatialVID-HQ, we extract 81-frame clips and apply quality filtering. For each retained clip, Qwen2.5-VL-72B, Grounded-SAM2, and MegaSAM provide captions, object masks, depth, and camera poses, which are lifted into background/object point clouds, fitted with 3D Gaussian trajectories, and rendered as background/trajectory maps plus a merged mask that constitute our 4D Geometric Control.

VerseControl4D contains 35,000 training clips and 1,000 validation/test clips. Overall, 26% of the clips come from Sekai-Real-HQ and 74% from SpatialVID-HQ, reflecting their complementary scene coverage.

To support both camera-only world exploration and coordinated camera-object control, VerseControl4D includes dynamic scenes (clips with salient foreground object motion together with camera motion) and static scenes (clips with negligible object motion and only camera movement). About 20% of the training clips are static scenes, and the validation set additionally includes 250 static-scene clips for dedicated camera-only evaluation.

35K

Training Clips

Val/Test Clips

26%

Sekai-Real-HQ

74%

SpatialVID-HQ

Dataset Examples

Camera Trajectory

Input Image

Ground Truth

Background RGB

Background Depth

3D Gaussian Trajectory RGB

3D Gaussian Trajectory Depth

Merged Mask

Camera Trajectory

Input Image

Ground Truth

Background RGB

Background Depth

3D Gaussian Trajectory RGB

3D Gaussian Trajectory Depth

Merged Mask

✨ Additional Results

Rendered control signals (left) are faithfully translated into high-quality videos (right). Swipe through the carousel to inspect twenty diverse scenes with aligned render controls and generated outputs.

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Render Controls

Generated Video

Page 1 / 10

👥 Multi-player View Consistency

Player A View → Generated Video

Player B View → Generated Video

Trajectory Animation

Rendered Controls

Generated Video

Trajectory Animation

Rendered Controls

Generated Video

We capture the same dynamic scene from two player viewpoints and independently generate videos from each view. VerseCrafter produces consistent multi-view world dynamics with aligned camera and object motions.

⚠️ Limitation

Trajectory Animation

Generated Video

Our current representation provides ellipsoid-level (principal-axis) orientation control rather than fine-grained 6D pose. It is reliable for strongly anisotropic rigid objects, but can be ambiguous for human-like subjects approximated by a single ellipsoid (near-circular cross-section), occasionally leading to facing-direction mismatches.

The background point cloud is built from the first frame and serves as a mostly static geometric scaffold, so it does not explicitly model per-frame non-rigid background deformations. Fine, texture-dominant dynamics (e.g., waterfalls/splashes) are weakly constrained after 3D→2D rendering and may be synthesized less faithfully than camera/object motion.

📚 Citation

If you find VerseCrafter useful in your research, please cite us:

@article{zheng2026versecrafter,
    title={VerseCrafter: Dynamic Realistic Video World Model with 4D Geometric Control},
    author={Zheng, Sixiao and Yin, Minghao and Hu, Wenbo and Li, Xiaoyu and Shan, Ying and Fu, Yanwei},
    journal={arXiv preprint arXiv:2601.05138},
    year={2026}
}