Overview (unmute for the best audio experience).

Introducing UrbanVerse — a system that converts real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments

enabling scalable robot learning in urban spaces with real-world generalization.

Real-to-sim Scene Generation Results of UrbanVerse.

Using the extracted scene layout as a blueprint and assets retrieved from UrbanVerse-100K, UrbanVerse generates simulation environments faithfully grounded in the real-world layout.

Beijing
Los Angeles
Cape Town
Tangier

Generated Digital Cousin Scenes: Beijing, China.

Further, for the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.

Input Video

Digital Cousin Scene 01

Digital Cousin Scene 02

Digital Cousin Scene 03

Digital Cousin Scene 04

Digital Cousin Scene 05

Generated Digital Cousin Scenes: Tangier, Morocco.

Further, for the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.

Input Video

Digital Cousin Scene 01

Digital Cousin Scene 02

Digital Cousin Scene 03

Digital Cousin Scene 04

Digital Cousin Scene 05

Digital Cousin Scene 06

Digital Cousin Scene 07

Digital Cousin Scene 08

Digital Cousin Scene 09

UrbanVerse-100K Asset Database

Object Assets Walkthrough.

Example of Per-object Annotation.

Example of Per-object Annotation

Examples of Road PBRs.

Road PBRs

Examples of Sidewalk PBRs.

Sidewalk PBRs

Examples of Sky HDRIs.

Sky HDRIs

Interactive Statistics of Object Category Distributions in UrbanVerse-100K.

Interactive Features: Click on any segment to drill down into subcategories. Use the center to navigate back up the hierarchy. Better View in Full Screen

CraftBench Test Scenes Gallery.

Scene 01

Scene 02

Scene 03

Scene 04

Scene 05

Scene 06

Scene 07

Scene 08

Scene 09

Scene 10

Diverse Real-world City Tour Video Collection for Urban Simulation Layout Grounding.

Real-world Scene Layout Distillation.

Given the uncalibrated RGB city-tour videos, we use the UrbanVerse-Gen pipeline to extract the real-world semantic scene layouts.

Scene 01
Scene 02
Scene 03

Real-World Urban Navigation Results of PPO-UrbanVerse on Diverse Street Environments.

Side-by-Side Comparison: COCO Wheeled Robot (3x Speed).

Scene 01
Scene 02
Scene 03
Scene 04
Scene 05
Scene 06
Scene 07

Side-by-Side Comparison: Go2 Quadruped Robot (3x Speed).

Scene 01
Scene 02
Scene 03
Scene 04
Scene 05
Scene 06
Scene 07

Mapless Long-horizon Urban Navigation Deployment.

Citation.

BibTeX
@misc{liu2025urbanversescalingurbansimulation,
          title={UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos}, 
          author={Mingxuan Liu and Honglin He and Elisa Ricci and Wayne Wu and Bolei Zhou},
          year={2025},
          eprint={2510.15018},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2510.15018}, 
}