Introducing UrbanVerse — a system that converts real-world urban scenes from city-tour videos into physics-aware, interactive simulation environments
enabling scalable robot learning in urban spaces with real-world generalization.
Using the extracted scene layout as a blueprint and assets retrieved from UrbanVerse-100K, UrbanVerse generates simulation environments faithfully grounded in the real-world layout.
Further, for the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.
Input Video
Digital Cousin Scene 01
Digital Cousin Scene 02
Digital Cousin Scene 03
Digital Cousin Scene 04
Digital Cousin Scene 05
Further, for the same city-tour video and its layout, UrbanVerse generates multiple diverse digital cousin scenes by instantiating the layout with different retrieved assets.
Input Video
Digital Cousin Scene 01
Digital Cousin Scene 02
Digital Cousin Scene 03
Digital Cousin Scene 04
Digital Cousin Scene 05
Digital Cousin Scene 06
Digital Cousin Scene 07
Digital Cousin Scene 08
Digital Cousin Scene 09
Interactive Features: Click on any segment to drill down into subcategories. Use the center to navigate back up the hierarchy. Better View in Full Screen
Scene 01
Scene 02
Scene 03
Scene 04
Scene 05
Scene 06
Scene 07
Scene 08
Scene 09
Scene 10
Given the uncalibrated RGB city-tour videos, we use the UrbanVerse-Gen pipeline to extract the real-world semantic scene layouts.
@misc{liu2025urbanversescalingurbansimulation,
title={UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos},
author={Mingxuan Liu and Honglin He and Elisa Ricci and Wayne Wu and Bolei Zhou},
year={2025},
eprint={2510.15018},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.15018},
}