DriveDreamer: Towards Real-world-driven
World Models for Autonomous Driving

GigaAI1,  Tsinghua University2 
*Equal Contribution

DriveDreamer excels in controllable driving video generation, aligning seamlessly with text prompts and structured traffic constraints. DriveDreamer can also interact with the driving scene and predict different future driving videos, based on input driving actions. Furthermore, DriveDreamer extends its utility to anticipate future driving actions.

Abstract

World models, especially in autonomous driving, are trending and drawing extensive attention due to its capacity for comprehending driving environments. The established world model holds immense potential for the generation of high-quality driving videos, and driving policies for safe maneuvering. However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios. Regarding that modeling the world in intricate driving scenes entails an overwhelming search space, we propose harnessing the powerful diffusion model to construct a comprehensive representation of the complex environment. Furthermore, we introduce a two-stage training pipeline. In the initial phase, DriveDreamer acquires a deep understanding of structured traffic constraints, while the subsequent stage equips it with the ability to anticipate future states. The proposed DriveDreamer is the first world model established from real-world driving scenarios. We instantiate DriveDreamer on the challenging nuScenes benchmark, and extensive experiments verify that DriveDreamer empowers precise, controllable video generation that faithfully captures the structural constraints of real-world traffic scenarios. Additionally, DriveDreamer enables the generation of realistic and reasonable driving policies, opening avenues for interaction and practical applications.

Method

The DriveDreamer framework begins with an initial reference frame and its corresponding road structural information (i.e., HDMap and 3D box). Within this context, DriveDreamer leverages the proposed ActionFormer to predict forthcoming road structural features in the latent space. These predicted features serve as conditions and are provided to Auto-DM, which generates future driving videos. Simultaneously, the utilization of text prompts allows for dynamic adjustments to the driving scenario style (e.g., weather and time of the day). Moreover, DriveDreamer incorporates historical action information and the multi-scale latent features extracted from Auto-DM, which are combined to generate reasonable future driving actions. In essence, DriveDreamer offers a comprehensive framework that seamlessly integrates multi-modal inputs to generate future driving videos and driving policies, thereby advancing the capabilities of autonomous-driving systems.

Results

1. Diverse Driving Video Generation.


2. Driving Video Generation with Traffic Condition and Different Text Prompts (Sunny, Rainy, Night)


3. Future Driving Video Generation with Action Interaction.


4. Future Driving Action Generation.

BibTeX

If you use our work in your research, please cite:

@article{wang2023drive,
  title={DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving},
  author={Wang, Xiaofeng and Zhu, Zheng and Huang, Guan and Chen, Xinze and Zhu, Jiagang and Lu, Jiwen},
  journal={arXiv preprint arXiv:2309.09777},
  year={2023}
}