Mobile robots operating in unknown urban environments encounter a wide range of complex terrains to which they must adapt their planned trajectory for safe and efficient navigation. Most existing approaches utilize supervised learning to classify terrains from either an exteroceptive or a proprioceptive sensor modality. However, this requires a tremendous amount of manual labeling effort for each newly encountered terrain as well as for variations of terrains caused by changing environmental conditions. In this work, we propose a novel terrain classification framework leveraging an unsupervised proprioceptive classifier that learns from vehicle-terrain interaction sounds to self-supervise an exteroceptive classifier for pixel-wise semantic segmentation of images. To this end, we first learn a discriminative embedding space for vehicle-terrain interaction sounds from triplets of audio clips formed using visual features of the corresponding terrain patches and cluster the resulting embeddings. We subsequently use these clusters to label the visual terrain patches by projecting the traversed tracks of the robot into the camera images. Finally, we use the sparsely labeled images to train our semantic segmentation network in a weakly supervised manner. We present extensive quantitative and qualitative results that demonstrate that our proprioceptive terrain classifier exceeds the state-of-the-art among unsupervised methods and our self-supervised exteroceptive semantic segmentation model achieves a comparable performance to supervised learning with manually labeled data.

Overview of the System
Overview of the System



The Freiburg Terrains dataset consists of three parts: 3.7 hours of audio recordings of the microphone pointed at the robot wheels. It also contains 24K RGB images from the camera mounted on top of the robot. We also provide the SLAM poses for each data collection run.


Please cite our work if you use the DeepTerrain dataset or report results based on it.

author = {Z{\"u}rn, Jannik and Burgard, Wolfram and Valada, Abhinav},
title = {Self-Supervised Visual Terrain Classification from Unsupervised Acoustic Feature Learning},
journal = {IEEE Transactions on Robotics},
year = {2021}

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here.

RGB Images


Camera Poses


Audio Recordings



Demo of Audio Terrain Classification

A demo of acoustics-based terrain classification from the work of Valada et al. (2018) can be found here.


  • Jannik Zürn, Wolfram Burgard, Abhinav Valada
    Self-Supervised Visual Terrain Classification from Unsupervised Acoustic Feature Learning
    IEEE Transactions on Robotics (T-RO), vol. 37, no. 2, pp. 466-481, 2019.

  • Abhinav Valada, Wolfram Burgard
    Deep spatiotemporal models for robust proprioceptive terrain classification
    The International Journal of Robotics Research (IJRR), 2017.

  • Abhinav Valada, Rohit Mohan, Wolfram Burgard
    Deep Feature Learning for Acoustics-based Terrain Classification
    Proceedings of the International Symposium on Robotics Research (ISRR), 2018.

  • Videos