Detailed evaluation of two driving models with similar offline prediction quality, but very different driving behavior. Top left: Ground-truth steering signal (blue) and predictions of two models (red and green) over time. Top right: a zoomed fragment of the steering time series, showing a large mistake made by Model 1 (red). Bottom: Several trajectories driven by the models in Town 1. Same scenarios indicated with the same color in both plots. Note how the driving performance of the models is dramatically different: Model 1 crashes in every trial, while Model 2 can drive successfully.

On Offline Evaluation of Vision-based Driving Models

Abstract

Autonomous driving models should ideally be evaluated by deploying them on a fleet of physical vehicles in the real world. Unfortunately, this approach is not practical for the vast majority of researchers. An attractive alternative is to evaluate models offline, on a pre-collected validation dataset with ground truth annotation. In this paper, we investigate the relation between various online and offline metrics for evaluation of autonomous driving models. We find that of- fline prediction error is not necessarily correlated with driving quality, and two models with identical prediction error can differ dramatically in their driving per- formance. We show that the correlation of offline evaluation with driving quality can be significantly improved by selecting an appropriate validation dataset and suitable offline metrics.

Publication
In European Conference on Computer Vision
Date
Links