Supported Datasets
==================
Each row below lists the modalities the dataset processor currently emits
into the unified format. ``✓`` means the modality is wired end-to-end
through preprocessing and the on-disk ``TransformedFrameData`` schema;
``—`` means the dataset doesn't ship that modality (and it's surfaced as
the corresponding default value via
:class:`~standard_e2e.dataset_utils.modality_defaults.ModalityDefaults`).
.. list-table::
:header-rows: 1
:widths: 22 10 10 10 10 12 14
* - Dataset
- Cameras
- LiDAR
- HD map (BEV)
- 3D detections
- Driving command
- Preference trajectory
* - `Waymo Open E2E `__
- ✓ (8 ring cameras)
- —
- —
- —
- ✓
- ✓
* - `Waymo Open Perception `__
- ✓ (5 cameras)
- ✓ (top + side, ego frame)
- ✓
- ✓
- —
- —
* - `Argoverse 2 Sensor `__
- ✓ (7 ring cameras)
- ✓ (merged sweep, ego frame)
- ✓
- ✓
- —
- —
* - `Argoverse 2 Lidar `__
- —
- ✓ (merged sweep, ego frame)
- ✓
- —
- —
- —
* - `NAVSIM `__ (OpenScene-v1.1)
- ✓ (8 cameras: front/left×3/right×3/rear)
- ✓ (merged sweep, ego frame)
- ✓ (via nuPlan ``map.gpkg`` → unified taxonomy; lane boundaries
carry no paint info, since nuPlan doesn't store paint type)
- ✓
- ✓ (4-class one-hot → :class:`~standard_e2e.enums.Intent`)
- —
* - `WayveScenes101 `__
- ✓ (5 fisheye: forward + side arc)
- ✓ (COLMAP SfM, ego frame) [#wayve_lidar]_
- —
- —
- —
- —
* - `comma2k19 `__
- ✓ (1 forward: comma EON, 1164×874 pinhole) [#comma2k19]_
- —
- —
- —
- —
- —
* - `TruckDrive `__ (heavy-truck)
- ✓ (11-camera 8 MP rig) [#truckdrive]_
- ✓ (Aeva FMCW joint cloud, ego frame)
- —
- ✓ (ego frame)
- —
- —
* - `View-of-Delft `__ (urban radar)
- ✓ (1 front: 1936×1216 pinhole) [#vod]_
- ✓ (Velodyne HDL-64, ego frame)
- —
- ✓ (ego frame)
- —
- —
* - `nuScenes `__
- ✓ (6 surround cameras) [#nuscenes]_
- ✓ (LIDAR_TOP, ego frame)
- ✓ (map-expansion → unified taxonomy)
- ✓ (ego frame)
- —
- —
All datasets also emit the ego **past/future trajectory** (from each
dataset's poses, via the segment-context aggregator) regardless of the
columns above.
.. note::
**comma2k19 is high-volume** — 20 Hz × ~2000 one-minute segments ≈ 2.4 M
frames (~2 TB at the native 1164×874 resolution). Two converter knobs bound
the output size and processing time: ``--frame_stride N`` keeps **every
N-th frame** (``1`` = full 20 Hz; e.g. ``--frame_stride 4`` ≈ 5 Hz), and
the ``cameras_identity_adapter``'s ``max_size`` param **downscales** each
frame so its longest side is at most that many px (intrinsics scaled to
match).
.. [#wayve_lidar] WayveScenes101 ships **no sensor lidar**. Its ``lidar_pc``
is populated from the per-scene **COLMAP SfM** point cloud: filtered
(reprojection error ≤ 6, track length ≥ 2), converted OpenCV→FLU, then
transformed into each frame's ego (FLU, x-forward/y-left/z-up) frame with
the *world→ego* pose and range-clipped (50 m) so it flows through the
standard lidar adapters. It is photogrammetric (sparse, up-to-scale), not
a sensor measurement. The ego, cameras and lidar share one FLU frame, so
a frame's cloud lifted by ``aux_data["pose_matrix"]`` reproduces the
source SfM cloud exactly.
.. [#comma2k19] comma2k19 ships a **single forward-facing** 20 Hz camera
(comma EON, 1164×874, treated as a pinhole; identity extrinsics, since the
dataset pose *is* the camera pose) plus a fused GNSS/IMU ego pose and CAN
telemetry — no lidar, HD map, 3D boxes, or driving command. The ego pose is
derived from the ECEF ``global_pose`` into a per-segment local FLU frame
(x-forward/y-left/z-up), so ``global_position`` X/Y/Z/heading are
segment-relative; ``global_position`` additionally carries the ego
**speed** (:attr:`~standard_e2e.enums.TrajectoryComponent.SPEED`) from the
ECEF velocity. Segments must be extracted from the distributed
``Chunk_*.zip`` archives first (as with WayveScenes101); each ``video.hevc``
is then decoded forward-only, since HEVC random seek is unreliable. Native
rate is 20 Hz — use ``--frame_stride`` to subsample.
.. [#truckdrive] TruckDrive (Torc Robotics / Princeton, CVPR 2026) is a
long-range highway **heavy-truck** dataset. Its 8 MP surround rig has
**11 cameras** — more than the eight canonical
:class:`~standard_e2e.enums.CameraDirection` slots — so each camera is
mapped to the canonical member matching its facing wherever one fits, with
dedicated members added only for the extra views the eight can't name: the
forward telephoto pair (``FRONT_LEFT_NARROW`` / ``FRONT_RIGHT_NARROW``) and
the rear-facing side pair (``SIDE_LEFT_BACK`` / ``SIDE_RIGHT_BACK``).
``lidar_pc`` is the seven-sensor **Aeva FMCW** joint cloud (xyz kept,
transformed into the ego ``velodyne`` frame); ``detections_3d`` are the
tracked 3D boxes in the ego frame, with the ego vehicle's own cab/trailer
and ``DontCare`` groups excluded per the paper taxonomy. The ego pose
(PPK + LiDAR-SLAM) drives the past/future trajectory. Short-range Ouster
lidar, 4D radar, accumulated GT depth and lane lines have no StandardE2E
target yet and are not ingested. The dataset ships as per-scene,
per-modality zips and **must be extracted first**
(``scripts/extract_truckdrive.sh``, or
``scripts/prepare_dataset_truckdrive.sh`` to extract and preprocess in one
step); frames are matched across sensors by their synchronization key.
.. [#vod] View-of-Delft (TU Delft, IEEE RA-L 2022) is a compact **urban**
dataset whose distinctive sensor is a **3+1D radar**. StandardE2E ingests its
single **front camera** (:class:`~standard_e2e.enums.CameraDirection.FRONT`,
1936×1216 pinhole; intrinsics from the calib ``P2``, extrinsics from
``inv(Tr_velo_to_cam)``), the 64-layer **Velodyne** ``lidar_pc`` (xyz in the
ego ``velodyne`` frame, per-point reflectance dropped) and KITTI
``detections_3d`` mapped from camera coordinates into the ego frame -- each of
VoD's 13 classes folded into the coarse
:class:`~standard_e2e.enums.DetectionType` (the two-wheeler family
``bicycle`` / ``Cyclist`` / ``rider`` / ``moped_scooter`` / ``motor`` →
``BICYCLE``; ``Car`` / ``truck`` / ``vehicle_other`` → ``VEHICLE``; static or
ambiguous boxes → ``UNKNOWN``; ``DontCare`` dropped). Box yaw is VoD's KITTI
rotation about the LiDAR **-Z** axis (camera-x zero-reference), so the ego
heading is ``-(rotation + pi/2)``. One
frame = one keyframe; one segment = one recording scene (``delft_*``), grouped
via the official scene table so the per-segment past/future ego trajectory
(from the per-frame ``mapToCamera`` pose) never spans two recordings. The
**3+1D radar** (``radar`` / ``radar_3frames`` / ``radar_5frames``) has no
StandardE2E modality yet and is not ingested; the detection release ships no
per-frame timestamps, so they are **synthesised** at the 10 Hz LiDAR-lead
rate; the **test** split has sensor data but no labels. The dataset ships as
zips and **must be extracted first** (``scripts/extract_vod.sh`` -- lidar tree
only, with an optional track-id overlay -- or
``scripts/prepare_dataset_vod.sh`` to extract and preprocess in one step).
.. [#nuscenes] nuScenes (Motional, CVPR 2020) is the de-facto surround-view
E2E / BEV benchmark: 1000 ~20 s scenes, a 6-camera surround rig (1600×900),
a 32-beam ``LIDAR_TOP`` and densely annotated 3D boxes at 2 Hz keyframes
(one frame = one keyframe sample; one segment = one scene). The six ``CAM_*``
channels map onto the canonical
:class:`~standard_e2e.enums.CameraDirection` members; ``lidar_pc`` is the
``LIDAR_TOP`` cloud (xyz, in the ego frame); ``detections_3d`` are the
``sample_annotation`` boxes transformed from the global frame into the ego
frame, each ``category_name`` folded into the coarse
:class:`~standard_e2e.enums.DetectionType`. The vector **map-expansion**
(lane centers from the arcline paths, lane/road dividers, crossings,
walkways, stop lines, drivable area, intersections) is translated to the
unified :class:`~standard_e2e.enums.MapElementType` in the ego frame and
rasterised by ``HDMapBEVAdapter`` -- only when the separate
``nuScenes-map-expansion-v1.3`` pack is unzipped into ``/maps/``
(else the HD map is skipped). ``--split`` is an official nuScenes label that
also selects the metadata version (``mini_train`` / ``mini_val`` →
``v1.0-mini``, ``train`` / ``val`` → ``v1.0-trainval``, ``test`` →
``v1.0-test``); the test split ships no annotations. nuScenes is read
**directly from the JSON tables** -- the ``nuscenes-devkit`` is not a runtime
dependency (it pins ``numpy<2``), so the split scene-lists and the
lane-arcline discretization are vendored from it (Apache-2.0). The 5 radars
have no StandardE2E target yet. A partially-downloaded trainval converts
cleanly: scenes whose sensor blob is not yet on disk are skipped. The release
ships as ``.tgz`` archives and must be extracted first
(``scripts/extract_nuscenes.sh``, or ``scripts/prepare_dataset_nuscenes.sh``
to extract and preprocess in one step).
How datasets are added
----------------------
See `Adding New Datasets Guide
`_
for the full processor → adapter → aggregator pipeline a new dataset has
to plug into.