Supported Datasets ================== Each row below lists the modalities the dataset processor currently emits into the unified format. ``✓`` means the modality is wired end-to-end through preprocessing and the on-disk ``TransformedFrameData`` schema; ``—`` means the dataset doesn't ship that modality (and it's surfaced as the corresponding default value via :class:`~standard_e2e.dataset_utils.modality_defaults.ModalityDefaults`). .. list-table:: :header-rows: 1 :widths: 22 10 10 10 10 12 14 * - Dataset - Cameras - LiDAR - HD map (BEV) - 3D detections - Driving command - Preference trajectory * - `Waymo Open E2E `__ - ✓ (8 ring cameras) - — - — - — - ✓ - ✓ * - `Waymo Open Perception `__ - ✓ (5 cameras) - ✓ (top + side, ego frame) - ✓ - ✓ - — - — * - `Argoverse 2 Sensor `__ - ✓ (7 ring cameras) - ✓ (merged sweep, ego frame) - ✓ - ✓ - — - — * - `Argoverse 2 Lidar `__ - — - ✓ (merged sweep, ego frame) - ✓ - — - — - — * - `NAVSIM `__ (OpenScene-v1.1) - ✓ (8 cameras: front/left×3/right×3/rear) - ✓ (merged sweep, ego frame) - ✓ (via nuPlan ``map.gpkg`` → unified taxonomy; lane boundaries carry no paint info, since nuPlan doesn't store paint type) - ✓ - ✓ (4-class one-hot → :class:`~standard_e2e.enums.Intent`) - — * - `WayveScenes101 `__ - ✓ (5 fisheye: forward + side arc) - ✓ (COLMAP SfM, ego frame) [#wayve_lidar]_ - — - — - — - — * - `comma2k19 `__ - ✓ (1 forward: comma EON, 1164×874 pinhole) [#comma2k19]_ - — - — - — - — - — * - `TruckDrive `__ (heavy-truck) - ✓ (11-camera 8 MP rig) [#truckdrive]_ - ✓ (Aeva FMCW joint cloud, ego frame) - — - ✓ (ego frame) - — - — * - `View-of-Delft `__ (urban radar) - ✓ (1 front: 1936×1216 pinhole) [#vod]_ - ✓ (Velodyne HDL-64, ego frame) - — - ✓ (ego frame) - — - — * - `nuScenes `__ - ✓ (6 surround cameras) [#nuscenes]_ - ✓ (LIDAR_TOP, ego frame) - ✓ (map-expansion → unified taxonomy) - ✓ (ego frame) - — - — All datasets also emit the ego **past/future trajectory** (from each dataset's poses, via the segment-context aggregator) regardless of the columns above. .. note:: **comma2k19 is high-volume** — 20 Hz × ~2000 one-minute segments ≈ 2.4 M frames (~2 TB at the native 1164×874 resolution). Two converter knobs bound the output size and processing time: ``--frame_stride N`` keeps **every N-th frame** (``1`` = full 20 Hz; e.g. ``--frame_stride 4`` ≈ 5 Hz), and the ``cameras_identity_adapter``'s ``max_size`` param **downscales** each frame so its longest side is at most that many px (intrinsics scaled to match). .. [#wayve_lidar] WayveScenes101 ships **no sensor lidar**. Its ``lidar_pc`` is populated from the per-scene **COLMAP SfM** point cloud: filtered (reprojection error ≤ 6, track length ≥ 2), converted OpenCV→FLU, then transformed into each frame's ego (FLU, x-forward/y-left/z-up) frame with the *world→ego* pose and range-clipped (50 m) so it flows through the standard lidar adapters. It is photogrammetric (sparse, up-to-scale), not a sensor measurement. The ego, cameras and lidar share one FLU frame, so a frame's cloud lifted by ``aux_data["pose_matrix"]`` reproduces the source SfM cloud exactly. .. [#comma2k19] comma2k19 ships a **single forward-facing** 20 Hz camera (comma EON, 1164×874, treated as a pinhole; identity extrinsics, since the dataset pose *is* the camera pose) plus a fused GNSS/IMU ego pose and CAN telemetry — no lidar, HD map, 3D boxes, or driving command. The ego pose is derived from the ECEF ``global_pose`` into a per-segment local FLU frame (x-forward/y-left/z-up), so ``global_position`` X/Y/Z/heading are segment-relative; ``global_position`` additionally carries the ego **speed** (:attr:`~standard_e2e.enums.TrajectoryComponent.SPEED`) from the ECEF velocity. Segments must be extracted from the distributed ``Chunk_*.zip`` archives first (as with WayveScenes101); each ``video.hevc`` is then decoded forward-only, since HEVC random seek is unreliable. Native rate is 20 Hz — use ``--frame_stride`` to subsample. .. [#truckdrive] TruckDrive (Torc Robotics / Princeton, CVPR 2026) is a long-range highway **heavy-truck** dataset. Its 8 MP surround rig has **11 cameras** — more than the eight canonical :class:`~standard_e2e.enums.CameraDirection` slots — so each camera is mapped to the canonical member matching its facing wherever one fits, with dedicated members added only for the extra views the eight can't name: the forward telephoto pair (``FRONT_LEFT_NARROW`` / ``FRONT_RIGHT_NARROW``) and the rear-facing side pair (``SIDE_LEFT_BACK`` / ``SIDE_RIGHT_BACK``). ``lidar_pc`` is the seven-sensor **Aeva FMCW** joint cloud (xyz kept, transformed into the ego ``velodyne`` frame); ``detections_3d`` are the tracked 3D boxes in the ego frame, with the ego vehicle's own cab/trailer and ``DontCare`` groups excluded per the paper taxonomy. The ego pose (PPK + LiDAR-SLAM) drives the past/future trajectory. Short-range Ouster lidar, 4D radar, accumulated GT depth and lane lines have no StandardE2E target yet and are not ingested. The dataset ships as per-scene, per-modality zips and **must be extracted first** (``scripts/extract_truckdrive.sh``, or ``scripts/prepare_dataset_truckdrive.sh`` to extract and preprocess in one step); frames are matched across sensors by their synchronization key. .. [#vod] View-of-Delft (TU Delft, IEEE RA-L 2022) is a compact **urban** dataset whose distinctive sensor is a **3+1D radar**. StandardE2E ingests its single **front camera** (:class:`~standard_e2e.enums.CameraDirection.FRONT`, 1936×1216 pinhole; intrinsics from the calib ``P2``, extrinsics from ``inv(Tr_velo_to_cam)``), the 64-layer **Velodyne** ``lidar_pc`` (xyz in the ego ``velodyne`` frame, per-point reflectance dropped) and KITTI ``detections_3d`` mapped from camera coordinates into the ego frame -- each of VoD's 13 classes folded into the coarse :class:`~standard_e2e.enums.DetectionType` (the two-wheeler family ``bicycle`` / ``Cyclist`` / ``rider`` / ``moped_scooter`` / ``motor`` → ``BICYCLE``; ``Car`` / ``truck`` / ``vehicle_other`` → ``VEHICLE``; static or ambiguous boxes → ``UNKNOWN``; ``DontCare`` dropped). Box yaw is VoD's KITTI rotation about the LiDAR **-Z** axis (camera-x zero-reference), so the ego heading is ``-(rotation + pi/2)``. One frame = one keyframe; one segment = one recording scene (``delft_*``), grouped via the official scene table so the per-segment past/future ego trajectory (from the per-frame ``mapToCamera`` pose) never spans two recordings. The **3+1D radar** (``radar`` / ``radar_3frames`` / ``radar_5frames``) has no StandardE2E modality yet and is not ingested; the detection release ships no per-frame timestamps, so they are **synthesised** at the 10 Hz LiDAR-lead rate; the **test** split has sensor data but no labels. The dataset ships as zips and **must be extracted first** (``scripts/extract_vod.sh`` -- lidar tree only, with an optional track-id overlay -- or ``scripts/prepare_dataset_vod.sh`` to extract and preprocess in one step). .. [#nuscenes] nuScenes (Motional, CVPR 2020) is the de-facto surround-view E2E / BEV benchmark: 1000 ~20 s scenes, a 6-camera surround rig (1600×900), a 32-beam ``LIDAR_TOP`` and densely annotated 3D boxes at 2 Hz keyframes (one frame = one keyframe sample; one segment = one scene). The six ``CAM_*`` channels map onto the canonical :class:`~standard_e2e.enums.CameraDirection` members; ``lidar_pc`` is the ``LIDAR_TOP`` cloud (xyz, in the ego frame); ``detections_3d`` are the ``sample_annotation`` boxes transformed from the global frame into the ego frame, each ``category_name`` folded into the coarse :class:`~standard_e2e.enums.DetectionType`. The vector **map-expansion** (lane centers from the arcline paths, lane/road dividers, crossings, walkways, stop lines, drivable area, intersections) is translated to the unified :class:`~standard_e2e.enums.MapElementType` in the ego frame and rasterised by ``HDMapBEVAdapter`` -- only when the separate ``nuScenes-map-expansion-v1.3`` pack is unzipped into ``/maps/`` (else the HD map is skipped). ``--split`` is an official nuScenes label that also selects the metadata version (``mini_train`` / ``mini_val`` → ``v1.0-mini``, ``train`` / ``val`` → ``v1.0-trainval``, ``test`` → ``v1.0-test``); the test split ships no annotations. nuScenes is read **directly from the JSON tables** -- the ``nuscenes-devkit`` is not a runtime dependency (it pins ``numpy<2``), so the split scene-lists and the lane-arcline discretization are vendored from it (Apache-2.0). The 5 radars have no StandardE2E target yet. A partially-downloaded trainval converts cleanly: scenes whose sensor blob is not yet on disk are skipped. The release ships as ``.tgz`` archives and must be extracted first (``scripts/extract_nuscenes.sh``, or ``scripts/prepare_dataset_nuscenes.sh`` to extract and preprocess in one step). How datasets are added ---------------------- See `Adding New Datasets Guide `_ for the full processor → adapter → aggregator pipeline a new dataset has to plug into.