From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving

1CARIAD SE, 2Technical University of Munich, 3Karlsruhe Institute of Technology, *equal contribution
ITSC 2025
CVAE Model

Generative Problem Formulation: The generative joint prediction model is realized as a Conditional Variational Autoencoder. It includes Encoders for actors, map, and ground truth; a Posterior for sampling latent scene variables during training; a Prior for sampling latent scene variables during inference; and a Decoder that produces future predictions from the encodings and drawn latent scene sample. Black-arrow paths run always, red during training, and green during inference.

Abstract

Accurate motion prediction of surrounding traffic participants is crucial for the safe and efficient operation of automated vehicles in dynamic environments. Marginal prediction models commonly forecast each agent's future trajectories independently, often leading to sub-optimal planning decisions for an automated vehicle. In contrast, joint prediction models explicitly account for the interactions between agents, yielding socially and physically consistent predictions on a scene level. However, existing approaches differ not only in their problem formulation but also in the model architectures and implementation details used, making it difficult to compare them. In this work, we systematically investigate different approaches to joint motion prediction, including post-processing of the marginal predictions, explicitly training the model for joint predictions, and framing the problem as a generative task. We evaluate each approach in terms of prediction accuracy, multi-modality, and inference efficiency, offering a comprehensive analysis of the strengths and limitations of each approach.

Research Question

While marginal predictions (a) tackle the problem of predicting the future behavior of traffic participants by forecasting each agent's future trajectories independently, joint predictions (b) explicitly capture the dependencies between agents, enabling the generation of socially and physically plausible multi-agent futures on a scene-level.

Existing approaches to joint motion prediction differ not only in their problem formulation, such as recombining marginal predictions into joint prediction, using scene-level losses during training, or adopting generative formulations, but also in their model architectures, dataset pre-processing steps, and post-processing techniques. As a result, it is difficult to determine whether gains in leaderboard performance result from different problem formulations, architectural changes, or other implementation details such as training strategies. We address this ambiguity by providing a detailed evaluation of commonly used approaches to joint motion prediction.

Our contributions are as follows:

  • We systematically explore joint prediction strategies building upon the SIMPL model [1] as a marginal baseline.
  • We outline several possible modifications to adapt SIMPL for joint prediction, including the use of more expressive trajectory decoders, as well as framing joint motion prediction as a generative task using a CVAE approach.
  • We conduct extensive experiments on the Argoverse 2 Motion Forecasting Dataset [2], evaluating not only prediction accuracy but also the multi-modality of the predicted modes and the models' inference times.
  • We provide a comprehensive analysis of the strengths and limitations of each approach, offering insights into their performance trade-offs.

Evaluated Approaches

We explore three strategies to joint motion prediction: (i) recombining marginal modes into joint predictions, (ii) applying a scene-level loss during training, and (iii) formulating motion prediction as a generative problem solved via a CVAE.

Specifically, we evaluate these approaches by comparing the following models:

  • Marginal Recombination: The marginal baseline model evaluated jointly by recombining predicted marginal modes.
  • Joint Loss: The baseline model trained directly with a scene-level loss to encourage joint consistency.
  • Multi-MLP: An extension of the baseline with a Multi-MLP decoder, using separate heads for each mode, trained with the scene-level loss.
  • Anchor Point Transformer: An extension of the baseline with a transformer-based decoder for improving inter-mode accuracy, also trained with the scene-level loss.
  • Conditional Variational Autoencoder: The baseline model adapted to the CVAE framework, enabling sampling-based generation of predictions.

Quantitative Results

Prediction performance on the Argoverse 2 Motion Forecasting Competition test set. Recombining the marginal modes (1) achieves already strong performance accross all metrics. Employing a scene-level loss during training (2) degrades performance, which is mitigated by employing more expressive decoders (3,4). All generative approaches (5,6) improve the prediction performance on three of four metrics, with only the actorCR increasing slightly.

Inference times of the proposed models, measured on an NVIDIA Quadro RTX 6000 24GB GPU. The model with marginal recombiniation (1) exhibits nearly double the inference time as the same model architecture trained with a scene-level loss (2). This is due to the additional post-processing required in the first mode. The remaining deterministic models (3,4) show similar inference times, with an increase corresponding to an increase of the model size. The generative models (5,6) show slightly higher inference times due to the decoder processing the full scene per predicted mode.

Qualitative Results

The quantitative prediction metrics above evaluate only the mode most aligned with the ground truth. To adress this, we analyze the diversity and multi-modality of the predicted modes by investigating example situations.

Example Situation 1

The deterministic Anchor Point Transformer model produces many unrealistic modes and shows no real multi-modality in its predicted modes. The generative CVAE-based model predicts more plausible and genuinely multi-modal predictions. However, when using a large weight on the KL divergence regularization, denoted as CVAE Large Beta, the predicted modes exhibit only limited diversity.

Example Situation 2

Again, only the generative CVAE-based model with a small weight on the KL divergence regularization shows true multi-modality in its predicted modes.

Example Situation 3

Here, all modes are able to accuratelly predict the ground truth movement in at least one predicted mode. However, only the CVAE Small Beta model predicts one mode where the red vehicle is continueing straight.

Example Situation 4

In this example, no model is able to correctly predict the ground truth movement. We attribute this to the untypical trajectory exhibited by the ground truth vehicle.

Acknowledgment

This work is a result of the joint research project STADT:up (19A22006E). The project is supported by the German Federal Ministry for Economic Affairs and Climate Action (BMWK), based on a decision of the German Bundestag. The author is solely responsible for the content of this publication.

Marco Caccamo was supported by an Alexander von Humboldt Professorship endowed by the German Federal Ministry of Education and Research.

BibTeX


@article{konstantinidis2025marginaltojoint,
  title={From Marginal to Joint Predictions: Evaluating Scene-Consistent Trajectory Prediction Approaches for Automated Driving},
  author={Konstantinidis, Fabian and Dallari Guerreiro, Ariel and Trumpp, Raphael and Sackmann, Moritz and Hofmann, Ulrich and Caccamo, Marco and Stiller, Christoph},
  journal={arXiv preprint arXiv:2507.05254},
  year={2025}
}