Vehicle Ego-Trajectory Segmentation Using Guidance Cues

Mihalea, Andrei; Florea, Adina Magda

doi:10.3390/app14177776

Open AccessArticle

Vehicle Ego-Trajectory Segmentation Using Guidance Cues

by

Andrei Mihalea

and

Adina Magda Florea

^*

Faculty of Automatic Control and Computer Science, National University of Science and Technology POLITEHNICA Bucharest, 60042 Bucharest, Romania

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7776; https://doi.org/10.3390/app14177776

Submission received: 26 July 2024 / Revised: 27 August 2024 / Accepted: 29 August 2024 / Published: 3 September 2024

(This article belongs to the Special Issue Intelligent Transportation System Technologies and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Computer vision has significantly influenced recent advancements in autonomous driving by providing cutting-edge solutions for various challenges, including object detection, semantic segmentation, and comprehensive scene understanding. One specific challenge is ego-vehicle trajectory segmentation, which involves learning the vehicle’s path and describing it with a segmentation map. This can play an important role in both autonomous driving and advanced driver assistance systems, as it enhances the accuracy of perceiving and forecasting the vehicle’s movements across different driving scenarios. In this work, we propose a deep learning approach for ego-trajectory segmentation that leverages a state-of-the-art segmentation network augmented with guidance cues provided through various merging mechanisms. These mechanisms are designed to direct the vehicle’s path as intended, utilizing training data obtained with a self-supervised approach. Our results demonstrate the feasibility of using self-supervised labels for ego-trajectory segmentation and embedding directional intentions within the network’s decisions through image and guidance input concatenation, feature concatenation, or cross-attention between pixel features and various types of guidance cues. We also analyze the effectiveness of our approach in constraining the segmentation outputs and prove that our proposed improvements bring major boosts in the segmentation metrics, increasing IoU by more than 12% and 5% compared with our two baseline models. This work paves the way for further exploration into ego-trajectory segmentation methods aimed at better predicting the behavior of autonomous vehicles.

Keywords:

ego-motion; semantic segmentation; ego-trajectory segmentation; ADAS

1. Introduction

Deep learning has brought significant advancements in the field of autonomous driving and Advanced Driving Assistance Systems (ADASs) in recent years, following the dominance that the paradigm has established over the computer vision domain. Therefore, modern computer vision techniques using deep learning have had a great impact on autonomous driving systems and a large range of its subproblems, including but not restricted to end-to-end steering systems [1,2]; perception modules for environment understanding, such as pedestrian detection [3,4]; obstacle avoidance [5]; and localization and mapping [6].

The problem tackled in this work is ego-trajectory segmentation, which means predicting the trajectory of a vehicle as a segmentation mask. Ego-trajectory segmentation can, in the context of ADAS, provide valuable information about the environment of the vehicle, especially for perception and control modules, which can be enabled to understand whether the chosen trajectory can be followed or not, based on the ego-trajectory segmentation output.

This task can be derived from a more popular research topic related to self-driving vehicles and ADASs, which is the lane segmentation problem, where methods such as [7,8] have shown promising results. Another related problem can be defined as drivable area segmentation and detection [9,10], where the main goal is to generate segmentation masks for regions of the image that represent areas in the environment where the vehicle can move through. However, even though solutions to both of these problems represent important stepping stones in the improvement of ADASs, there are differences between lane segmentation and ego-trajectory segmentation, and depending on the situation, one can prove to be more helpful than the other. From our perspective, lane segmentation and ego-trajectory segmentation are both crucial parts in self-driving systems but have different roles, and we consider ego-trajectory segmentation to be equally important, since it can be integrated with different systems, such as path planners, route optimization modules, and collision avoidance mechanisms. Despite not being as popular of a research topic as lane segmentation, there are still valuable developments in ego-trajectory segmentation, especially with methods that use ground-truth data generated by using information from sensors other than RGB cameras, such as IMU, LIDAR [11], and GPS [12].

Even though the performance of state-of-the-art models for ego-trajectory segmentation has reached great results, one issue that is not widely tackled in the current methods is the inability of the model to select one trajectory based on a given intention, making the deployment of such methods in real-life scenarios unreliable due to the stochasticity of the decisions that might arise in situations where multiple possible trajectories could be predicted by an ego-trajectory prediction model. Usually, in such situations, the model will only predict one possibility, making a decision which is influenced by the distribution of the training data.

Unlike previous work performed in the ego-trajectory prediction field, we offer a new perspective to the task and propose a solution that learns to associate guidance cues to given vehicle trajectories during training, such that at inference time, an input guidance cue can force the network to provide a given trajectory based on the intention passed along by said cue. To achieve this goal, we present a detailed description of how to develop guidance mechanisms, starting from simpler ideas and moving to more advanced concepts. Therefore, the summary of the contributions presented in this work is given below:

Ego-trajectory segmentation from self-supervised labels. We show that ego-trajectory segmentation can be learned in a supervised manner from labels obtained with self-supervised methods, and we compare two baseline models.
Soft segmentation. Another topic of discussion in this work is the analysis of a soft segmentation head with different thresholds applied to the results, which is equivalent to selecting the most probable path out of more possible trajectories and represents our first idea of controlling the predicted ego-trajectory.
Segmentation guidance. We provide an analysis of the improvements brought to the ego-trajectory segmentation task when guidance data are provided to the network via two different mechanisms: data merging and cross-attention.
Improving guidance via class splitting. The experiments analyze whether splitting the ego-trajectory segmentation maps into different classes based on trajectory rotation helps improve the segmentation metrics.

2. Related Work

This section begins with a state-of-the-art description of the problem of self-supervised ego-motion estimation, which is the first part of our pipeline, enabling us to generate ego-trajectory labels in a self-supervised approach. After that, we describe the state-of-the-art developments in semantic segmentation as a general method and then delve into the specific task of ego-trajectory segmentation for a moving vehicle. While semantic segmentation is a highly researched topic in the deep learning academic community, in the autonomous driving field, this topic is not usually involved directly in predicting the ego-trajectory, but it is intensely utilized in scene understanding problems [13], like the segmentation of pedestrians [14,15,16], other vehicles, driving lanes [17,18,19], and other objects in the environment.

2.1. Self-Supervised Ego-Motion Estimation

As a first step of our label generation pipeline, the self-supervised ego-motion estimation problem plays a crucial role, since the quality of the segmentation training data depends on this process, which, if the method is robust enough, can be applied to almost any driving video in the wild.

Zhou et al. [20] presented a framework for the simultaneous learning of depth end ego-motion from unlabeled driving video sequences through a view synthesis process that acts as a supervision signal and ties the two neural networks together during training time but allows for their independent usage for inference.

Further improvements in this area bring depth and pose consistency across the predictions throughout the driving sequence [21] by imposing a geometric constraint represented as an auxiliary loss. In a similar work, Godard et al. [22] achieved depth consistency and made several other design improvements, such as auto-masking, which ignores pixels in regions where the motion assumptions are not met, and a robust minimum reconstruction loss for handling occlusions. Additionally, in [23], it is shown that camera parameters can also be learned by a separate neural network alongside the depth and ego-motion networks, which eliminates the need for this additional prior knowledge. Besides the 2D photometric consistency loss, Mahjourian, Wicke and Angelova [24] proposed a 3D-based loss, with the purpose of enforcing the consistency of estimated point clouds. One of the issues with self-supervised depth and ego-motion is the real scale recovery of the network predictions, which might become an impediment to using them in real-life scenarios. To address this problem, Wagstaff and Kelly [25] proposed a method that introduces an additional loss between the ground-truth camera height and the estimated camera height, which is obtained after training another neural network for ground plane estimation. The scaling factor between the predicted and ground-truth heights is multiplied with both depth and pose estimates to generate additional metric scaled targets for the two tasks. Moreover, Watson et al. [26] presented a method that can leverage sequence information through a cost volume, while Gu et al. [27] showed an optimization method for iteratively improving the depth and ego-motion predictions by using gated recurrent units.

2.2. Semantic Segmentation

Semantic segmentation is a cutting-edge technique in computer vision that involves partitioning an image into multiple segments and assigning to each pixel of a segment the correct label corresponding to the object or region it represents. The state-of-the-art methods in semantic segmentation leverage deep learning architectures, particularly convolutional neural networks, to achieve remarkable accuracy and efficiency.

Overall, the state of the art in semantic segmentation continues to evolve rapidly, driven by innovations in deep learning architectures, optimization techniques, and the availability of large-scale annotated datasets. These advancements hold great promise for a wide range of applications, including autonomous driving, medical image analysis [28,29,30], and scene understanding in robotics [31,32] and augmented reality [33,34].

From an architectural perspective, the FCN (fully convolutional network) for semantic segmentation [35] is claimed to be one of the first end-to-end neural networks trained for pixel-wise predictions. This architecture makes use of an encoder for learning features and a deconvolutional layer that upsamples the features from the deep layers back to the original image shape. Due to the information loss that happens when going deeper within the network, data from the shallower layers are also fused to the output, so the spatial information is preserved. Further developments in the semantic segmentation field brought models like DeepLab [36] and Mask R-CNN [37]. The former one features atrous spatial pyramid pooling and fully connected conditional random fields, while the latter is an extension of Faster R-CNN [38], which can now be used for both object detection and semantic segmentation.

However, state-of-the-art segmentation networks followed the trend which brought the transformer architecture into the world of computer vision and adopted such backbone foundation models. Many of the recent literature studies presented state-of-the-art results for computer vision problems only by adding such a backbone to already established specialized computer vision architectures, allowing them to use the same backbone for a large variety of problems and surpassing the already existing specialized models for each task. Such backbones include the vision transformer (ViT) [39], Swin Transformer [40], BEiT [41], and InternImage [42], which also uses a special type of operation called deformable convolution [43], enabling the convolution kernel to sample offset pixel locations.

2.3. Self-Supervised Ego-Trajectory Labeling

In our previous work [44], we proposed a method for generating vehicle ego-trajectory segmentation labels, starting from a state-of-the-art self-supervised method for joint depth and ego-motion estimation, like the work presented in [20,21]. Such methods provide a framework for learning both depth and ego-motion by using the self-supervised training paradigm by warping one or more source frames into a target frame by using the network prediction and then applying a photometric loss to minimize the differences between the original image and the ones obtained through the warping process. This view synthesis process acts as a supervision signal throughout the entire training process.

Furthermore, our past work provides details on how to use the previously obtained ego-motion to generate the vehicle ego-trajectory segmentation labels by leveraging the camera intrinsic and extrinsic parameters in order to project the predicted ego-motion into the real-world coordinates, obtaining the real pose of the vehicle, which is represented by the contact points between the wheels and the ground and will be projected back from a selected number of future time steps, from the real-world coordinates into the pixel coordinates of each frame. Therefore, each training frame will contain the ground contact points from future time steps across the vehicle’s followed path during the driving sequence.

Considering the frame at time step t across the driving sequence and the future positions of the vehicle from k future frames, we denote these position as

P_{t + 1}, P_{t + 2}, \dots, P_{t + k}

. Each of these positions are computed by using the ego-motion predicted by the pose network. After these real-world coordinate points are obtained, we compute their projection into the frame by taking advantage of the intrinsic and extrinsic parameters of the camera and obtain the frame coordinates of the k future positions.

2.4. Ego-Trajectory Segmentation

As previously stated, ego-trajectory segmentation is a sub-class of semantic segmentation problems which require learning to predict the trajectory followed by a vehicle, usually represented as a continuous shape which denotes the future locations of the vehicle, projected into the current frame. In [11], a method for drivable path segmentation is proposed by generating path labels by using an IMU sensor, correcting them by removing the path labels where obstacles are met by using knowledge from a LIDAR sensor and then training a semantic segmentation network on these labels. Similarly, an aggressive deep driving method [45] uses the data provided by an IMU sensor to generate a cost map of the vehicle trajectory and combines a bird’s-eye view of the cost map with model predictive control in order to input commands into a vehicle. An approach similar to the one described in [11] is presented in [46], which also uses IMU data to incorporate vehicle path labels into a semantic segmentation task with various other classes besides the trajectory class. In [12], the authors describe a method for generating ego-trajectory segmentation labels on the KITTI raw dataset [47] using data from GPS readings and then propose an end-to-end sequence-based deep network for trajectory prediction that relies on a feature extraction backbone consisting of atrous convolution layers and spatial pyramid pooling while leveraging a matching module that assigns scores between the embeddings obtained between different pairs of consecutive frames.

3. Methods

In this section, we start by presenting the two proposed baseline models; then, we thoroughly describe the method used for improving vehicle ego-trajectory segmentation through the addition of guidance cues, which can be provided in three different variations.

The overall framework can be visualized in Figure 1. It describes both the process of obtaining the self-supervised ego-trajectory labels and the final goal of this paper, which is the segmentation task for the vehicle’s ego-trajectory. Therefore, a summary of the entire pipeline can be described by the following steps:

Joint learning of depth and ego-motion neural networks in self-supervised manner on RGB image data from driving videos.
Applying ego-motion network on unseen data to obtain pose labels.
Obtaining ego-trajectory segmentation labels starting from generated pose labels from previous step.
Training semantic segmentation neural networks on newly obtained ego-trajectory labels.

3.1. Baseline Models

The ego-trajectory segmentation task is performed as a supervised training process, where a neural network learns to predict vehicle trajectory segmentation from labels obtained in the previously described self-supervised manner. In our previous work, we chose a DeepLabV3-plus [48] segmentation architecture and experimented with different dataset sampling weights in order to balance the driving scenarios that are fed into the network during training.

However, more modern segmentation methods can be deployed in order to achieve better results in a large variety of tasks; therefore, starting from this hypothesis, we want to check if the ego-trajectory segmentation task can benefit from a modern, state-of-the-art segmentation network based on a vision foundation model backbone and a segmentation head.

The model we decided to utilize for this task is called InternImage [42] and is based on a vision foundation backbone with the core operation of deformable convolution [43], followed by a modern segmentation model, namely, Mask2Former [49]. Like many recent developments, InternImage follows the trend of proposing a backbone that can be further attached to a large variety of decoding heads for different computer vision problems; therefore, it is not only a segmentation model but also a vision foundation model that has the ability to provide valuable features for a large range of tasks. InternImage constitutes the backbone of the architecture; therefore, it needs a head for the downstream tasks. In this case, the Mask2Former [49] is employed for the segmentation task, as it provides reliable performance on different benchmark datasets. This segmentation network’s performance heavily relies on a cross-attention mechanism applied within the predicted mask regions.

3.2. Soft Segmentation Head

Unlike the classical semantic segmentation task, where each pixel of the image is assigned to one of the dataset classes, the proposed soft segmentation head implements a relaxed version of the original segmentation problem, which assigns a score for each pixel in the original image and indicates the probability of that pixel belonging to a given class. This approach is our first step in trying to generate multiple possible trajectory segmentation masks and select the one with the highest probability, by applying thresholds to the output soft mask values.

In our case, the segmentation problem is used to differentiate between two different classes: vehicle ego-trajectory and the rest of the scene. The Mask2Former segmentation head adopts a different approach from the classic deep learning-based segmentation methods, closer to the instance segmentation problems, where binary masks are predicted for each instance of an object. In our case, as a main difference, the binary mask will be used to predict pixels that belong to a class instead of an instance of the class. Therefore, for our problem, in the hard segmentation case, the network will predict two binary masks: one for the ego-trajectory and one for the rest of the scene. The soft segmentation scenario is similar when using the InternImage model with the Mask2Former head, which again means that two different masks will be predicted, but in this case, the two masks will not be binary; instead, they will have continuous values which represent the score assigned to each pixel for the current class.

In order to obtain the soft segmentation model outputs, we will take the soft mask described above but only for the ego-trajectory class and ignore the soft mask predicted for the “rest of the scene” class.

3.3. Segmentation Guidance

The main goal pursued in this work is implementing a guidance mechanism for the ego-trajectory segmentation mask predicted by the network, using additional input data. Three methods were selected for representing the additional guidance data that are given to the model, as follows:

Continuous trajectory rotation angle. This information is obtained from the self-supervised ego-motion module, which can leverage the relative pose between consecutive frames in order to compute the absolute rotation between the first and last positions of the vehicle along the trajectory.
Trajectory rotation angle category. Similar to the trajectory rotation angle described above, this method relies on discretizing the value into different bins and then fusing the bin value to the backbone features, in accordance to the guidance data merging strategy.
Text description of the rotation angle. The discretized rotation angle categories are assigned a text description which goes through a text embedding layer before being merged with the image data.

The trajectory rotation angles that describe the categories are chosen by visualizing a number of samples from the dataset and then manually deciding the boundaries for each category. The trajectory categories and text description of the scenario can be seen below:

Tight left (category 0) corresponds to a trajectory rotation angle in the range ${(- \infty, - 60)}^{°}$ .
Slight left (category 1) corresponds to a trajectory rotation angle in the range ${[- 60, - 18)}^{°}$ .
Forward (category 2) corresponds to a trajectory rotation angle in the range $[- 18, 18^{°})$ .
Slight right (category 3) corresponds to a trajectory rotation angle in the range ${[18, 60)}^{°}$ .
Tight right (category 4) corresponds to a trajectory rotation angle in the range ${[60, \infty)}^{°}$ .

For each of the guidance data representations, we experiment with two main options for providing this additional information to the network: the first one relies on input merging, while the second one employs a cross-attention mechanism between the decoder input features and the embeddings obtained for the guidance input data by passing them to small neural networks or text embedding layers. The two methods of feeding the guidance cue data into the network are described below:

Input merging (concatenation). This method relies on concatenating additional input data to the features extracted by the backbone. Note that the additional inputs are concatenated after backbone feature extraction is performed, because their concatenation to the input images would cause a mismatch between the number of channels from the weights of the pre-trained backbone and the input number of channels. The backbone provides output features at different scales; therefore, the additional input is replicated to match the feature width and height for each scale before being concatenated to these features.

Cross-attention. A cross-attention mechanism is deployed for aggregating the guidance cues into the semantic segmentation network. The guidance cues remain the same as the ones described above for the input merging case, but the mechanism of introducing the additional data into the model is different, involving a cross-attention module that combines the features provided through the InternImage pixel decoder, considered to be keys and values, and the guidance features which represent the queries. Unlike the case of input merging, for the cross-attention scenario, the guidance cues are also passed through a small neural network, which yields the query features.

Figure 2 shows the differences between the original Mask2Former segmentation head [49] and our modified versions, which include the two guidance merging mechanisms explained in this section. The architecture representation is adapted from [49].

3.4. Trajectory Class Splitting

The methods described above, including the baseline ego-trajectory segmentation networks, the soft segmentation head tweak, and the guidance cue data introduced into the network, are all formulated as two-class semantic segmentation problems, where the first class is represented by the vehicle ego-trajectory, while the other class is associated with the rest of the scene. However, the Mask2Former [49] segmentation head that is deployed for learning the trajectory segmentation masks also consists of a classification module, which relies on mask classification, meaning that per-pixel segmentation outputs are obtained by predicting N binary masks and N category labels which correspond to each of the masks, where N is the number of classes. Therefore, the final improvement we made in our ego-trajectory segmentation framework consists in splitting the trajectory labels into more classes, based on the trajectory rotation angle, which is obtained from the cumulative relative pose of the vehicle along a given trajectory, as described in Section 2.3. Therefore, instead of predicting a class for the rest of the scene and one for the ego-trajectory, the network will learn to predict a class for the rest of the scene and multiple classes for the ego-trajectory, based on the trajectory rotation angle. We selected 5 different trajectory classes, corresponding to the categories defined in Section 3.3. For this improvement, the input merging and cross-attention mechanisms for guidance cue integration into the network remain the same as previously described.

3.5. Training Details

Except for guidance-specific parameters, all experimental configurations were identical, in order to be able to only compare the effects of the guidance mechanisms on the performance of our segmentation task. The training setup was as follows: the batch size had a value of four, and we used two GPUs for training, where each of them received two of the batch samples. The neural network parameter optimizer was AdamW [50], with an initial learning rate of

2 \times 10^{- 5}

,

β

values of

(0.9, 0.999)

, and a weight decay of

0.05

. A total of 220,000 total iterations were selected for our experiments.

A data balancing technique was applied to facilitate the equal exposure of more extreme and rare data samples to the neural network as the scenarios which represent a larger part of the dataset. For this purpose, the dataset was split based on the trajectory rotation angle of each input sample, which was computed based on the relative pose of the adjacent frames that composed the trajectory. In this manner, each frame was assigned a trajectory rotation angle class, which was obtained after discretizing the continuous values of the trajectory rotation angle. Based on these classes, the balanced data sampler was applied, using a downsampling strategy, which means lowering the example frequency of the majoritary classes from the dataset. Note that the class splitting for data balancing was very similar to the class splitting that we presented in Section 3.4 but had a different purpose, data balancing, in this case, from empowering the guidance cue integration into the network.

As for the objective functions, three different losses were employed for the hard segmentation problem, and one more loss was added for the training of the soft segmentation network. The three main losses were dice loss [51], with a weighting of

5.0

; cross entropy loss for classification, with a weighting of

2.0

; and cross entropy loss for binary mask predictions, which had a weight of

5.0

. A class weighting was also applied for the classification loss, which complements the class balancing mechanism. This weighting associates a value of

0.1

for the background class (or rest of the scene) and

1.0

for the other classes. The Jaccard metric loss [52] was added to the soft segmentation module, with the same weighting value as the dice and classification losses.

4. Evaluation and Results

In this section, we describe the dataset and evaluation metrics chosen for benchmarking the implemented methods and then present and discuss the main results of our work, showing both qualitative and quantitative outputs and comparing the proposed guidance improvements to the baseline methods.

4.1. Training and Evaluation Dataset

Both training and evaluation were performed on the KITTI odometry dataset [53]. We selected sequences

0, 4, 5, 6, 7, 17, 18, 19, 20

, and 21 for the training part and sequences

8, 9

, and 10 for validation. Note that there are missing sequences from the list, because this work represents only a stage of our complete pipeline, namely, the ego-trajectory segmentation. The other part, consisting of the joint training of depth and ego-motion is to be performed on the rest of the dataset sequences.

4.2. Evaluation Metrics

Evaluating the performance of a segmentation model requires robust metrics. Two commonly used metrics are accuracy and mean intersection over union (IoU).

Accuracy measures the proportion of correctly classified pixels to the total number of pixels in an image. While it provides a straightforward assessment, it may not be the most informative metric for imbalanced classes or scenarios where certain categories are rare. The formula for accuracy can be seen in Equation (1), where N is the total number of pixels,

p_{i}

is the predicted label for pixel i,

g_{i}

is the ground-truth label for pixel i, and

I (\cdot)

is an indicator function which returns 1 if the condition is true and 0 otherwise.

a c c = \frac{\sum_{i = 1}^{N} I (p_{i} = g_{i})}{N}

(1)

Mean intersection over union (IoU), on the other hand, evaluates the overlap between predicted and ground-truth regions. It quantifies how well the model’s segmentation aligns with the actual objects in the image. The IoU metric is particularly valuable in scenarios where precise localization is critical. The formula for mean intersection over union is shown in Equation (2).

I o U = \frac{\sum_{i = 1}^{N} I (p_{i} = g_{i})}{I (p_{i} = 1) + I (g_{i} = 1) - I (p_{i} = g_{i})}

(2)

The soft Jaccard metric can be utilized for evaluating the soft ego-trajectory segmentation results of the proposed method. Unlike IoU, which works for discrete values, the generalized Jaccard metric works for prediction and target values which belong to the

[0, 1]

interval, making it adequate for evaluating the soft segmentation head outputs. Equation (3) describes this metric, as it is presented in [52]. Here, g is the ground-truth target soft mask, while p is the predicted one. The L1 norm is denoted by

| | \cdot {| |}_{1}

.

J M = \frac{{| | g | |}_{1} + {| | p | |}_{1} - | | g - {p | |}_{1}}{{| | g | |}_{1} + {| | p | |}_{1} + | | g - {p | |}_{1}}

(3)

4.3. DeepLabV3+ vs. InternImage

The first improvement in our ego-trajectory segmentation framework lies in the change from the DeepLabV3+ to the InternImage segmentation networks.

Table 1 shows the results we previously obtained with the DeepLabV3+ network and the new, improved results when the InternImage segmentation network was utilized for our task.

This table shows the large improvement margin obtained by only changing the segmentation network for the ego-trajectory segmentation task, which emphasizes one example of how much the state-of-the-art solutions have improved in a relatively small amount of time in many computer vision problems.

A visual comparison between the results obtained with the DeepLabV3+ model and the ones obtained with the InternImage model can be seen in Figure 3. Both results are compared with the ground-truth ego-trajectory, obtained from the ego-motion ground-truth data of the KITTI odometry dataset [53], even though they were both trained on the self-supervised labels generated by our proposed method. As seen in these scenarios, the InternImage segmentation model appears to provide better results for situations where uncertainty about the followed path is higher, sticking better to the ground-truth path when compared with the DeepLabV3+ model, which, in many cases, fails to generate a clear output in such situations. These improvements can be explained by the distinctive features utilized in the InternImage model, such as dynamic sparse kernels, which provide advantages similar to those of multi-head attention, especially the long-term dependence of different regions in the image, but in a more efficient way, which can be favorable for scaling large models. Additionally, the masked attention applied to high-resolution features through the Mask2Former head also contributes to the overall performance of the model.

4.4. Soft Segmentation Head

As stated before, the soft segmentation head allows us to choose different thresholds for the pixel values and only keep those that exceed the threshold value. In this manner, we can select more or fewer pixels out of the network predictions, depending on the situation, especially in cases where the network predicts pixels with higher values for the trajectory in a given direction and smaller values in another direction. Depending on the thresholds, such cases can be treated differently: a higher threshold will suppress the trajectory with lower pixel values and keep only the ones with larger values, while a lower threshold can lead to keeping both trajectories. An example that emphasizes this situation can be seen in Figure 4.

The effect of different threshold values applied to the soft segmentation head outputs can be seen in Table 2. A threshold value of

0.45

leads to the best IoU value, while the lower the selected threshold value is, the higher the accuracy becomes, which can be associated with an increase in the true positives, since more pixels are categorized as non-road, which is the majority class.

4.5. Segmentation Guidance

We have seen that soft segmentation can be a tool for selecting one out of more possible predicted trajectories. However, this does not offer enough control on which trajectory to pick and does not involve an intention given to the network in selecting said trajectory.

Therefore, this subsection highlights the main results obtained when employing the different guidance inputs and merging mechanisms to the ego-trajectory segmentation training process, which has the purpose of bringing additional control to the path generation process, control provided by bringing guidance cues into the network. For our task, the segmentation guidance was evaluated by using the same metrics as before. The evaluation was similar to the training process, which means that during evaluation, the guidance cue was generated from the ground-truth dataset, using the same method described for obtaining the trajectory angle. Note that we use the ground-truth pose only to constrain the model to follow the ground-truth trajectory, because this increases the segmentation metrics, which proves whether or not the guidance mechanisms make improvements to the overall performance. Furthermore, even if during evaluation, the ground truth is used, this is only caused by the limitation of the data, which only have a ground-truth path; therefore, evaluating the IoU between the predictions and ground truths can only be made possible if the guidance follows the values obtained from the ground-truth pose of the dataset. However, at inference time, in scenarios where we would not want to maximize a benchmark metric, like IoU in this case, the model can take various values for the guidance cues, which is exactly what we would want in real-life scenarios, where there is no ground truth and the guidance data represent the intent that is passed to the network in order to follow a given direction.

In Table 3, the main results of ego-trajectory segmentation can be seen. It presents the performance of the network in the selected evaluation metrics, with each combination of input guidance and guidance merging strategies, for both hard and soft segmentation. Guidance appears to slightly improve the metrics, especially when the rotation angle is involved. However, all the guidance data and merging mechanisms lead to an improvement compared with the baseline.

As seen in the table, guidance seems to lead to slight improvements in the hard segmentation problem, but in the other case, in the soft segmentation counterpart, the benefits are not that notable. This could have been caused by a lack of scenarios in the training data, where different paths are followed for the same geo-location point, i.e., an intersection where the vehicle goes in a direction in the first scenario and in another direction in the second scenario. This could enable the model to learn how to associate different guidance cues for the same input frame when different output targets are presented. However, the model seems to be somehow conditioned by the guidance input. As seen in Figure 5, we observe how different guidance cues affect the output of the hard segmentation model on different samples from the dataset. These samples have been picked such that at least one of the guidance methods yields an increase in mIoU greater than

0.3

when compared with the baseline hard segmentation model. For such cases, all the other guidance cue results have been added to the figure. This figure only shows the first merging mechanism, which consists in concatenating the backbone features with the guidance data in the form of the trajectory rotation angle, trajectory rotation angle category, and text embedding which describe the steering scenario.

However, there are also cases where the original model without any guidance performs better than the ones where guidance data are provided. Figure 6 shows such examples for the hard ego-trajectory segmentation InternImage model. Like in the examples above, where guidance led to some improvements, a similar methodology was used, i.e., taking the sample frames where the IoU between the ground truth and the prediction of the original model without guidance is higher than the IoU of all the guided models by a margin of at least

0.2

. We can observe that most of these cases are caused by the guided model, which predicts more possible trajectories, which leads to a decrease in the IoU score.

Another important discussion can start from the addition of the guidance cues in the inference of the model, which can be seen as a signal that should not be available during test time. This is partially true, since the trajectory rotation angles are taken from the dataset and are not computed during inference; therefore, they can be considered additional input data that cannot be accessed during live deployment. However, even if the other two guidance cues, namely, trajectory rotation angle category and the text description of the trajectory rotation angle, are derived from the same problematic trajectory rotation angle, when considering a live deployment scenario and different possible decisions for the vehicle to follow, for example, in intersections, it is important to have a cue that constraints the model to follow one of the possible paths. Therefore, even if the actual trajectory rotation angle cannot be one of these cues, since it would be almost equivalent to a steering command, a semantic or textual cue could still be a viable possibility. This means that even if the text cue is taken from the dataset when inference is performed, this is only performed because this is the easiest way to check that the model can follow the given cues without any manual work, which would involve generating additional trajectory labels. In a real-life scenario, the trajectory rotation angle might not be available, but text guidance can be given in different ways, such as guidance from the passenger or a pre-planned path that is converted into textual cues.

4.6. Trajectory Class Splitting

In this subsection, we highlight the main benefits that trajectory class splitting brings over the previous methods, where only two classes are involved in the segmentation problem. The results for the ego-trajectory segmentation with hard labels for each trajectory class after the split can be seen in Table 4. Note that when reporting the IoU and accuracy metrics for the model trained on more classes, during inference, the prediction is converted back into a single path class, regardless of whether it belongs to any of the five classes obtained after the split, and is compared against a ground truth that also comprises only one class (the background is discarded, because it increases the metrics artificially). This allows us to perform an easier comparison between the models trained with a single path class and multiple path classes for different trajectory angles. As shown in Table 4, even though the non-guided model performs worse after adding multiple trajectory classes, the performance improves drastically after adding guidance to the network, surpassing the best model trained on two classes by a margin of over

3 %

in IoU. Despite the clear benefit of adding multiple classes during training for the hard segmentation model, the soft segmentation head does not show such great improvements when compared with the models trained with only one trajectory class.

4.7. Trajectory Guidance Impact

Furthermore, we want to analyze the impact of guidance cues on the same frame by iterating through possible values of the guidance range for each type of feature, like rotation angle, rotation angle category, and text description of the scenario. Doing so enables us to visualize if the guidance features change the output of the segmentation network and if the change they lead to is correct. Figure 7 shows some frame samples and their corresponding ego-trajectory segmentation outputs for different values of the guidance cues when the ego-trajectory segmentation network is trained with the five trajectory classes described in Section 3.4 and one class for the rest of the scene. As observed in Figure 7, the guidance features merged into the network have an impact on the trajectory output, conditioning it to follow a path that is closer to the desired intent given as input through the guidance cue. Alongside the results from Table 4, the qualitative outputs from the figure prove the beneficial influence of trajectory class splitting when combined with the guidance cue mechanisms when constraining the segmentation masks to follow the ground-truth mask values, which is equivalent to guiding the network to provide segmentation outputs that follow the real trajectory of the vehicle as it is described in the dataset. As in the previous figures, the green trajectory is generated from the ground-truth pose, while the red trajectories are the ones predicted by our neural network, with yellow being the intersection between ground truth and predictions.

As observed in Figure 7, for the scenarios where the guidance conditions the network to predict a forward trajectory (the c column of the figure), the predictions are still limited to the bottom side of the frame, representing only a short portion of the trajectory. However, this is a positive result where we observe that the network learned that there is no possible trajectory to follow, since there is no road leading forward. Another example, shown in Figure 8, shows a case where the network also predicts the forward trajectory, in an intersection where all the directions are possible to follow. For this figure, only one of the guidance merging methods is seen (text embedding concatenation with the backbone features), because all of them yield very similar results.

5. Discussion and Conclusions

In this work, we show that ego-trajectory segmentation can be learned from labels obtained in a self-supervised manner and provide an additional mechanism for controlling the direction of the ego-trajectory segmentation mask. One of the first improvements that brought major benefits to the ego-trajectory performance was simply changing the segmentation network from DeepLabV3+ to InternImage. This change came with an increase of more than

3 %

in the mIoU metric for the ego-trajectory segmentation label, even when the baseline backbone model of the InternImage network was deployed. It is likely that further improvements can be achieved by leveraging the even bigger variants of the InternImage DCNv3-based backbone, but the computational resources and training time are drastically increased for such an architecture.

Besides the improvements brought by the state-of-the-art segmentation network, we explored the replacement of the classic segmentation problem with one that makes use of soft labels. Even though this does not match the best results highlighted in this work, we still think it has potential for further development, especially because it enables the application of various thresholds which can filter the final output trajectory based on necessity.

Furthermore, we proved in this work that segmentation guidance cues given to the segmentation network via the mechanisms described in the previous sections provide powerful constraints for the deep network, which help it learn to make associations between an intent, represented by the guidance cue, and the trajectory that corresponds to the intent. This result is also evidenced by the increase in IoU when applying the guidance cue mechanisms, compared with the baseline result. Additionally, our further improvement represented by the trajectory class splitting method brought even greater gains in the segmentation metrics, increasing IoU by around

3 %

when compared with the best model trained with only two classes, showing that it furthers constrains the network to follow the guidance cue intent provided as input alongside the RGB image.

As further improvements reserved for future work, we propose data augmentation for generating new possible ego-trajectory masks for a given training input frame and methods for smoothing the predicted ego-trajectories. Also, to make the entire pipeline more robust and transferable to different datasets and scenarios, it is crucial to train all the components on larger datasets, starting from the depth and ego-motion network, which is the backbone of label generation in our framework, to the segmentation network. Unlike other methods, this should be easier to achieve, since the data generation method is entirely self-supervised, meaning that new data can be generated from videos in the wild or from large-scale datasets, as long as the camera parameters are known.

Augmenting data for ego-trajectory segmentation can provide a broad range of new scenarios during the training time of the segmentation network; therefore, it can bring benefits when it comes to previously unseen scenarios during inference, leading to a better generalization ability. There are several directions that can be pursued for this purpose, as future work:

Road segmentation-based augmentation. The first such approach can leverage either a road segmentation model which can be trained for this specific task or a very large pre-trained model, like Segment Anything [54], which can be prompted with text inputs and whose zero-shot performance is very impressive. Once the segmented road is obtained, a set of possible paths can be generated by using the Ackermann steering geometry, and those who have a high overlap with the segmentation mask of the road will be randomly selected as targets. Using this method might be beneficial for exposing the model during training to different turning scenarios assigned to the same input frame.
MixUp augmentation. This approach suggests the usage of MixUp augmentation [55] for generating a new input image based on a mixture of more images from the dataset. This can provide new paths and labels obtained from different images, merging them into a single one, which provides a wide variety of possible scenarios with their assigned path labels. Another alternative is keeping the original image and mixing it with only the road path from the second image where the path segmentation mask is true.

However, when generating new ego-trajectory paths through data augmentation, we have to take into consideration the obstacles that might arise across the vehicle’s path, such as pedestrians and other vehicles, and make sure that only the paths that bypass the obstacles through viable steering scenarios are generated by such methods. Therefore, for future work, it is important to develop a mechanism that is also aware and able to detect obstacles across the possible vehicle trajectories.

The other future research direction implies the exploration of methods for smoothing and improving the current ego-trajectory prediction based on predictions from the past frames. A first approach to do this consists of taking the past ego-trajectory predictions and project them into the current frame by using the relative pose between the current frame and the past ones by leveraging the self-supervised pose estimation network.

Author Contributions

Conceptualization: A.M. and A.M.F.; formal analysis, methodology, software, and validation: A.M.; investigation and resources: A.M. and A.M.F.; data curation and writing—original draft preparation: A.M.; writing—review and editing: A.M. and A.M.F.; supervision, project administration, and funding acquisition: A.M.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101120657, project ENFIELD (European Lighthouse to Manifest Trustworthy and Green AI).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bojarski, M.; Del Testa, D.; Dworakowski, D.; Firner, B.; Flepp, B.; Goyal, P.; Jackel, L.D.; Monfort, M.; Muller, U.; Zhang, J.; et al. End to end learning for self-driving cars. arXiv 2016, arXiv:1604.07316. [Google Scholar]
Li, Z.; Yu, Z.; Lan, S.; Li, J.; Kautz, J.; Lu, T.; Alvarez, J.M. Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 17–21 June 2024; pp. 14864–14873. [Google Scholar]
Iftikhar, S.; Zhang, Z.; Asim, M.; Muthanna, A.; Koucheryavy, A.; Abd El-Latif, A.A. Deep Learning-Based Pedestrian Detection in Autonomous Vehicles: Substantial Issues and Challenges. Electronics 2022, 11, 3551. [Google Scholar] [CrossRef]
Dasgupta, K.; Das, A.; Das, S.; Bhattacharya, U.; Yogamani, S.K. Spatio-Contextual Deep Network Based Multimodal Pedestrian Detection for Autonomous Driving. arXiv 2021, arXiv:2105.12713. [Google Scholar] [CrossRef]
Dairi, A.; Harrou, F.; Senouci, M.; Sun, Y. Unsupervised obstacle detection in driving environments using deep-learning-based stereovision. Robot. Auton. Syst. 2018, 100, 287–301. [Google Scholar] [CrossRef]
Su, P.; Luo, S.; Huang, X. Real-time dynamic SLAM algorithm based on deep learning. IEEE Access 2022, 10, 87754–87766. [Google Scholar] [CrossRef]
Lo, S.; Hang, H.; Chan, S.; Lin, J. Multi-Class Lane Semantic Segmentation using Efficient Convolutional Networks. arXiv 2019, arXiv:1907.09438. [Google Scholar]
Honda, H.; Uchida, Y. CLRerNet: Improving Confidence of Lane Detection with LaneIoU. arXiv 2023, arXiv:2305.08366. [Google Scholar]
Han, C.; Zhao, Q.; Zhang, S.; Chen, Y.; Zhang, Z.; Yuan, J. YOLOPv2: Better, Faster, Stronger for Panoptic Driving Perception. arXiv 2022, arXiv:2208.11434. [Google Scholar]
Che, Q.H.; Le, H.T.; Ngo, M.D.; Tran, H.L.; Phan, D.D. Intelligent Attendance System: Combining Fusion Setting with Robust Similarity Measure for Face Recognition. In Proceedings of the 2023 International Conference on Multimedia Analysis and Pattern Recognition (MAPR), Quy Nhon, Vietnam, 5–6 October 2023; IEEE: New York, NY, USA, 2023. [Google Scholar] [CrossRef]
Barnes, D.; Maddern, W.; Posner, I. Find Your Own Way: Weakly-Supervised Segmentation of Path Proposals for Urban Autonomy. arXiv 2017, arXiv:1610.01238. [Google Scholar]
Sun, Y.; Zuo, W.; Liu, M. See the Future: A Semantic Segmentation Network Predicting Ego-Vehicle Trajectory With a Single Monocular Camera. IEEE Robot. Autom. Lett. 2020, 5, 3066–3073. [Google Scholar] [CrossRef]
Cakir, S.; Gauß, M.; Häppeler, K.; Ounajjar, Y.; Heinle, F.; Marchthaler, R. Semantic Segmentation for Autonomous Driving: Model Evaluation, Dataset Generation, Perspective Comparison, and Real-Time Capability. arXiv 2022, arXiv:2207.12939. [Google Scholar]
Ullah, M.; Mohammed, A.; Alaya Cheikh, F. PedNet: A Spatio-Temporal Deep Convolutional Neural Network for Pedestrian Segmentation. J. Imaging 2018, 4, 107. [Google Scholar] [CrossRef]
Guo, Z.; Liao, W.; Xiao, Y.; Veelaert, P.; Philips, W. Weak segmentation supervised deep neural networks for pedestrian detection. Pattern Recog. 2021, 119, 108063. [Google Scholar] [CrossRef]
Chu, H.; Ma, H.; Li, X. Pedestrian instance segmentation with prior structure of semantic parts. Pattern Recog. Lett. 2021, 149, 9–16. [Google Scholar] [CrossRef]
Li, J.; Jiang, F.; Yang, J.; Kong, B.; Gogate, M.; Dashtipour, K.; Hussain, A. Lane-DeepLab: Lane semantic segmentation in automatic driving scenarios for high-definition maps. Neurocomputing 2021, 465, 15–25. [Google Scholar] [CrossRef]
Meyer, A.; Salscheider, N.O.; Orzechowski, P.F.; Stiller, C. Deep Semantic Lane Segmentation for Mapless Driving. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 869–875. [Google Scholar] [CrossRef]
Rasib, M.; Butt, M.A.; Riaz, F.; Sulaiman, A.; Akram, M. Pixel Level Segmentation Based Drivable Road Region Detection and Steering Angle Estimation Method for Autonomous Driving on Unstructured Roads. IEEE Access 2021, 9, 167855–167867. [Google Scholar] [CrossRef]
Zhou, T.; Brown, M.; Snavely, N.; Lowe, D.G. Unsupervised Learning of Depth and Ego-Motion from Video. arXiv 2017, arXiv:1704.07813. [Google Scholar]
Bian, J.; Li, Z.; Wang, N.; Zhan, H.; Shen, C.; Cheng, M.; Reid, I.D. Unsupervised Scale-consistent Depth and Ego-motion Learning from Monocular Video. arXiv 2019, arXiv:1908.10553. [Google Scholar]
Godard, C.; Mac Aodha, O.; Firman, M.; Brostow, G.J. Digging into Self-Supervised Monocular Depth Prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
Gordon, A.; Li, H.; Jonschkowski, R.; Angelova, A. Depth from Videos in the Wild: Unsupervised Monocular Depth Learning from Unknown Cameras. arXiv 2019, arXiv:1904.04998. [Google Scholar]
Mahjourian, R.; Wicke, M.; Angelova, A. Unsupervised Learning of Depth and Ego-Motion from Monocular Video Using 3D Geometric Constraints. arXiv 2018, arXiv:1802.05522. [Google Scholar]
Wagstaff, B.; Kelly, J. Self-Supervised Scale Recovery for Monocular Depth and Egomotion Estimation. arXiv 2020, arXiv:2009.03787. [Google Scholar]
Watson, J.; Aodha, O.M.; Prisacariu, V.; Brostow, G.J.; Firman, M. The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth. arXiv 2021, arXiv:2104.14540. [Google Scholar]
Gu, X.; Yuan, W.; Dai, Z.; Zhu, S.; Tang, C.; Tan, P. DRO: Deep Recurrent Optimizer for Structure-from-Motion. arXiv 2021, arXiv:2103.13201. [Google Scholar]
Kayalibay, B.; Jensen, G.; van der Smagt, P. CNN-based segmentation of medical imaging data. arXiv 2017, arXiv:1701.03056. [Google Scholar]
Chen, L.; Bentley, P.; Mori, K.; Misawa, K.; Fujiwara, M.; Rueckert, D. DRINet for medical image segmentation. IEEE Trans. Med. Imaging 2018, 37, 2453–2462. [Google Scholar] [CrossRef]
Guo, Z.; Li, X.; Huang, H.; Guo, N.; Li, Q. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans. Radiat. Plasma Med. Sci. 2019, 3, 162–169. [Google Scholar] [CrossRef]
Milioto, A.; Stachniss, C. Bonnet: An open-source training and deployment framework for semantic segmentation in robotics using cnns. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; IEEE: New York, NY, USA, 2019; pp. 7094–7100. [Google Scholar]
Hurtado, J.V.; Valada, A. Semantic scene segmentation for robotics. In Deep Learning for Robot Perception and Cognition; Elsevier: Amsterdam, The Netherlands, 2022; pp. 279–311. [Google Scholar]
Ko, T.Y.; Lee, S.H. Novel method of semantic segmentation applicable to augmented reality. Sensors 2020, 20, 1737. [Google Scholar] [CrossRef]
Tanzi, L.; Piazzolla, P.; Porpiglia, F.; Vezzetti, E. Real-time deep learning semantic segmentation during intra-operative surgery for 3D augmented reality assistance. Int. J. Comput. Assist. Radiol. Surg. 2021, 16, 1435–1445. [Google Scholar] [CrossRef] [PubMed]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2014, arXiv:1411.4038. [Google Scholar]
Chen, L.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv 2016, arXiv:1606.00915. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2018, arXiv:1703.06870. [Google Scholar]
Ren, S.; He, K.; Girshick, R.B.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2015, arXiv:1506.01497. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. arXiv 2021, arXiv:2103.14030. [Google Scholar]
Bao, H.; Dong, L.; Piao, S.; Wei, F. BEiT: BERT Pre-Training of Image Transformers. arXiv 2022, arXiv:2106.08254. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. arXiv 2023, arXiv:2211.05778. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. arXiv 2017, arXiv:1703.06211. [Google Scholar]
Mihalea, A.; Samoilescu, R.F.; Florea, A.M. Self-Supervised Steering and Path Labeling for Autonomous Driving. Sensors 2023, 23, 8473. [Google Scholar] [CrossRef]
Drews, P.; Williams, G.; Goldfain, B.; Theodorou, E.A.; Rehg, J.M. Aggressive Deep Driving: Combining Convolutional Neural Networks and Model Predictive Control. In Proceedings of Machine Learning Research, Proceedings of the 1st Annual Conference on Robot Learning, Mountain View, CA, USA, 13–15 November 2017; Levine, S., Vanhoucke, V., Goldberg, K., Eds.; PMLR: Cambridge, MA, USA, 2017; Volume 78, pp. 133–142. [Google Scholar]
Zhou, W.; Worrall, S.; Zyner, A.; Nebot, E. Automated Process for Incorporating Drivable Path into Real-Time Semantic Segmentation; IEEE Press: New York, NY, USA, 2018; pp. 1–6. [Google Scholar] [CrossRef]
Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets Robotics: The KITTI Dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
Chen, L.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. arXiv 2018, arXiv:1802.02611. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention Mask Transformer for Universal Image Segmentation. arXiv 2021, arXiv:2112.01527. [Google Scholar]
Loshchilov, I.; Hutter, F. Fixing Weight Decay Regularization in Adam. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Sudre, C.H.; Li, W.; Vercauteren, T.; Ourselin, S.; Cardoso, M.J. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations. arXiv 2017, arXiv:1707.03237. [Google Scholar]
Wang, Z.; Ning, X.; Blaschko, M.B. Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels. arXiv 2023, arXiv:2302.05666. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RL, USA, 16–21 June 2012. [Google Scholar]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond Empirical Risk Minimization. arXiv 2018, arXiv:1710.09412. [Google Scholar]

Figure 1. The framework used for learning the vehicle’s ego-trajectory segmentation through supervised training on labels obtained through a self-supervised approach. The blue shape represents an input dataset, orange trapezoids denote neural networks, the turquoise rectangles are intermediary neural network outputs and the dark red ones are the final processed outputs of the networks.

Figure 2. Comparison between the Mask2Former segmentation head without any modification and our two custom versions of guidance merging. (a) No guidance. (b) Concatenation guidance. (c) Cross-attention guidance.

Figure 3. Comparison between the ego-trajectory segmentation results obtained with DeepLabV3 vs. the ground truth and the results of InternImage vs. the same ground truth. Network results are denoted with red, while the ground-truth trajectory is represented with green. The intersection between the two is yellow. (a) DeepLabV3+ vs. ground truth (b) InternImage vs. ground truth.

Figure 4. Soft segmentation results as a heatmap and after applying two different thresholds. (a) Soft segmentation output. (b) Threshold: 0.2. (c) Threshold: 0.35.

Figure 5. Ego-trajectory segmentation output samples for the hard segmentation model with and without guidance data. (a) No guidance. (b) Angle guidance. (c) Category guidance. (d) Text guidance.

Figure 6. Ego-trajectory segmentation output samples for the hard segmentation model with and without guidance data. (a) No guidance. (b) Angle guidance. (c) Category guidance. (d) Text guidance.

Figure 7. Ego-trajectory segmentation outputs for the same frame, emphasizing the effect of the guidance cues on the predicted trajectory. Each row shows outputs for a type of guidance merging mechanism, while each column represents a separate value of the additional input guidance.

Figure 8. Segmentation results in an intersection scenario, where different guidance values of the same method are applied. (a) Text: tight left. (b) Text: slight left. (c) Text: forward. (d) Text: slight right. (e) Text: tight right.

Table 1. Evaluation results when comparing the DeepLabV3+ and InternImage segmentation models. The change from DeepLabV3+ to InternImage provides a considerable increase in segmentation performance on our task.

	Model	DeepLabV3+		InternImage
	Labels	Acc (%)	IoU (%)	Acc (%)	IoU (%)
SS Labels		74.86	65.10	83.74	72.05
GT Labels		70.20	60.76	76.30	64.09

Table 2. Soft segmentation model results after applying different thresholds.

	InternImage Soft
Threshold	Acc (%)	IoU (%)
0.10	94.87	55.09
0.15	93.78	59.18
0.20	92.62	62.39
0.25	91.30	65.08
0.30	89.75	67.48
0.35	87.97	69.52
0.40	85.92	70.95
0.45	83.53	71.93
0.50	80.87	71.45
0.55	77.88	70.27
0.60	74.56	68.48
0.65	70.89	66.09
0.70	66.86	63.12
0.75	62.43	59.58
0.80	57.51	55.40
0.85	51.90	50.42
0.90	45.15	44.21

Table 3. Evaluation results with different guidance mechanisms.

	Hard Segmentation		Soft Segmentation	Soft Segmentation @ 0.45
Guidance	Acc (%)	IoU (%)	JML (%)	Acc (%)	IoU (%)
No guidance	83.74	72.05	68.55	83.11	71.84
Rotation angle concatenation	85.29	74.44	70.85	85.16	72.99
Category concatenation	85.15	73.54	68.85	84.47	72.83
Text embedding concatenation	85.52	73.43	70.01	83.11	72.91
Rotation angle cross-attention	84.70	74.10	68.49	83.90	72.81
Category cross-attention	84.98	73.94	68.18	83.33	72.75
Text embedding cross-attention	84.75	73.13	70.09	84.80	73.00

Table 4. Evaluation results comparing the best-performing model before class splitting with various models trained with the 5 different trajectory classes and each of the guidance mechanisms.

	Hard Segmentation		Soft Segmentation	Soft Segmentation @ 0.45
Guidance	Acc (%)	IoU (%)	JML (%)	Acc (%)	IoU (%)
Best model with one class	85.29	74.44	70.85	85.16	72.99
No guidance	83.73	70.79	64.79	78.05	69.02
Rotation angle concatenation	88.57	77.85	70.72	85.12	75.48
Category concatenation	87.55	76.91	69.87	83.77	74.27
Text embedding concatenation	87.85	77.21	70.29	85.42	74.81
Rotation angle cross-attention	86.84	77.32	67.86	81.46	72.32
Category cross-attention	86.50	76.13	68.07	80.19	72.33
Text embedding cross-attention	85.56	75.84	68.61	84.06	73.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mihalea, A.; Florea, A.M. Vehicle Ego-Trajectory Segmentation Using Guidance Cues. Appl. Sci. 2024, 14, 7776. https://doi.org/10.3390/app14177776

AMA Style

Mihalea A, Florea AM. Vehicle Ego-Trajectory Segmentation Using Guidance Cues. Applied Sciences. 2024; 14(17):7776. https://doi.org/10.3390/app14177776

Chicago/Turabian Style

Mihalea, Andrei, and Adina Magda Florea. 2024. "Vehicle Ego-Trajectory Segmentation Using Guidance Cues" Applied Sciences 14, no. 17: 7776. https://doi.org/10.3390/app14177776

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vehicle Ego-Trajectory Segmentation Using Guidance Cues

Abstract

1. Introduction

2. Related Work

2.1. Self-Supervised Ego-Motion Estimation

2.2. Semantic Segmentation

2.3. Self-Supervised Ego-Trajectory Labeling

2.4. Ego-Trajectory Segmentation

3. Methods

3.1. Baseline Models

3.2. Soft Segmentation Head

3.3. Segmentation Guidance

3.4. Trajectory Class Splitting

3.5. Training Details

4. Evaluation and Results

4.1. Training and Evaluation Dataset

4.2. Evaluation Metrics

4.3. DeepLabV3+ vs. InternImage

4.4. Soft Segmentation Head

4.5. Segmentation Guidance

4.6. Trajectory Class Splitting

4.7. Trajectory Guidance Impact

5. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI