Next Article in Journal
Particle Size Distribution in Holby–Morgan Degradation Model of Platinum on Carbon Catalyst in Fuel Cell: Normal Distribution
Previous Article in Journal
Eight Element Wideband Antenna with Improved Isolation for 5G Mid Band Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Enhancing Autonomous Visual Perception in Challenging Environments: Bilateral Models with Vision Transformer and Multilayer Perceptron for Traversable Area Detection

by
Claudio Urrea
* and
Maximiliano Vélez
Electrical Engineering Department, Faculty of Engineering, University of Santiago of Chile, Av. Libertador Bernardo O’Higgins 3363, Estación Central, Santiago 9170022, Chile
*
Author to whom correspondence should be addressed.
Technologies 2024, 12(10), 201; https://doi.org/10.3390/technologies12100201
Submission received: 24 September 2024 / Revised: 12 October 2024 / Accepted: 16 October 2024 / Published: 17 October 2024

Abstract

:
The development of autonomous vehicles has grown significantly recently due to the promise of improving safety and productivity in cities and industries. The scene perception module has benefited from the latest advances in computer vision and deep learning techniques, allowing the creation of more accurate and efficient models. This study develops and evaluates semantic segmentation models based on a bilateral architecture to enhance the detection of traversable areas for autonomous vehicles on unstructured routes, particularly in datasets where the distinction between the traversable area and the surrounding ground is minimal. The proposed hybrid models combine Convolutional Neural Networks (CNNs), Vision Transformer (ViT), and Multilayer Perceptron (MLP) techniques, achieving a balance between precision and computational efficiency. The results demonstrate that these models outperform the base architectures in prediction accuracy, capturing distant details more effectively while maintaining real-time operational capabilities.

1. Introduction

As artificial intelligence has advanced, autonomous vehicles have emerged as a compelling research topic for both industry and academia [1], with various modules being developed for their implementation [2]. One of these modules enables environmental perception so that the vehicle can create an accurate representation of the route and estimate its position in the road ecosystem [3]. Furthermore, a good representation of the environment allows the selection of optimal parameters to save energy during vehicle control, plan routes to reduce traffic congestion, and increase safety during driving [4].
The vehicle during navigation must detect, through embedded perceptual sensors [5], the traversable area and the type of obstacles to avoid. The most commonly implemented perceptual sensors for the operation of autonomous ground vehicles are: RGB camera, LiDAR, and RADAR [6]. RGB cameras are the most popular sensors, even implemented as the only sensor to reconstruct the vehicle’s environment [7] due to their ability to interpret textures, potentially achieve high resolutions, consume low amounts of power, and be more affordable than RADAR or LiDAR [8].
Semantic segmentation consists of assigning a label to each pixel of an input image [9]. Semantic segmentation models based on deep learning have achieved good performance, allowing the description of the environment in complex scenes, the ability to gain information that allows the vehicle to predict the steering angle [10], determine lane boundaries [11], detect obstacles [12], and infer the enabled area for vehicle traffic [13]. Studies have been conducted on structured [14,15] and unstructured [16,17]. The former are routes marked with signage and paved, usually located in urban areas. The latter are unpaved, unmarked routes, which can be found in rural areas and industrial sites [18,19]. Scenes with unstructured routes are challenging for semantic segmentation models based on deep learning because the unmarked route presents little differentiation from the non-traversable area, making it difficult to classify these between classes [20].
Despite significant advances in semantic segmentation for precise detection of the traversable area in unstructured routes, capturing details or high-frequency characteristics in images remains a challenge, especially in environments with a low contrast between the route and the rest of the scene [21,22]. Currently, robust models based on Convolutional Neural Networks (CNNs) have been developed for this purpose, but the most common solution has been to add more parameters to the base architecture [23] or use knowledge distillation techniques [24], which make it difficult to implement the model on devices with limited storage capacities, and it also complicates the training process.
Recent publications show that hybrid architectures based on Convolutional Neural Networks, Vision Transformer (ViT), or Multilayer Perceptron (MLP) capture global characteristics with fewer parameters than pure models, maintaining a high inference capacity [25,26].
Experiments with hybrid models of binary semantic segmentation in low-contrast images show good performance [27,28]. These use CNNs to extract local features and integrate ViT or MLP as an attention module to model global relationships between features extracted by CNNs layers.
Based on the good performance presented by hybrid architectures in low-contrast images and their applications in lightweight systems, the idea of implementing this type of architecture for the detection of vehicular routes in environments where roads are not clearly visible arises. In the present study, ViT and MLP-based skeletons are developed and evaluated in a bilateral model called BiSeNet, combining and evaluating the three techniques: Convolutional Neural Networks, ViT, and MLP, to generate a robust, lightweight model that acts in real-time.
The contributions of the present research are:
  • A new architecture based on the bilateral model is developed for the semantic segmentation of unstructured routes to detect the traversable area for a vehicle, achieving greater precision in capturing the details of these scenes compared to the architectures used as backbones. The proposed architectures, which combine CNNs, MLP, and ViT techniques, demonstrate a better ability to capture high frequencies in the images of the databases used.
  • A lighter version of the BiSeNet architecture is developed, making it implementable on equipment with limited storage capacity, while maintaining real-time inference. When using the BiSeNet architecture with ViT and MLP as the backbone, it is approximately three times lighter than the BiSeNet version with Xception as the backbone.
The performance of the bilateral BiSeNet architecture is implemented and evaluated considering ViT and MLP-based skeletons on the public CaT and Rellis 3D databases, which contain scenes with low contrast between the traversable area for a standard vehicle and the rest of the ground. The databases were adapted for the training of binary semantic segmentation architectures.
This article consists of seven sections. Section 2 discusses state-of-the-art studies, presenting the main databases created on unstructured routes and models that have been used to study semantic segmentation in route images. The methodology used for the study of the models is described in Section 3, detailing the databases used and the architecture of the studied models. The results obtained and comparative studies are presented in Section 4 and their discussion is carried out in Section 5. Section 6 presents the conclusions of this study and, finally, Section 7 outlines potential future work.

2. State-of-the-Art

2.1. Public Databases of Unstructured Routes

There are many databases that have been published for the study of scene understanding with applications in autonomous vehicles. The vast majority of these databases present 2D annotations for RGB camera images, such as mask annotations for training semantic segmentation models of scenes and object bounding boxes for training object detection models. Most of the existing databases are created on urban roads (on-road), the most used in this category are KITTI [29], CamVid [30], and CityScapes [31].
Databases for the study of vehicle navigation in off-road scenes are scarcer. Generally, these types of databases are created in environments with various types of obstacles, scarce signage, and variable lighting conditions. Some examples of such databases are CaT [32] and AutoMine [33], which focus on capturing industrial environments, such as forestry and mining, respectively. On the other hand, IDD [34], set in an Indian city, is also considered an off-road scene database due to traffic conditions and poor road infrastructure.

2.2. Convolutional Neural Network-Based Models

The first methods for semantic segmentation were developed using CNNs and have led to rapid development in the field of image semantic segmentation. FCN [35] is a pioneer in the encoder–decoder scheme that optimized the segmentation process. Then, UNET [36] improved this scheme by connecting features between the encoder and decoder at different levels, allowing features to remain after the downsampling process. Based on the UNET architecture, BiSeNet [37] was designed, reducing the feature fusion links and implementing a convolutional attention module. This allows reducing the number of parameters and operations, improving the inference time compared to UNET [38].
The first version of BiSeNet consisted of two paths: a contextual one and a spatial one; whereas in its best version, the contextual path is composed of Xception [39] as a backbone and convolutional attention modules. Xception is an architecture that uses depthwise separable convolution [40] in most of its layers. This technique consists of applying depthwise convolution that acts on each channel individually, and then applying pointwise convolution that combines the information from the channels. This approach allows for reducing the number of parameters and computational demands compared to traditional convolutions.

2.3. Vision Transformer-Based Models

The self-attention mechanism of visual transformers allows correlating distant sections and features of the image. The first fully developed architecture using this methodology for semantic segmentation has been ViT [41], which presents good performance, especially in capturing features in global contexts, but has a high computational cost of O(N2), affecting the inference speed. According to [42,43], memory access is the most critical factor affecting the speed of models. Operators, such as frequent reshaping, element-wise addition, and tensor normalization, consume time in accessing memory units. Although there are methods to address this inefficiency with sparse attention [44,45] and low-rank approximation [46,47], they often sacrifice prediction accuracy for inference time.
Recently, methods have been developed that operate in real-time without sacrificing prediction accuracy by reducing memory access time. Swin Transformer [48] divides the image into small windows that are processed locally by the attention modules, reducing cache memory access. The shifted window scheme provides efficiency by limiting the calculation of self-attention to non-overlapping windows while allowing cross-window connection. This architecture presents flexibility to models at various scales with linear computational complexity. LinFormer [49] introduces a linear transformation that reduces the dimensionality of input sequences before calculating attention, reducing the amount of memory needed to store calculated values. The main idea is to add two linear projections when calculating the key and values before applying the attention mechanism. EfficientViT [50] replaces the MHSA module with a CGA module composed of MHSA modules with different channelwise that partition the image features into different heads. The architecture was designed in a modular way, using fewer memory-bound, self-attention layers and memory-efficient FFN layers for communication between channels. After each FFN module, depthwise convolution is used to increase token interaction. FastViT [51] uses structural reparameterization to reduce memory access cost, which reorganizes operations that normally occur within a residual block (skip connections), simplifying operations. On the other hand, overparameterization [52,53] is implemented, which, although increases the number of parameters beyond what is necessary, also facilitates model training and makes it more flexible in inference.

2.4. Multilayer Perceptron-Based Models

With the idea of simplifying deep-learning models, models based only on MLPs [54] and residual connections [55] have recently been created. Compared to CNNs and transformers, these architectures involve less inductive bias and, therefore, are more adaptable to changes during inference [56]. MLP-based models are composed of three parts: patch embedding, channel-mixing, and patch-mixing. The first stage subdivides the image and then passes it to the other stages that relate features and patches [57].
MLP-Mixer [54] was the first proposed model for image classification using only perceptron layers. This model uses multiple perceptron layers to replace the convolution operation of CNNs and the self-attention mechanism of transformers. However, this architecture presents inefficiencies due to its way of relating tokens, that is, mixing different tokens without discriminating that some have more relation between them than others. To optimize the mixing process, variations of this method have been designed. ASMLP [58] reduces the number of relationships in the mixing process through bounded horizontal and vertical shifts. The main block is composed of skip connection, MLP, and axial shift operations, in the latter vertical shift and horizontal shift are used to extract features from the images and channel projection to transform the features using a linear layer. WaveMLP [59] dynamically adds weights to tokens by representing their features as a wave function where the amplitude is the original feature and the phase contains semantic content, allowing the efficient representation of tokens facilitating algorithm access to this information. The model is built by stacking blocks of phase-aware token-mixing, channel-mixing MLPs, and normalization layers in four stages. The gMLP architecture [60] designs gating operations that act as information filters, discriminating between features that should be processed and those that should be ignored. This is achieved through the linear gating operation, which consists of applying element-wise multiplication between the linear function with trainable parameters and the projected channels. S2MLP [61] introduces the spatial-shift module that replaces the channel mixing and token mixing modules of the MLP-MIXER model with a spatial shift that allows local relating of image features. This module contains no trainable parameters, making it a simple and scalable architecture. Before and after applying spatial-shift, fully connected layer blocks are applied that add weight to the relationships between image channels.

2.5. Lightweight Hybrid Models That Operate in Real-Time

Considering that CNNs are fast and efficient at capturing local features and that ViT and MLPs are better at capturing global relationships, lightweight models that combine these techniques have been created, improving the ability to model complex dependencies, capturing fine details, and operating in real-time. The following presents strategies that improve these two aspects using deep learning techniques that often mix CNNs, ViT, or MLP.
Segformer [62] is composed of an MiT-based backbone that applies overlapped patch merging and positional-encoding-free, then applies an efficient attention block. MLP blocks are used to decode the information captured by the transformer architecture. MaxViT [63] reduces the quadratic complexity of typical attention modules to a linear complexity, without losing global feature capture. These modules are interleaved with depthwise CNN modules and squeeze-excitation [64] that optimize local feature capture. MobileViT [65] interleaves MobileNetV2 blocks with a transformer that uses CNNs to optimize the capture of local and global features. The MobileNetV2 blocks apply depthwise separable convolutions to reduce computational complexity, while the transformer inserts global self-attention to capture long-range dependencies.

2.6. Models for Semantic Segmentation of Unstructured Routes

In the design of architectures that operate on unstructured routes, it is important to consider the capture of different textures and their global context in the scene. The following presents strategies that improve these two aspects using deep-learning techniques that often mix CNNs, ViT, or MLP.
In [23], the CMSNet architecture is designed to test different configurations. Resnet or VGG can be used as the backbone. To increase feature capture, various pyramid feature groupings, such as ASPP, GPP, and SPP, were selected. Experiments carried out on a self-created database called Kamino, with a performance of 87% mIoU and an inference speed of 29 FPS. The OFFSEG [66] architecture uses the pretrained MobilNetV2 model on the ImageNet database as a classifier, followed by a color-based segmentation methodology, separating the scene into four classes. The evaluation was carried out on the Rellis3D and RUGD databases, achieving an average accuracy of 97.3%. The study model, based on the encoder–decoder architecture type, proposes and studies models with good performance in one of the two roles. The studies are carried out considering a database generated by the author, where the best performance is presented by the ResNet and DeepLabV3 pair as encoder and decoder, respectively, with an mIoU of 85.2% and an inference speed of 37 FPS, outperforming the performance of the models separately.
The study [67] designs an architecture to capture the traversable area in mining environments, generating its own database. The segmentation network is composed of convolutional layers, transformers, and linear classifiers. Its lightweight version presents a performance of 92.2% mIoU, which exceeds models such as Swin-T and PSPNet.

3. Methodology

The present study followed the methodology presented in Figure 1. It began with a review of the state-of-the-art in public databases and architectures for real-time semantic image segmentation. Then, the databases and the type of base architecture were selected to conduct the study. Finally, the models were trained to evaluate their performance, considering the application where they are implemented.
In this scheme, it is observed that this research begins by taking two routes. On one hand, a review of public databases based on unstructured routes is carried out, and on the other hand, a review of state-of-the-art architectures is conducted. Both routes are linked for the training, validation, and testing of new architectures in challenging unstructured route databases. The arrows graphically indicate the logical sequence in which the present investigation was carried out.
The research stages are detailed below.

3.1. Database Selection and Processing

To evaluate the designed models, public databases created in forests and fields were analyzed. From this group, databases that presented scenes with low contrast between the route and the rest of the ground were selected, with good lighting and weather conditions and without moving obstacles. The annotations had to contemplate masks that delimited the traversable area from the rest of the scene.
None of the analyzed databases present binary annotations; therefore, the original annotations were modified considering the differentiation of the traversable route from the rest of the scene. This process was carried out by grouping the classes of the original annotations into two classes and then binarizing the image.

3.2. Selection and Design of Bilateral Models

Bilateral models for semantic segmentation operate in real-time without sacrificing inference accuracy. These models are composed of two paths, a contextual path and a spatial path [37]. The first path is responsible for increasing the receptive field of the neural network. The second preserves the details of the image that are lost through operations in the contextual path [68]. Additionally, architectures were designed where ViT as a decoder and MLP as an attention module were connected in series in the contextual path.
Due to the function it performs, the contextual path requires more computational resources and execution time than the spatial path. Recent advances in ViT and MLP-based models offer an alternative to improve the capture of global features in the contextual path, reducing the number of parameters of the base architecture without sacrificing real-time operation. After a thorough analysis of architectures, two ViT models and two MLP models were selected to implement in the bilateral model, analyzing the number of parameters, implementation simplicity, and flexibility during inference. Additionally, architectures were designed where ViT was connected in series as a decoder and MLP as an attention module in the contextual path.

3.3. Training of Hybrid Bilateral Models

The databases were divided considering 80% of the images for training, 10% for validation, and 10% for testing the model.
To avoid overfitting, the training and validation data were augmented 10 times by applying 20% variations in contrast, brightness, saturation, and hue. The images were also rotated considering 10° as the angular limit, and the RGB channel values of the image were varied by a maximum of one unit with a 50% probability of occurrence.
During training, the Mean Intersection Over Union (mIoU) performance index was used as the cost function, and the Adam optimization algorithm was used to optimize the parameters. The batch size was 32 images, considering 1000 epochs to train the models and a learning rate of 1 × 10−3. The iterations were performed on an NVidia A100 GPU with 80 GB of memory.

3.4. Evaluation of Hybrid Bilateral Models

To measure the accuracy of the segmentation algorithm, the performance indices Mean Intersection Over Union (mIoU) and Dice Similarity Coefficient (DSC) are provided in terms of the number of pixels correctly classified as class i ( T P i ), incorrectly classified as class i ( F P i ), and those that were not correctly classified in class i ( F N i ). M is the total number of classes to be inferred:
m I o U = 1 / M i = 1 M T P i / ( T P i + F P i + F N i )
D S C = 1 / M i = 1 M 2 T P i / ( 2 T P i + F P i + F N i )
It is expected that the predictions of the studied models predict two classes: the traversable route and the rest of the scene. Therefore, M would be equal to two.
The images used to test the designed models were augmented 10 times by applying variations of 20% in contrast, brightness, saturation, and hue. The images were also rotated considering 10° as the angular limit, and the RGB channel values of the image were varied by a maximum of one unit with a 50% probability of occurrence. This was performed to increase the test sample and thus obtain representative statistical values that allow comparing the performance of the models in the selected databases.
Other important parameters for evaluating model performance are inference speed, model parameter count, and trained model computational size [69,70].
The total test images were used to calculate the inference speed, and then the total time was divided by the number of images. The models are evaluated on a computer with two NVidia T4 GPUs with 16 GB of memory each.

4. Results

4.1. Unstructured Route Databases

Table 1 presents public databases for the study of vehicle navigation on unstructured routes that were analyzed to test the models. RUGD [71] contains RGB images with multiple types of annotations and in various types of terrain in rural areas. Rellis 3D [72] is a multimodal database whose data were captured at Rellis Campus of Texas University and which presents the largest number of classes of the reviewed databases. ORFD [73] was built on various unstructured terrains with different weather and lighting conditions. CaT [32] was made in a forest near Mississippi State University and presents annotations to evaluate the traversability of off-road and on-road vehicles.
CaT [32] and Rellis3D [72] databases were selected to train and evaluate the models. Most of their images present a low contrast between the area enabled for vehicular traffic and the rest of the ground. Figure 2 shows examples of the analyzed images.
CaT considers annotations for the traversability of three types of vehicles. For this study, areas that were traversable by smaller vehicles were considered as the area to be segmented, and areas traversable only by 4 × 4 vehicles were excluded.
Rellis 3D presents annotations considering 20 classes, so the mask was processed to contain only two labels: the route enabled for transit and the rest of the scene. As classes/labels enabled for vehicular traffic, the classes with the following annotations were selected: grass, dirt, concrete, mud, and asphalt.
Figure 3 shows the original masks of the CaT and Rellis 3D annotations, as well as the binarized masks prepared for training the designed models.

4.2. Bilateral Architecture Schemes

In Figure 4, the three bilateral schemes studied can be observed. In the BiSeNet architecture, Figure 4a, the contextual path is composed of a backbone followed by an attention module, both composed of CNNs. Figure 4b presents a hybrid model where the contextual path was modified by a ViT or MLP backbone. Based on the studies [27,74], the contextual path of the bilateral architecture presented in Figure 4c is composed of a ViT skeleton and the mixing module of the studied MLP models.
The state-of-the-art ViT-based models were analyzed, concluding that EfficientViT [50] and FastViT [51] offer a better balance between computational efficiency and performance, highlighting their optimization for memory access and inference accuracy. The architecture of both models is presented in Figure 5.
EfficientViT applies spatial mixing between FFN layers to speed up communication between different features, and CGA, which is composed of multiple heads that distribute the total features to process them. FastViT implements RepMixer, a convolutional module that mixes tokens by reparameterizing the residual connections, to alleviate computational memory access. Additionally, dense convolutions located in the stem and patch embedding modules are replaced by a factorized version that uses train-time overparametrization.
Regarding the MLP-based models, the gMLP [60] and S2MLP [61] architectures were selected, which presented a better balance between scalability and prediction efficiency. The gMLP model uses a Spatial Gating Unit module in the token mixing section that makes the feature selection more flexible, allowing adaptation to different data types and tasks. By functioning as a feature filter, the Spatial Gating Unit allows acceleration of the prediction thanks to computational savings and improving accuracy since its mechanism builds nonlinear relationships between the image features. S2MLP implements a mixing module that does not require parameter training, reducing computational time during training. The spatial shift technique, applied in the Spatial Shift module, allows capturing local dependencies in images, which improves the representation capacity without the need to apply convolutional techniques. Figure 6 presents the mixing modules of the aforementioned architectures.
The first version of BiSeNet [37] was selected as the base model because, although the other versions have better performance, they require more memory space and consider a knowledge distillation process that requires greater computational capacity during training [75], and the operating model is less flexible to changes during inference [76].
Table 2 presents the output tensors of the modules of the bilateral architectures when inputting 256 × 256 and three RGB channel images. The images are passed consecutively through the contextual and spatial paths, reducing their resolution and increasing the number of channels. In the feature fusion module, the resolution is re-established with two channels where each of the detected classes is specified.

4.3. Training of Bilateral Hybrid Models

Considering the parameters for training presented in the methodology, the curves shown in Figure 7 were obtained for the bilateral models with the best performance.
In terms of training speed, the bilateral models exhibit higher rates than the models used as backbones. While BiSeNet-EfficientViT-S2MLP achieves a training rate of 28.71 it/s, EfficientViT has a training rate of 7.76 it/s. Similarly, BiSeNet-FastViT-S2MLP and its backbone FastViT exhibit training rates of 28.71 it/s and 2.76 it/s, respectively. BiSeNet-Xception achieves a training rate of 22.92 it/s.

4.4. Evaluation of Bilateral Hybrid Models

Table 3 presents the performance metrics and key parameters of the base and designed architectures, considering the two studied datasets. It is observed that BiSeNet-FastViT-S2MLP outperforms the base architectures in terms of prediction accuracy. When using the architecture with gMLP, while detail capture is observed, it is less than with the S2MLP implementation. This table can be used for an ablation study of the designed hybrid models where a ViT model was used as a backbone and an MLP model as the attention module.
Figure 8 presents boxplots to analyze the distribution of the mIoU index results, considering both datasets. To facilitate the analysis, Table 4 shows their main parameters.
Figure 9 and Figure 10 show segmentation examples of paths using the designed architectures and ViT models, considering the CaT and Rellis 3D datasets, respectively. The red circles with dashed lines emphasize areas with high intensity frequencies, which are challenging for the cited models to capture. The images presented as examples were selected to be representative of each segment traversed by the vehicle.
The graph in Figure 11 allows comparing the inference speed and accuracy in predicting the traversable area of state-of-the-art hybrid architectures with the best proposed design, BiSeNet-FastViT-S2MLP, and the CaT dataset. The hybrid models from the state-of-the-art architectures were selected so that the computational weight of the trained model is around 30 MB. These models are available in the Timm library of PyTorch.

5. Discussion

In the model design, the goal is to enhance the function of the contextual path by testing two backbones based on ViTs and MLPs. The ViT models focus on reducing memory usage through various modules, while the MLP models focus on the complex classification of semantic information.
The hybrid models have a similar number of parameters, approximately one-third of Xception’s. However, the training speeds of both models are practically identical, despite the backbones differing significantly in their training rates. This suggests that the bilateral base architecture influences training speed more than the choice of backbone. When these architectures are integrated into BiSeNet, the data flow is split between the contextual and spatial paths, alleviating part of the computational burden previously managed entirely by the backbone. Additionally, the use of skip connections improves model convergence without causing gradient problems or extending backpropagation operations.
The training curves of the best-performing bilateral models show better convergence of the BiSeNet-EfficientViT-S2MLP model, comparable to BiSeNet-Xception. However, BiSeNet-FastViT-S2MLP shows greater convergence in its validation curve, indicating that the model finds the correct inference parameters more efficiently.
Both selected datasets come from forest environments, where images have low contrast between the path and the surrounding ground. Particularly, the scenes in the CaT dataset pose more challenges for the models due to the abundance of vegetation, causing the entire scene to have a similar intensity, complicating the path distinction. This challenge is reflected in both the quantitative and qualitative results. The boxplots reveal that the inference results for the CaT dataset are more dispersed, presenting multiple outliers, while the results for Rellis 3D are close to the respective boxplot’s minimum and maximum. Despite this, the medians of the bilateral models, considering both datasets, are similar in magnitude.
In terms of performance, the BiSeNet-FastViT-S2MLP model outperforms the base architectures in prediction accuracy. While the gMLP variant shows some capability to capture details, these are fewer compared to the S2MLP implementation.
When comparing the designed model with the base ViT models, a reduction in inference speed and an increase in model weight are observed. This occurs due to the addition of modules to the base transformer architecture, namely the spatial path and the MLP-mixer module. Despite this, the model still operates in real time (over 30 FPS). Although these parameters are affected, accuracy is improved, which is significant for defining contours, and is crucial for an autonomous vehicle to predict steering angles accurately or detect obstacles at a greater distance.
The results sequentially demonstrate the evolution of the traversable path detection as a spatial path and MLP-Mixer modules are added. Both FastViT and EfficientViT show improvements in shape detection when these modules are included. The qualitative results obtained with BiSeNet-FastViT-S2MLP are similar to those obtained with BiSeNet-Xception. Still, it is important to note that the designed architecture has fewer trainable parameters and lower computational weight, which is an advantage in applications on portable devices with limited computational storage.
When comparing BiSeNet-FastViT-S2MLP with state-of-the-art other hybrid models, the proposed model achieves the highest prediction accuracy while maintaining high inference speed. The analyzed models commonly integrate combinations of CNNs-ViT or CNNs-MLP, suggesting that the combined use of the three deep learning techniques—CNNs, ViT, and MLP—as in this study, could further enhance the model’s inference capabilities.
The quantitative and qualitative results obtained for two datasets with similar characteristics demonstrate that the BiSeNet-FastViT-S2MLP architecture not only surpasses the architectures used as a base but also maintains an inference speed compatible with real-time applications. This achievement is particularly significant, considering the complexity inherent in the analyzed datasets, suggesting that the proposed approach could have a substantial impact on improving the visual perception of autonomous vehicles in challenging environments.

6. Conclusions

This study presents a significant contribution to the field of visual perception for autonomous vehicles, demonstrating that it is possible to improve accuracy in detecting traversable areas on unstructured roads without compromising computational efficiency. The development of bilateral hybrid architectures combining the three most well-known techniques in computer vision and deep learning is presented. Among the designed models, BiSeNet-FastViT-S2MLP stands out, showing better low-frequency feature capture. Additionally, when comparing the performance of BiSeNet-FastViT-S2MLP with other hybrid models on the CaT dataset, the model demonstrates better inference accuracy while maintaining real-time operation.

7. Future Works

The models were only tested on isolated images, which does not reflect the reality of practical applications where image sequences are handled. This limitation arises due to the lack of unstructured road datasets that include semantic segmentation annotations in videos, preventing more representative tests. There are also no datasets that contemplate failures in camera capture due to lens fogging, glare from a front light beam, or mobile vibrations.
Hybrid models open new opportunities for developing more robust and adaptable perception systems in various environments. In adverse weather conditions, perception models that extract features from images may generate erroneous predictions [77]. To mitigate this issue, future work could focus on training models on a newly created database containing images with such conditions, allowing the receptive field of hybrid models to improve performance in these scenarios. Another viable solution would be the development of a multimodal platform that combines additional perceptual sensors, such as LIDAR or radar, along with RGB cameras. This would provide complementary data to enhance model predictions and generate more accurate inferences under complex conditions.
This work could serve as a foundation for developing multitasking models that can be implemented on real platforms, such as (Advances Driver Assistance Systems) ADAS. These systems warn drivers about permitted traffic zones and obstacle detection on the road. Additionally, ADAS could be complemented with Autonomous Visualization Agents (AVA) systems [78], which would enhance the driver’s visual analysis by generating dynamic visualizations. AVA, using natural language instructions instead of advanced computer vision tools, would facilitate visual interpretation and decision-making during driving.

Author Contributions

Conceptualization, C.U. and M.V.; methodology, C.U. and M.V.; software, C.U. and M.V.; validation, C.U. and M.V.; formal analysis, C.U. and M.V.; investigation, C.U. and M.V.; resources, C.U. and M.V.; data curation, C.U. and M.V.; writing—original draft preparation, C.U. and M.V.; writing—review and editing, C.U. and M.V.; visualization, C.U. and M.V.; supervision, C.U.; project administration, C.U.; funding acquisition, C.U. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

This work has been supported by the Vicerrectoría de Investigación, Innovación y Creación of the Universidad de Santiago de Chile and ANID FONDEQUIP Mediano EQM230160 grant, Chile.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations, listed in alphabetical order, are used in this manuscript:
ADASAdvances Driver Assistance Systems
ASPPAtrous Spatial Pyramid Pooling
AVAAutonomous Visualization Agents
BiSeNetBilateral Segmentation Net
CaTCAVS Traversability
CAVSCenter for Advanced Vehicular Systems
CGACascade Group Attention
CSMNetCascaded Segmented Matting Network
CNNsConvolutional Neural Networks
DeiTData Efficient Image Transformer
DSCDice Similarity Coefficient
IDDIndian Driving Dataset
gMLPGated Multilayer Perceptron
GPPGlobal Pyramid Pooling
GPUGraphical Processor Unit
FCNFull Convolutional Network
FNFalse Negative
FPFalse Positive
FPSFrames Per Second
MHSAMulti Head Self Attention
mIoUMean Intersection Over Union
MLPMultilayer Perceptron
LiDARLight Detection and Ranging
PSPNetPyramid Scene Parsing Network
ORFDOff Road Free Space Detection
RADARRadio Detection and Ranging
ResNetResidual Network
S2MLPSpatial Shift Multilayer Perceptron
SPPSpatial Pyramid Pooling
TNTrue Negative
TPTrue Positive
ViTVisual Transformer
VGGVisual Geometry Group

References

  1. Badue, C.; Guidolini, R.; Carneiro, R.V.; Azevedo, P.; Cardoso, V.B.; Forechi, A.; Jesus, L.; Berriel, R.; Paixão, T.M.; Mutz, F.; et al. Self-driving cars: A survey. Expert Syst. Appl. 2021, 165, 113816. [Google Scholar] [CrossRef]
  2. Parekh, D.; Poddar, N.; Rajpurkar, A.; Chahal, M.; Kumar, N.; Joshi, G.P.; Cho, W. A review on autonomous vehicles: Progress, methods and challenges. Electronics 2022, 11, 2162. [Google Scholar] [CrossRef]
  3. Cheng, J.; Zhang, L.; Chen, Q.; Hu, X.; Cai, J. A review of visual SLAM methods for autonomous driving vehicles. Eng. Appl. Artif. Intell. 2022, 114, 104992. [Google Scholar] [CrossRef]
  4. Muhammad, K.; Hussain, T.; Ullah, H.; Ser, J.D.; Rezaei, M.; Kumar, N.; Hijji, M.; Bellavista, P.; de Albuquerque, V.H.C. Vision-based semantic segmentation in scene understanding for autonomous driving: Recent achievements, challenges, and outlooks. IEEE Trans. Intell. Transp. Syst. 2022, 23, 22694–22715. [Google Scholar] [CrossRef]
  5. Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions: A survey. ISPRS J. Photogramm. Remote Sens. 2023, 196, 146–177. [Google Scholar] [CrossRef]
  6. Marti, E.; De Miguel, M.A.; Garcia, F.; Perez, J. A review of sensor technologies for perception in automated driving. IEEE Intell. Transp. Syst. Mag. 2019, 11, 94–108. [Google Scholar] [CrossRef]
  7. Wang, K.; Zhao, G.; Lu, J. A deep analysis of visual SLAM methods for highly automated and autonomous vehicles in complex urban environment. IEEE Trans. Intell. Transp. Syst. 2024, 25, 10524–10541. [Google Scholar] [CrossRef]
  8. Chen, Q.; Xie, Y.; Guo, S.; Bai, J.; Shu, Q. Sensing system of environmental perception technologies for driverless vehicle: A review of state of the art and challenges. Sens. Actuators A Phys. 2021, 319, 112566. [Google Scholar] [CrossRef]
  9. Hao, S.; Zhou, Y.; Guo, Y. A brief survey on semantic segmentation with deep learning. Neurocomputing 2020, 406, 302–321. [Google Scholar] [CrossRef]
  10. Saleem, H.; Riaz, F.; Mostarda, L.; Niazi, M.A.; Rafiq, A.; Saeed, S. Steering angle prediction techniques for autonomous ground vehicles: A review. IEEE Access 2021, 9, 78567–78585. [Google Scholar] [CrossRef]
  11. Zakaria, N.J.; Shapiai, M.I.; Ghani, R.A.; Yassin, M.N.M.; Ibrahim, M.Z.; Wahid, N. Lane detection in autonomous vehicles: A systematic review. IEEE Access 2023, 11, 3729–3765. [Google Scholar] [CrossRef]
  12. Badrloo, S.; Varshosaz, M.; Pirasteh, S.; Li, J. Image-based obstacle detection methods for the safe navigation of unmanned vehicles: A review. Remote Sens. 2022, 14, 3824. [Google Scholar] [CrossRef]
  13. Bruno, D.R.; Berri, R.A.; Barbosa, F.M.; Osório, F.S. CARINA Project: Visual perception systems applied for autonomous vehicles and advanced driver assistance systems (ADAS). IEEE Access 2023, 11, 69720–69749. [Google Scholar] [CrossRef]
  14. Lee, D.-H.; Liu, J.-L. End-to-end deep learning of lane detection and path prediction for real-time autonomous driving. Signal Image Video Process. 2023, 17, 199–205. [Google Scholar] [CrossRef]
  15. Rateke, T.; von Wangenheim, A. Road surface detection and differentiation considering surface damages. Auton. Robot. 2021, 45, 299–312. [Google Scholar] [CrossRef]
  16. Gao, B.; Zhao, X.; Zhao, H. An active and contrastive learning framework for fine-grained off-road semantic segmentation. IEEE Trans. Intell. Transp. Syst. 2023, 24, 564–579. [Google Scholar] [CrossRef]
  17. Sharma, S.; Ball, J.E.; Tang, B.; Carruth, D.W.; Doude, M.; Islam, M.A. Semantic segmentation with transfer learning for off-road autonomous driving. Sensors 2019, 19, 2577. [Google Scholar] [CrossRef]
  18. Yang, Y.; Zhou, W.; Jiskani, I.M.; Wang, Z. Extracting unstructured roads for smart open-pit mines based on computer vision: Implications for intelligent mining. Expert Syst. Appl. 2024, 249, 123628. [Google Scholar] [CrossRef]
  19. Abdelsalam, A.; Happonen, A.; Karha, K.; Kapitonov, A.; Porras, J. Toward autonomous vehicles and machinery in mill yards of the forest industry: Technologies and proposals for autonomous vehicle operations. IEEE Access 2022, 10, 88234–88250. [Google Scholar] [CrossRef]
  20. Rasib, M.; Butt, M.A.; Riaz, F.; Sulaiman, A.; Akram, M. Pixel level segmentation based drivable road region detection and steering angle estimation method for autonomous driving on unstructured roads. IEEE Access 2021, 9, 167855–167867. [Google Scholar] [CrossRef]
  21. Firkat, E.; Zhang, J.; Wu, D.; Yang, M.; Zhu, J.; Hamdulla, A. ARDformer: Agroforestry road detection for autonomous driving using hierarchical transformer. Sensors 2022, 22, 4696. [Google Scholar] [CrossRef]
  22. Bai, C.; Zhang, L.; Gao, L.; Peng, L.; Li, P.; Yang, L. Real-time segmentation algorithm of unstructured road scenes based on improved BiSeNet. J. Real-Time Image Process. 2024, 21, 91. [Google Scholar] [CrossRef]
  23. Ferreira, N.A.; Ruiz, M.; Reis, M.; Cajahyba, T.; Oliveira, D.; Barreto, A.C.; Simas Filho, E.F.; de Oliveira, W.L.A.; Schnitman, L.; Monteiro, R.L.S. Low-latency perception in off-road dynamical low visibility environments. Expert Syst. Appl. 2022, 201, 117010. [Google Scholar] [CrossRef]
  24. Gou, J.; Yu, B.; Maybank, S.; Tao, D. Knowledge distillation: A survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
  25. Han, Z.; Jian, M.; Wang, G.-G. ConvUNeXt: An efficient convolution neural network for medical image segmentation. Knowl.-Based Syst. 2022, 253, 109512. [Google Scholar] [CrossRef]
  26. Saikia, F.N.; Iwahori, Y.; Suzuki, T.; Bhuyan, M.K.; Wang, A.; Kijsirikul, B. MLP-UNet: Glomerulus segmentation. IEEE Access 2023, 11, 53034–53047. [Google Scholar] [CrossRef]
  27. Yuan, H.; Peng, J. LCSeg-Net: A low-contrast images semantic segmentation model with structural and frequency spectrum information. Pattern Recognit. 2024, 151, 110428. [Google Scholar] [CrossRef]
  28. Gulzar, Y.; Khan, S.A. Skin lesion segmentation based on vision transformers and convolutional neural networks—A comparative study. Appl. Sci. 2022, 12, 5990. [Google Scholar] [CrossRef]
  29. Geiger, A.; Lenz, P.; Stiller, C.; Urtasun, R. Vision meets robotics: The KITTI dataset. Int. J. Robot. Res. 2013, 32, 1231–1237. [Google Scholar] [CrossRef]
  30. Sellat, Q.; Bisoy, S.; Priyadarshini, R.; Vidyarthi, A.; Kautish, S.; Barik, R.K. Intelligent Semantic Segmentation for Self-Driving Vehicles Using Deep Learning. Comput. Intell. Neurosci. 2022, 2022, 6390260. [Google Scholar] [CrossRef]
  31. Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes dataset for semantic urban scene understanding. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3213–3223. [Google Scholar]
  32. Sharma, S.; Dabbiru, L.; Hannis, T.; Mason, G.; Carruth, D.W.; Doude, M.; Goodin, C.; Hudson, C.; Ozier, S.; Ball, J.E.; et al. CaT: CAVS traversability dataset for off-road autonomous driving. IEEE Access 2022, 10, 24759–24768. [Google Scholar] [CrossRef]
  33. Li, Y.; Li, Z.; Teng, S.; Zhang, Y.; Zhou, Y.; Zhu, Y.; Cao, D.; Tian, B.; Ai, Y.; Zhe, X.; et al. AutoMine: An unmanned mine dataset. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 21276–21285. [Google Scholar]
  34. Varma, G.; Subramanian, A.; Namboodiri, A.; Chandraker, M.; Jawahar, C.V. IDD: A dataset for exploring problems of autonomous navigation in unconstrained environments. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1743–1751. [Google Scholar]
  35. Sun, W.; Wang, R. Fully convolutional networks for semantic segmentation of very high resolution remotely sensed images combined with DSM. IEEE Geosci. Remote Sens. Lett. 2018, 15, 474–478. [Google Scholar] [CrossRef]
  36. Kotaridis, I.; Lazaridou, M. Semantic Segmentation Using a UNET Architecture on SENTINEL-2 Data. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2022, XLIII-B3-2022, 119–126. [Google Scholar] [CrossRef]
  37. Yu, C.; Wang, J.; Peng, C.; Gao, C.; Yu, G.; Sang, N. BiSeNet: Bilateral Segmentation Network for Real-Time Semantic Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 334–349. [Google Scholar]
  38. Krithika alias AnbuDevi, M.; Suganthi, K. Review of Semantic Segmentation of Medical Images Using Modified Architectures of UNET. Diagnostics 2022, 12, 3064. [Google Scholar] [CrossRef]
  39. Sharma, S.; Kumar, S. The Xception Model: A Potential Feature Extractor in Breast Cancer Histology Images Classification. ICT Express 2022, 8, 101–108. [Google Scholar] [CrossRef]
  40. Qin, Y.Y.; Cao, J.T.; Ji, X.F. Fire Detection Method Based on Depthwise Separable Convolution and YOLOv3. Int. J. Autom. Comput. 2021, 18, 300–310. [Google Scholar] [CrossRef]
  41. Chen, Y.; Gu, X.; Liu, Z.; Liang, J. A Fast Inference Vision Transformer for Automatic Pavement Image Classification and Its Visual Interpretation Method. Remote Sens. 2022, 14, 1877. [Google Scholar] [CrossRef]
  42. Wang, Y.; Han, Y.; Wang, C.; Song, S.; Tian, Q.; Huang, G. Computation-efficient deep learning for computer vision: A survey. Cybern. Intell. 2024, 1–47, 1–47. [Google Scholar] [CrossRef]
  43. Tabani, H.; Balasubramaniam, A.; Marzban, S.; Arani, E.; Zonooz, B. Improving the Efficiency of Transformers for Resource-Constrained Devices. In Proceedings of the 2021 24th Euromicro Conference on Digital System Design (DSD), Palermo, Spain, 1–3 September 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 449–456. [Google Scholar]
  44. Kitaev, N.; Kaiser, L.; Levskaya, A. Reformer: The Efficient Transformer. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 5398–5406. [Google Scholar]
  45. Pan, Z.; Cai, J.; Zhuang, B. Fast Vision Transformers with HiLo Attention. In Proceedings of the 36th International Conference on Neural Information Processing System, New Orleans, LA, USA, 28 November–9 December 2022; pp. 14541–14554. [Google Scholar]
  46. Dib, E.; Le Pendu, M.; Jiang, X.; Guillemot, C. Local Low Rank Approximation with a Parametric Disparity Model for Light Field Compression. IEEE Trans. Image Process. 2020, 29, 9641–9653. [Google Scholar] [CrossRef]
  47. Lee, S.; Kim, H.; Jeong, B.; Yoon, J. A Training Method for Low Rank Convolutional Neural Networks Based on Alternating Tensor Compose-Decompose Method. Appl. Sci. 2021, 11, 643. [Google Scholar] [CrossRef]
  48. Yi, S.; Liu, X.; Li, J.; Chen, L. UAVformer: A Composite Transformer Network for Urban Scene Segmentation of UAV Images. Pattern Recognit. 2023, 133, 109019. [Google Scholar] [CrossRef]
  49. Song, E.; Zhang, B.; Liu, H. Combining external-latent attention for medical image segmentation. Neural Netw. 2024, 10, 468–477. [Google Scholar] [CrossRef]
  50. Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 17256–17267. [Google Scholar]
  51. Anasosalu Vasu, P.K.; Gabriel, J.; Zhu, J.; Tuzel, O.; Ranjan, A. Fastvit: A Fast Hybrid Vision Transformer Using Structural Reparameterization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5785–5795. [Google Scholar]
  52. Li, S.; Wang, J.; Song, Y.; Wang, S.; Wang, Y. A Lightweight Model for Malicious Code Classification Based on Structural Reparameterisation and Large Convolutional Kernels. Int. J. Comput. Intell. Syst. 2024, 17, 30. [Google Scholar] [CrossRef]
  53. Lechner, M.; Amini, A.; Rus, D.; Henzinger, T.A. Revisiting the Adversarial Robustness-Accuracy Tradeoff in Robot Learning. IEEE Robot. Autom. Lett. 2023, 8, 1595–1602. [Google Scholar] [CrossRef]
  54. Tolstikhin, I.O.; Houlsby, N.; Dosovitskiy, A. Mlp-Mixer: An All-Mlp Architecture for Vision. Adv. Neural Inf. Process. Syst. 2021, 34, 24261–24272. [Google Scholar]
  55. Touvron, H.; Bojanowski, P.; Caron, M.; Cord, M.; El-Nouby, A.; Grave, E.; Izacard, G.; Joulin, A.; Synnaeve, G.; Verbeek, J.; et al. ResMLP: Feedforward Networks for Image Classification with Data-Efficient Training. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 5314–5321. [Google Scholar] [CrossRef]
  56. Guo, M.-H.; Liu, Z.-N.; Mu, T.-J.; Liang, D.; Martin, R.R.; Hu, S.-M. Can Attention Enable MLPs to Catch up with CNNs? Comput. Vis. Media 2021, 7, 283–288. [Google Scholar] [CrossRef]
  57. Liu, R.; Li, Y.; Tao, L.; Liang, D.; Zheng, H.-T. Are We Ready for a New Paradigm Shift? A Survey on Visual Deep MLP. Patterns 2022, 3, 100520. [Google Scholar] [CrossRef]
  58. Lai, H.-P.; Tran, T.-T.; Pham, V.-T. Axial Attention MLP-Mixer: A New Architecture for Image Segmentation. In Proceedings of the 2022 IEEE Ninth International Conference on Communications and Electronics (ICCE), Nha Trang, Vietnam, 27–29 July 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 381–386. [Google Scholar]
  59. Tang, Y.; Han, K.; Guo, J.; Xu, C.; Li, Y.; Xu, C.; Wang, Y. An Image Patch Is a Wave: Phase-Aware Vision MLP. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 10925–10934. [Google Scholar]
  60. Song, R.; Sun, L.; Gao, Y.; Peng, C.; Wu, X.; Lv, S.; Wei, J.; Jiang, M. Global-Local Feature Cross-Fusion Network for Ultrasonic Guided Wave-Based Damage Localization in Composite Structures. Sens. Actuators A Phys. 2023, 362, 114659. [Google Scholar] [CrossRef]
  61. Yu, T.; Li, X.; Li, P. S2-Mlp: Spatial-Shift Mlp Architecture for Vision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 297–306. [Google Scholar]
  62. Xie, E.; Wang, W.; Luo, P. SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
  63. Ong, K.L.; Lee, C.P.; Lim, H.S.; Lim, K.M.; Mukaida, T. SCQT-MaxViT: Speech Emotion Recognition with Constant-Q Transform and Multi-Axis Vision Transformer. IEEE Access 2023, 11, 63081–63091. [Google Scholar] [CrossRef]
  64. Hassanin, M.; Anwar, S.; Radwan, I.; Khan, F.S.; Mian, A. Visual Attention Methods in Deep Learning: An In-Depth Survey. Inf. Fusion 2024, 108, 102417. [Google Scholar] [CrossRef]
  65. Mehta, S.; Rastegari, M. MobileViT: Light-Weight, General-Purpose, and Mobile-Friendly Vision Transformer. In Proceedings of the ICLR 2022, Virtual Event, 25–29 April 2022; pp. 1–26. [Google Scholar]
  66. Viswanath, K.; Singh, K.; Jiang, P.; Sujit, P.B.; Saripalli, S. OFFSEG: A Semantic Segmentation Framework for Off-Road Driving. In Proceedings of the 2021 IEEE 17th International Conference on Automation Science and Engineering (CASE), Lyon, France, 23–27 August 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 354–359. [Google Scholar]
  67. Zheng, C.; Liu, L.; Meng, Y.; Wang, M.; Jiang, X. Passable Area Segmentation for Open-Pit Mine Road from Vehicle Perspective. Eng. Appl. Artif. Intell. 2024, 129, 107610. [Google Scholar] [CrossRef]
  68. Gao, G.; Xu, G.; Li, J.; Yu, Y.; Lu, H.; Yang, J. FBSNet: A Fast Bilateral Symmetrical Network for Real-Time Semantic Segmentation. IEEE Trans. Multimed. 2023, 25, 3273–3283. [Google Scholar] [CrossRef]
  69. Li, J.; Cheng, L.; Xia, T.; Ni, H.; Li, J. Multi-Scale Fusion U-Net for the Segmentation of Breast Lesions. IEEE Access 2021, 9, 137125–137139. [Google Scholar] [CrossRef]
  70. Yang, M.Y.; Kumaar, S.; Lyu, Y.; Nex, F. Real-Time Semantic Segmentation with Context Aggregation Network. ISPRS J. Photogramm. Remote Sens. 2021, 178, 124–134. [Google Scholar] [CrossRef]
  71. Wigness, M.; Eum, S.; Rogers, J.G.; Han, D.; Kwon, H. A RUGD Dataset for Autonomous Navigation and Visual Perception in Unstructured Outdoor Environments. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Macau, China, 3–8 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 5000–5007. [Google Scholar]
  72. Jiang, P.; Osteen, P.; Wigness, M.; Saripalli, S. RELLIS-3D Dataset: Data, Benchmarks and Analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1110–1116. [Google Scholar]
  73. Min, C.; Jiang, W.; Zhao, D.; Xu, J.; Xiao, L.; Nie, Y.; Dai, B. ORFD: A Dataset and Benchmark for Off-Road Freespace Detection. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2532–2538. [Google Scholar]
  74. Cheng, Z.; Wang, L. Dynamic Hierarchical Multi-Scale Fusion Network with Axial MLP for Medical Image Segmentation. Sci. Rep. 2023, 13, 6342. [Google Scholar] [CrossRef]
  75. Tsai, T.-H.; Tseng, Y.-W. BiSeNet V3: Bilateral Segmentation Network with Coordinate Attention for Real-Time Semantic Segmentation. Neurocomputing 2023, 532, 33–42. [Google Scholar] [CrossRef]
  76. Kim, S.; Ham, G.; Cho, Y.; Kim, D. Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning. IEEE Trans. Knowl. Data Eng. 2023, 1–13. [Google Scholar] [CrossRef]
  77. Wiseman, Y. Real-Time Monitoring of Traffic Congestions. In Proceedings of the 2017 IEEE International Conference on Electro Information Technology (EIT), Lincoln, NE, USA, 14–17 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 501–505. [Google Scholar]
  78. Liu, S.; Miao, H.; Li, Z.; Olson, M.; Pascucci, V.; Bremer, P. AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making. Comput. Graph. Forum 2024, 43, e15093. [Google Scholar] [CrossRef]
Figure 1. Schematic of the methodological sequence applied in the present study.
Figure 1. Schematic of the methodological sequence applied in the present study.
Technologies 12 00201 g001
Figure 2. Example of images from the studied databases that have a low contrast between the traversable and non-traversable area: (a) Image from the CaT database; (b) Image from the Rellis 3D database.
Figure 2. Example of images from the studied databases that have a low contrast between the traversable and non-traversable area: (a) Image from the CaT database; (b) Image from the Rellis 3D database.
Technologies 12 00201 g002
Figure 3. Example of original masks from the selected databases and binarized masks for training the designed models: (a) Original CaT mask; (b) Binarized CaT mask; (c) Original Rellis 3D mask; (d) Binarized Rellis 3D mask.
Figure 3. Example of original masks from the selected databases and binarized masks for training the designed models: (a) Original CaT mask; (b) Binarized CaT mask; (c) Original Rellis 3D mask; (d) Binarized Rellis 3D mask.
Technologies 12 00201 g003
Figure 4. Diagram of the bilateral architectures studied: (a) Original BiSeNet architecture; (b) Architecture with ViT or MLP-based skeleton; (c) Architecture with ViT skeleton and MLP-based mixing module.
Figure 4. Diagram of the bilateral architectures studied: (a) Original BiSeNet architecture; (b) Architecture with ViT or MLP-based skeleton; (c) Architecture with ViT skeleton and MLP-based mixing module.
Technologies 12 00201 g004
Figure 5. Schematic of transformer-based architectures. The sections highlighted by green segments repeat three times in the architectures: (a) EfficientViT; (b) FastViT.
Figure 5. Schematic of transformer-based architectures. The sections highlighted by green segments repeat three times in the architectures: (a) EfficientViT; (b) FastViT.
Technologies 12 00201 g005
Figure 6. Schematic of the mixing blocks of MLP-based architectures. The sections highlighted by green segments repeat three times in the architectures: The sections highlighted by green segments correspond to the mixer modules implemented in the studied architectures: (a) gMLP-Mixer; (b) S2MLP.
Figure 6. Schematic of the mixing blocks of MLP-based architectures. The sections highlighted by green segments repeat three times in the architectures: The sections highlighted by green segments correspond to the mixer modules implemented in the studied architectures: (a) gMLP-Mixer; (b) S2MLP.
Technologies 12 00201 g006
Figure 7. Loss versus epoch curves obtained during training and validation of the bilateral models BiSeNet-FastVit-S2MLP, BiSeNet-EfficientVit-S2MLP, and BiSeNet-Xception: (a) Training curves (b) Validation curves.
Figure 7. Loss versus epoch curves obtained during training and validation of the bilateral models BiSeNet-FastVit-S2MLP, BiSeNet-EfficientVit-S2MLP, and BiSeNet-Xception: (a) Training curves (b) Validation curves.
Technologies 12 00201 g007
Figure 8. Boxplots of mIOU for the bilateral models BiSeNet-EfficientViT-S2MLP, BiSeNet-FastViT-S2MLP, and BiSeNet-Xception, considering the CaT and Rellis 3D datasets. The orange line above each box represents the median position for the analyzed data series. The gray circles indicate outliers.
Figure 8. Boxplots of mIOU for the bilateral models BiSeNet-EfficientViT-S2MLP, BiSeNet-FastViT-S2MLP, and BiSeNet-Xception, considering the CaT and Rellis 3D datasets. The orange line above each box represents the median position for the analyzed data series. The gray circles indicate outliers.
Technologies 12 00201 g008
Figure 9. Visual results for base models and designed models, considering the test images from the CaT dataset. The traversable area is shown in white. The red circles with dashed lines highlight Ground Truth details better captured by the proposed bilateral architectures compared to transformer-based models.
Figure 9. Visual results for base models and designed models, considering the test images from the CaT dataset. The traversable area is shown in white. The red circles with dashed lines highlight Ground Truth details better captured by the proposed bilateral architectures compared to transformer-based models.
Technologies 12 00201 g009
Figure 10. Visual results for base models and designed models, considering the test images from the Rellis 3D dataset. The traversable area is shown in black. The red circles with dashed lines highlight Ground Truth details better captured by the proposed bilateral architectures compared to transformer-based models.
Figure 10. Visual results for base models and designed models, considering the test images from the Rellis 3D dataset. The traversable area is shown in black. The red circles with dashed lines highlight Ground Truth details better captured by the proposed bilateral architectures compared to transformer-based models.
Technologies 12 00201 g010
Figure 11. Graph comparing BiSe-FastViT-S2MLP with lightweight real-time hybrid state-of-the-art models, considering the CaT dataset. The computational weights of the trained models do not exceed 30 MB.
Figure 11. Graph comparing BiSe-FastViT-S2MLP with lightweight real-time hybrid state-of-the-art models, considering the CaT dataset. The computational weights of the trained models do not exceed 30 MB.
Technologies 12 00201 g011
Table 1. Public databases created on unstructured routes with annotations for the study of models dedicated to the semantic segmentation of images.
Table 1. Public databases created on unstructured routes with annotations for the study of models dedicated to the semantic segmentation of images.
DatasetImagesResolutionClassesScene
RUGD [71]2700688 × 5506Countryside
Rellis 3D [72]62351920 × 120020Countryside
ORFD [73]12,1981280 × 7203Countryside and Forest
CaT [32]18121024 × 6444Forest
Table 2. Output tensors of the modules that make up the studied architectures. The tensor format is: [channel, width, height].
Table 2. Output tensors of the modules that make up the studied architectures. The tensor format is: [channel, width, height].
Architecture ModuleOutput Tensor
Input[3, 256, 256]
Spatial path[256, 32, 32]
Contextual path[768, 32, 32]
Features fusion[2, 256, 256]
Output[1, 256, 256]
Table 3. Performance indices of the investigated models considering the CaT and Rellis 3D datasets.
Table 3. Performance indices of the investigated models considering the CaT and Rellis 3D datasets.
ModelMean mIoU CaT (%)Mean mIoU Rellis 3D (%)Mean DSC CaT (%)Mean DSC Rellis 3D (%)Speed (FPS)Parameters (M)Trained Model Weight (MB)
BiSeNet-Xception92.7596.9996.0398.4398.3722.4290.07
BiSeNet-gMLP82.3195.9589.3697.8271.263.35324.87
BiSeNet-S2MLP83.8696.4390.6998.1073.9868.37324.87
FastViT93.0196.2896.1698.0694.533.1613.32
BiSeNet-FastViT93.1796.5796.2298.1777.063.6416.24
BiSeNet-FastViT-gMLP93.1596.7796.2298.3178.055.0922.04
BiSeNet-FastViT-S2MLP93.5196.9796.4298.4674.177.1830.44
EfficientViT91.8196.1795.3698.02127.043.4113.76
BiSeNet-EfficientViT91.5796.4195.298.1096.463.8963.93
BiSeNet-EfficientViT-gMLP92.3496.6695.6698.2692.25.3422.87
BiSeNet-EfficientViT-S2MLP92.2896.6795.6898.2679.247.4431.27
Table 4. Percentile, median, maximum, and minimum values used to create the boxplots in Figure 8.
Table 4. Percentile, median, maximum, and minimum values used to create the boxplots in Figure 8.
ArchitectureBiSeNet-EfficientViT-S2MLPBiSeNet-FastViT-S2MLPBiSeNeT-Xception
DatasetCaTRellis3DCaTRellis3DCaTRellis3D
Q1 (25%)0.8750.9510.8860.9530.8840.957
Mediana0.9400.9680.9520.9700.9430.972
Q3 (75%)0.9730.9500.9840.9810.9740.98
Minimum0.7280.9060.7390.9110.7490.918
Maximum0.9990.9940.9990.9940.9990.994
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Urrea, C.; Vélez, M. Enhancing Autonomous Visual Perception in Challenging Environments: Bilateral Models with Vision Transformer and Multilayer Perceptron for Traversable Area Detection. Technologies 2024, 12, 201. https://doi.org/10.3390/technologies12100201

AMA Style

Urrea C, Vélez M. Enhancing Autonomous Visual Perception in Challenging Environments: Bilateral Models with Vision Transformer and Multilayer Perceptron for Traversable Area Detection. Technologies. 2024; 12(10):201. https://doi.org/10.3390/technologies12100201

Chicago/Turabian Style

Urrea, Claudio, and Maximiliano Vélez. 2024. "Enhancing Autonomous Visual Perception in Challenging Environments: Bilateral Models with Vision Transformer and Multilayer Perceptron for Traversable Area Detection" Technologies 12, no. 10: 201. https://doi.org/10.3390/technologies12100201

APA Style

Urrea, C., & Vélez, M. (2024). Enhancing Autonomous Visual Perception in Challenging Environments: Bilateral Models with Vision Transformer and Multilayer Perceptron for Traversable Area Detection. Technologies, 12(10), 201. https://doi.org/10.3390/technologies12100201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop