Development of an Ensembled Meta-Deep Learning Model for Semantic Road-Scene Segmentation in an Unstructured Environment

Sivanandham, Sangavi; Gunaseelan, Dharani Bai

doi:10.3390/app122312214

Open AccessArticle

Development of an Ensembled Meta-Deep Learning Model for Semantic Road-Scene Segmentation in an Unstructured Environment

by

Sangavi Sivanandham

and

Dharani Bai Gunaseelan

^*

School of Electronics Engineering, Vellore Institute of Technology (VIT), Vellore 632014, India

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12214; https://doi.org/10.3390/app122312214

Submission received: 15 October 2022 / Revised: 19 November 2022 / Accepted: 21 November 2022 / Published: 29 November 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Road scene segmentation is an integral part of the Intelligent Transport System (ITS) for precise interpretation of the environment and safer vehicle navigation. Traditional segmentation methods have faced difficulties in meeting the requirements of unstructured and complex image segmentation. Therefore, the Deep-Neural Network (DNN) plays a significant role in effectively segmenting images with multiple classes in an unstructured environment. In this work, semantic segmentation models such as U-net, LinkNet, FPN, and PSPNet are updated to use classification networks such as VGG19, Resnet50, Efficientb7, MobilenetV2, and Inception V3 as pre-trained backbone architectures, and the performance of each updated model is compared with the unstructured Indian Driving-Lite (IDD-Lite) dataset. In order to improve segmentation performance, a stacking ensemble approach is proposed to combine the predictions of a semantic segmentation model across different backbone architectures using a simple grid search method. Thus, four ensemble models are formed and analyzed on the IDD-Lite dataset. The two metrics Intersection over Union (IoU or Jaccard index) and Dice coefficient (F1 score) are used to assess the segmentation performance of each ensemble model. The results show that an ensemble of U-net with different backbone architectures is more efficient than other ensemble models. This model has achieved 73.12% and 76.67%, respectively, in IoU and F1 scores.

Keywords:

semantic segmentation; ensembling algorithms; computer vision; IDD; segmentation models; road scene perception; deep learning

1. Introduction

Semantic segmentation plays a crucial role in scene interpretation and inferring relationships between different objects in autonomous vehicles. Instead of just identifying the labels of each pixel in an image, semantic segmentation facilitates in pinpointing the target as well as its boundary information. In recent days, there has been active research in semantic pixel-wise labeling. However, encoder-decoder-based neural networks have gained considerable attention for semantic pixel wise segmentation that predicts the class labels for each input image pixel [1]. An encoder usually constitutes a pre-trained Classification Network such as VGG [2], ResNet [3], MobileNet [4], etc. It uses the convolution layers with filter banks for down-sampling the image and generates high-dimensional feature maps for the input image. However, the down-sampling technique significantly reduces the essential spatial information for semantic segmentation. In order to overcome this limitation, decoders are designed to up-sample the obtained lower-resolution feature maps to their original size. These restored features offer pixel-level labels with refined object boundaries for segmentation. The generalized framework of the encoder-decoder-based segmentation model is depicted in Figure 1.

Despite the different neural networks used for semantic segmentation, locating multiple objects in a single image is still a significant threat. In order to obtain the high-dimensional features of the input image for localization and to recover the size of the output image with that of the input, encoder-decoder-based segmentation models are considered in this work to perform pixel-level semantic segmentation on Indian roads for scene understanding [5].

U-net is one of the primary encoder-decoder-based segmentation networks. Although it was initially developed for medical images, the U-net is currently being implemented in various applications, such as self-driving vehicles and remote sensing [6]. In 2015 U-net was proposed with two paths: contraction and expansion [7]. The novelty of this network lies in the expansive path, where the feature map is up-sampled after each stage, followed by concatenation and convolution. It also uses the skip connection to improve the segmentation result by generating a U-shaped network that keeps the original input image size as output. In 2017, a variant of U-net named LinkNet was proposed by Chaurasia and Culurciello [8], which replaces the convolution block in the encoder with a residual block, and uses the “addition” operation feature instead of “stacking”. It consists of four encoder and decoder blocks. Batch normalization is used after each convolution, followed by a ReLU activation function. The novelty lies in bypassing the output after each down-sampling step from each encoder block to its corresponding decoder block. In the same year, 2017, Zhao proposed the PSPNet (Pyramid Scene Parsing Network) [9]. The encoder in this network contains a CNN as a backbone, along with a dilated convolution layer and pyramid pooling modules. The initial feature map obtained from the backbone has additional features due to the dilated convolution layer. The decoder follows a similar structure as U-net, with convolution followed by a bi-linear up-sampling layer. Later, FPN (Feature Pyramid Network), the extended version of PSPNet, was introduced in 2019 for object detection [10]. It works on any image size and calculates the feature map at each level, irrespective of the backbone network. In the encoder stage, the features are up-sampled and concatenated with the features in the decoder, so that the exact image size is maintained as input. In our work, a pre-trained classification network is used to modify the encoder blocks of all four segmentation models. In the upcoming session, we will present a detailed overview of the different segmentation models using modified backbone networks.

2. Related Work

The state of the art in semantic pixel-wise segmentation research can be broadly classified into two different types, namely Traditional and Deep-Learning (DL) based methods. Since 2012, the development of CNN-based segmentation models has been initiated in the early stages by the Fully Convolution Neural Network (FCN) [11]. The FCN has the advantage of using the original size of the image with fewer parameters and lower computational time. The most commonly used DL techniques in satellite and aerial imagery analysis were Autoencoders and CNN. Inspired by the performance of existing autoencoders [12] and to overcome the limitations of the results of the coarse segmentation method, encoder-decoder-based segmentation models such as SegNet [13] and U-shaped networks were introduced. After the successful development of CNNs such as VGG16, MobileNet, and ResNet as object classifiers, researchers were motivated to use them for prediction tasks such as segmentation. Therefore, a new method was introduced for combining the CNN model as an encoder and decoder of the base model [1].

In [14], the SegNet-based segmentation model was developed using the VGG16 network as the encoder backbone for feature extraction. They analyzed the tradeoff between memory space and segmentation accuracy when implemented in the CamVid dataset to understand the road scene. As a result, SegNet has a reduced number of parameters and is efficient in both memory and processing time. In 2016, a new method of dilated convolution was proposed to enhance the resolution of the obtained feature map [15]. In this method, pixels are skipped without a pooling operation to preserve the image’s resolution during processing and increase the accuracy when integrating with existing segmentation models.

Another model for image segmentation is DeepLabV1, DeepLabV2, and DeepLabV3, which is state of the art in using CNN for feature extraction, dilated convolutional layers, and Control Random Fields (CRF). They employed the Atrous convolution layer for producing a larger feature map in the output without increasing the parameters and losing resolution, which is best for better segmentation performance. A novel network called ENet (Efficient Neural Network), similar to SegNet, was proposed in 2016 by Adam paszke for handling lower latency [16]. The dilated convolutional and stacked residual layers are used to maintain accuracy. In 2020, a new proposed architecture called Eff-UNet (Efficient UNet) [1] was introduced, which combines the effects of EfficientNet as an encoder backbone architecture with the U-net decoder. This model’s performance was evaluated in an unstructured IDD-Lite dataset and outperformed other architectures by winning the first prize in the segmentation competition. Similarly, the U-Net model’s accuracy has improved by replacing the U-net’s original encoder with the pre-trained classification network VGG11 [17].

Based on this survey of different segmentation models, we modeled our required network using the different encoders with the predefined decoder using the transfer learning technique.

3. Materials and Methods

3.1. Datasets

To analyze the performance of the proposed Segmentation Models, we conducted experiments with IDD-Lite, focusing on unstructured Indian roads. The segmentation challenge in IDD-Lite was held at the 7th National Conference on Computer Vision, Pattern Recognition, Image Processing, and Graphics (NCVPRIPG) [18] in December 2019. The dataset was collected from 128 driving sequences in the outskirts of Hyderabad and Bangalore, using a car-mounted front-facing camera in 2018.

The unstructured environment poses a challenge due to the wide variety of pedestrians and vehicles, which are less likely to obey traffic rules. Vague road boundaries with muddy terrain can also be considered as possible drivable areas. The database also includes signs, advertisements, and information boards that cause difficulties in mapping and localization. Figure 2 illustrates images from the IDD and Cityscape [19] datasets to highlight the differences between structured and unstructured environments. IDD-lite is a subsampled variant of IDD that follows the distribution of the original dataset (IDD) but with a reduced number of labels [20].

Table 1 represents the pixel fraction for each class considered in IDD-Lite. The size of this dataset is much smaller than 50 MB, with an image resolution of 320 × 227. It portrays the challenges and uncertainties due to 7 hand-picked classes, while there are 34 classes in the original IDD.

3.2. Methodology

3.2.1. General Outline of the Proposed Method

In this work, we evaluated four different encoder-decoder-based basic Semantic Segmentation Models (SSM), namely U-net [21], LinkNet [8], PSPNet [9], and FPN [22], to segment the multi-class images. Variants of each base model were formed by replacing the encoder blocks with five distinct ImageNet pre-trained backbone networks, such as VGG19 [2], Resnet50 [3], Efficientb7 [23], MobilenetV2 [4] and InceptionV3 [24]. These modified networks were trained with the IDD-Lite segmentation training dataset. The trained networks generated a multi-class segmentation mask for every test image. The segmentation accuracy of each SSM base model can be improved by weighting the masks obtained from the different backbones of each base model and ensembling them via the meta-model. The framework for getting the final multi-class segmentation mask using a stacked ensembling algorithm comprises two levels, as depicted in Figure 3. First, the base model is trained with five backbones on the same IDD-Lite dataset. The predictions of these five models on the same training data are used to create the training set for the meta-model in the next stage. Therefore, the final predictions of the stacked ensembling framework are more accurate than the predictions of the individual models.

The development of the model architecture with different encoders and the method used to ensemble the predictions of these models to create the final meta-model are described in the following subsections.

3.2.2. General Outline of the Proposed Method

The two-stage model proposed in this work involves constructing and training the basic semantic segmentation models with different pre-trained encoders, followed by a stacked ensemble method to optimize the final predictions.

An encoder is a simple CNN network that acts as the backbone architecture for the segmentation model, which down-samples the input image to acquire the final high-level feature map. A pre-trained encoder allows the network to converge faster and more efficiently, with enhancement in performance compared to a non-pre-trained model. In this work, the four basic SSMs are improvised in performance using the five popular CNN architectures that show top-1 accuracy in ImageNet classifications as the basic framework. Therefore, 20 different combinations of networks are developed by combining the basic model with the backbone architectures, as in Table 2. Each model is represented as “M_ij” where “i” and “j” correspond to basic SSM and the encoder variant, respectively. Since there are seven different classes in our dataset, the output of each model is represented as “M_ij^k” and “k” denotes the class labels. After training each of these models on the IDD-Lite training dataset, the segmentation masks (“P_ij”) are generated as the output for each image. After completing the evaluation of each base model with its encoder variants, the next step was to implement the stacking-based ensembling approach to combine their predictions to determine the final multi-class segmentation mask using the meta-model.

3.2.3. Stacking-Based Ensembling Methods

In learning-based methods, the ensembling model is a technique that improves the overall performance of the final output predicted by combining the decisions from multiple models. The stacking algorithm is utilized in this work among the different ensembling techniques. Each base model has a different performance for each encoder and produces different segmentation masks, “P_ij”. In order to aggregate the advantages of each model prediction, ensembling the outputs is required for this work. These predictions from each model are assigned some fixed weights based on their performance in the training dataset. The fundamental problem in using this ensembling algorithm is calculating the optimal model weights that produce better performance than any other combination of weights contributing to the model.

A simple grid search algorithm determines the optimal weights of each model to overcome this challenge. A grid-search algorithm is a parameter optimization technique that selects the best possible combination of different parameters from a given range. In this case, the parameter used is the weights, “w_j” assigned for each model prediction. The grid search algorithm starts by forming a grid search space with the given weight values. For each model, the initial weight values are assigned between zero and the randomly specified range, and the error is calculated for each combination of values. After each iteration, the weight values are updated, and the error value is calculated again. Finally, the optimal weights “W_opt” are selected for five encoders based on the minimum error values. “W_opt” for each basic model is represented as

W_{o p t} = [w_{1} w_{2} w_{3} w_{4} w_{5}]

(1)

The weight values range from 0 to 1, and the assumed weights are commonly expected to be constrained, as in Equation (2):

w_{j} \geq 0 a n d \sum_{j = 1}^{5} w_{j} = 1

(2)

These weights are multiplied by their corresponding model output predictions, “P_ij” to form the final ensembled meta-model, “E_j”.

E_{j} = \sum_{j = 1}^{5} P_{i j} \times w_{j}

(3)

The final obtained ensembled model (“E_j”) generates the segmentation mask for the test image with better prediction accuracy than other models before ensembling.

4. Evaluation Metrics

The most frequently used evaluation metrics to compute the performance of segmentation models are Mean Intersection over Union (mIoU) and F1 score, as represented in Equations (5) and (6), respectively. Intersection over Union (IoU), also known as Jaccard Index, is a very effective and straightforward metric for segmentation. The IoU value is obtained by dividing the overlap area by the union area between the predicted segmentation and the ground truth [25]. The IoU value for “

j

” different class labels in this work is obtained by Equation (4), and the definitions for TP, FP, and FN are given in Table 3.

J (A, B) o r I o U_{j} = \frac{T P_{j}}{T P_{j} + F P_{j} + F N_{J}}

(4)

M e a n I o U = \frac{1}{j} \sum_{j = 1}^{5} I o U_{j}

(5)

F1 score is defined as the “harmonic mean of precision and recall metrics”. A model with a higher F1 score represents the correct classification of each observation to its respective class, described in Equation (6).

F 1 S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(6)

where,

P r e c i s i o n = \frac{T P}{T P + F P}

R e c a l l = \frac{T P}{T P + F N}

5. Experimental Evaluation

5.1. Implementation Details

In this work, the experiments were carried out on NVIDIA Quadro P600 and Intel Xeon Silver 4114 CPU Processor with GP107 GPU 384 NIVIDIA CUDA Cores. The segmentation models are trained on the resized image of the original image size, 256 × 256 for U-net, LinkNet, and FPN, while a 384 × 384 image size is utilized for PSPNet. The possible choices of training parameters used were optimization algorithm, epoch, batch size, weight initialization, and learning rate. Based on the Keras framework, segmentation Model libraries were used to perform the experiments in this work [26]. Each model was trained for 20 epochs, with Learning Rate (LR) set to 0.001. The batch size was optimized to four, and the Adam optimization algorithm was preferred. 20 epoch size is chosen as the default number of iterations for training all the models in this work, so that the overall performance of all the models can be compared easily. It was also observed from the experiments conducted that increasing the number of iterations for training the model leads to the overfitting criteria for the dataset provided.

5.2. Evaluation Results

The different model combinations described in Section 3.2.2 are evaluated on the IDD-Lite dataset. The segmentation performance of each model, with various combinations of backbones, was experimented on and analyzed using the mean IoU value and F1 Score, as tabulated in Table 4.

The tabulated results in Table 4 clearly show that U-net performs the best among the segmentation models. When U-net is chosen as the model for segmentation of unstructured Indian roads, either Efficientnetb7 or Inceptionv3 can be used as the best-performing backbones due to their better performance, with the highest mean IoU values of 66.86% and 68.10%, respectively.

VGG19 obtains the second-highest mean IoU value of 65.40% when used with PSPNet. VGG19 can extract wider roads but performs less effectively in complex environments. However, VGG faces challenges in terms of memory and time requirements. Finally, MobilenetV2 only needs to be trained with a few parameters, achieving lower segmentation performance than the other encoders. The highest F1 score of 0.7667 represents perfect precision and recall, called the Dice Similarity Coefficient (DSC).

These values of the IoU score and their corresponding losses for training and validation are analyzed using the evaluation curves. Figure 4 and Figure 5 represent the two loss functions for the training and validation phases. They depict that pre-trained encoders converge more quickly than the non-pre-trained encoder models.

Thus, after analyzing the performance of the different CNNs in each segmentation model, we can conclude that each encoder classifies different class labels with reasonable accuracy. In order to form the final meta-learning model, a stacking-based ensembling procedure is performed to consolidate the advantage of using each encoder for different class labels. The final segmentation mask generated by this meta-model outperforms the other non-ensembled models. Their mean IoU value evaluates the performance of the final four generated meta-learning models, and these are tabulated in Table 5.

The higher the mean IoU value, the better the prediction accuracy of the segmentation mask. Therefore, it is evident that the weight-ensembled U-Net has higher segmentation accuracy with a higher mean IoU value. Thus, the proposed ensembled U-Net architecture has achieved satisfactory results in an unstructured environment with high-traffic areas. Since mIoU returns a single value for all the seven classes present, IoU is validated for each class for the ensembled model on the test dataset. The results are also presented in Table 5.

It is observed from the table that the IoU value for some classes, such as non-drivable areas and road objects, is low due to the uncertain view of both classes in the IDD-Lite dataset. Since the roads are covered with mud, it isn’t easy to distinguish between the drivable and non-drivable areas. The experimental segmentation results for the generated weighted models are shown in Figure 6, with some input images from the IDD-Lite dataset and the corresponding ground truth values. The final mask output for the different ensembled segmentation models, such as U-Net, LinkNet, FPN, and PSPNet, are represented in Figure 6c, Figure 6d, Figure 6e, and Figure 6f, respectively. Although the proposed ensembled method has significantly improved in clearly segmenting each pixel in an image, it still faces challenges in segmenting occluded objects in an image or things of the same class that exist together. Nevertheless, it is significant that the formulated meta learning model works reliably on an unstructured roadway with diverse environmental conditions.

6. Conclusions

In this work, the proposed weight-ensembled method in semantic segmentation models has enhanced the model performance in pixel-level segmentation. This work mainly focuses on improving the model performance in pixel-level segmentation by using weights to ensemble the output predictions obtained from different base models. The Grid Search algorithm learns the optimal weight vector used in the linear combination of output predictions. This algorithm evaluates the model and generates the optimal weights using cross-validation and the Least Square error method. The optimal weight vectors obtained are applied to the output predictions of the model, and these weighted predictions from base models are ensembled to form the four meta-learning algorithms. The final results from the weight-ensembled U-Net show a significant improvement in predicting the pixel-level information for scene perception, with a higher IoU value of 73.12%. This method has outperformed existing baseline models regarding detection accuracy and memory requirements. Although the primary goal of this work was to implement it on mobile devices, we discovered that it could also be applied to high-end GPUs.

However, this approach faces challenges in selecting appropriate backbone networks for feature extraction and in the strategies used to combine the base learner’s predictions. Finding the criteria for choosing the most appropriate backbones for the base learner and replacing the basic grid search algorithm with Deep-Learning techniques for faster convergence in training will be the subjects of future work. Future research may also attempt to implement this approach in the IDD dataset, which has more classification problems in each image. We also anticipate getting a similar model performance when applied to high-resolution images with a larger dataset.

Author Contributions

Conceptualization, S.S.; methodology, S.S.; formal analysis, S.S.; investigation, D.B.G.; supervision, D.B.G.; writing—original draft, S.S.; writing—review and editing, S.S. All authors have read and agreed to the published version of the manuscript.

Funding

The APC is funded by Vellore Institute of Technology, Vellore-632014, India.

Institutional Review Board Statement

Not Applicable.

Informed Consent Statement

Not Applicable.

Data Availability Statement

The IDD-Lite Dataset that supports the findings of this study are openly available from the web link https://idd.insaan.iiit.ac.in/dataset/details/ 14 October 2022.

Conflicts of Interest

The authors declare no conflict of interest.

References

Baheti, B.; Innani, S.; Gajre, S.; Talbar, S. Eff-UNet: A Novel Architecture for Semantic Segmentation in Unstructured Environment. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020; pp. 1473–1481. [Google Scholar] [CrossRef]
Mateen, M.; Wen, J.; Nasrullah; Song, S.; Huang, Z. Fundus Image Classification Using VGG-19 Architecture with PCA and SVD. Symmetry 2019, 11, 1. [Google Scholar] [CrossRef] [Green Version]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef] [Green Version]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 2018, 4510–4520. [Google Scholar] [CrossRef] [Green Version]
Xing, Y.; Zhong, L.; Zhong, X. DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network. Math. Probl. Eng. 2022, 2022, 6195148. [Google Scholar] [CrossRef]
Hu, J.; Li, L.; Lin, Y.; Wu, F.; Zhao, J. A Comparison and Strategy of Semantic Segmentation on Remote Sensing Images. Adv. Intell. Syst. Comput. 2020, 1074, 21–29. [Google Scholar] [CrossRef] [Green Version]
U-Net Architecture For Image Segmentation. Available online: https://blog.paperspace.com/unet-architecture-image-segmentation/ (accessed on 4 November 2022).
Chaurasia, A.; Culurciello, E. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing, St. Petersburg, FL, USA, 9–12 December 2018; Institute of Electrical and Electronics Engineers Inc.: St. Petersburg, FL, USA; Volume 2018, pp. 1–4. [Google Scholar] [CrossRef] [Green Version]
Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid Scene Parsing Network. In Proceedings of the Proceedings—30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017; pp. 6230–6239. [Google Scholar] [CrossRef] [Green Version]
Parmar, V.; Bhatia, N.; Negi, S.; Suri, M. Exploration of Optimized Semantic Segmentation Architectures for Edge-Deployment on Drones. arXiv 2020, arXiv:2007.02839. [Google Scholar]
Zhuang, J.; Yang, J.; Gu, L.; Dvornek, N. Shelfnet for Fast Semantic Segmentation. In Proceedings of the Proceedings—2019 International Conference on Computer Vision Workshop, ICCVW, Seoul, Republic of Korea, 27–28 October 2019; pp. 847–856. [Google Scholar] [CrossRef] [Green Version]
Ngiam, J.; Khosla, A.; Kim, M.; Nam, J.; Lee, H.; Ng, A.Y. Multimodal Deep Learning. In Proceedings of the 28th International Conference on Machine Learning, ICML, Bellevue, WA, USA, 28 June–2 July 2011; pp. 689–696. [Google Scholar]
Summary of—SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation | by Siddhant Kumar | Towards Data Science. Available online: https://towardsdatascience.com/summary-of-segnet-a-deep-convolutional-encoder-decoder-architecture-for-image-segmentation-75b2805d86f5 (accessed on 29 December 2021).
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. Published as Conference paper at International Conference on Learning Representations. arXiv 2016, arXiv:1511.07122. [Google Scholar] [CrossRef]
Paszke, A.; Chaurasia, A.; Kim, S.; Culurciello, E. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv 2016, arXiv:1606.02147. [Google Scholar]
Iglovikov, V.; Shvets, A. TernausNet: U-Net with VGG11 Encoder Pre-Trained on ImageNet for Image Segmentation. arXiv 2018, arXiv:1801.05746. [Google Scholar]
IDD Challenge—NCVPRIPG 2019. Available online: https://cvit.iiit.ac.in/ncvpripg19/idd-challenge/ (accessed on 29 December 2021).
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 3213–3223. [Google Scholar] [CrossRef] [Green Version]
Mishra, A.; Kumar, S.; Kalluri, T.; Varma, G. Semantic Segmentation Datasets for Resource Constrained Training; Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Siddique, N.; Paheding, S.; Elkin, C.P.; Devabhaktuni, V. U-Net and Its Variants for Medical Image Segmentation: A Review of Theory and Applications. IEEE Access 2021, 9, 82031–82057. [Google Scholar] [CrossRef]
Li, X.; Lai, T.; Wang, S.; Chen, Q.; Yang, C.; Chen, R. Weighted Feature Pyramid Networks for Object Detection. In Proceedings of the Proceedings—2019 IEEE Intl Conf on Parallel and Distributed Processing with Applications, Big Data and Cloud Computing, Sustainable Computing and Communications, Social Computing and Networking, ISPA/BDCloud/SustainCom/SocialCom, Xiamen, China, 16–18 December 2019; pp. 1500–1504. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, ICML, Long Beach, CA, USA, 9–15 June 2019; pp. 10691–10700. [Google Scholar]
Szegedy, C.; Ioffe, S.; Vanhoucke, V.; Alemi, A.A. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence, AAAI, San Francisco, CA, USA, 4–9 February 2017; pp. 4278–4284. [Google Scholar]
Cho, Y.-J. Weighted Intersection over Union (WIoU): A New Evaluation Metric for Image Segmentation. arXiv 2021, arXiv:2107.09858. [Google Scholar]
Yakubovskiy, P. Segmentation Models Documentation. Available online: https://segmentation-models.readthedocs.io/_/downloads/en/v0.2.0/pdf/ (accessed on 17 April 2020).

Figure 1. The general design of encoder-decoder-based Semantic Segmentation Models (SSM).

Figure 2. Some road scene images from structured (Cityscape) and unstructured (IDD) environments are in the first and second rows, respectively.

Figure 3. Overview of the framework for stacking-based ensemble approach of Segmentation Models.

Figure 4. Jaccard Index (IoU Score) value for training and validation datasets for four Segmentation Models with different backbones. (a) U-Net; (b) Linknet; (c) PSPNet; and (d) FPN.

Figure 5. The categorical focal dice loss function for training and validation datasets for four Segmentation Models with different backbones. (a) U-Net; (b) LinkNet; (c) PSPNet; and (d) FPN.

Figure 6. Results from our four weight-ensembled Segmentation Models in the IDD-Lite dataset. (a) Different input images to the network representing various conditions from an unstructured environment; (b) ground truth values for the input data; (c) predicted segmentation results in U-Net_{weight-ensembled}, with different colours indicating different classes in the image; (d) LinkNet_{weight-ensembled} prediction; (e) PSPNet_{weight-ensembled} prediction; and (f) FPN_{weight-ensembled} prediction.

Table 1. IDD-Lite dataset class labels with class names and the proportion of pixels in each class.

Class Names	Class Labels	Proportion of Pixels
Drivable area	Class 1	0.32
Non-drivable	Class 2	0.02
Living things	Class 3	0.01
(Two-wheeler, auto-rickshaw, large vehicle) Vehicles	Class 4	0.1
(Barrier) Roadside objects	Class 5	0.12
(Construction) Far objects	Class 6	0.28
Sky	Class 7	0.15

Table 2. Representation of each basic segmentation model with different encoders.

	Resnet50 (1)	VGG19 (2)	Inceptionv3 (3)	Efficientnetb7 (4)	Mobilenetv2 (5)
U-Net (1)	M₁₁	M₁₂	M₁₃	M₁₄	M₁₅
LinkNet (2)	M₂₁	M₂₂	M₂₃	M₂₄	M₂₅
PSPNet (3)	M₃₂	M₃₂	M₃₃	M₃₄	M₃₅
FPN (4)	M₄₁	M₄₂	M₄₃	M₄₄	M₄₅

Table 3. Confusion matrix for the predicted and actual label in performance measurement.

	Predicted Positive	Predicted Negative
Ground Truth Positive	True Positive (TP)	True Negative (TN)
Ground Truth Negative	False Positive (FP)	False Negative (FN)

Table 4. Performance evaluation of different backbone architectures for SSM using mean IoU value and F1 score.

	Evaluation Sores of Models with Different Backbone Architectures
Model/Backbone	VGG19		Resnet50		Efficientnetb7		InceptionV3		MobilenetV2
Model/Backbone	IoU	F1score	IoU	F1score	IoU	F1score	IoU	F1score	IoU	F1score
U-NET	0.5332	0.6532	0.6535	0.7667	0.6686	0.7002	0.6810	0.7053	0.6414	0.7623
LinkNet	0.5256	0.6400	0.5362	0.6388	0.6617	0.6728	0.6031	0.6457	0.5294	0.6234
FPN	0.6085	0.6500	0.5424	0.6978	0.5823	0.6964	0.5784	0.6874	0.5280	0.6547
PSPNet	0.6540	0.7088	0.5232	0.7284	0.5701	0.7003	0.5255	0.7665	0.4909	0.6832

Table 5. IoU Value and F1 score obtained by the ensembled Segmentation Models.

Class/Model	Evaluation Scores of Ensembled Methods
	U-Net_{weight-Ensembled}		LinkNet_{weight-Ensembled}		FPN_{weight-Ensembled}		PSPNet_{weight-Ensembled}
	IoU	F1 Score	IoU	F1 Score	IoU	F1 Score	IoU	F1 Score
Class 1	0.9215	0.75	0.8015	0.652	0.78	0.69	0.72	0.725
Class 2	0.6112		0.5702		0.55		0.45
Class 3	0.6004		0.462		0.451		0.45
Class 4	0.9018		0.852		0.752		0.88
Class 5	0.7082		0.68		0.66		0.69
Class 6	0.8815		0.856		0.725		0.80
Class 7	0.7172		0.6812		0.66		0.69
Mean score	0.7312		0.7052		0.6552		0.6875

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sivanandham, S.; Gunaseelan, D.B. Development of an Ensembled Meta-Deep Learning Model for Semantic Road-Scene Segmentation in an Unstructured Environment. Appl. Sci. 2022, 12, 12214. https://doi.org/10.3390/app122312214

AMA Style

Sivanandham S, Gunaseelan DB. Development of an Ensembled Meta-Deep Learning Model for Semantic Road-Scene Segmentation in an Unstructured Environment. Applied Sciences. 2022; 12(23):12214. https://doi.org/10.3390/app122312214

Chicago/Turabian Style

Sivanandham, Sangavi, and Dharani Bai Gunaseelan. 2022. "Development of an Ensembled Meta-Deep Learning Model for Semantic Road-Scene Segmentation in an Unstructured Environment" Applied Sciences 12, no. 23: 12214. https://doi.org/10.3390/app122312214

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Development of an Ensembled Meta-Deep Learning Model for Semantic Road-Scene Segmentation in an Unstructured Environment

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Datasets

3.2. Methodology

3.2.1. General Outline of the Proposed Method

3.2.2. General Outline of the Proposed Method

3.2.3. Stacking-Based Ensembling Methods

4. Evaluation Metrics

5. Experimental Evaluation

5.1. Implementation Details

5.2. Evaluation Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI