Enhancing Weather Scene Identification Using Vision Transformer

Dewi, Christine; Arshed, Muhammad Asad; Christanto, Henoch Juli; Rehman, Hafiz Abdul; Muneer, Amgad; Mumtaz, Shahzad

doi:10.3390/wevj15080373

Open AccessArticle

Enhancing Weather Scene Identification Using Vision Transformer

by

Christine Dewi

^1,2,*

,

Muhammad Asad Arshed

^3,*

,

Henoch Juli Christanto

⁴

,

Hafiz Abdul Rehman

³,

Amgad Muneer

⁵

and

Shahzad Mumtaz

⁶

¹

Department of Information Technology, Satya Wacana Christian University, Salatiga 50711, Indonesia

²

School of Information Technology, Deakin University, Campus, 221 Burwood Hwy, Burwood, VIC 3125, Australia

³

School of Systems and Technology, University of Management and Technology, Lahore 54770, Pakistan

⁴

Department of Information System, Atma Jaya Catholic University of Indonesia, Jakarta 12930, Indonesia

⁵

Department of Computer Science and Information Sciences, Universiti Teknologi PETRONS, Seri Iskandar 32160, Malaysia

⁶

School of Natural and Computing Sciences, University of Aberdeen, Aberdeen AB24 3UE, UK

^*

Authors to whom correspondence should be addressed.

World Electr. Veh. J. 2024, 15(8), 373; https://doi.org/10.3390/wevj15080373

Submission received: 24 July 2024 / Revised: 9 August 2024 / Accepted: 15 August 2024 / Published: 16 August 2024

(This article belongs to the Special Issue Advancements in Autonomous Vehicles: Security, Optimization and Future Challenges)

Download

Browse Figures

Versions Notes

Abstract

:

The accuracy of weather scene recognition is critical in a world where weather affects every aspect of our everyday lives, particularly in areas like intelligent transportation networks, autonomous vehicles, and outdoor vision systems. The importance of weather in many aspects of our life highlights the vital necessity for accurate information. Precise weather detection is especially crucial for industries like intelligent transportation, outside vision systems, and driverless cars. The outdated, unreliable, and time-consuming manual identification techniques are no longer adequate. Unmatched accuracy is required for local weather scene forecasting in real time. This work utilizes the capabilities of computer vision to address these important issues. Specifically, we employ the advanced Vision Transformer model to distinguish between 11 different weather scenarios. The development of this model results in a remarkable performance, achieving an accuracy rate of 93.54%, surpassing industry standards such as MobileNetV2 and VGG19. These findings advance computer vision techniques into new domains and pave the way for reliable weather scene recognition systems, promising extensive real-world applications across various industries.

Keywords:

weather scene; MobileNetV2; VGG19; Vision Transformer; global feature extraction; self-attention

1. Introduction

Weather identification is identifying and forecasting weather patterns using advanced technologies such as computer vision and machine learning. This procedure is important to our daily lives since it enables us to make informed decisions about outdoor activities, clothing choices, and travel. Accurate weather recognition also helps improve our capacity to anticipate and address emergencies by providing early warning of extreme weather events like hurricanes, tornadoes, and floods. Furthermore, it is essential in directing decisions crucial for industries like energy, transportation, and agriculture—all of which are greatly impacted by atmospheric conditions.

Individuals rely on the weather information provided throughout a given time period because it affects their everyday activities and habits. People often make judgments and schedule their activities based on the prevailing weather. This can include things like opting to go for a bike ride, booking a trip, or organizing a vacation. Furthermore, weather is a consideration to consider when planning company operations, transit systems, sporting events, and sightseeing trips. It is critical to consider the weather in the location where activities are held.

Weather is unique to a certain location and is often measured via human observations or sensors. However, the high cost of camera sensors may hurt the local economy. As technology advances, artificial intelligence (AI) is predicted to play an increasingly important role in embedded systems, enabling more accurate weather analysis while lowering hardware costs. AI is growing more prominent in people’s life, making numerous chores easier. Many prominent organizations are currently incorporating AI into their technology and continuing to invest in its development. Deep learning, a branch of AI, uses architectures with hidden layers to automatically extract information from photos, making it an effective tool for weather forecasting.

Over the last couple of decades, the research community and industry players have actively invested in autonomous vehicle technology aiming to revolutionize the transportation sector by adopting techniques that could make them affordable, safe, efficient and convenient. Amongst many challenges faced by autonomous vehicle, one of them is to accurately perceive the environment in terms of weather conditions (such as rain, fog, snow etc.), which could affect associated sensors compromising their safety. It is paramount to have an effective predictive ability of weather conditions by a visually captured data through camera so to make autonomous vehicles safer.

Considerable research effort has been dedicated to the realm of weather image categorization employing deep learning architectures. Within these investigations, a plethora of approaches and methodologies have been utilized to attain elevated levels of classification precision. As an illustration, within a specific investigation, Elhoseiny et al. [1] leveraged attributes extracted from the fully connected layers within the AlexNet framework to categorize weather images into two distinct classes. Through the utilization of the SoftMax function to classify the derived attributes from the ultimate layer, they achieved a commendable accuracy of 91.1%.

Guerra et al. [2] conducted a thorough investigation into the classification of weather images, covering three different types of meteorological data. A hybrid approach combining augmentation techniques and superpixel technology was utilized to consistently improve pixel distribution throughout all images. The Convolutional Neural Network (CNN) model was trained over several iterations before the dataset was classified using the Support Vector Machine (SVM) method. Notably, the ResNet-50 model achieved a total accuracy of 80.7%, emerging as the top performer.

The significance of weather recognition is pronounced in various practical scenarios, particularly within systems aiding self-driving technology. This significance is evidenced by its ability to enhance road safety through measures like adjusting vehicle speed and modulating different lighting conditions based on real-time weather information. As a result, a subset of research endeavors has been dedicated to weather recognition via in-car cameras.

An approach denoted as template matching was proposed in studies [3,4] for the identification of raindrops on windshields, as they serve as robust indicators of rainy conditions. In [3], a framework was devised to establish three distinct global features that discriminate between overcast, sunny, and wet weather.

Roser and Moosmann [5] introduced a technique wherein the entire image was divided into thirteen equal sections with diverse histogram data extracted from each region individually to facilitate rain detection. Notably, beyond just rain, investigations have also extended to examining fog and haze. The application of Koschmieders Law [6] was demonstrated in [7] to compute visibility in foggy scenarios.

In studies [8,9], a process involving power spectra computation followed by the utilization of Gabor filters was employed to extract features for fog recognition. Bronte et al. [10] proposed an edge-oriented technique employing Sobel filters to identify edges and consequently assess the presence of fog. Moreover, Gallen et al. [11] introduced a method that employed the backscattered radiance pattern of light to specifically detect fog during nighttime conditions.

Furthermore, a cluster of investigations in the domain of weather recognition focuses on common outdoor photographs [12] for estimating prevailing weather conditions. This is achieved by employing illumination calculations on multiple images captured at a specific location. In the pursuit of identifying weather conditions, a myriad of global features was explored in [13], encompassing power spectral slope, edge gradient energy, infection point particulars, contrast, and saturation.

To enhance weather category classification, Li et al. [14] devised an approach that amalgamated Support Vector Machines (SVMs) and decision mechanisms with an array of global characteristics. In a departure from previous methodologies, Lu et al. [15] proposed a solution to the two-class weather classification predicament, employing a diverse range of local factors such as the sky, shadows, and reflections.

Efforts to tackle the challenge of multiclass weather classification were undertaken by Zhang et al. in studies [16,17], where a blend of global and local features was harnessed. Addressing the identical two-class weather recognition problem, these researchers combined hand-crafted features with features extracted from Convolutional Neural Networks (CNNs), yielding notably improved outcomes.

Presently, computers possess the capability to analyze satellite imagery, enabling them to ascertain prevailing weather conditions and formulate forecasts. While this information is readily accessible through the internet, it is imperative to acknowledge the substantial variability in weather across diverse geographical locations. Within industries like transportation, the real-time classification of weather conditions holds particular significance. This is exemplified in applications like self-driving vehicles, where weather images aid in decisions such as activating windshield wipers during rain. Nonetheless, the task of classifying weather images presents challenges due to inherent similarities between distinct weather phenomena, such as mist and snow, or cloudiness and rainfall.

Image classification, a technology enabling computers to discern weather patterns from real-time images, holds immense potential. It serves as a foundational tool for the development of Advanced Driver Assistance Systems (ADASs) and autonomous machines. To categorize weather into four classes (cloudy, wet, snowy, or clear), the study [18] employed the AlexNet and GoogleNet architectures. GoogleNet demonstrated an accuracy of 92.0%, while AlexNet achieved 91.1%. However, pertinent details about the distribution of training and test data as well as dataset acquisition methods were omitted.

Meanwhile, Xia et al. [19] embarked on the task of classifying weather images into four categories: foggy, rainy, snowy, and sunny, and they achieved an accuracy of 86.47% with AlexNet. Notably, the computation duration for each design was not provided.

In a different study [20], CNN and transfer learning were harnessed for the classification of weather images, focusing on binary label classification of “With Rain” (WR) and “No Rain” (NR) using the VGG16 architecture; they achieved an accuracy of 85.28%.

Transfer learning, a pivotal technique in machine learning, emerges as a solution to the inherent challenge of limited training data [21]. Built on the assumption of unbiased and evenly distributed training and testing data, this approach facilitates the transfer of knowledge from a source domain to a target domain. Within the realm of computer vision, transfer learning expedites learning processes and enhances performance. Typically, pre-trained models are trained on diverse image datasets and subsequently retrained on specific source datasets. Notably, past research has not ventured into the realm of multiclass weather image identification employing diverse CNN architectures via transfer learning.

The study proposed by Chu et. al. [22] undertakes the classification of weather images across six distinct categories: cloudy, foggy, rainy, sunny, snowy, and sunrise. To achieve this and expedite model development with enhanced performance, transfer learning is adeptly employed. The ImageNet dataset serves as the bedrock for transfer learning endeavors. For rigorous comparison with forthcoming research, the dataset is meticulously curated from publicly available sources such as Kaggle and the Camera as Weather Sensor (CWS) dataset [22]. In assessing the efficacy of this study, performance metrics encompass accuracy, precision, recall, and the F1 score.

Highlighting the relevance of weather recognition across diverse sectors, including autonomous driving and agriculture, Młodzianowski [23] emphasizes an image-centric weather detection system as a viable solution. This system harnesses transfer learning to accurately classify weather conditions, even when faced with a limited dataset. In the pursuit of this objective, the study introduces three weather recognition models grounded in the architectures of ResNet50, MobileNetV2, and InceptionV3, subsequently delving into a comparative analysis of their efficiencies.

From the literature point of view, we have concluded that most of the studies are based on traditional CNN-based architecture and of limited classes; therefore, there is a need to introduce an effective model that can identify the number of weather scenes. Thus, the contributions of this study are listed below:

Evaluation of Enhanced Vision-Transformer (ViT) Model: The study introduces and assesses a fine-tuned Vision-Transformer (ViT) model for weather scene recognition. The evaluation showcases the model’s superiority when compared against three conventional pre-trained CNN-based models (VGG-16, MobileNetV2). This comparison sheds light on the enhanced capabilities of the ViT model in accurately discerning weather patterns.
Global Feature Extraction: In this study, we introduce a patch-wise self-attention module and a global feature extraction technique, both of which constitute significant contributions to our research.
CNN-Based Pre-Trained Models: In this study, we have additionally conducted fine-tuning on pre-trained CNN-based models, such as VGG16 and MobileNetV2, for comparison with the proposed ViT model.
Advancement to Multiclass Weather Scene Classification: Going beyond binary classification, this study advances the field by concentrating on multiclass weather scene classification. This approach acknowledges the intricate and diverse nature of real-world weather scenarios.
In-Depth Exploration of Transfer Learning: The study dedicates attention to the efficacy of transfer learning-based pre-trained models with a particular focus on the fine-tuned ViT model. Through thorough investigation and comparison, the research provides valuable insights into the potential advantages of employing transfer learning for weather recognition tasks.

2. Materials and Methods

In this section, a dataset description, preprocessing and methodology are presented in detail. Preprocessing ensures that the dataset is optimally prepared for the following stages by skillfully handling these processes.

2.1. Dataset Description

The selected dataset comprises 11 distinct classes, each representing various weather conditions [24]. Each class contains a specific number of images. For a visual depiction of the dataset, see Figure 1.

2.2. Vision Tranformer (ViT) Architecture

ViT was developed as a DNN (deep neural network) architecture for image recognition in 2020 [25]. The transformer’s architecture was initially designed with natural language processing in mind, introducing the innovative idea that images should be viewed as patches of images or token sequences. ViT uses the transformative capabilities built into the transformer design to skillfully manage these token sequences. It is worth noting that the transformer architecture, which serves as the foundation for ViT, has showcased its adaptability and effectiveness across a diverse array of tasks, including image restoration and object detection [26], underscoring its broad applicability and performance capabilities [27]. ViT can extract a full perspective from the input image that includes both local and global features thanks to the combined effects of tokenization and embedding.

ViT introduces predefined positional embeddings. These additional vectors serve as repositories for encoding the precise positions of tokens within the sequence before they are processed by the transformer layers. This strategic integration allows the model to determine the relative placements of tokens and extract important spatial information from the input image.

The ViT architecture based on the Multi-head Self-Attention (MSA) mechanism. MSA empowers the model with the ability to simultaneously focus on various regions within the image. This mechanism comprises distinct “heads” each with the capacity to independently compute attention. These individual attention heads have the flexibility to concentrate on diverse image segments. This produces representations that eventually join together to create the complete representation of the image. This simultaneous attention to multiple parts grants ViT the capability to capture intricate relationships among input elements. However, it is important to note that this enhancement escalates the model’s complexity and computational requirements due to the increased number of attention heads and additional processing steps required for aggregating their outputs. The mathematical expression for MSA can be formulated as follows:

M S A = C o n c a t i n a t i o n (H 1, H 2, \dots . H n)

(1)

The self-attention mechanism plays a fundamental role in transformer architectures, facilitating the interactions modeling and associations across sequences in all predictive tasks. In contrast to Convolutional Neural Networks, the layer of self-attention consolidates insights along with characteristics from the complete sequence of input. This encompasses global information with local; it nurtures a more accurate portrayal of information. The mechanism of attention operates by computation of the scalar product. This product is computed between key vectors and the query. This is followed by the normalization of the resulting attention scores using the SoftMax function. Subsequently, it adjusts the vectors to produce an improved output. A comprehensive investigation by Cordonnier et al. [28] delved into the interplay in-between operations of the convolution layer and the self-attention mechanism. Their research revealed that self-attention, especially with features, emerges as an exceptionally versatile mechanism that is capable of capturing both local and global characteristics. This highlights the adaptability and flexibility inherent in self-attention, setting it apart from conventional convolutional techniques.

For a visual representation of the ViT network at an abstract level, please refer to Figure 2, which illustrates the major components of an effective ViT model.

Patch Embedding: In the ViT framework, the input image is partitioned into fixed-size, non-overlapping patches. Each of these patches undergoes a linear transformation, facilitated by a learned linear transformation matrix, converting the 2D spatial information within the image into a sequential arrangement of embeddings [29].

E_{p a t c h} = X \cdot W_{p a t c h}

(2)

In Equation (2), E_patch, X, and W_patch represent the patch embeddings, image patches, and learned linear transformation matrix, respectively.

Positional Embedding: Given that the inherent structure of the transformer architecture lacks an inherent understanding of the spatial arrangement of these patches, the infusion of positional information becomes necessary. This is achieved through the inclusion of positional embeddings, which are added to the patch embeddings. These positional embeddings furnish crucial details about the spatial positioning of each patch within the original image [30].

E_{p o s} = P o s i t i o n a l_E m b e d d i n g s (i, j)

(3)

In Equation (3), E_pos represents positional embeddings and i, j represent the spatial coordinates of the patch within the image.

Transformer Encoder: The embeddings, i.e., positional, denoted as E_pos, traverse through a sequence of TE layers. Each layer comprises a fusion of two things: first, the feedforward neural networks and second, the mechanism of self-attention [31]. This self-attention mechanism empowers each patch with the ability to focus on other patches, effectively capturing intricate relationships within an image globally [32]. Following this, the feedforward neural networks conduct additional processing to enhance these attended representations. As a result of this encoding process, a collection of contextualized embeddings is generated for each patch, adeptly encapsulating a rich blend of both localized and global information that is inherently embedded within the image.

A = S o f t M a x (\frac{Q K^{T}}{\sqrt d_{k}})

(4)

In Equation (4),

A, Q, K, V

and

d_{k}

represent the attention scores, query matrix, key matrix, value matrix, and dimension of the key vectors, respectively.

Classification Head: The ultimate contextualized embeddings stemming from the transformer encoder serve as the foundation for downstream tasks, including image classification. In the context of classification tasks, various strategies can be employed to process these contextualized embeddings. A commonly used approach involves taking one out of two things: either a classification token or the average embeddings. Subsequently, fully connected layers can be applied to generate class predictions based on this processed information.

P_{C l a s s} = S o f t M a x (W_{C l a s s} \cdot A v e r a g e P o o l i n g (E_{C o n t e x t u a l i z e d}))

(5)

In Equation (5), P_Class, W_Class_, and AveragePooling (E_{Contextualized}) represent the class predictions, weights of the classification layer, and average pooling of contextualized embeddings, respectively.

2.3. Hyperparameters for ViT Pre-Trained Model

In this research, the initial images underwent preprocessing and resizing to dimensions of 224 × 224 pixels. These resized images were subsequently partitioned into patches, each measuring 16 × 16 pixels. The process of dividing the input image into smaller, fixed-size patches involves breaking down the image into pieces that are 16 pixels wide and 16 pixels tall.

For this research, the model employed was trained on a substantial dataset known as ImageNet-21k [33,34]. This extensive dataset encompasses approximately 14 million images categorized into a staggering 21,841 distinct classes, which were tailored explicitly for large-scale image classification tasks. The architecture of the model comprises 12 transformer layers with each layer housing 768 hidden components. The model’s substantial capacity is reflected in its impressive 85.8 million trainable parameters, which greatly contribute to its learning capabilities. You can find the specific parameter values and configurations utilized in the ViT model in Table 1.

2.4. Experimental Setup

In this study, the Keras framework of deep learning and Python programming language were used for the experiment’s purposes. The experiments were carried out with the free version of Google Colab (https://colab.research.google.com/ (accessed on 1 April 2024)). The experiments configuration details are available in Table 2.

3. Results and Discussion

In this section, we provide an in-depth exploration of the evaluation metrics employed, delve into the specifics of our experimental procedures, and present the outcomes derived from the methodology we have proposed.

3.1. Performance Evaluation Metrics

Evaluating the efficacy of machine learning and deep learning models hinges on the utilization of key performance indicators. These metrics play a pivotal role within the domains of machine learning, deep learning, and statistical investigation [35]. This research endeavor has focused on four indispensable evaluation criteria to gauge the efficiency of our innovative model.

Accuracy: The accuracy metric gauges the overall correctness of the model’s predictions by determining the ratio of correctly classified instances to the total number of samples. Nevertheless, in situations characterized by imbalanced datasets or cases where different types of errors carry varying degrees of importance, depending solely on accuracy may prove inadequate for a comprehensive assessment.

$A c c = \frac{T P + F P}{T P + F P + T N + F N}$

(6)

In Equations (6)–(9), TP, FP, TN, FN, Acc, P, and R represent the true positive, false positive, true negative, false negative, accuracy, precision, and recall, respectively.
Precision: Precision measures a model’s ability to correctly identify positive samples among the actual positive instances. This metric quantifies the proportion of true positives concerning the total of true positives and false positives.

$P = \frac{T P}{T P + F P}$

(7)
Recall: We evaluate the model’s ability to accurately detect positive samples within the actual positive pool using the recall metric, which is also referred to as sensitivity or the true positive rate. This measurement is obtained by determining the ratio of true positives to the sum of true positives and false negatives. In essence, recall offers insights into the comprehensiveness of positive predictions.

$R = \frac{T P}{T P + F N}$

(8)
F1-Score: The harmonic mean of precision and recall is known as the F1-score. The F1-score falls within the range of 0 to 1 with its optimal performance achieved at a score of 1.

$F 1 = \frac{2 \times P \times R}{P + R}$

(9)

3.2. Experimental Results

The fine-tuned ViT model was considered to identify meteorological scenes. ViT’s key benefit over conventional CNNs is its ability to be immediately supervised and trained on large datasets, eliminating the need for pre-training on auxiliary tasks. Furthermore, ViT illustrates its capabilities by attaining cutting-edge performance across a wide range of image recognition applications all while maintaining a far lower parameter count than traditional CNN designs. In light of these compelling advantages, we rigorously fine-tuned the ViT model for the specific purpose of identifying weather scenes, producing outstanding results.

Table 3 presents a detailed overview of the hyperparameters employed during the fine-tuning of the ViT model. Further, we have used the Google Colab, Tensor Processing Unit (TPU) for study experiments [36].

We have partitioned the dataset into training and testing subsets using a split ratio of 0.2. The resulting dataset following this partition is displayed in Table 4.

We evaluated the proposed model’s performance using class-specific precision, recall, and F1-scores, as shown in Table 5. The “support” column in the table indicates the sample count for each class. For example, the “snow” class contains 41 samples for testing, but the “rain” class contains 50 examples for testing. Notably, the cumulative sum of the “support” column equals 511, indicating that our model was thoroughly evaluated on 511 samples.

In situations with uneven class distributions or significant discrepancies in the cost of misclassifying distinct classes, a confusion matrix is invaluable for assessing the performance of a classification model [37]. This matrix includes essential measures like accuracy, precision, recall, and F1-score. Figure 3 depicts a confusion matrix for the ViT model, allowing for an informative assessment of their respective performance.

To assess the resilience of our proposed model, we conducted a comparative analysis between the ViT model and two leading pre-trained CNN-based models: VGG16 [38] and MobileNetV2 [39]. The effectiveness of the ViT model can be attributed to its advanced global feature extraction technique, as illustrated in Figure 4. We have achieved an accuracy of 0.9061 with MobileNetV2 and 0.8991 with the VGG16 pre-trained model. The proposed ViT model outperformed VGG16 and MobileNetV2 in terms of all evaluation metrics, i.e., accuracy, precision, recall, and F1-score.

The method we have introduced demonstrates superior performance compared to state-of-the-art approaches, as evidenced by the evaluation of our model’s results presented in Table 6. In Table 6, Xia et al. achieved 96.03% accuracy [19], but their considered dataset was based on four classes. All the studies in Table 6 are mainly based on four classes; however, in the study of Li et al. [10], the number of classes is 10, but all the classes based on clouds types achieved an accuracy of 80%. In our study, we have 11 classes and we have achieved an effective accuracy of 93.54%.

3.3. Robustness and Limitation of the Proposed ViT

The proposed ViT for weather scene identification brings forth notable benefits and some inherent limitations. The models excel in capturing global contextual information, which is a critical aspect of understanding weather patterns. Their scalability allows researchers to adapt model sizes to the complexity of the task, from basic weather classifications to fine-grained analyses. Using pre-trained weights helps knowledge transfer from huge datasets, which improves performance. ViT’s self-attention processes allow for selective concentration on appropriate visual regions, which aids in the recognition of minor weather trends. However, problems persist, such as the need for significant labeled data, computational resources for training, reduced interpretability when compared to simpler models, and sensitivity to image resolution. When evaluating the use of ViT models for weather scene recognition, researchers must carefully assess these considerations, aiming to maximize their benefits while resolving their limits.

3.4. Theoretical and Practical Implications

This study makes theoretical contributions by broadening the scope of ViT applications in computer vision and the complex realm of weather scene detection. This study advances our understanding of complicated scene recognition and ViT’s excellent generalization capabilities across a wide range of environmental conditions. On a practical level, the findings have far-reaching impacts. They empower weather forecasting agencies with ViT-based tools for more accurate predictions, enhancing decision making in agriculture, transportation, and disaster management. ViT-equipped autonomous vehicles and drones ensure enhanced safety, promoting advancements in self-driving technology and surveillance systems. Additionally, ViT models support environmental monitoring, climate research, renewable energy optimization, and smart city development while also enhancing consumer-oriented weather applications. In summary, this research bridges theory and practice, fostering innovation and informed decision making across a spectrum of industries in response to dynamic weather scenarios.

4. Conclusions

This paper presents the utilization of patch-based technology, specifically the ViT model, for single-image scene identification through deep learning and computer vision. The proposed ViT model can recognize 11 distinct weather-related classes. The main aim of this research is to showcase the practical use of deep learning and computer vision in enhancing scene awareness, particularly in urban environments. These insights have broad applications, including enabling autonomy in urban settings and beyond.

The rigorous evaluation of our quartet of models—MobileNetV2, VGG-19, and the fine-tuned ViT—in the context of weather scene recognition unequivocally crowned the ViT model as the undisputed leader. While VGG-19 achieved an admirable accuracy of 89.91% and MobileNetV2 reached 90.6%, the fine-tuned ViT model demonstrated exceptional prowess with an impressive accuracy rate of 93.54%. These conclusive results firmly establish the ViT model’s superiority over its CNN-based counterparts, positioning it as the foremost choice for weather recognition tasks requiring heightened accuracy.

Taking our research further into real-time weather scene recognition scenarios, particularly in dynamic weather conditions, holds immense potential for practical applications. Such endeavors could unearth invaluable insights, further enhancing the ViT model’s real-world effectiveness and ensuring its relevance in the ever-evolving field of weather scene identification.

Author Contributions

Conceptualization, C.D., M.A.A., H.J.C., H.A.R., A.M. and S.M.; methodology, C.D., M.A.A., H.J.C., H.A.R., A.M. and S.M.; validation, C.D., M.A.A., H.J.C., H.A.R., A.M. and S.M.; investigation, C.D., M.A.A., H.J.C., H.A.R., A.M. and S.M.; data curation, C.D., M.A.A., H.J.C., H.A.R., A.M. and S.M.; writing—original draft preparation, C.D., M.A.A., H.J.C., H.A.R., A.M. and S.M.; writing—review and editing, C.D., M.A.A., H.J.C., H.A.R., A.M. and S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

“Weather Detection Image Dataset.” Accessed: 1 April 2024. [Online]. Available: https://www.kaggle.com/datasets/tamimresearch/weather-detection-image-dataset/data.

Acknowledgments

This research is sponsored by DIREKTORAT RISET DAN PENGABDIAN MASYARAKAT Satya Wacana Christian University, Indonesia.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Elhoseiny, M.; Huang, S.; Elgammal, A. Weather classification with deep convolutional neural networks. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3349–3353. [Google Scholar]
Guerra, J.C.V.; Khanam, Z.; Ehsan, S.; Stolkin, R.; McDonald-Maier, K. Weather Classification: A new multi-class dataset, data augmentation approach and comprehensive evaluations of Convolutional Neural Networks. In Proceedings of the 2018 NASA/ESA Conference on Adaptive Hardware and Systems (AHS), Edinburgh, UK, 6–9 August 2018; pp. 305–310. [Google Scholar]
Yan, X.; Luo, Y.; Zheng, X. Weather recognition based on images captured by vision system in vehicle. In Advances in Neural Networks–ISNN 2009: 6th International Symposium on Neural Networks, ISNN 2009, Wuhan, China, 26–29 May 2009; Proceedings, Part III 6; Springer: Berlin/Heidelberg, Germany, 2009; pp. 390–398. [Google Scholar]
Kurihata, H.; Takahashi, T.; Mekada, Y.; Ide, I.; Murase, H.; Tamatsu, Y.; Miyahara, T. Raindrop detection from in-vehicle video camera images for rainfall judgment. In Proceedings of the First International Conference on Innovative Computing, Information and Control-Volume I (ICICIC’06), Beijing, China, 30 August–1 September 2006; pp. 544–547. [Google Scholar]
Roser, M.; Moosmann, F. Classification of weather situations on single color images. In Proceedings of the 2008 IEEE Intelligent Vehicles Symposium, Eindhoven, The Netherlands, 4–6 June 2008; pp. 798–803. [Google Scholar]
Middleton, W.E.K. Vision through the Atmosphere; Springer: Berlin/Heidelberg, Germany, 1957. [Google Scholar]
Hautiere, N.; Tarel, J.-P.; Lavenant, J.; Aubert, D. Automatic fog detection and estimation of visibility distance through use of an onboard camera. Mach. Vis. Appl. 2006, 17, 8–20. [Google Scholar] [CrossRef]
Pavlic, M.; Rigoll, G.; Ilic, S. Classification of images in fog and fog-free scenes for use in vehicles. In Proceedings of the 2013 IEEE Intelligent Vehicles Symposium (IV), Gold Coast, QLD, Australia, 23–26 June 2013; pp. 481–486. [Google Scholar]
Pavlić, M.; Belzner, H.; Rigoll, G.; Ilić, S. Image based fog detection in vehicles. In Proceedings of the 2012 IEEE Intelligent Vehicles Symposium, Madrid, Spain, 3–7 June 2012; pp. 1132–1137. [Google Scholar]
Bronte, S.; Bergasa, L.M.; Alcantarilla, P.F. Fog detection system based on computer vision techniques. In Proceedings of the 2009 12th International IEEE Conference on Intelligent Transportation Systems, St. Louis, MO, USA, 4–7 October 2009; pp. 1–6. [Google Scholar]
Gallen, R.; Cord, A.; Hautière, N.; Aubert, D. Towards night fog detection through use of in-vehicle multipurpose cameras. In Proceedings of the 2011 IEEE Intelligent Vehicles Symposium (IV), Baden-Baden, Germany, 5–9 June 2011; pp. 399–404. [Google Scholar]
Shen, L.; Tan, P. Photometric stereo and weather estimation using internet images. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 1850–1857. [Google Scholar]
Song, H.; Chen, Y.; Gao, Y. Weather condition recognition based on feature extraction and K-NN. In Foundations and Practical Applications of Cognitive Systems and Information Processing: Proceedings of the First International Conference on Cognitive Systems and Information Processing, Beijing, China, 15–17 December 2012 (CSIP2012); Springer: Berlin/Heidelberg, Germany, 2014; pp. 199–210. [Google Scholar]
Li, Q.; Kong, Y.; Xia, S.-M. A method of weather recognition based on outdoor images. In Proceedings of the 2014 International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, 5–8 January 2014; pp. 510–516. [Google Scholar]
Lu, C.; Lin, D.; Jia, J.; Tang, C.-K. Two-class weather classification. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 3718–3725. [Google Scholar]
Zhang, Z.; Ma, H. Multi-class weather classification on single images. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 4396–4400. [Google Scholar]
Zhang, Z.; Ma, H.; Fu, H.; Zhang, C. Scene-free multi-class weather classification on single images. Neurocomputing 2016, 207, 365–373. [Google Scholar] [CrossRef]
Kang, L.-W.; Chou, K.-L.; Fu, R.-H. Deep learning-based weather image recognition. In Proceedings of the 2018 International Symposium on Computer, Consumer and Control (IS3C), Taichung, Taiwan, 6–8 December 2018; pp. 384–387. [Google Scholar]
Xia, J.; Xuan, D.; Tan, L.; Xing, L. ResNet15: Weather recognition on traffic road with deep convolutional neural network. Adv. Meteorol. 2020, 2020, 1–11. [Google Scholar] [CrossRef]
Notarangelo, N.M.; Hirano, K.; Albano, R.; Sole, A. Transfer learning with convolutional neural networks for rainfall detection in single images. Water 2021, 13, 588. [Google Scholar] [CrossRef]
Tan, C.; Sun, F.; Kong, T.; Zhang, W.; Yang, C.; Liu, C. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, 4–7 October 2018; Proceedings, Part III 27; Springer: Cham, Switzerland, 2018; pp. 270–279. [Google Scholar]
Chu, W.-T.; Zheng, X.-Y.; Ding, D.-S. Camera as weather sensor: Estimating weather information from single images. J. Vis. Commun. Image Represent. 2017, 46, 233–249. [Google Scholar] [CrossRef]
Młodzianowski, P. Weather Classification with Transfer Learning-InceptionV3, MobileNetV2 and ResNet50. In Digital Interaction and Machine Intelligence: Proceedings of MIDI’2021–9th Machine Intelligence and Digital Interaction Conference, Warsaw, Poland, 9–10 December 2021; Springer: Cham, Switzerland, 2022; pp. 3–11. [Google Scholar]
Weather Detection Image Dataset. Available online: https://www.kaggle.com/datasets/tamimresearch/weather-detection-image-dataset/data (accessed on 16 April 2024).
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner THoulsby, N. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision; Glasgow, UK, 23–28 August 2020, Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2020; Volume 12346 LNCS, pp. 213–229. [Google Scholar]
Devlin, J.; Chang, M.-W.; Lee, K.; Google, K.T.; Language, A.I. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Cordonnier, J.B.; Loukas, A.; Jaggi, M. On the relationship between self-attention and convolutional layers. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Trockman, A.; Kolter, J.Z. Patches Are All You Need? arXiv 2022, arXiv:2201.09792. [Google Scholar]
Islam, K. Recent advances in vision transformer: A survey and outlook of recent work. arXiv 2023, arXiv:2203.01536. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. arXiv 2023, arXiv:2201.09450. [Google Scholar] [CrossRef]
Liu, C.; Hirota, K.; Dai, Y. Patch attention convolutional vision transformer for facial expression recognition with occlusion. Inf. Sci. 2023, 619, 781–794. [Google Scholar] [CrossRef]
Ridnik, T.; Ben-Baruch, E.; Noy, A.; Zelnik-Manor, L. ImageNet-21K Pretraining for the Masses. arXiv 2021, arXiv:2104.10972v4. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar] [CrossRef]
Evaluation Metrics Machine Learning. Available online: https://www.analyticsvidhya.com/blog/2019/08/11-important-model-evaluation-error-metrics/ (accessed on 8 December 2020).
colab.google. Available online: https://colab.google/ (accessed on 9 September 2023).
Understanding Confusion Matrix | by Sarang Narkhede | Towards Data Science. Available online: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62 (accessed on 16 June 2021).
Understanding VGG16: Concepts, Architecture, and Performance. Available online: https://datagen.tech/guides/computer-vision/vgg16/ (accessed on 13 February 2023).
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
Minhas, S.; Khanam, Z.; Ehsan, S.; McDonald-Maier, K.; Hernández-Sabaté, A. Weather Classification by Utilizing Synthetic Data. Sensors 2022, 22, 3193. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Kong, H.; Wong, C.S. Neural Network-Based Identification of Cloud Types from Ground-Based Images of Cloud Layers. Appl. Sci. 2023, 13, 4470. [Google Scholar] [CrossRef]
Ogunrinde, I.; Bernadin, S. Deep Camera–Radar Fusion with an Attention Framework for Autonomous Vehicle Vision in Foggy Weather Conditions. Sensors 2023, 23, 6255. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Weather classification dataset: (a) dew, (b) fog–smog, (c) frost, (d) glaze, (e) hail, (f) lightning, (g) rain, (h) rainbow, (i) rime, (j) sandstorm (k) snow.

Figure 2. ViT abstract level architecture diagram [25].

Figure 3. Confusion matrix of ViT model.

Figure 4. Comparison of ViT, VGG-16 and MobileNetV2 models.

Table 1. Hyperparameters configurations of ViT model.

Hyperparameter	Value
Encoder Dimensionality	768
Transformer Encoder Hidden Layers	12
Activation Function of Hidden Layers	Gelu
Dropout Rate of Hidden Layers	0.1
Images Channels	3
Image Size	224 × 224
Patches of the Images	16 × 16

Table 2. Hardware/software configurations.

Hardware/Software Information	Configuration
Operating System (OS)	Window 10
Colab	Google Free Version
Memory	12.7 GB
CUDA Version	12.2
GPU Memory	12–16 GB

Table 3. Hyperparameters for Vision Transformer model.

Layer Type	Parameters
Architecture	Vision Transformer
Data Balancing	Yes
Augmentation	Yes
Learning Rate	2 × 10⁻⁵
Epochs	30
Batch Size	32

Table 4. Class-wise dataset samples for training and testing.

Dataset	Dew	Fog–Smog	Frost	Glaze	Hail	Lightening	Rain	Rainbow	Rime	Sandstorm	Snow	Total
Train	195	176	180	173	182	195	182	198	181	188	191	2041
Test	37	56	52	59	50	37	50	34	51	44	41	511

Table 5. Performance of ViT model class-wise.

	Precision	Recall	F1	Support
Dew	0.9737	1.0000	0.9867	37
Fog–smog	0.9636	0.9464	0.9550	56
Frost	0.8679	0.8846	0.8762	52
Glaze	0.8644	0.8644	0.8644	59
Hail	1.0000	0.98000	0.9899	50
Lightning	1.0000	1.0000	1.0000	37
Rain	0.8909	0.9800	0.9333	50
Rainbow	1.0000	1.0000	1.0000	34
Rime	0.9184	0.8824	0.9000	51
Sandstorm	0.9556	0.9773	0.9663	44
Snow	0.9189	0.8293	0.8718	41
Accuracy			0.9354	511
Macro Avg	0.9412	0.9404	0.9403	511
Weighted Avg	0.9359	0.9354	0.9352	511

Table 6. Comparison of our study with state-of-the-art research.

Authors	Class Counts	Method	Evaluation Metric	Results
(Lu et al., 2014) [15]	2	Collaborative Learning	Accuracy	Test 84.2%
(Zhang et al., 2015) [16]	4	Learning-Based Approach	Accuracy	Test 59.44%
(Zhang et al., 2016) [17]	4	Multiple Category-Specific Dictionary Learning and Multiple Kernel Learning	Accuracy	Test 71.39%
(Xia et al., 2020) [19]	4	ResNet15	Accuracy	Test 96.03%
(Notarangelo et al., 2021) [20]	2	CNN (Pre-Learned Filters)	Accuracy	Test 85.28%
(Minhas et al., 2022) [40]	4	CNN + Synthetic Data	Accuracy	Training 74%
(Li et al., 2023) [41]	10 (Only Cloud Types)	Neural Network	Accuracy	80%
(Ogunrinde et al., 2023) [42]	7	YoloV5	Accuracy	84.9%
Proposed	11	Vision Transformers (RGB Images)	Accuracy Precision Recall F1	93.54% 93.59% 93.54% 93.52%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dewi, C.; Arshed, M.A.; Christanto, H.J.; Rehman, H.A.; Muneer, A.; Mumtaz, S. Enhancing Weather Scene Identification Using Vision Transformer. World Electr. Veh. J. 2024, 15, 373. https://doi.org/10.3390/wevj15080373

AMA Style

Dewi C, Arshed MA, Christanto HJ, Rehman HA, Muneer A, Mumtaz S. Enhancing Weather Scene Identification Using Vision Transformer. World Electric Vehicle Journal. 2024; 15(8):373. https://doi.org/10.3390/wevj15080373

Chicago/Turabian Style

Dewi, Christine, Muhammad Asad Arshed, Henoch Juli Christanto, Hafiz Abdul Rehman, Amgad Muneer, and Shahzad Mumtaz. 2024. "Enhancing Weather Scene Identification Using Vision Transformer" World Electric Vehicle Journal 15, no. 8: 373. https://doi.org/10.3390/wevj15080373

Article Menu

Enhancing Weather Scene Identification Using Vision Transformer

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.2. Vision Tranformer (ViT) Architecture

2.3. Hyperparameters for ViT Pre-Trained Model

2.4. Experimental Setup

3. Results and Discussion

3.1. Performance Evaluation Metrics

3.2. Experimental Results

3.3. Robustness and Limitation of the Proposed ViT

3.4. Theoretical and Practical Implications

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI