1. Introduction
Leather is a widely used material in various industries such as the automotive, fashion, and furniture industries due to its unique texture, durability, and aesthetic appeal. With the global leather market valued at USD 440.64 billion and a projected growth to USD 468.49 billion in 2023 and USD 738.61 billion by 2030 [
1], the necessity for optimization and efficiency in all aspects of its production is undeniable. Within the leather manufacturing industry, one of the most important processes involved in transforming raw materials into consumer goods is the inspection process or quality control. By ensuring a high level of consistency in the grading of raw and processed materials, manufacturers not only greatly reduce labour costs, but also eliminate unnecessary material losses caused by the misclassification of exported commodities. Traditional methods for leather surface defect detection rely on qualified technicians to visually inspect every inch of the processed materials. This procedure can be labour-intensive, time-consuming, and highly subjective, which increases the risk of human error and inconsistent defect identification. To overcome these challenges, there is a growing interest in leveraging artificial intelligence (AI) and computer vision techniques for automated and accurate leather surface defect detection.
In recent years, vision transformers [
2] have been emerging as a powerful approach in computer vision tasks, showing superior performance in image classification [
3], object detection [
4], and segmentation [
5]. Vision transformers are deep neural networks that process images in a patch-based manner, where the image is divided into non-overlapping patches and then linearly embedded into a sequence of vectors. These sequences of vectors are then processed by transformer layers, originally proposed for natural language processing tasks, to capture both local and global contextual information. The self-attention mechanism [
6] in transformers allows for capturing long-range dependencies, making them highly effective for analysing complex patterns and structures in images.
When implementing such a system in the real-world manufacturing environment, certain practical considerations and potential challenges should be considered. To process images in real-time, a trade-off between model accuracy and speed must be considered. As manufacturers may undergo changes in equipment, surface material types, and/or processing chemicals, the ability to quickly re-train or fine-tune a model on a new dataset can be greatly advantageous. Despite the fact that low-resolution images reduce the computational demands of the model, allowing it to train faster and be applied in a real-time setting, the presence of fewer pixels representing the defective area make them difficult to accurately localise. Defects may be subtle and could potentially be missed or blurred out in a low-resolution image. As the transformer splits the image into patches, if the resolution is too low, each patch might not contain enough detail for the model to make accurate predictions. Furthermore, the manufacturing environment may include variations in lighting, viewing angle, and other factors that might change the appearance of the leather without constituting a defect. The inspection system would therefore require a controlled environment, or a high level of robustness and adaptability from the model. Finally, owing to the reduced availability and feasibility of resources such as high-definition image capturing systems and processing power, many current machine vision-based AI systems are not an option for manufacturers in developing countries, where the majority of raw leather products are produced.
For this reason, this research focusses on using anomaly detection and localisation using lower-resolution leather surface images for defect classification. Training an image classifier on a small low-resolution dataset offers several benefits. Firstly, it reduces the hardware and resource requirements for the training process. Since low-resolution images have smaller data sizes, less computational power and storage capacity are needed to process and store the data. This translates to cost savings and improved accessibility, especially for individuals or organizations with limited resources. Secondly, training on a small dataset allows for increased efficiency. With fewer images to process, the training time is significantly reduced, enabling faster iterations and experimentation with different model architectures and hyper-parameters. This accelerated development cycle facilitates the refinement of the classifier and speeds up the overall research or development process. Consequently, an image classifier on a small dataset of low-resolution images not only reduces hardware and resource requirements but also enhances efficiency, making it an advantageous approach in certain scenarios.
The contributions of this paper include the modification and implementation of two transformer-based model configurations tailored for low-resolution leather surface defect detection, as well as an in-depth analysis of the model’s performance compared to existing pre-trained DL methods. This study would be beneficial for the adoption of vision transformers (ViT) in the leather industry, enabling a faster, cheaper, and more robust leather surface defect detection system for improved product quality control.
The rest of the paper is organized as follows.
Section 2 provides an overview of the related work in the field.
Section 3 describes the proposed method for addressing the research problem. In
Section 4, we present the experiments conducted to validate the effectiveness of the proposed approach. Finally,
Section 5 concludes the paper by summarizing the key findings and highlighting the contributions made in this study.
3. Proposed Method
Vision transformers (ViT) use self-attention mechanisms to process images in a hierarchical manner, capturing both local and global contextual information. The ViT model offers advantages such as scalability, interpretability, and flexibility, making it well-suited for anomaly detection tasks in industries such as manufacturing, healthcare, and security. By training the ViT on large sets of non-defective images, they can learn representations of normal patterns, which can then be used to identify deviations from these patterns as anomalies. These traits allow the ViT to effectively model complex patterns and relationships within images, making them suitable for detecting surface defects in various computer vision-based image domains. Leveraging the model’s ability to capture fine-grained details also renders them favourable for identifying subtle anomalies in low-resolution image sets that may be unaccounted for by traditional DL methods.
This section proposes a ViT-based leather surface defect classification method, with the anomalies detected and localised via the ViT architecture.
3.1. Dataset Preparation
The Leather Defect detection and Classification dataset [
29] was originally created as a training and validation set for a deep learning neural network-based approach for automated localisation and classification of leather defects using a machine vision system [
28]. Although the original dataset was captured in a much higher resolution of 4608 × 3288 pixels, an open-source version was published at a greatly reduced resolution of 227 × 227 pixels. This dataset with reduced image quality is selected in this study because we focus on achieving defect classification utilising smaller datasets of lower-resolution training images in order to reduce both image-capturing hardware requirements and computational complexity.
In leather manufacturing, colours and finishes are typically introduced through a process known as finishing. The combination of processes such as dyeing, finishing, patination, embossing, fatliquoring, and sealing can result in a wide variety of colours and finishes. Furthermore, while other defects can occur in leather manufacturing, the presence of folding marks, grain off, growth, loose grain, and pinholes in an assortment of colours and finishes provides a comprehensive representation of the most common and impactful defects that arise in the process.
While the colours of sample images can potentially impact classification outcomes, the diversity and distribution of colours across the dataset help alleviate this issue. Furthermore, the vision transformer benefits from processing colour images as opposed to grayscale images due to richer visual cues, such as hue, saturation, and contrast, which can enhance the model’s ability to distinguish and classify objects accurately. By incorporating colour information, the vision transformer can potentially capture more fine-grained details and patterns, leading to improved performance.
The dataset comprises a total of 3600 low-resolution sample images, split into ‘training’ and ‘validation’ sets, each of which are equally sub-divided into six categories of leather surface images, viz., ‘folding marks’, ‘grain off’, ‘growth’, ‘loose grain’, ‘pinhole’, and ‘non defective’.
Figure 4 illustrates one randomly selected sample image from each category of the dataset.
The categories in the training set contain 480 unannotated images each, while the categories in the validation set have 120 images, each consisting of a variety of colours and finishes. As the dataset provides a sufficient number (3600) of well-distributed samples across various types, colours, and textures of finished leather surfaces, a good degree of diversity and representativeness is sufficiently captured.
Unfortunately, the dataset contains no clearly annotated defective ground truths as can be found in the popular anomaly detection benchmarking MVTec AD dataset [
46]. For this reason, the leather category taken from the MVTec AD dataset is resized to a matching resolution of 224 × 224 pixels in order to confirm the models’ ability to identify and localise defects.
3.2. Pre-Processing
From the original leather defect dataset, the validation images for each category are combined with the training images of the same category before test images are completely removed for objective testing later on. The entire dataset is first converted to a NumPy array, reshaped, and pre-processed for sequential data manipulation. The data are then normalized and linked to a 1-hot encoded label list for storage as NumPy files. Validation and training sets are previously combined due to the data and label lists being shuffled and split into training and validation sets after re-loading the NumPy files in each instance, thereby allowing for possible ratiometric adjustments and varied levels of data augmentation without reprocessing the entire dataset.
3.3. Proposed ViT Architectures
The ViT architecture shown in
Figure 5 consists of six principal components to be applied to the modified leather defect dataset. Each of these steps are illustrated as follows, using a randomly selected defective sample image as an example.
3.3.1. Patch Extraction
The input image is divided into non-overlapping patches of size
P ×
P, which can be represented as a tensor
X of shape (
B,
H/
P,
W/
P,
C), where
B is the batch size;
H and
W are the height and width of the image, respectively; and
C (=3) is the number of RGB colour channels.
Figure 6 depicts a non-defective sample image split into 14 × 14 patches of 16 × 16 pixels each.
3.3.2. Patch Embedding
Each patch is linearly projected into a flat vector by multiplying with a learnable weight matrix E of shape (C, D), where C is the number of channels and D is the embedding dimension, resulting in a tensor Xe of shape (B, H/P, W/P, D).
3.3.3. Positional Encoding
Positional information is added to the patch embeddings to capture spatial relationships. This is typically achieved using sine and cosine functions of different frequencies, denoted as
PE, resulting in a tensor of shape (
B,
H/
P,
W/
P,
D). To make patches positionally aware, learnable ‘position embedding’ vectors are added to the patch embedding vectors. As position embedding vectors learn distance within the image, neighbouring vectors have high similarity. Depicted in
Figure 7 is the cosine similarity between all-to-all positional embeddings represented as a 14 × 14 grid of similarity maps. Selecting a block (
) from the grid, the colour of each pixel in that block indicates how similar the positional embedding of the patch (
) is to the positional embedding of the patch (
).
From the zoomed-in block, it is clear that each positional embedding has a strong similarity to itself as well as those close to it, indicating that the model embeds regions that are physically adjacent with similar vectors. Interestingly, there is more than one bright region, suggesting that various regions use similar positional embedding vectors.
3.3.4. Multi-Head Self-Attention
The patch embeddings with positional encoding are fed into a multi-head self-attention mechanism, where the tensor is linearly projected into query (
Q), key (
K), and value (
V) tensors using learnable weight matrices
Wq,
Wk, and
Wv of shape (
D,
D), respectively. The self-attention mechanism computes the attention scores,
A, by taking the dot product of
Q and
K, scaled by
, followed by Softmax activation, and then multiplied by
V. The output of self-attention is then concatenated and linearly projected using another weight matrix
Wo of shape (
D,
D). This process can be visualized, as depicted in
Figure 8, where the colour scale illustrates the attention weights’ magnitude: blue shades denote low scores, while green shades indicate high scores.
As the attention mechanisms play a role in capturing the relationships and dependencies between different patches, the attention matrices represent the attention weights assigned to each patch or token in a specific head. Each matrix has dimensions equal to the number of patches, while each element within represents the attention weight or relevance assigned to the corresponding row (source patch) and column (target patch) pair.
Figure 9 depicts the input image and 100th rows of attention matrices in heads 0 to 7. These blocks indicate how the 100th patch attends to other patches in the same layer, thereby providing insight into which patches are considered most relevant to the 100th patch or token in terms of capturing visual features and/or relationships. Each one is a 14-by-14 matrix, while the scale of light to dark patches refers to the distribution of attention weights across the patches.
3.3.5. Layer Normalization and Residual Connections
After the multi-head self-attention operation, layer normalization is applied to the output to stabilize the features. This is followed by the application of residual connections with skip connections, where the output of the self-attention is combined with its initial input. The combined result is passed through a feed-forward neural network with ReLU activation, enabling non-linearity and further transformation of the features.
3.3.6. MLP (Classification) Head
The multilayer perceptron (MLP) [
48,
49] consists of multiple layers, where each layer transforms its input through an activation function, represented in Equation (1). The MLP typically consists of fully connected layers, where each unit in a layer is connected to all the units in the previous layer and has a unique set of weights.
where
is the input vector (output of the preceding layer),
is the activation function,
is the layer weights, and
is the bias vector.
The final representation from the last self-attention block is flattened and passed through a fully connected layer with Softmax activation to obtain class probabilities for image classification. Once constructed and trained, the ViT can be utilised to make predictive classifications on unseen low-resolution leather input images.
3.4. Transfer Learning from DL Networks
To compare the training and classification results of the ViT models with the methods obtained from transfer learning from existing pre-trained models, the same dataset is utilised to fine-tune (train) three popular deep learning network architectures. While many larger DL architectures are available, the largest manufacturers of raw leather materials are located in developing or third-world countries. For this reason, consideration must be taken for these developing areas and the financial, hardware, and time constraints involved in the real-world manufacturing environment. To mitigate the requirement for more expensive and powerful hardware while simultaneously reducing model training times, low-end versions of more influential architectures were chosen. The three selected architectures include the ResNet-50 [
41], the Inception-V3 [
43], and the EfficientNet-B0 [
45] networks, pre-trained on the much larger ImageNet dataset [
40].
There are two principal methods when applying transfer learning to pre-built architectures. The first is to partially train the model, in which multiple (user selected) top layers from the model are trained on the new dataset, with all layers below being set to frozen. The second method is ‘top training’, whereby all layers of the model are frozen except the last layer, which is only modified to accommodate the required output tensors.
3.5. ViT-Based Anomaly Detection
Each transformer encoder layer consists of a multi-head self-attention mechanism followed by a feed-forward neural network. The self-attention mechanism allows the model to capture global dependencies between different patches or tokens of the input image. After each patch is passed through the stack of transformer encoder-decoder layers, normal and defective image features are refined and represented as feature vectors in high-dimensional space. The final output of the ViT model is a sequence of feature vectors, where each vector corresponds to a specific position or patch in the input image. Using the features extracted from intermediate layers of the trained vision transformer models, normal and defective regions are predicted in the test dataset to produce prediction weights.
In this paper, two ViT-based anomaly detection methods are considered.
The first ViT-based anomaly detection method is to calculate the dot product similarity between the extracted feature vectors and the feature vectors of input image, as well as the predicted class of the input image. By reshaping the calculated similarity scores to original image dimensions, it is possible to generate a heatmap that highlights anomalous data regions with pixel-level localisation of defects. The sensitivity of the classification outputs can be adjusted by tuning the anomaly threshold value to meet specified industry requirements.
The second ViT-based anomaly method is the modified AnoViT [
50], whereby the distribution of the normal data is learned from the differences between the ViT-based encoder–decoder image reconstruction and the original input image. Pixel-level l
2-distances between the two are calculated, as well as the average pooling across the channels to produce a score map. As an unsupervised learning method, the model is trained on normal data only and is therefore only able to learn the distribution of non-defective images. This means that abnormal or defective images will cause an increase in reconstruction error, as anomalous regions will be reconstructed incorrectly. By integrating the ViT-based encoder into an encoder–decoder architecture, the AnoViT model [
50] leverages core vision transformer characteristics to achieve an improved method of anomaly detection and localisation. An architectural diagram of the AnoViT model is shown in
Figure 10, where
denotes embedded patches after linear projection,
are positionally encoded patches, and
are the output patch embeddings.
By applying a multi-head self-attention (MSA), the relationship between patches is learned in order to utilise global information. Additionally, by processing patch-level images, the approach creates image embeddings with rich information for each location.
Typical ViT models consider image patches of a specific size as tokens in the NLP tasks. When performing classification, the ‘CLS’ token is added at the beginning of the resulting sequence:
where
are image patches. Where the architecture has multiple layers, the state of the CLS token on the output layer is used for classification. Unlike these models, the AnoViT excludes the patch embedding’s
token and additional MLP head, then creates a feature map by rearranging the remaining embeddings to match the existing positions in the patch image and used as the feature map
.
From the reconstruction error, the activation map is generated for each input image depicting anomalous regions of interest, such as surface defects.
Having been trained on normal data, the model can learn the distribution of normal data only. Owing to this, an atypical input image causes reconstruction error to increase as the model has difficulty reproducing anomalous regions. To localise areas containing anomalies in a new input image, the image reconstructed at the output of the ViT-based encoder–decoder model
is used to calculate the
-distance between the two on a per-pixel basis. The score map
is calculated by taking the average pooling across the channels (Equation (3)).
The anomaly score is calculated by extracting the highest value from the score map (Equation (4)). Anomaly localisation is then achieved by utilising the score map to evaluate the abnormality of each pixel in the image.
3.6. Evaluation
While adhering to the described vision transformer architectural structure, two variations of the ViT model are tested on the same modified low-resolution leather defect dataset for a comparative evaluation of performance metrics and classification results.
As high attention values can be interpreted as the model finding those regions more important for the task at hand, and therefore by visualizing the “attention” as a heatmap over the image, attention mechanisms provide some model interpretability. However, while this allows a rough sense of what the model is “looking at”, the attention mechanism is only one component of the model’s decision-making process and does not fully explain the model’s decision-making process. Despite the ViT delivering high performance in computer vision tasks, it presents interpretability challenges due to its complex architecture. Unlike convolutional neural networks (CNNs), which process local, contiguous information, the ViT handles images as sequences of patches, applying self-attention mechanisms and feed-forward neural networks. This non-local processing makes understanding feature importance and tracing decision-making pathways difficult. Existing interpretability methods like attention rollouts, feature attribution, and saliency maps, while promising, need further development to effectively untangle the ViT model’s decision-making processes.
3.6.1. Evaluation
Evaluation of the models’ training and validation performances is conducted using typical metrics, such as accuracy, loss, precision, recall, F1-score, receiver operating characteristic (ROC) curves, and the area under this curve (AUC).
3.6.2. Fine-Tuning
Fine-tuning is achieved via experimentation (trial and error) with varying hyper-parameter values throughout the training stages to optimize model performance.
3.6.3. Validation
The models’ performance is validated by averaging the classification results of multiple novel input images from each leather category, as well as the visual analysis and confirmation of activation maps for localised anomalous regions. Classification results are then compared to those derived from the application of transfer learning of popular deep learning classifiers (ResNet-50, Inception-V3, and EfficientNet-B0).
4. Experiments
In order to evaluate the influence of the attention mechanism, tokenization, and hyper-parameters within the vision transformer model, two slightly varied implementations of the model are modified and trained on the same low-resolution leather defect dataset. Metrics and results are then compared between the two ViT variations and three state-of-the-art deep learning models (ResNet-50, Inception-V3, and EfficientNet-B0).
4.1. Setup
4.1.1. Dataset
Prior to pre-processing, ten defective images from each of the six categories (60 images) were completely removed from the reduced resolution Leather Defect detection and Classification dataset [
29] for objective classification after training. The remaining 3540 images (227 × 227 pixels) in all six categories are resized to 224 × 224 pixels and converted to a NumPy array before adding a new axis so that the array has a shape of (1, height, width, channels). Images are then pre-processed and sequentially added to a ‘data list’. Once all images are complete, the entire data list is converted to a NumPy array, the data type is altered, and the pixel values are normalized to the range of 0–1. The new shape of the NumPy array is (3540, 1, 224, 224, 3). Finally, the first two elements are swapped, and the first element is removed from the array, so that the final ‘data array’ shape is (3540, 224, 224, 3). A new NumPy ‘label array’ is then created corresponding to the size of the data array and one-hot encoded before both arrays are saved as NumPy files for consistent testing with multiple models.
After loading the NumPy files, the data and labels are shuffled and split into training and validation sets with the ratio 0.8 to 0.2 (2832 training to 708 validation). Before training, the dataset is augmented using techniques including rotation, flipping, and scaling to increase the diversity of the data and improve model generalization.
4.1.2. Model Variants
The first configuration of the ViT model utilises the conventional vision transformer structure, as described in
Section 3.3, with patch extraction and standard multi-head attention. It includes an MLP head layer [
48] with a Softmax activation function for multi-class image classification.
The second ViT model differs in the following three ways. It employs a shifted patch tokenization layer for patch extraction, with a custom multi-head attention layer which uses a linearly scaled dot product attention mechanism with a diagonal attention mask. This attention mechanism is designed to have a lower computational complexity than the original multi-head attention. The model also includes the MLP head layer; however, instead of a Softmax activation, it returns the logits directly.
The anomaly detection method is a ViT-based encoder–decoder architecture modified from the AnoViT [
50], which is an unsupervised method that utilises extracted patch embeddings to reflect the global context of the image from the attention-based ViT encoder. With the embeddings of each image patch, reconstruction error at the image and pixel levels is calculated to both detect and localise image anomalies by deriving the reconstructed image’s activation map at the output.
Comparative supervised learning methods include three transfer learning architectures, ResNet-50, Inception-V3, and EfficientNet-B0. Modifying the original ResNet-50 architecture requires adjusting the input and output shape to [224, 224, 3], to match dataset dimensions, and the last pooling layer before the 1000 class dense layer to be extracted, so that the output can be flattened for a new dense layer to be appended. Although the new layer still uses the Softmax activation function, it is modified from the 1000 output nodes or classes to accommodate only six classes.
As with the ResNet-50 model, modifying the Inception-V3 model incorporates freezing all layers except the last layer, before creating a new input layer with matching dimensions to those of the reshaped input data. The extracted output from the max pooling layer (last layer before the output layer) is flattened and linked to a newly constructed dense output layer with a Softmax activation function, where the original 1000 output nodes were redefined to six nodes or classes.
The EfficientNet-B0 architecture is altered by extracting the final pooling layer before the dense layer containing 1000 classes. From here, however, a slightly different approach from the previous two models is taken. Instead of flattening the output layer, a global average pooling is applied to the data instead. This technique leaves other dimensions untouched, while applying average pooling to the spatial dimensions until each spatial dimension is reduced to one. Unlike the process of flattening, the values are not retained due to being averaged; however, global spatial information is captured by summarizing the feature maps across the entire spatial extent of the image. From here, a dense layer is added with an output of 1024 and the rectified linear units (ReLU) activation function. The final layer is another dense layer, this time with an output shape of (none, 6) to match the number of categories in the dataset and a ‘softmax’ activation function for multi-class classification. To significantly decrease training time of the EfficientNet-B0 model, the “efficientnet_b0_weights.h5” weights file is utilised in the ‘ModelCheckpoint’ callback. The modified model is depicted in
Figure 3.
Before training the modified models, the ‘Early Stopping’ callback is included, which monitors the validation loss metric and ends the training session when no significant improvement to validation loss is detected over a predefined ‘patience’ or consecutive number of epochs. After training each model, the 10 unseen test images taken from each category (60 images total) are input for classifier prediction.
For the sake of consistency throughout training stages, all DN and ViT models were compiled and fit with the ‘categorical cross-entropy’ loss function and the ‘adam’ optimizer, and for probability distributions in multi-class classification, the ‘softmax’ activation function. Other optimizers and loss functions were also tested with each model; however, it was found that this combination yielded optimal results overall.
4.1.3. Fine-Tuning
Though more specific adjustments are stated in the specific model implementations, some general fine-tuning techniques can be applied across all models. Proper selection of learning rate assists in efficient model convergence and improved accuracy, while an optimal weight decay can improve the generalization ability and prevent overfitting. A higher upper limit of epochs can be selected, as well as an increased ‘patience’ parameter for the ‘early stopping’ callback method. Batch normalization for improved training efficiency and consistency. Increasing the batch size, within hardware limitations (CPU and GPU memory availability), can positively influence training speed, generalization, and learning dynamic. Data augmentation and dropout for model robustness and prevention of overfitting. Further training optimization is achieved with adjustments to the number of parallel processes (‘num_workers’) and automatic mixed precision training (‘amp’) parameters. Finally, experimentation with ViT-specific hyper-parameters such as the patch size, projection dimension, number of heads, and transformer layers can influence the model’s ability to capture spatial information and patterns in the input images.
4.2. Performance Metrics
When comparing the two ViT models trained on the low-resolution leather dataset, it is evident that both models displayed similar training and validation accuracy ROC curves. However, by optimizing training efficiency with the ‘early stopping’ callback method and a patience of 10, there is a notable difference in their accuracy plots (shown in
Figure 11). The second ViT model (ViT-02) achieves a significantly higher validation accuracy of 93.89% in just 68 epochs compared to the first (ViT-01), with an average validation accuracy of 89.86% after stopping at 75 epochs. This indicates that ViT-02 performed better in terms of validation accuracy, suggesting improved classification or prediction capability compared to ViT-01 within a shorter training duration.
In
Figure 12, the validation loss comparison between the two ViT network configurations indicates a similar result; however, ViT-02 reaches a lower average validation loss of 0.1846 after stopping at 68 epochs compared to ViT-01’s average validation loss of 0.1970 at 75 epochs. This indicates that ViT-02 achieved an improved accuracy or generalization compared to ViT-01 within a shorter training duration. Also noted in each of these figures is that ViT-02 has a much steeper gradient and therefore reaches its optimal range more efficiently than ViT-01.
Figure 13 depicts the heatmap confusion matrices for the categorical classification performance during training of each vision transformer. It is evident that although trained on a small number of low-resolution images, both models perform similarly, with a relatively high number of true positives (TP) in each category of the leather dataset.
It can also be seen in
Figure 14 that by comparing the average precision, recall, and F1-scores for each leather category, there is relatively little difference between the two models’ training performances; however, ViT-02 performs equally well as or outperforms the original ViT model in all leather categories.
Table 1 displays the average training metrics of each tested architecture trained on the Leather Defect detection and Classification dataset. From this comparative analysis, it is evident that the ResNet-50 model exhibited subpar performance and also takes the longest time to train, further emphasizing its inefficiency. Conversely, the EfficientNet-B0 and Inception-V3 models demonstrated superior results in terms of average validation accuracy and precision when compared to ViT-01. Between Inception-V3 and EfficientNet-B0, the former shows a slightly higher precision but a slightly lower recall than the latter. Inception-V3 not only has a higher AUC, indicating that it performs slightly better across various classification thresholds, but it also takes significantly less time to train than EfficientNet-B0, making it the more efficient model between the two. Of the five models in
Table 1, ViT-02 demonstrated superior overall performance and generalization ability reached in almost half the training time of ViT-01, making it a more robust and effective model for the categorical classification of leather surface defects. Additionally, ViT-02 has the shortest training time (176.27 s), suggesting that it is the most computationally efficient model.
4.3. Results and Discussion
4.3.1. Defect Classification Accuracy
By completely removing 10 images from each category (60 images) before data pre-processing and training of the models, an objective classification test of unseen input images yielded the results shown in
Table 2. Owing to the low-resolution and small number of leather surface images in the training set, relatively low categorical classification accuracies (six-class classification) are to be expected. ViT-02 outperforms all other methods.
In terms of each vision transformer predicting whether an input image is defective or non-defective (the two classes of defective or not), ViT-01 is outperformed by the Inception-V3 and EfficientNet-B0 models, while ViT-02 outperforms all models. However, when predicting one of the six specific leather categories, the results show that ViT-02 outperforms the EfficientNet-B0 model by two correct predictions. The average approximate inference time per image is calculated by dividing the time taken for all 60 unseen images by 60. From this, we can see that while ViT-01 is slightly faster than the ResNet-50 and Inception-v3 models, the EfficientNet-B0 and ViT-02 are significantly faster than these three. Although the ViT-02 model is slower than the EfficientNet-B0 model by approximately 16.24 milliseconds per image, it is redeemed by being more accurate in both binary and multi-class classification.
4.3.2. Dot Product (Similarity) Anomaly Detection and Localisation
Training each of the two ViT models on 600 normal samples and 600 combined defective samples (120 random images from each of the five categories), not only can defects be detected but also localised. By extracting activations from intermediate layers, normal and defective features can be represented as vectors, where each image corresponds to a feature vector in a high-dimensional space. The classification of new input images as normal or defective can be found by calculating the similarity (dot product or cosine similarity) between these vectors and the vector of a new input. Similarly, the dot product can be calculated between each input patch feature vector and the corresponding normal image patch feature vector. A significantly low similarity suggests the patch is anomalous. Having been trained to recognize specific defects, a heatmap can be generated that can localise abnormal regions at the pixel level. A high-level anomaly heatmap can be generated where a patch’s colour intensity is inversely proportional to its similarity score with normal images. By visualizing the activation maps of the network’s layers, high activations typically correspond to the regions where the model has detected certain features that correspond to surface defects, as shown in
Figure 15.
Using a threshold value, “peaks” or hot spots are detected and red bounding boxes are drawn around these areas to further aid in defect classification analysis. Adjusting the anomaly threshold value can fine-tune the sensitivity of the classification outputs to meet industry requirements. Other useful heatmap overlay parameter adjustments include reducing overlay saturation with a pixel value off-set, as well as setting heatmap values in the dark regions to 0 as the edges of dark pixels may be thought of as high anomaly.
4.3.3. AnoViT Anomaly Detection
To train the vision transformer-based encoder-decoder model (AnoViT) on the low-resolution Leather Defect detection and Classification dataset, several modifications are required. As an unsupervised learning method, only normal images are required for training. After the training and validation datasets are split in the ratio of 0.8 to 0.2, the remaining 472 non-defective leather sample images are utilised for training the model. Several configuration parameters must be adjusted for the reduced image size (from the original scripts input dimensions of 1024 × 1024 pixels), as well as the increased number of input images, and the unavailability of validating ground truth images.
Figure 16 depicts the original input image (top) and its corresponding class activation map (bottom) generated using the reconstruction error based on learned distribution of normal images.
Figure 17 shows that visibly prominent defects found in the benchmark MVTec AD leather dataset [
46] are identified relatively clearer. However, attributed to the fact that the ‘normal’ data used to train the model possess no defects and therefore no shadows, some reconstruction error is found in any abnormalities, including slight contours and shadows caused by defects such as the cut and poke defects. To better match the required ratiometric output dimensions as well as patch size divisions, images are processed and exported at decreased dimensions of 512 × 512 pixels (half the original size).
Aside from the primary defect identified in box 2 of both images, it is worth noting a line at the top (shown in box 1) of many reconstructed output images caused by an error in the reconstruction process, though the line itself is not indicative of any defects in the input image. This reconstructive error is found in both datasets (seen in
Figure 16a) but is more prominent in the images from the MVTec AD dataset.
To enhance the performance on low-resolution images, techniques like histogram equalization and contrast enhancement can be employed, improving the quality of low-resolution images before they are fed into the model. Model performance is also refined by better learning generalized features with increased data augmentation techniques such as rotation, flipping, zooming, translation, and brightness. To further improve the sensitivity and performance of the AnoViT anomaly detection method, it was found that fine-tuning the parameter constants, within their provided ranges, produced improved results. Sensitivity to certain surface or defect types is increased by manually experimenting with various combinations of hyper-parameter values. Variations in static array values (mean and std) used for image denormalization, optimizer parameters (learning rate, beta1, and beta2), and ViT-specific characteristics (patch and batch size, number of epochs, weight decay, and validation ratio) resulted in observable result improvements in certain defect types. Although experimentation with adjusted hyper-parameter values did produce better localisation results, improvements were not necessarily achieved across all categories (indistinguishable change in pinhole defects). Some simple post-processing techniques, such as image brightness and contrast, also enhanced defect visibility in all output results. Refined localisation results can be seen in defective categories with more prominent surface defects, as seen in the folding mark and loose grain sample images in
Figure 18.
4.4. Comparison with State-of-the-Art Transformer-Based Methods
This section provides a comparative analysis of two vision transformer (ViT) models, focusing on their performance in detecting and localising anomalies as well as classifying defects found in images of leather surfaces.
When applying the AnoViT method and reducing the resolution of the MVTec AD leather dataset by a factor of 4.57, i.e., from 1024 × 1024 pixels reduced to 224 × 224 pixels, not only is the reconstruction error increased due to noise, but the false reconstruction error (not indicating defective region) becomes more prominent. The difference in defect localisation between these resolutions is illustrated in
Figure 19.
Owing to more visibly prominent defects found in the MVTec AD dataset, it can be noted that the falsely identified anomalous regions are more pronounced in lower-resolution images (224 × 224 pixels) when compared to higher-resolution images (1024 × 1024 pixels). The error intensity can, however, be reduced in low-resolution images with further fine-tuning of hyper-parameters for the datasets with dissimilar lighting and surface types.
A comparison of the ViT-01 model’s validation accuracy of 89.86% reached after a training time of 341.01 s and the ViT-02’s validation accuracy of 93.89% reached after a training time of 176.27 s conveys a significant improvement in accuracy (4.03%) achieved in nearly half the training time (
Table 1). The fact that the ViT-02 model is fully trained in 2 min and 56.27 s on a standard laptop (Intel core i7 with 6 Gb GPU) suggests that the classifier model can be quickly re-trained on a new or modified dataset and deployed in a real-time environment where efficient decision making is essential. Once trained, the model achieved inference of all 60 unseen test images in 4.76 s (79.3 milliseconds per image), making it a viable real-time classification model.
When comparing the anomaly localisation results produced by the AnoViT and the Dot Product methods on low-resolution images, it can be seen that the latter method did perform acceptably over the majority of the defect types found in the six leather categories; however, the former method localised multiple defects in a single image with more clarity. While the AnoViT method requires a moderate increase in computational resources, the results produced suggest it is a more robust and effective model for the detection and localisation of leather surface defects in low-resolution images.