A Method for Enhancing the Accuracy of Pet Breeds Identification Model in Complex Environments

Lin, Zhonglan; Xia, Haiying; Liu, Yan; Qin, Yunbai; Wang, Cong

doi:10.3390/app14166914

Open AccessArticle

A Method for Enhancing the Accuracy of Pet Breeds Identification Model in Complex Environments

by

Zhonglan Lin

,

Haiying Xia

^*

,

Yan Liu

,

Yunbai Qin

^*

and

Cong Wang

School of Electronic and Information Engineering, Guangxi Normal University, Guilin 541004, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2024, 14(16), 6914; https://doi.org/10.3390/app14166914

Submission received: 2 June 2024 / Revised: 27 July 2024 / Accepted: 2 August 2024 / Published: 7 August 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Most existing studies on pet breeds classification focus on images with simple backgrounds, leading to the unsatisfactory performance of models in practical applications. This paper investigates training pet breeds classification models using complex images and constructs a dataset for identifying breeds of pet cats and dogs. We use this dataset to fine-tune three SOTA models: ResNet34, DenseNet121, and Swin Transformer. Specifically, in terms of top-1 accuracy, the performance of DenseNet is improved from 89.10% to 89.19%, while that of the Swin Transformer is increased by 1.26%, marking the most significant enhancement. The results show that training with our dataset significantly enhances the models’ classification capabilities in complex environments. Additionally, we offer a lightweight pet breeds identification model based on PBI-EdgeNeXt (Pet Breeds Identification EdgeNeXt). We utilizes the PolyLoss function and Sophia optimizer for model training. Furthermore, we compare our model with five commonly used lightweight models and find that the proposed model achieves the highest top-1 accuracy of 87.12%. These results demonstrate that the model achieves high accuracy, reaching the SOTA level.

Keywords:

pet breeds identification; complex background; transfer learning; image classification

1. Introduction

Pet breeds are one of the most concerning topics in people’s daily lives, especially cats and dogs, which are the most common household pets. With rapid advances in technology, people can obtain highly accurate breed identification results using genetic testing [1]. However, such work requires obtaining genetic samples such as pet hair and blood, and sending them to online or offline pet genetic testing institutions for identification in order to obtain the final results. This process is often both time-consuming and labor-intensive, making it difficult to adapt to the rapid advancement of the “lazy economy”. Nowadays, people increasingly desire the ability to fulfill their needs anytime and anywhere, including food delivery, online meetings, and educational opportunities. Therefore, it is of practical significance to offer a mobile pet breeds identification system that is not constrained by time and location.

Over the past few years, deep learning, particularly convolutional neural networks, has been extensively applied across numerous fields [2,3,4,5]. Deep learning models can automatically extract effective features from supervised data end to end. In the realm of image classification, it has become a widely accepted fundamental approach to extract image features via convolutional layers [6]. The vast majority of the existing research on image classification [7,8,9,10,11,12,13] focuses on classifying data that are visually recognizable. Even for datasets as classic as ImageNet [7], its category definition is based on species or item types (such as cats, pencils, trees, and cars), rather than breeds.

In addition, many of the research subjects in these works are situated in ideal environments. Consequently, the algorithm exhibits low classification accuracy in real-world scenarios, making it challenging to meet customer expectations. This is due to the variable and dynamic scenes in practical applications, where the size of the research subjects in the image is variable, the image background is complex, and there are often accompanying distractors. Images in which a pet occupies a larger area tend to favor large kernel convolutions, while those in which a pet occupies less space prefer small kernel convolutions, bringing significant challenges to the model. In practical applications, due to the reduced signal-to-noise ratio in the visual features that potentially distinguish target from background [14], the model often struggles to extract useful features related to the task, leading to consistently low recognition accuracy. This issue is pervasive across the entire field of image processing, not solely confined to our pet breeds classification endeavors. We consider that there are two reasons. Firstly, in terms of the practical application data, it is difficult for models to extract effective features from complex images due to the low signal-to-noise ratio of feature maps. Secondly, the feature extraction ability of the model is insufficient, which is not only due to performance limitations stemming from the model design, but also often because the data fed during the training process are so simple that they have a high signal-to-noise ratio of image features. The model becomes accustomed to feature maps with a high signal-to-noise ratio in the training process. Consequently, it often encounters difficulty in adapting to practical applications where the acquired images have a low signal-to-noise ratio.

In addition, in practical applications, due to time and human resource constraints, the dataset used for fine-tuning may not necessarily achieve an optimal distribution. Therefore, it is crucial for the model to be able to adapt to datasets with suboptimal distributions and effectively utilize them for transfer learning.

In response to these issues, we conducted a selection process within the Oxford-IIIT Pet dataset [15], selecting images with more intricate backgrounds, where pets occupy varying sizes of regions. In this way, we have developed a dataset for breed classification of cats and dogs, comprising 5550 images uniformly distributed across 37 breed categories, named the CPI (Complex Pet Images) dataset (Figure 1). We utilize this dataset to fine-tune (Figure 2) three state-of-the-art image classification models, pre-trained on the ImageNet dataset [7]: Swin Transformer [16], DenseNet121 [17], and ResNet34 [18], testing their recognition performance in complex environments.

Additionally, we present a lightweight cat and dog breeds identification model PBI-EdgeNeXt designed for edge devices (see Figure 3), which can accurately identify the breed of a cat or dog from images captured in various scenarios. The main contributions of this paper are given as follows:

(1) We construct a complex image dataset derived from the Oxford IIIT Pet dataset. This dataset is utilized for fine-tuning the model to improve its classification accuracy in non-ideal environmental images.

(2) We apply knowledge transfer to the PBI-EdgeNeXt model using the CPI dataset and subsequently develop a lightweight pet breeds identification model for edge devices.

(3) Through rigorous experimentation and analysis, it is demonstrated that our CPI dataset significantly enhances the accuracy of pet breeds identification in complex environments. Furthermore, our PBI-EdgeNeXt model outperforms the other five compared models.

The organization of this article is as follows: In the second part, we critically analyze existing research on pet classification, discussing its strengths and contributions while highlighting the shortcomings. The third part introduces the CPI dataset and outlines the general method for fine-tuning models using the dataset. Additionally, it presents a mobile-oriented pet cat and dog breeds identification model based on PBI-EdgeNeXt. The final chapter details our experimental work. The summary section provides a recap and synthesis of the research conducted in this work.

2. Related Works

To counteract the influence of redundant information on model training, Zhiqiang Yuan et al. [19] devised the denoised representation matrix and the enhanced adjacency matrix (DREA) to help the model to focus on salient instances during local feature modeling, enabling it to efficiently gather local information. Chen et al. [20] designed a compact high-level spectral information tokenizer (CHLSIT), using nonlinear combination of spectral bands to represent the high-level conceptual information of changes in spectral interest. Redundancy can be removed by extracting high-level spectral conceptual features. To capture the dependencies between hierarchical features and reduce redundant information, Wang et al. [21] proposed a convolutional long short-term memory (ConvLSTM)-based hierarchical feature fusion module (HFFM).

In practical applications, in addition to the low signal-to-noise ratio of the feature map caused by complex backgrounds, images are often accompanied by interferences. When the image contains objects that are similar to the target in shape, color, and other appearance characteristics, distinguishing between the target and the interferences becomes crucial. For the processing of complex images, the Capsule Network (CapsNet) [22] often demonstrates suboptimal recognition performance. This is due to the fact that the Capsule Network consistently endeavors to focus on all details within the image, thereby striving to comprehend all objects within the image, including the background and distractors [23]. Thus, Abra Ayidzoe Mighty et al. [23] improved the capsule network by using the Gabor filter and custom preprocessing block to learn the structural and semantic information in the image. This method enhances the extraction of important features, improves the activation graph, and improves the ability of the model to extract task-related features.

However, these methods depend on specific data types, which limits flexibility for complex image processing involving various target types. Moreover, eliminating redundant information does not decrease the computational demand on downstream feature extraction modules, nor does it alleviate the strain on hardware resources. It is noteworthy that the aforementioned efforts represent design modifications to the model, which merely raise the ceiling of its feature extraction capabilities. It must be emphasized that regardless of changes made to hardware or algorithms, even when the neural network’s feature extraction capacity is enhanced to its theoretical maximum, if the model does not learn from high-quality complex image datasets, its recognition capabilities may not suffice when confronted with unfamiliar complex image data.

3. Materials and Methods

3.1. CPI Dataset

As previously stated, the primary objective of this study is to enhance the recognition accuracy of the model on complex images using transfer learning. To achieve this, we developed a 37-category dataset for pet cat and dog breeds classification, referred to as the CPI dataset. The categories and their corresponding number of images are shown in Table 1, where Name stands for the name of the category and N stands for the number of images. The CPI dataset comprises 5550 real-world images, all sourced from the Oxford-IIIT Pet dataset [15].

Oxford-IIIT Pet is a public dataset with 37 categories and a total of 7349 pet images. It includes 12 different breeds of pet cats and 25 different breeds of pet dogs, with each category containing approximately 200 images. The majority of the image data in the Oxford-IIIT Pet dataset are annotated images posted by enthusiasts on social networking sites dedicated to sharing photos of cats and dogs, including indoor and outdoor shots, as well as images with simpler or solid-colored backgrounds [15].

Utilizing a model trained solely on the Oxford-IIIT Pet dataset for identifying breeds in real-world cat and dog images is prone to errors. As previously highlighted, this issue arises from the model’s inadequate generalization capacity and the low signal-to-noise ratio of the training set feature maps. Consequently, we refined the Oxford-IIIT Pet dataset by discarding all overly simplistic images within each category, retaining only the more intricate images. Images in the Oxford-IIIT Pet dataset were removed if any of the following conditions were met:

(1) The image background is a solid color.

(2) The background has few features due to dim light (such as pictures taken in a dark room).

(3) The background features are too simple due to an environment with fewer characteristics such as grass, blue sky, sofa and bedspread.

In this manner, we obtained 5247 images. To enhance the selection of images to create a migration learning dataset suitable for practical applications, we applied rotations, brightness adjustments, noise filtering, and horizontal and vertical mirroring on categories with fewer images. The processed data served as an augmentation. Consequently, we were able to produce a pet cat and dog breed classification dataset—the CPI dataset (see Figure 1)—evenly distributed across 37 categories.

Figure 1. Examples of dataset images. This figure presents some images of the proposed CPI dataset.

The CPI dataset is tailored for identifying pet cat and dog breeds in challenging environments. The dataset includes 5550 images of 37 pet categories, comprising 12 cat breeds and 25 dog breeds. As shown in Figure 1, these images are captured from diverse perspectives, settings, and lighting conditions, presenting a complex background. The backgrounds of these images include plants and various common indoor and outdoor objects. In some images, there may be people present, which can make the pets not the most salient objects (i.e., not the most noticeable). These factors are designed to reduce the signal-to-noise ratio of the image features observed during model training, increase the difficulty of identifying pets, and push the feature extraction capabilities of the model to the theoretical limits of its architectural design.

3.2. Fine-Tuning Pre-Trained Networks for Pet Breeds Identification

Transfer learning is a machine learning technique that enables models to leverage knowledge acquired through training on one task not only for that task but also for various other tasks. In practical applications, due to limitations in manpower, material resources, and time costs, only small-scale datasets are typically available. Transfer learning techniques are particularly beneficial in such scenarios, allowing models to transfer knowledge acquired from large datasets such as ImageNet to pet breeds identification tasks.

Classic neural network architectures typically consist of a backbone, neck, and head. As the name suggests, the backbone extracts features from data to create feature maps. The role of the head is to make a final decision based on the features extracted by the network, such as classification, detection, segmentation, etc. It is also referred to as the decision layer (or decision block). The neck is positioned between the backbone and head, often used to enhance the utilization of the features extracted by the backbone (from a design perspective, the neck is not essential). Therefore, in transfer learning techniques, fine-tuning the pre-trained network involves first removing the decision layer, redesigning it for the current task, and then retraining it. We transfer the knowledge obtained from large datasets such as ImageNet to pet breeds identification tasks. Furthermore, to enhance recognition accuracy in practical applications (especially in complex scenarios), we do not fine-tune the model directly on the Oxford-IIIT Pet dataset, but on the CPI dataset (see Figure 2).

Figure 2. Overview of the proposed methodology. This figure illustrates the process of fine-tuning the model using the CPI dataset.

3.3. PBI-EdgeNeXt

Inspired by the EdgeNeXt network, a pet breed identification network for real-life and complex images is designed. The goal is designed so as to offer people a solution to identify pet breeds anytime and anywhere. We improve EdgeNeXt, providing a lightweight pet breed classification model. The model consists of four parts, from Stage 1 to Stage 4 (see Figure 3), and its framework and improvements are described as follows.

3.3.1. Image Frequency Division and Octave Convolution

An Octave Convolution module (OctConv) [24] is introduced into our PBI-EdgeNeXt. This module aims to reduce redundant information in convolutional blocks and SDTA encoders processing of feature maps, thereby improving their processing efficiency.

Reducing redundant information in feature maps is crucial for enhancing neural network performance. Neural network modules, such as convolution and attention blocks, often produce feature maps with significant spatial redundancy. Each pixel independently stores its own feature description, neglecting the shared information that could be stored and processed between adjacent pixels.

Figure 3. The proposed model. The figure displays the framework of PBI-EdgeNeXt.

Although EdgeNeXt has achieved commendable results in image classification [25], we introduced OctConv to reduce redundant information in feature maps, enhance the computational efficiency of each encoder, and improve classification performance in complex backgrounds with significant interference (see Figure 3). By utilizing OctConv to decompose the high-frequency edge information and the smooth low-frequency color information, the model can more accurately differentiate between the image’s background, interferences, and targets. Additionally, as shown in Figure 4, OctConv reduces the shared feature information between neighboring pixels in the low-frequency components of the feature map as much as possible, effectively minimizing redundancy by decreasing it by an octave. Subsequently, as illustrated in Figure 4c, the smoothly varying low-frequency information is mapped and stored in a low-resolution tensor. In this way, while avoiding processing redundant spatial information, it optimizes the use of hardware resources.

3.3.2. Positional Encoding

Convolutional encoders, leveraging their distinctive receptive field convolution operations, offer a wealth of local feature information to the model. Ref. [26] examines the degree to which CNNs implicitly encode absolute position information in their learned representations, and demonstrates that deep CNN may implicitly learn to encode this information. In complex visual tasks requiring scene comprehension and multi-object interaction analysis, such positional information is essential [27]. However, acquiring crucial positional information is typically lacking during the initial stages of convolutional neural network training. If the CNN fails to learn how to encode position information, it may know what it is looking at but not where it is positioned in the image [26]. This typically occurs more frequently during the initial stages of network training, and it not only impedes the model’s convergence but also potentially impacts its ultimate performance.

Taking into account the reasons mentioned above, for the image classification task with a complex background, we introduce the positional encoding (PE) module in stages 3 and 4 of EdgeNeXt (see Figure 3). This addition provides extra spatial information for the model, enhancing its ability to comprehend the relative and absolute positional relationships of each part in the image [27].

Our positional encoding employs sine and cosine functions at varying frequencies:

P E (p o s i, 2 i) = s i n (p o s i / 10000^{2 i / d_{m o d e l}})

(1)

P E (p o s i, 2 i + 1) = c o s (p o s i / 10000^{2 i / d_{m o d e l}})

(2)

where

p o s i

stands for location, and i stands for dimension. By enhancing the capability of the model to express positions within its feature maps, it accelerates convergence during training and performs better after deployment when handling tasks requiring precise positional information.

3.3.3. Loss Function

When training deep neural networks, cross-entropy loss [28] and focal loss [29] are the most common choices. Most research works deal with class balanced datasets, which can achieve ideal results by using the traditional loss function for model training. This paper aims to improve the recognition accuracy of the model in practical applications. However, in real tasks, the collection of a fine-tuning dataset is inevitably unbalanced between classes. EdgeNeXt [25] uses cross-entropy loss, which may not be a good loss function against the problems caused by unbalanced categories of datasets [30]. A loss function commonly used to deal with unbalanced data tasks is focal loss. However, the loss function needs to fine-tune the super parameters and carry out experimental tests [29]. However, in practical applications, engineers often hope to find a simpler and more effective solution. Therefore, this paper introduces the PolyLoss [31] function to train the PBI-EdgeNeXt model, and its formula is shown in Formula (4):

L o s s_{f o c a l} (p, y) = - α {(1 - p_{y})}^{γ} l o g (p_{y})

(3)

L o s s_{P o l y} (p, y) = - l o g (p_{y}) + \sum_{j = 1}^{N} ϵ_{j} {(1 - p_{y})}^{j}

(4)

where

p_{y}

is the prediction probability of the model for the correct category y,

α

is the weight used to balance positive and negative samples,

γ

is the focus parameter for adjusting the weights of easy-to-classify samples, and

ϵ

is the polynomial coefficient.

As shown in Formula (3), there are many hyperparameters that need to be adjusted when using the focal loss function. PolyLoss provides a flexible framework that can adjust polynomial bases based on different tasks and data to improve performance. For imbalanced datasets, the loss function uses an adaptive penalty mechanism to effectively increase the model’s focus on the minority class. Specifically, we use the Poly-1 loss function, which is expressed as follows:

L o s s_{P o l y - 1} (p, y) = - l o g (p_{y}) + ϵ (1 - p_{y})

(5)

We introduce the Poly-1 loss function, which allows us to adapt it to new tasks and datasets by adjusting only one hyperparameter, that is, the coefficient of the first-order polynomial,

ϵ

.

3.3.4. Optimizer

To achieve efficient and high-precision image classification, an effective optimizer is crucial. SGD and Adam series [32] are the most common optimizers, which are widely used in deep learning training. However, they often encounter a training shock and slow model convergence in complex image classification tasks with category imbalance or containing many interfering objects. SGD optimizes parameters using only gradient information, omitting second-order information, which leads to slower convergence and a higher tendency to get stuck in local optima for non-convex optimization problems. Adam and its variants optimize model parameters using gradient, first-order, and second-order moment estimates. However, they incur excessive computational costs when dealing with non-convex functions, leading to slower convergence. While Adam has a faster convergence rate compared to SGD, its reliance on second-order momentum to control the learning rate often results in too-low adjustments during the later stages of training. This may lead to missing the global optimum and even failing to converge [33]. Keskar et al. [33] conducted a large number of experiments and analyses, revealing that when Adam is employed to train models on the CIFAR-10 dataset [8], although it converges faster than SGD, the model’s performance in the final test is inferior to that trained using SGD.

In this paper, we employ Sophia as the optimizer to train the PBI-EdgeNeXt model. The design of Sophia is more suitable for heterogeneous curvatures with different parameter dimensions, making it often appropriate for a wide range of tasks [34], especially when dealing with unbalanced datasets. The key point for Sophia to adapt to the differences between various parameters is that it utilizes a lightweight estimate of the diagonal Hessian as the pre-conditioner. By limiting the amplitude through element-wise clipping, it controls the worst-case update size and mitigates the negative impact of non-convexity and rapid change of Hessian along the trajectory. This approach controls the size of the parameter updates in the most adverse scenarios, facilitating efficient parameter optimization [34]. Consequently, it ensures a more balanced convergence rate and more effective optimization across various parameter dimensions.

4. Experiment and Analysis

4.1. Experimental Setup

The experimental setup is as follows: the GPU is NVIDIA GeForce RTX 3050 Ti Laptop, the CPU is Intel Core i7-12700H, the operating system is Windows 11, and the proposed PBI-EdgeNeXt framework is implemented within PyTorch. The initial learning rate is 0.0001, the batch size is 16, the maximum number of epochs is 100,

β_{1}

and

β_{2}

are set to 0.99 and 0.999, the first coefficient

ϵ

of the Poly-1 Loss function is 1.0, and the hyperparameter rho of the Sophia optimizer is 0.04.

To ensure the effectiveness and reliability of the training process, we divide the dataset into a training set, validation set, and testing set. The training set comprises 3552 images, the validation set comprises 888 images, and the test set comprises 1110 images. Throughout the training process, we adjust the model parameters and optimize its performance using the training set, assess the model progress using the validation set, and evaluate the model’s final performance and ability to deal with unknown data using the test set after training. This approach allows us to more scientifically assess the performance of the pet breed classification model.

4.2. Evaluation Indicators

In this experiment, we select the mean precision (mP), mean recall (mR), mean

F_{1}

score (m

F_{1}

), and accuracy (acc) as the evaluation metrics for the model.

The mP represents the average value of the model’s precision rate for each type of sample. The formula for calculating precision is as follows:

P r e c i s i o n = \frac{T P}{T P + F P}

(6)

The mR represents the average value of the model’s recall rate for each type of sample. The formula for calculating recall is as follows:

R e c a l l = \frac{T P}{T P + F N}

(7)

The formula for calculating

F_{1}

score is as follows:

F_{1} = \frac{2 P r e c i s i o n R e c a l l}{P r e c i s i o n + R e c a l l}

(8)

The evaluation of the model accuracy uses the Top-1 accuracy of the model classification results, which is calculated as follows:

A c c = \frac{T P}{T o t a l}

(9)

Since the dataset has 37 categories, we calculate the precision, recall, and

F_{1}

score for each category and then average them to obtain the mean precision (mP), mean recall (mR), and mean

F_{1}

score (m

F_{1}

).

4.3. Comparison of Two Datasets

To confirm the enhanced performance of the model following fine-tuning on the CPI dataset, we conduct comparative experiments across two datasets. In our experiment, we employ three distinct SOTA deep learning image classification models: Swin Transformer [16], DesneNet [18], and ResNet [18]. Specifically, we use ResNet34, DenseNet121, and the tiny version of Swin Transformer. The selection of these famous architectures is derived from the comparative research in [11,35,36], with each model falling under three distinct architectures (residual convolution architecture, dense connection network architecture, and transformer architecture). These three models are pre-trained on the Imagenet dataset, and fine-tuned on two pet breeds classification datasets as follows:

(1) Fine-tune the network using the images collected by the Oxford-IIIT pet dataset.

(2) Fine-tune the model using the CPI dataset, which is the method proposed in this paper (see Figure 2).

The categories in the CPI dataset are divided in an

8 : 2

ratio, with 80% allocated for training and 20% for testing. To ensure the fairness of this comparative experiment, we consistently employ the test set separated from the CPI dataset for model testing. Therefore, we remove the pictures selected for the test set from the Oxford-IIIT Pet dataset beforehand to ensure that only data unseen by the model appear in the test set. In this comparative experiment, the size of all images in the Oxford-IIIT Pet and CPI datasets are adjusted to

224 \times 224

, ensuring consistency in image size between fine-tuning and pre-training stages for each model.

In the two aforementioned scenarios, all parameters of the pretrained networks are fine-tuned using backpropagation algorithms and the SGD optimizer. At the same time, we use the Top-1 accuracy as the primary evaluation metric in the test, with the Top-5 accuracy, mP, mR, and m

F_{1}

score as auxiliary evaluation metrics. Comparing the CPI dataset with the original Oxford-IIIT Pet dataset helps us to analyze the impact of complex images on classification accuracy in practical applications.

The training results are shown in Table 2. It is evident that models exhibit superior performance after fine-tuning the CPI dataset, when compared to those fine-tuned on the Oxford-IIIT Pet dataset. For Top-1 accuracy, ResNet34 is improved by 0.36 percentage points, DenseNet121 is improved by 0.09 percentage points, and Swin Transformer is improved by 1.26 percentage points. We conduct comparative experiments on two datasets using three different architectures. The experimental results demonstrate that our CPI dataset is more suitable for fine-tuning pet breed classification models.

4.4. Ablation Experiments

To validate the loss function employed to train PBI-EdgeNext and evaluate the effects of the various modules on the vanilla EdgeNeXt model, we apply ablation on the CPI dataset, utilizing the Sophia optimizer. The experimental setup and results are detailed in Table 3.

On the CPI dataset, we respectively train the original EdgeNeXt model 100 epochs via cross-entropy loss function and Poly-1 loss function. The test results indicate that the model trained using the Poly-1 Loss function has shown significant improvement compared to using the cross-entropy loss function. For Top-1 accuracy, the EdgeNeXt model trained with the Poly-1 Loss function is 0.05 percentage points higher. The Top-5 accuracy is improved by 1.10 percentage points.

We incorporate position encoding modules into stages 3 and 4 of the vanilla EdgeNeXt architecture and train the model using a cross-entropy loss function. Test results indicate that both Top-1 and Top-5 accuracies are enhanced relative to the baseline EdgeNeXt model, gaining 0.18 and 0.54 percentage points, respectively. When the model is trained using Poly-1 loss, there is a 0.63 percentage point improvement in the Top-1 accuracy and a 1.08 percentage point improvement in the Top-5 accuracy compared to training with cross-entropy loss.

We then test the impact of the Octave convolutional block on the network model. Subsequently, we assess the influence of the Octave convolutional block on the network model. We integrate it into the original EdgeNeXt and conduct 100 training epochs each, using the cross-entropy loss function and PolyLoss on the CPI dataset, respectively. The results show that the inclusion of this module leads to slight enhancements in Top-1 accuracy, recall, and precision. When compared to the original EdgeNeXt model, the Top-1 accuracy is increased by 0.63 percentage points, and the Top-5 accuracy by 1.35 percentage points for the model trained with cross-entropy loss. Furthermore, the Top-1 accuracy rises an additional 1.17 percentage points for the PolyLoss-trained model compared to the cross-entropy loss-trained model.

Finally, we test our solution by adding both the two PE modules and the Octave convolutional block to the model (i.e., using our PBI-EdgeNeXt model) while using PolyLoss for training. Among the eight experimental results, our solution demonstrates the highest recall rate for pet breeds identification, coupled with high precision. Additionally, the Top-1 accuracy of our model tops the charts across the eight datasets, showing an improvement of 1.89 percentage points compared to the vanilla EdgeNeXt model and 0.45 percentage points compared to using the cross-entropy loss function. This demonstrates the effectiveness of our solution in enhancing the EdgeNeXt algorithm.

4.5. Comparison of Models

To validate the performance of PBI-EdgeNeXt, we compare it with five lightweight SOTA network models. These models are trained and tested on the CPI dataset, and their performance is detailed in Table 4.

It is obvious that our PBI-EdgeNeXt outperforms the other models as demonstrated by its top performance across the five indicators we selected. Our model not only achieves high average recall rates across various types but also ensures high precision, securing the highest F1 score among the six models. On the CPI dataset, it achieves a Top-1 accuracy of 87.12% and a Top-5 accuracy of 97.21%.

Figure 5 displays the validation Top-1 accuracy curves for six model training processes. The graph reveals that our PBI-EdgeNeXt model exhibits the best performance. This superiority is evident not only in its highest validation accuracy but also in its rapid convergence rate.

4.6. Saliency-Map-Based Evaluation

To visualize the processing of image features by PBI-EdgeNeXt, we employ the saliency map visualization technique to assess the significance of the information for the task from each pixel. Figure 6 displays some of the visualization outcomes. The data used in the saliency-map-based evaluation are sourced from our CPI dataset, characterized by intricate image backgrounds and prevalent interference, leading to a lower signal-to-noise ratio in its feature maps. The saliency map visualizations reveal that our PBI-EdgeNeXt can extract task-relevant features from complex images while filtering out irrelevant background and disturbances.

Additionally, we utilize six models to extract features from some representative complex images that we selected, and display the three most representative results using the saliency map visualization technique (see Figure 7). It is apparent from Figure 7 that some models do not allocate adequate attention to the pet’s face, while others concentrate solely on the face, neglecting other parts of the pet. All of these issues lead to the models’ comprehension of image features being overly simplistic or incomplete, resulting in failures.

4.7. Discussion

In this research, we introduce a method that utilizes transfer learning to enhance the accuracy of pet breeds classification in complex backgrounds. To develop this method, we create a specialized dataset for pet breeds identification in complex backgrounds, referred to as the CPI dataset. This dataset is derived from the Oxford-IIIT Pet dataset, with images chosen from those with complex backgrounds, interference, or redundant information. We employ three different architectures models pretrained on ImageNet, fine-tune them using the CPI dataset and the Oxford-IIIT Pet dataset, and subsequently evaluate their performance on the same test set of complex images. The experimental results indicate that the Top-1 accuracy of the models fine-tuned on the CPI dataset is higher than that on the Oxford-IIIT Pet dataset as shown in Table 2. This suggests that our CPI dataset is better suited for models that operate in real-world environments. A more detailed analysis is discussed in the “Experiments and Analysis” section, so we will not expand on it here.

Meanwhile, we propose a new lightweight model termed PBI-EdgeNeXt, designed to address the challenges of pet breeds identification in complex backgrounds. This model combines EdgeNeXt with octave convolution, enabling it to effectively capture task-specific features while suppressing irrelevant and redundant information. As described in the “Ablation Experiments” subsection above, the validity of our improvements to EdgeNeXt is confirmed through detailed ablation studies, as shown in Table 3. Our PBI-EdgeNeXt model demonstrates outstanding performance on the CPI dataset, outperforming the other five prevalent lightweight SOTA models (as depicted in Table 4). This indicates that our model design not only achieves theoretical breakthroughs but also has significant advantages in practical applications, providing a new technical path and practical value for solving the problem of pet breed recognition in complex backgrounds.

Finally, we employ the saliency map visualization to display the feature extraction of selected images from the CPI dataset using PBI-EdgeNeXt (see Figure 6). This visualization demonstrates its capacity to suppress redundant information and capture task-relevant features within the images. This visualization not only helps us understand the working mechanism of the model in complex scenarios but also intuitively shows its potential and effectiveness in practical applications. We also select some representative complex images from the test set, use six models to extract the features of these images, and display them using saliency map visualization techniques. We select three of the most representative results as shown in Figure 7. This visualization reveals the success of the proposed PBI-EdgeNeXt model and the reasons for the failure of some comparative models.

Overall, our research not only introduces new datasets and model designs for the field of pet breed identification but also provides in-depth theoretical exploration and empirical analysis for image classification problems in complex backgrounds. These achievements are not only of great significance in academic research but also provide powerful technical support and guidance for the industrial sector in practical applications, thereby promoting progress and development in related fields.

However, our pet breed identification system has certain limitations. The biggest limitation is that the proposed PBI-EdgeNeXt, as a classification model, can only output one definite classification result for an image. Compared to object detection models and instance segmentation models, our model is unable to classify pet breeds for multiple instances in one image. One possible solution is to first detect pets in the image through an object detection task, cut out the areas with pets as regions of interest from the image, and then use our model to classify the species of these regions of interest. However, the premise for this solution to obtain accurate results is that the object detection model can accurately and non-destructively detect pets as much as possible. Secondly, the dataset used for training the model does not contain cases where pets are excessively occluded. Therefore, when pets in the input image lose too much information due to occlusion, the model may not be able to accurately classify pets based on the unobstructed parts. In the future, we will conduct further research on this work and continue to look for improvement measures for the proposed solution in order to enhance the final classification accuracy of the model and address the aforementioned issues as much as possible.

5. Conclusions

Existing computer vision technology demonstrates low accuracy in classifying pet breeds in practice scenarios. We identify the primary cause as excessive simple data within the datasets utilized for model training and testing, and delineate the consequent drawbacks. To address this challenge, we have constructed a complex image dataset termed CPI, tailored for breeds classification of pet cats and dogs in real-world scenarios (especially complex environments). We have also introduced a lightweight breeds identification system for pet cats and dogs, leveraging the PBI-EdgeNeXt technology. Building on this foundation, we utilize the CPI dataset to apply knowledge transfer on three different SOTA models (ResNet34, DenseNet121, Swin Transformer) and compare the PBI-EdgeNeXt model with several lightweight SOTA models commonly used on edge devices. The results indicate that knowledge transfer based on the CPI dataset can significantly enhance the capability of the identification model to categorize complex images in real-world scenarios. This improvement is particularly notable when compared to the Oxford-IIIT Pet dataset, achieving higher accuracy in breeds identification of pet cats and dogs in such scenes. The PBI-EdgeNeXt model demonstrates the highest classification accuracy among the compared models, achieving SOTA level. Furthermore, we showcase the feature saliency maps of PBI-EdgeNeXt for extracting selected complex images, which prove that after fine-tuning on the CPI dataset, our model merely focuses on the regions of interest (pet cats or dogs), while other redundant features and interferences are suppressed. Therefore, we conclude that the PBI-EdgeNeXt architecture is highly suitable for the classification of pet breeds in real-world scenarios, particularly in complex environments.

Author Contributions

Z.L.: Contribution: methodology, research design; H.X.: Contribution: re-sources; Y.L.: Contribution: verification, manuscript writing; Y.Q.: Contribution: resources; C.W.: Contribution: verification, software. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article. The data supporting the findings of this study are available upon request from the first author, [Lin], upon reasonable request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Raymond, P.W.; Brandon, D.V.; Claire, M.W. Forensic DNA phenoty**: Canis familiaris breed classification and skeletal phenotype prediction using functionally significant skeletal SNPs and indels. Anim. Genet. 2022, 53, 247–263. [Google Scholar] [CrossRef] [PubMed]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2012, 25, 84–90. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 640–651. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. Springer International Publishing: New York, NY, USA, 2015; pp. 234–241. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Yi, L.; Li, G.; Jiang, M. An end-to-end steel strip surface defects recognition system based on convolutional neural networks. Steel Res. Int. 2017, 88, 1600068. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning multiple layers of features from tiny images. Handb. Syst. Autoimmune Dis. 2009, 7. [Google Scholar]
Fe-Fei, L. A Bayesian approach to unsupervised one-shot learning of object categories. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; IEEE: Piscataway, NJ, USA, 2003. [Google Scholar]
Griffin, G.; Holub, A.; Perona, P. Caltech-256 Object Category Dataset; California Institute of Technology: Pasadena, CA, USA, 2007. [Google Scholar]
Khan, S.S.; Doohan, N.V.; Gupta, M.; Jaffari, S.; Chourasia, A.; Joshi, K.; Panchal, B. Hybrid Deep Learning Approach for Enhanced Animal Breed Classification and Prediction. Trait. Signal 2023, 40, 2087–2099. [Google Scholar] [CrossRef]
Mostajer Kheirkhah, F.; Asghari, H. Plant leaf classification using GIST texture features. IET Comput. Vis. 2019, 13, 369–375. [Google Scholar] [CrossRef]
Mabrouk, A.B.; Najjar, A.; Zagrouba, E. Image flower recognition based on a new method for color feature extraction. In Proceedings of the 2014 International Conference on Computer Vision Theory and Applications (VISAPP), Lisbon, Portugal, 5–8 January 2014; IEEE: Piscataway, NJ, USA, 2014; Volume 2. [Google Scholar]
Rowe, Z.W.; Scott-Samuel, N.E.; Cuthill, I.C. Cuthill.How background complexity impairs target detection. Anim. Behav. 2024, 210, 99–111. [Google Scholar] [CrossRef]
Parkhi, O.M.; Vedaldi, A.; Zisserman, A.; Jawahar, C.V. Cats and dogs. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; IEEE: Piscataway, NJ, USA, 2012. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Conference, 11–17 October 2021. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Yuan, Z.; Zhang, W.; Tian, C.; Rong, X.; Zhang, Z.; Wang, H.; Fu, K.; Sun, X. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, Z.; Dong, L.; Xiong, S.; Lu, X. A Joint Saliency Temporal–Spatial–Spectral Information Network for Hyperspectral Image Change Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Wang, D.; Bai, Y.; Wu, C.; Li, Y.; Shang, C.; Shen, Q. Convolutional LSTM-Based Hierarchical Feature Fusion for Multispectral Pan-Sharpening. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–16. [Google Scholar] [CrossRef]
Sabour, S.; Frosst, N.; Hinton, G.E. Dynamic routing between capsules. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
Abra Ayidzoe, M.; Yu, Y.; Mensah, P.K.; Cai, J.; Adu, K.; Tang, Y. Gabor capsule network with preprocessing blocks for the recognition of complex images. Mach. Vis. Appl. 2021, 32, 91. [Google Scholar] [CrossRef]
Chen, Y.; Fan, H.; Xu, B.; Yan, Z.; Kalantidis, Y.; Rohrbach, M.; Yan, S.; Feng, J. Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Maaz, M.; Shaker, A.; Cholakkal, H.; Khan, S.; Zamir, S.W.; Anwer, R.M.; Shahbaz Khan, F. Edgenext: Efficiently amalgamated cnn-transformer architecture for mobile vision applications. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022. [Google Scholar]
Islam, M.A.; Jia, S.; Bruce, N.D. How much position information do convolutional neural networks encode? arXiv 2020, arXiv:2001.08248. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar]
De Boer, P.T.; Kroese, D.P.; Mannor, S.; Rubinstein, R.Y. A tutorial on the cross-entropy method. Ann. Oper. Res. 2005, 134, 19–67. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-Balanced Loss Based on Effective Number of Samples. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019), Long Beach, CA, USA, 15–20 June 2019; pp. 9260–9269. [Google Scholar]
Leng, Z.; Tan, M.; Liu, C.; Cubuk, E.D.; Shi, X.; Cheng, S.; Anguelov, D. PolyLoss: A Polynomial Expansion Perspective of Classification Loss Functions. arXiv 2022, arXiv:2204.12511. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Keskar, N.S.; Socher, R. Improving generalization performance by switching from adam to sgd. arXiv 2017, arXiv:1712.07628. [Google Scholar]
Liu, H.; Li, Z.; Hall, D.; Liang, P.; Ma, T. Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. arXiv 2023, arXiv:2305.14342. [Google Scholar]
Aarizou, A.; Merah, M. Transfer Learning for Plant Disease Detection on Complex Images. In Proceedings of the 2022 7th International Conference on Image and Signal Processing and Their Applications (ISPA), Mostaganem, Algeria, 8–9 May 2022; IEEE: Piscataway, NJ, USA, 2022. [Google Scholar]
Cahyo, D.D.N.; Sunyoto, A.; Ariatmanto, D. Transfer Learning and Fine-tuning Effect Analysis on Classification of Cat Breeds using a Convolutional Neural Network. In Proceedings of the 2023 6th International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia, 10–11 November 2023; IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]

Figure 4. Octave convolution: this figure illustrates the computation of octave convolution. Subfigure (a) displays the main calculations, as well as the updating and exchange of information. Subfigure (b) shows the division of the input feature map into high-frequency and low-frequency components. Subfigure (c) demonstrates the reduction in resolution of the low-frequency components that contain redundant information. Subfigure (d) acts as a legend.

Figure 5. Accuracy curve of the six models. The figure displays the results of the Top-1 classification accuracy evaluation using the validation set during their respective training processes.

Figure 6. Saliency maps. This figure demonstrates the capacity of PBI-EdgeNeXt to suppress redundant information and capture task-relevant features within the images.

Figure 7. Saliency maps comparison of six models. This figure shows how much attention the models attach to the information in these images, and visually reveals the reasons for the failure of some comparative models.

Table 1. CPI dataset composition. The table lists the pet breeds and for each, the number of images in the CPI dataset.

Name	N	Name	N	Name	N
Abyssinian	150	german shorthaired	150	Russian Blue	150
american bulldog	150	great pyrenees	150	saint bernard	150
american pit bull terrier	150	havanese	150	samoyed	150
basset hound	150	japanese chin	150	scottish terrier	150
beagle	150	keeshond	150	shiba inu	150
Bengal	150	leonberger	150	Siamese	150
Birman	150	Maine Coon	150	Sphynx	150
boxer	150	miniature pinscher	150	staffordshire bull terrier	150
British Shorthair	150	newfoundland	150	wheaten terrier	150
chihuahua	150	Persian	150	yorkshire terrier	150
Egyptian Mau	150	pomeranian	150
english cocker spaniel	150	pug	150
english setter	150	Ragdoll	150

Table 2. Comparison of two datasets. It illustrates that models exhibit superior performance after fine-tuning on the CPI dataset compared to those fine-tuned on the vanilla Oxford-IIIT Pet dataset.

Metrics	Datasets
	Oxford-IIIT Pet			CPI
	ResNet	DenseNet	Swin Transformer	ResNet	DenseNet	Swin Transformer
Top-1 Acc	79.10	89.10	85.32	79.46	89.19	86.58
Top-5 Acc	95.41	98.02	96.22	93.87	97.93	96.13
mP	79.90	89.38	85.56	79.81	89.37	86.92
mR	79.10	89.10	85.32	79.46	89.19	86.58
m $F_{1}$	79.05	89.11	85.27	79.32	89.12	86.54

Table 3. The table presents the results of the ablation experiments on the PBI-EdgeNeXt model. It demonstrates the effect of the loss function and two modules on enhancing the model’s performance.

Ablations			Metrics
PolyLoss	PE	OctConv	Top-1 (%)	Top-5 (%)	mP (%)	mR (%)	m $F_{1}$ (%)
			85.50	95.50	85.71	85.50	85.43
✔			85.95	96.40	86.20	85.95	85.87
	✔		85.68	96.04	85.99	85.68	85.65
		✔	86.13	96.85	86.32	86.13	86.07
	✔	✔	86.94	95.77	87.05	86.94	86.88
✔	✔		86.31	97.12	86.70	86.31	86.30
✔		✔	87.30	96.67	87.52	87.30	87.20
✔	✔	✔	87.39	96.49	87.58	87.39	87.35

Table 4. Performance comparison of models. This table presents the results of the comparative experiments between the PBI-EdgeNeXt model and the five selected models.

Model	Top-1 Acc (%)	Top-5 Acc (%)	mP (%)	mR (%)	m $F_{1}$ (%)
EfficientNet-B0	85.50	97.21	86.00	85.50	85.47
EfficientNetV2-B0	78.74	94.68	79.51	78.74	78.71
MobileNetV2	82.88	96.76	83.51	82.88	82.70
ShuffleNetV1	80.54	95.86	81.23	80.54	80.45
ShuffleNetV2	81.08	95.23	81.59	81.08	81.01
Ours	87.12	97.21	87.32	87.12	87.07

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, Z.; Xia, H.; Liu, Y.; Qin, Y.; Wang, C. A Method for Enhancing the Accuracy of Pet Breeds Identification Model in Complex Environments. Appl. Sci. 2024, 14, 6914. https://doi.org/10.3390/app14166914

AMA Style

Lin Z, Xia H, Liu Y, Qin Y, Wang C. A Method for Enhancing the Accuracy of Pet Breeds Identification Model in Complex Environments. Applied Sciences. 2024; 14(16):6914. https://doi.org/10.3390/app14166914

Chicago/Turabian Style

Lin, Zhonglan, Haiying Xia, Yan Liu, Yunbai Qin, and Cong Wang. 2024. "A Method for Enhancing the Accuracy of Pet Breeds Identification Model in Complex Environments" Applied Sciences 14, no. 16: 6914. https://doi.org/10.3390/app14166914

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Method for Enhancing the Accuracy of Pet Breeds Identification Model in Complex Environments

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. CPI Dataset

3.2. Fine-Tuning Pre-Trained Networks for Pet Breeds Identification

3.3. PBI-EdgeNeXt

3.3.1. Image Frequency Division and Octave Convolution

3.3.2. Positional Encoding

3.3.3. Loss Function

3.3.4. Optimizer

4. Experiment and Analysis

4.1. Experimental Setup

4.2. Evaluation Indicators

4.3. Comparison of Two Datasets

4.4. Ablation Experiments

4.5. Comparison of Models

4.6. Saliency-Map-Based Evaluation

4.7. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI