A New Siamese Network Loss for Cattle Facial Recognition in a Few-Shot Learning Scenario

Porto, João; Higa, Gabriel; Weber, Vanessa; Weber, Fabrício; Loebens, Newton; Claure, Pietro; de Almeida, Leonardo; Porto, Karla; Pistori, Hemerson

doi:10.3390/agriengineering6030169

Open AccessArticle

A New Siamese Network Loss for Cattle Facial Recognition in a Few-Shot Learning Scenario

by

João Porto

^1,2,*,

Gabriel Higa

¹,

Vanessa Weber

^2,3,

Fabrício Weber

²

,

Newton Loebens

^1,4

,

Pietro Claure

³,

Leonardo de Almeida

⁵,

Karla Porto

² and

Hemerson Pistori

^1,5

¹

Inovisão Department, Universidade Católica Dom Bosco, Campo Grande 79117-900, Brazil

²

Kerow Soluções de Precisão, Campo Grande 79117-900, Brazil

³

Faculty of Engineering, Universidade Estadual de Mato Grosso do Sul, Campo Grande 79115-898, Brazil

⁴

Faculty of Applied Mathematics, Universidade Federal do Pampa, Itaqui 97650-000, Brazil

⁵

Faculty of Engineering, Universidade Federal de Mato Grosso do Sul, Campo Grande 79070-900, Brazil

^*

Author to whom correspondence should be addressed.

AgriEngineering 2024, 6(3), 2941-2954; https://doi.org/10.3390/agriengineering6030169

Submission received: 21 June 2024 / Revised: 4 August 2024 / Accepted: 13 August 2024 / Published: 20 August 2024

(This article belongs to the Section Livestock Farming Technology)

Download

Browse Figures

Versions Notes

Abstract

:

This study explores the use of a Siamese neural network architecture to enhance classification performance in few-shot learning scenarios, with a focus on bovine facial recognition. Traditional methodologies often require large datasets, which can significantly stress animals during data collection. In contrast, the proposed method aims to reduce the number of images needed, thereby minimizing animal stress. Systematic experiments conducted on datasets representing both full and few-shot learning scenarios revealed that the Siamese network consistently outperforms traditional models, such as ResNet101. It achieved notable improvements, with mean values increasing by over 6.5% and standard deviations decreasing by at least 0.010 compared to the ResNet101 baseline. These results highlight the Siamese network’s robustness and consistency, even in resource-constrained environments, and suggest that it offers a promising solution for enhancing model performance with fewer data and reduced animal stress, despite its slower training speed.

Keywords:

machine learning; artificial intelligence; nelore; precision livestock farming; facial recognition; Siamese network

1. Introduction

Agriculture is a sector of significant relevance to the Brazilian economy, accounting for approximately 8% of the 25% contribution to the GDP. According to the report released by the Brazilian Association of Meat Exporting Industries [1], Brazil has the second-largest cattle herd in the world, with approximately 202 million head, representing 12.18% of the global bovine population. Additionally, Brazil is the largest exporter of beef worldwide and has been integrating advanced technologies to meet the global demand for food [2,3].

Centered on the global objective of food production, disease control, traceability, and the updating of health and sanitation records in support of livestock farming, animal identification holds significant socioeconomic importance. Following the guidelines of the WOAH (World Organisation for Animal Health) regarding animal tracking, especially for herds intended for direct food production, coupled with the increasing consumer demands for food safety, underscores the importance of traceability systems [4].

Computer vision applications involving machine learning have become increasingly common not only in controlled, laboratory environments, but also as strongly integrated elements in dynamic, real-time decision-making processes. Specifically, agriculture and livestock farming have benefited from technological advancements in fields such as few-shot learning (FSL) [5] since it makes the usage of deep learning techniques with small datasets feasible. For instance, Yang et al. [6] proposed a methodology based on FSL for plant disease classification, and Qiao et al. [7] utilized an FSL-based strategy for bovine segmentation in videos.

One of the possible applications of deep learning in a feedlot system is the individual identification of animals, which allows for traceability. In turn, this is useful for tasks such as sanitary control [8] and insurance fraud prevention [9]. Animal biometric identification with deep learning is usually conducted on muzzle prints and coat patterns, which can be considered more reliable and less invasive than traditional techniques [8]. Other characteristics that can be used for identification include iris and retinal vein patterns [10].

Traceability provided by the association of machine learning and artificial intelligence is also a crucial component in the confinement system, as it enables the individual identification of animals. The information collected from this traceability focuses primarily on sanitary control and disease trajectories. Animal biometrics is generally carried out through the identification of cattle nose patterns, which are comparable to fingerprints, with the added advantage of being a non-invasive method reliant solely on machine learning modeling [11].

The replacement of traditional identification methods with traceability using cattle nose pattern recognition, as opposed to direct skin marking or ear tagging, is emerging as a factor of quality and animal welfare. This non-invasive technique not only enhances the accuracy of identification, but also significantly reduces stress and discomfort for the animals, promoting a higher standard of well-being [12]. Considered a frontier milestone in computer vision, biometrics applied to livestock and animal monitoring in general allows for pattern recognition and cognitive science as essential elements for recording, correctly identifying, and verifying animals. In this context, cattle are included as a primary focus. This integration of advanced technologies not only ensures precise identification, but also enhances the overall management and welfare of livestock Uladzislau and Feng [13].

Deep learning applications for cattle biometric identification have been studied in several recent works, and a comprehensive review on the topic was conducted by Hossain et al. [8]. For instance, Li et al. [11] evaluated over 50 convolutional neural network and reported a maximum accuracy of 98.7% in cattle muzzle identification. Some of these works employed few-shot learning techniques. One of them is the work by Qiao et al. [7], who proposed a model for one-shot segmentation of cattle, which is important for the acquisition of data of individuals. Another example is that by Shojaeipour et al. [12], who evaluated a ResNet50 in a 5-shot cattle muzzle identification task and reported an accuracy of 99.11%. With those experiments in mind, we propose to not only look at the muzzle, but the entire face of the animal, which has some significant differentiation features using the Siamese network adaptation of the classifiers analyzed in those papers.

In the assessment of each experiment, a crucial step involves the computation of a loss value, a fundamental process within the realm of deep learning, executed via diverse functions tailored to perform distinct calculations [14]. Remarkably, these functions yield vastly disparate values for identical inputs, emphasizing the paramount significance of selecting an appropriate function, particularly within metric learning frameworks [15]. While there is a number of acknowledged loss functions for metric learning, such as the contrastive and triplet losses [16], the strategy of adapting such losses or composing new ones is adopted in works on diverse problems, with the objective of enhancing performance [17].

For instance, Lu et al. [18] used a modified contrastive loss function to train a Siamese neural network for signature verification, reporting an increase of 5.9% in accuracy. This strategy is commonly used for human face recognition. SphereFace [19], CosFace [20], and ArcFace are but three examples of custom loss functions designed for this purpose. ArcFace has been used for cattle face identification in the work of Xu et al. [21], where an accuracy of 91.3% was reported.

Building upon these advancements, we devised a novel loss function, emphasizing the distinctions between feature vectors of different classes, to facilitate more efficient convergence of error to zero. In this paper, we investigate the use of a Siamese neural network for few-shot cattle identification in a traditional way and in a 10-shot learning context using the proposed loss function for both, achieving a statistically significant improvement of over 9% in f-score.

The principal contributions of this paper are threefold: (i) the adaptation and evaluation of a Siamese neural network designed to determine whether two facial images belong to the same cow and to classify cow face images within a predefined set of individuals; (ii) the development of a novel loss function tailored to optimize the network in few-shot learning scenarios; and (iii) an exploration of potential real-world applications of these techniques, extending beyond cattle to other domains.

2. Materials and Methods

2.1. Dataset

The dataset for Nelore cattle faces was collected in Campo Grande, Mato Grosso do Sul, Brazil, at the facilities of Embrapa Beef Cattle—CNPGC. The collection process involved capturing images of 47 Nelore breed cattle, comprising 20 males and 27 females. This was achieved using a GoPro5 camera mounted on a tripod positioned alongside the containment chute. The camera was strategically placed to ensure consistent and high-quality image capture of the cattle’s faces.

The resulting dataset comprises 2210 video frames, each meticulously separated and labeled according to the individual animal’s electronic cattle tag. To prepare the images for analysis, preprocessing was conducted to ensure only the faces of the cattle were visible, as shown in Figure 1. A simple blob detection method [22] was then applied to remove any residual black areas from the initial cropping process. This thorough preprocessing and segmentation ensured the dataset was well-suited for further analysis and modeling, indirectly addressing negative variables such as data imbalance presented in Figure 2 across different animals.

To ensure accurate analysis, it is essential to remove the background from cattle face images due to significant noise that can introduce unwanted artifacts into the dataset. In some cases, other animals in the background posed a significant problem during the initial iterations of the experiment. This preprocessing step is critical for isolating the primary facial features and the muzzle, which must be clearly visible for effective identification. Variations in the angles of the face can negatively impact the training process, as some facial features, such as horns, might be obscured. This underscores the need for consistent image alignment. Ensuring that every frame captures the animal looking at the camera at the correct angle with its eyes open is challenging. To address these variables, the structure used for capturing the images, depicted in Figure 1a, was employed to minimize such variations and provide greater uniformity to the dataset construction.

2.2. Proposed Approach

Cao et al. [23] proposed a two-phase training procedure for a Siamese neural network, which is the basis for the adaptation used in this work. In the first phase, a contrastive loss is applied to optimize the distance between the representations of two input images, which serves primarily for similarity recognition training, as presented in the original idea of the Siamese network by Koch et al. [24].

In the second phase, a classification head is used and the network is optimized for classification. Within this framework, in our proposed approach, we separated the entire network into blocks according to their function: (i) the block of networks with shared weights, denoted as the backbone; (ii) the recognition layer, whose output is the Euclidean distance between representations, referred to as the recognition head; and finally, (iii) the classification head. The architecture is graphically presented in Figure 3.

Since there are two distinct phases within each epoch, each phase has its associated loss function, optimizer, and hyperparameters. For the classification process, the loss can be calculated using a standard loss function, such as cross-entropy [25], comparing the output of the classification head with a one-shot encoded array using the expected label as the index for this encoding. For the recognition training phase, however, a new loss function was created by modifying the standard contrastive loss function.

The standard contrastive loss function can be defined as follows. Let

F_{l}

be the feature extractor up to layer l;

x_{i} \in X

is an input image and

x_{j} \in X

is an anchor image from the training set

X

;

d_{i j} = ∥ F_{l} (x_{i}) - F_{l} (x_{j}) ∥

is the Euclidean distance between the feature vectors from layer l for two different images

x_{i}

and

x_{j}

;

m \in R

is an arbitrary margin value that indicates the minimum distance that two representation vectors of images from different classes must have between themselves; and

y \in Y

, where

Y

is the set of training labels relative to each pair of images (

x_{i}

,

x_{j}

), assuming values of 0 when the two images are of different classes or 1 when they are from the same; finally, let m be the margin that defines the minimum distance that two vectors must have between them if they are from different classes with a default value of 1.25, as proposed by Cao et al. [23] in the original implementation. Then, the standard contrastive loss function is defined as:

\begin{matrix} L_{contrastive} (d_{i j}, y) = \{\begin{matrix} \frac{1}{2} d_{i j}^{2} & if y = 1 \\ \frac{1}{2} max {(0, m - d_{i j})}^{2} & if y = 0 and d_{i j} < m \\ 0 & otherwise \end{matrix} \end{matrix}

(1)

The custom contrastive loss function proposed in this work is an adaptation of Equation (1), defined as:

\begin{matrix} L_{contrastive} (d_{i j}, y) = \{\begin{matrix} ln ({(1 + \frac{1}{d_{i j}})}^{d_{i j}}) & if y = 1 \\ 1 - ln ({(1 + \frac{1}{d_{i j}})}^{d_{i j}}) & if y = 0 and d_{i j} < m \\ 0 & otherwise \end{matrix} \end{matrix}

(2)

Above all, the modifications were made in order to improve the convergence of the error and to limit the loss values between 0 and 1, while keeping a contrastive behavior: the loss value for images of different classes quickly falls when the distance between representations increase, and the loss for images of the same class increases when the distance between the extracted feature vectors increases. Note that a small value of

d_{i j}

indicates high similarity between patches, which is desirable for images of the same class. On the other hand, a large value of

d_{i j}

indicates dissimilarity, essential for separating examples from different classes. Therefore, the loss function is designed to minimize the value of

d_{i j}

for patches having the same class and maximize

d_{i j}

for patches belonging to distinct classes, enabling the model to generalize from a limited number of examples and learn effective discriminative representations. Figure 4 illustrates the behavior of the proposed loss function with a margin of 1.25. For positive labels, the loss approaches 1 as the distance increases, while for negative labels, the loss decreases towards 0 as the distance between different classes becomes larger, with the margin determining the threshold at which the distance between different classes is considered sufficiently big, resulting in a loss of 0.

2.3. Experimental Setup

Two experimental procedures were conducted: the first aimed to compare the new loss function with the original one within a Siamese architecture, and the second involved comparing the performance of this loss function with that of the backbone outside the Siamese structure, across scenarios with a full set of images and scenarios with limited samples. The experimental procedure employed a 10-fold cross-validation strategy. The ResNet101 [26] was selected as the base network for the backbone due to its versatility in classification tasks and its capacity to capture residual information through its internal layers. This network has demonstrated exceptional performance in the works of Li et al. [27], Xu et al. [28], and Wang et al. [29], establishing it as a reliable benchmark for comparative analysis in this experiment.

The original image dataset, comprising 2210 samples, was organized in two distinct configurations to assess the few-shot learning capabilities of the framework for the proposed task. Initially, the entire dataset was utilized, referred to as the “full dataset”. Subsequently, a few-shot scenario was created by randomly selecting ten samples per animal, termed the “few-shot dataset”. For both configurations, 10% of the images were designated for testing, while the remaining 90% were allocated for training and validation. The training and validation set was further divided, with 20% of the data used for validation within each fold.

In the first experiment, the effect of the new loss function, as defined in Equation (2), on network performance was evaluated using the few-shot dataset. This new loss function was compared to the original loss function specified in Equation (1). At the conclusion of each fold in this experiment, precision, recall, and F1-score metrics were computed to assess performance. Although these metrics are conventionally applied to binary classification problems, we addressed this limitation by employing a Macro-Averaged approach. This approach involves calculating each metric for individual classes separately, using a one-versus-all strategy.

In heavily imbalanced classification problems, such as those illustrated by the complete dataset (Figure 2), accuracy can be deceptive and may not accurately represent the model’s performance on the minority class. Instead, precision, recall, and F1-score are more informative metrics. Precision measures the proportion of true positive predictions among all predicted positives, recall assesses the model’s ability to identify actual positives, and the F1-score combines these metrics to provide a balanced view of performance. These metrics are crucial in imbalanced scenarios, as they offer a clearer picture of how well the model performs on the minority class, which is often the class of primary interest in practical applications.

Equations (3)–(5) detail the calculations for the statistical measurements, where N represents the total number of classes:

Macro Precision = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i}}

(3)

Macro Recall = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F N_{i}}

(4)

Macro F 1 - score = 2 \times \frac{Macro Precision \times Macro Recall}{Macro Precision + Macro Recall}

(5)

In these equations,

T P_{i}

denotes the number of true positives, or correctly predicted instances of class i;

F P_{i}

represents false positives, or instances incorrectly predicted as class i; and

F N_{i}

signifies false negatives, or instances of class i that were incorrectly predicted as another class.

After completing the initial comparison of loss functions, a subsequent experiment was undertaken to contrast the performance metrics derived from a standard ResNet101 model with those from a Siamese architecture. Specifically, we evaluated these metrics using the loss function that exhibited superior performance in the initial experiment—our proposed loss function. In this subsequent experiment, both datasets were employed to assess the impact of the Siamese framework on network performance for solving the facial recognition problem, with identical test measurements being computed as the first experiment.

Following the standard of the backbone used in this work [23], all images were resized to 224 × 224 pixels and the ResNet was configured to output a 512-feature vector, serving as the input for both heads of the architecture. The optimizer for both experiments was set to AdamW with two different learning rates, using 0.001 for the classification task and a value of 1:20 proportional to this (0.00005) for the recognition task. To evaluate the loss in the classification process of both experiments, the cross-entropy function was chosen.

The training was performed in batches of 16 images. Neither data normalization nor augmentation techniques were used. In this process, a maximum of 1000 epochs was stipulated, with an early stopping patience of 10 epochs, for which the loss values were monitored in the validation procedure to reduce overfitting. To analyze the metrics gathered, three statistical methods were elected. The first was the analysis of the mean values with each standard deviation, with those values’ tables made: one for the comparison between the two contrastive functions and another comparing the plain ResNet with the Siamese adaptation for both dataset scenarios.

The second statistical analysis method utilized was the boxplot diagram. Each experiment was accompanied by its own boxplot diagram, providing a visual depiction of the distribution of values across each fold for every metric. These diagrams offer insights into the uniformity of the data, showcasing the Interquartile Ranges (IQRs) and the medians, thus enabling a clear understanding of the data’s spread and central tendency. Lastly, the third method involved analyzing variance (ANOVA) complemented by TukeyHSD post-hoc tests, both conducted at a significance threshold of 5%.

3. Results

During the analysis of mean values and standard deviations for each metric in the first experiment, as demonstrated in Table 1, a notable trend emerges. Comparing the default contrastive loss with the proposed alternative presented in this paper, it becomes evident that the latter consistently yields higher results across all metrics with a numerical increase of 0.096, 0.075, and 0.091 for precision, recall, and F-score, respectively, accompanied by lower deviations. This pattern suggests a promising avenue for enhancing performance through the adoption of the new loss function.

Building upon the trends delineated by the table of mean values, Figure 5 illustrates the performance enhancement visually. The boxplots for the new loss function not only exhibit higher median values, but also demonstrate narrower interquartile ranges, corroborating the findings of reduced deviations observed earlier. Additionally, the application of an ANOVA test for the different loss functions revealed statistically significant disparities, with values of

p = 4.24 \times 10^{- 3}

,

p = 6.48 \times 10^{- 3}

, and

p = 3.68 \times 10^{- 3}

for precision, recall, and F-score, respectively.

Based on the outcomes of the second experiment, Table 2 delineates the average results across the 10 executed folds alongside their corresponding standard deviations for both datasets. Notably, the Siamese execution showcases average values surpassing those of the original ResNet across all three metrics in both scenarios, with a margin of 0.006 for precision and 0.01 for both recall and F-score for the full dataset one. Furthermore, the Siamese architecture exhibits smaller standard deviations, indicating a more consistent distribution of results around the mean value. The same table shows similar results for the few-shot dataset, which simulates a 10-shot learning context, with a margin of 0.11 for both precision and F-score and 0.08 for recall. One can see that the Siamese architecture consistently showed an increase in average values and a reduction in standard deviation when compared to the plain ResNet101 network, similar to the execution with the complete dataset.

In Figure 6, the boxplots compare the performance of Siamese ResNet 101 and standard ResNet 101 across all four executions, using datasets labeled “10pcls” to indicate the 10-shot version employed for the few-shot evaluation. Consistent with the trends delineated in Table 2, it becomes apparent that the adoption of a Siamese architecture confers enhancements, particularly evident when operating with smaller datasets. Whereas the baseline network experiences pronounced susceptibility to the limited training data, the Siamese methodology preserves elevated performance levels with minimal variance, notwithstanding a reduction in absolute metrics, as evidenced by the trends in median values and interval magnitudes.

The ANOVA and TukeyHSD results were similar for all three metrics. There was no evidence of a statistically significant difference between the ResNet101 trained with the full dataset and the Siamese networks, both with the few-shot dataset (

p = 0.096

,

p = 0.963

, and

p = 0.650

for precision, recall, and F-score, respectively) and with the full dataset (

p = 0.994

,

p = 0.966

, and

p = 0.971

for precision, recall and F-score, respectively). Also, there is no evidence that the Siamese networks differed when the dataset was downscaled (

p = 0.056

,

p = 0.780

, and

p = 0.385

for precision, recall, and F-score, respectively). On the other hand, the dataset downscaling did seem to make a difference for the plain ResNet101 (

p = 1 \times 10^{- 7}

,

p = 3.19 \times 10^{- 4}

, and

p = 9.7 \times 10^{- 6}

for precision, recall, and F-score, respectively). Finally, the ANOVA and TukeyHSD results were significant for the ResNet101 trained with the few-shot dataset, both regarding the Siamese network trained with the same few-shot dataset (

p = 0.0002

,

p = 0.001

, and

p = 0.0003

for precision, recall, and F-score, respectively) and regarding the one trained with the full dataset (

p = 1 \times 10^{- 7}

,

p = 7.84 \times 10^{- 5}

, and

p = 2.5 \times 10^{- 6}

for precision, recall, and F-score, respectively).

4. Discussion

The ANOVA and TukeyHSD results, as well as the boxplots suggest that the Siamese framework did, in fact, improve the capacity of the neural network in the proposed few-shot learning scenario. An interesting factor to note in this test is that both the execution with the full dataset and the one with the reduced dataset using the Siamese approach did not show any statistical difference. The same occurs when comparing the base backbone with the complete dataset with the two Siamese executions; for all these cases, the p-values were above the significance level of 0.05. From these data, it can be inferred that the variation in the amount of data for training significantly affects the backbone, while the Siamese model does not show the same variation. Furthermore, given the ANOVA results presented, the Siamese network trained in a 10-shot scenario can be considered equal in performance to the plain ResNet101 trained with the full dataset, which indicates the feasibility of using it in the scenario where data are scarce.

The results we achieved in the task of identifying individual cattle align with those found in the literature. While this paper does not focus solely on muzzle recognition like the works of Li et al. [11] and Shojaeipour et al. [12], which have achieved accuracy results of around 99%, it follows the line of study presented by Xu et al. [21]. Our study focuses on the use of whole-face images and achieves a similar result, with an accuracy value of 91.3%.

The presence of the recognition branch in the process is of great importance, as it helps reduce confusion between classes by altering the distances between the feature vectors extracted in the embedding space. This positive point has been found to be highly relevant for problems with few samples in the works of Chen et al. [30] and Chen et al. [31], where the idea of separation is reinforced as a differentiating factor when analyzing samples not seen during model training and validation. This allows the formation of pseudo-regions in this space, speeding up the classification process based on the distance between the known data and the new data inserted into the model.

In the same context, Wang et al. [32] demonstrate in their work how the definition of themes through this contrastive separation aids in the performance of few-shot models for the classification scenario, both with few samples and in the presence of many in various datasets, portraying a scenario similar to the one presented in Section 3, with the neural network adapted for few-shot learning maintaining a similar behavior in the presence of datasets with varying numbers of samples.

An important point that favors learning from a few samples using this twin network approach is the distribution of data during training, as discussed by Zhang et al. [33]. Even in the presence of little data, it is still possible to obtain a robust training dataset through the formation of positive and negative pairs, which expands the amount of samples that the network can learn, in terms of similarity and distance between attribute vectors.

Still on the topic of improving learning with reduced datasets, a significant facilitator present in the architecture is the presence of easily modifiable backbones, which allows for the use of transfer learning to initialize their learned weights, reducing the time needed for the network to acquire minimum knowledge and start understanding the most basic parts of the objects of interest. Similarly, Yi et al. [34] demonstrated this capability in their work by using Faster R-CNN [35] to solve their surface roughness detection problem.

Finally, it can be observed that the dual learning structure proposed initially by Cao et al. [23] and adapted for the bovine problem in this article is of great importance for the final result. In their paper, Yu et al. [36] adopted a similar methodology in the Augmented Feature Adaptation layer of their model, learning to distance the classes from each other according to the features extracted by their Augmented Feature Generation layer and similar ways to measure loss.

Although the results obtained for the three metrics indicate promising future possibilities for using the loss function, especially in few-shot scenarios, its advantages over other loss functions remain debatable. By analyzing the works of Ni et al. [37] and Wang et al. [38], which employ the triplet logic introduced by Schroff et al. [39], it becomes evident that one common method to enhance the performance of the original contrastive loss function is by increasing the number of samples processed simultaneously. This approach, concerning a minimum separation margin similar to the function presented in this paper, can improve statistical metrics. However, it significantly raises the operational cost in terms of memory, as the network needs to perform many more comparisons in each iteration to determine and recalculate inter-class and intra-class distances.

This increase in the number of samples used per iteration is directly proportional to the heightened difficulty of loss convergence, as indicated by Wu et al. [40]. With this in mind, the proposed loss function aims to achieve the improvements sought by these loss functions without increasing the sample size. This approach maintains the original complexity while promoting the convergence of loss values, as illustrated in Figure 4 in Section 2. In summary, this study introduces new possibilities for enhancing the contrastive loss function without directly increasing memory usage during training, while preserving the overall architecture structure with image pair analyses.

5. Conclusions

The results obtained in this study allow us to argue that Siamese networks can become a viable option for improving the performance and increasing the robustness of traditional classification models like ResNet101 in the face of data scarcity, since its performance on 10-shot learning was observed to be on par with that of the ResNet101 trained in the entire dataset. On the other hand, it presents more computational requirements in terms of space and time. Therefore, future research can focus on optimization, with the aim to reduce these requirements while keeping or improving the achieved performance.

In the section on future work, the hypothesis of incorporating age as a variable in the recognition process has been raised. Over time, the facial structure of each animal becomes increasingly distinguishable, with its snout exhibiting morphological features similar to a human fingerprint. Therefore, it can be inferred that future studies utilizing a dataset with age variability among the individuals may yield positive results in the learned weights during training.

Regarding the applicability of the architecture in real-world scenarios, it has proven effective in performing the expected function, thereby reducing the need for manual classification at the experimental sites. Based on the described results, it is plausible to consider developing a product for large-scale use, not only for cattle, but also for other areas of animal husbandry where individuals have distinctive facial features that allow for differentiation using this method. This is an important task due to the large number of animals on each farm and the need for precise, non-intrusive tracking. Such a system would enhance the quality of the classification process and provide valuable information for management.

Author Contributions

Conceptualization, J.P., N.L. and H.P.; methodology, J.P. and K.P.; software, J.P. and G.H.; validation, V.W. and F.W.; formal analysis, N.L. and H.P.; investigation, J.P. and F.W.; resources, H.P. and V.W.; data curation, F.W. and P.C.; writing—original draft preparation, J.P., G.H. and K.P.; writing—review and editing, J.P., G.H. and N.L.; visualization, L.d.A.; supervision, H.P. and V.W.; project administration, H.P.; funding acquisition, H.P., F.W. and V.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work has received financial support from the Dom Bosco Catholic University, the Foundation for the Support and Development of Education, Science and Technology from the State of Mato Grosso do Sul, FUNDECT, and Federal University of Pampa, UNIPAMPA. Some of the authors have been awarded with scholarships from the Brazilian National Council of Technological and Scientific Development, CNPq, and the Coordination for the Improvement of Higher Education Personnel, CAPES.

Data Availability Statement

The data used in this paper will be made available on request due to the agreement made between the authors and Kerow group.

Acknowledgments

We would also like to thank NVIDIA for providing the Titan X GPUs used in the experiments and Kerow Soluções de Precisão for providing the datasets used on the experiment.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Das Indústrias Exportadoras de Carne (ABIEC). Perfil da Pecuária no Brasil. Associação Brasileira das Indústrias Exportadoras de Carne. 2023. Available online: https://www.abiec.com.br/ (accessed on 21 January 2024).
Milanez, A.Y.; Mancuso, R.V.; Maia, G.B.d.S.; Guimarães, D.D.; Alves, C.E.A.; Madeira, R.F. Conectividade Rural: Situação Atual e Alternativas para Superação da Principal Barreira à Agricultura 4.0 No Brasil. 2020. Available online: https://web.bndes.gov.br/bib/jspui/handle/1408/20180 (accessed on 21 January 2024).
Neto, A.; Nicola, S.; Moreira, J.; Fonte, B. Livestock Application: Naïve Bayes for Diseases Forecast in a Bovine Production Application: Use of Low Code. In Proceedings of the International Conference on Innovations in Bio-Inspired Computing and Applications, Cham, Switzerland, 16–18 December 2021; Springer: Cham, Switzerland, 2021; pp. 183–192. [Google Scholar]
Cihan, P.; Saygili, A.; Ozmen, N.E.; Akyuzlu, M. Identification and Recognition of Animals from Biometric Markers Using Computer Vision Approaches: A Review. Kafkas Univ. Vet. Fak. Derg. 2023, 29, 581. [Google Scholar] [CrossRef]
De Andrade Porto, J.V.; Dorsa, A.C.; de Moraes Weber, V.A.; de Andrade Porto, K.R.; Pistori, H. Usage of few-shot learning and meta-learning in agriculture: A literature review. Smart Agric. Technol. 2023, 5, 100307. [Google Scholar] [CrossRef]
Yang, J.; Yang, Y.; Li, Y.; Xiao, S.; Ercisli, S. Image information contribution evaluation for plant diseases classification via inter-class similarity. Sustainability 2022, 14, 10938. [Google Scholar] [CrossRef]
Qiao, Y.; Xue, T.; Kong, H.; Clark, C.; Lomax, S.; Rafique, K.; Sukkarieh, S. One-Shot Learning with Pseudo-Labeling for Cattle Video Segmentation in Smart Livestock Farming. Animals 2022, 12, 558. [Google Scholar] [CrossRef] [PubMed]
Hossain, M.E.; Kabir, M.A.; Zheng, L.; Swain, D.L.; McGrath, S.; Medway, J. A systematic review of machine learning techniques for cattle identification: Datasets, methods and future directions. Artif. Intell. Agric. 2022, 6, 138–155. [Google Scholar] [CrossRef]
Ahmad, M.; Abbas, S.; Fatima, A.; Ghazal, T.M.; Alharbi, M.; Khan, M.A.; Elmitwally, N.S. AI-Driven livestock identification and insurance management system. Egypt. Inform. J. 2023, 24, 100390. [Google Scholar] [CrossRef]
Awad, A.I. From classical methods to animal biometrics: A review on cattle identification and tracking. Comput. Electron. Agric. 2016, 123, 423–435. [Google Scholar] [CrossRef]
Li, G.; Erickson, G.E.; Xiong, Y. Individual Beef Cattle Identification Using Muzzle Images and Deep Learning Techniques. Animals 2022, 12, 1453. [Google Scholar] [CrossRef]
Shojaeipour, A.; Falzon, G.; Kwan, P.; Hadavi, N.; Cowley, F.C.; Paul, D. Automated muzzle detection and biometric identification via few-shot deep transfer learning of mixed breed cattle. Agronomy 2021, 11, 2365. [Google Scholar] [CrossRef]
Uladzislau, S.; Feng, X. Modified omni-scale net architecture for cattle identification on their muzzle point image pattern characteristics. In Proceedings of the International Conference on Computer, Artificial Intelligence, and Control Engineering (CAICE 2023), Hangzhou, China, 17–19 February 2023; SPIE: Bellingham, WA, USA, 2023; Volume 12645, pp. 489–494. [Google Scholar]
Xu, C.; Zhao, Y.; Lin, J.; Pang, Y.; Liu, Z.; Shen, J.; Liao, M.; Li, P.; Qin, Y. Bifurcation investigation and control scheme of fractional neural networks owning multiple delays. Comput. Appl. Math. 2024, 43, 1–33. [Google Scholar] [CrossRef]
Kaya, M.; Bilge, H.S. Deep Metric Learning: A Survey. Symmetry 2019, 11, 1066. [Google Scholar] [CrossRef]
Terven, J.; Cordova-Esparza, D.M.; Ramirez-Pedraza, A.; Chavez-Urbiola, E.A. Loss Functions and Metrics in Deep Learning. arXiv 2023, arXiv:cs.LG/2307.02694. [Google Scholar]
Xu, C.; Lin, J.; Zhao, Y.; Cui, Q.; Ou, W.; Pang, Y.; Liu, Z.; Liao, M.; Li, P. New results on bifurcation for fractional-order octonion-valued neural networks involving delays. Netw. Comput. Neural Syst. 2024, 1–53. [Google Scholar] [CrossRef]
Lu, T.Y.; Wu, M.E.; Chen, E.H.; Ueng, Y.L. Reference Selection for Offline Hybrid Siamese Signature Verification Systems. Comput. Mater. Contin. 2022, 73, 935–952. [Google Scholar] [CrossRef]
Liu, W.; Wen, Y.; Yu, Z.; Li, M.; Raj, B.; Song, L. SphereFace: Deep Hypersphere Embedding for Face Recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Los Alamitos, CA, USA, 21–26 July 2017; pp. 6738–6746. [Google Scholar] [CrossRef]
Wang, H.; Wang, Y.; Zhou, Z.; Ji, X.; Gong, D.; Zhou, J.; Li, Z.; Liu, W. CosFace: Large Margin Cosine Loss for Deep Face Recognition. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 5265–5274. [Google Scholar] [CrossRef]
Xu, B.; Wang, W.; Guo, L.; Chen, G.; Li, Y.; Cao, Z.; Wu, S. CattleFaceNet: A cattle face identification approach based on RetinaFace and ArcFace loss. Comput. Electron. Agric. 2022, 193, 106675. [Google Scholar] [CrossRef]
Št, O.; Beneš, B. Connected component labeling in CUDA. In GPU Computing Gems Emerald Edition; Elsevier: Amsterdam, The Netherlands, 2011; pp. 569–581. [Google Scholar]
Cao, Z.; Li, X.; Jianfeng, J.; Zhao, L. 3D convolutional siamese network for few-shot hyperspectral classification. J. Appl. Remote Sens. 2020, 14, 048504. [Google Scholar] [CrossRef]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the ICML Deep Learning Workshop, Lille, France, 7–9 July 2015; Volume 2. [Google Scholar]
Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. Adv. Neural Inf. Process. Syst. 2018, 31. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, Z.; Zhang, H.; Chen, Y.; Wang, Y.; Zhang, J.; Hu, L.; Shu, L.; Yang, L. Dairy Cow Individual Identification System Based on Deep Learning. In Proceedings of the International Conference on Cognitive Systems and Signal Processing, Fuzhou, China, 17–18 December 2022; Springer: Singapore, 2022; pp. 209–221. [Google Scholar]
Xu, B.; Wang, W.; Guo, L.; Chen, G.; Wang, Y.; Zhang, W.; Li, Y. Evaluation of deep learning for automatic multi-view face detection in cattle. Agriculture 2021, 11, 1062. [Google Scholar] [CrossRef]
Wang, J.; Zhang, X.; Gao, G.; Lv, Y.; Li, Q.; Li, Z.; Wang, C.; Chen, G. Open Pose Mask R-CNN Network for Individual Cattle Recognition. IEEE Access 2023, 1349. [Google Scholar] [CrossRef]
Chen, L.; Wu, J.; Xie, Y.; Chen, E.; Zhang, X. Discriminative feature constraints via supervised contrastive learning for few-shot forest tree species classification using airborne hyperspectral images. Remote Sens. Environ. 2023, 295, 113710. [Google Scholar] [CrossRef]
Chen, P.; Wang, J.; Lin, H.; Zhao, D.; Yang, Z. Few-shot biomedical named entity recognition via knowledge-guided instance generation and prompt contrastive learning. Bioinformatics 2023, 39, btad496. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Lu, W.; Wang, Y.; Shi, K.; Jiang, X.; Zhao, H. TEG: Image theme recognition using text-embedding-guided few-shot adaptation. J. Electron. Imaging 2024, 33, 013028. [Google Scholar] [CrossRef]
Zhang, T.; Chen, J.; Liu, S.; Liu, Z. Domain discrepancy-guided contrastive feature learning for few-shot industrial fault diagnosis under variable working conditions. IEEE Trans. Ind. Inform. 2023, 19, 10277–10287. [Google Scholar] [CrossRef]
Yi, H.; Lv, X.; Shu, A.; Wang, H.; Shi, K. Few-shot detection of surface roughness of workpieces processed by different machining techniques. Meas. Sci. Technol. 2024, 35, 045016. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 39, 1137–1149. [Google Scholar] [CrossRef]
Yu, X.; Gu, X.; Sun, J. Contrasting augmented features for domain adaptation with limited target domain data. Pattern Recognit. 2024, 148, 110145. [Google Scholar] [CrossRef]
Ni, J.; Liu, J.; Zhang, C.; Ye, D.; Ma, Z. Fine-grained patient similarity measuring using deep metric learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; pp. 1189–1198. [Google Scholar]
Wang, X.; Han, X.; Huang, W.; Dong, D.; Scott, M.R. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5022–5030. [Google Scholar]
Schroff, F.; Kalenichenko, D.; Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 815–823. [Google Scholar]
Wu, C.Y.; Manmatha, R.; Smola, A.J.; Krahenbuhl, P. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2840–2848. [Google Scholar]

Figure 1. Sample images used in the methodology: Image (a) shows how each animal is captured by the camera in the experimental area, while image (b) displays the result of the preprocessing method, which removes the background and highlights only the significant features of the animal’s face.

Figure 2. Bar graph showing the distribution of images across each class in the complete dataset used for the experiment, highlighting the inherent imbalance in the original dataset.

Figure 3. General structure of the adapted network, using ResNet101 as the Siamese backbone, with shared weights and two heads for both recognition and classification tasks.

Figure 4. Graph comparing both of the contrastive loss functions with a margin of 1.25, with the full line representing the proposed loss and the segmented line being the original loss.

Figure 5. Boxplot diagrams comparing original and proposed loss functions. The Y-axis represents the metric values, while the X-axis denotes the architecture.

Figure 6. Boxplot diagrams comparing the performance of Siamese ResNet 101 and standard ResNet 101 across all four executions. The Y-axis represents the metric values, while the X-axis denotes the architecture.

Table 1. Mean values and standard deviations are provided for each metric, comparing the default contrastive loss against our proposed contrastive loss method employing the Siamese ResNet101 architecture on the few-shot dataset. Bold formatting highlights the highest values for each metric.

Contrastive Loss	Precision	Recall	F-Score
original	0.774 ± 0.077	0.8297872 ± 0.066	0.791 ± 0.074
ours	0.870 ± 0.052	0.905 ± 0.040	0.882 ± 0.044

Table 2. Average results and standard deviation for each metric, comparing the plain ResNet101 with the Siamese ResNet101 in both dataset scenarios.

	Architecture	Full Dataset	Few-Shot Dataset
Precision	resnet101	0.926 ± 0.042	0.759 ± 0.072
Precision	siamese_resnet101	0.932 ± 0.035	0.870 ± 0.052
Recall	resnet101	0.914 ± 0.048	0.823 ± 0.057
Recall	siamese_resnet101	0.924 ± 0.028	0.905 ±0.040
F-score	resnet101	0.909 ± 0.052	0.779 ± 0.067
F-score	siamese_resnet101	0.919 ± 0.034	0.882 ± 0.044

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Porto, J.; Higa, G.; Weber, V.; Weber, F.; Loebens, N.; Claure, P.; de Almeida, L.; Porto, K.; Pistori, H. A New Siamese Network Loss for Cattle Facial Recognition in a Few-Shot Learning Scenario. AgriEngineering 2024, 6, 2941-2954. https://doi.org/10.3390/agriengineering6030169

AMA Style

Porto J, Higa G, Weber V, Weber F, Loebens N, Claure P, de Almeida L, Porto K, Pistori H. A New Siamese Network Loss for Cattle Facial Recognition in a Few-Shot Learning Scenario. AgriEngineering. 2024; 6(3):2941-2954. https://doi.org/10.3390/agriengineering6030169

Chicago/Turabian Style

Porto, João, Gabriel Higa, Vanessa Weber, Fabrício Weber, Newton Loebens, Pietro Claure, Leonardo de Almeida, Karla Porto, and Hemerson Pistori. 2024. "A New Siamese Network Loss for Cattle Facial Recognition in a Few-Shot Learning Scenario" AgriEngineering 6, no. 3: 2941-2954. https://doi.org/10.3390/agriengineering6030169

Article Menu

A New Siamese Network Loss for Cattle Facial Recognition in a Few-Shot Learning Scenario

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Proposed Approach

2.3. Experimental Setup

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI