Improving Early Fault Detection in Machine Learning Systems Using Data Diversity-Driven Metamorphic Relation Prioritization

Srinivasan, Madhusudan; Kanewala, Upulee

doi:10.3390/electronics13173380

Open AccessArticle

Improving Early Fault Detection in Machine Learning Systems Using Data Diversity-Driven Metamorphic Relation Prioritization

by

Madhusudan Srinivasan

^1,*

and

Upulee Kanewala

²

¹

Computer Science Department, East Carolina University, Greenville, NC 27858, USA

²

School of Computing, University of North Florida, Jacksonville, FL 32224, USA

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3380; https://doi.org/10.3390/electronics13173380

Submission received: 4 August 2024 / Revised: 20 August 2024 / Accepted: 23 August 2024 / Published: 26 August 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Metamorphic testing is a valuable approach to verifying machine learning programs where traditional oracles are unavailable or difficult to apply. This paper proposes a technique to prioritize metamorphic relations (MRs) in metamorphic testing for machine learning and deep learning systems, aiming to enhance early fault detection. We introduce five metrics based on diversity in source and follow-up test cases to prioritize MRs. The effectiveness of our proposed prioritization methods is evaluated on three machine learning and one deep learning algorithm implementation. We compare our approach against random-based, fault-based, and neuron activation coverage-based MR ordering. The results show that our data diversity-based prioritization performs comparably to fault-based prioritization, reducing fault detection time by up to 62% compared to random MR execution. Our proposed metrics outperformed neuron activation coverage-based prioritization, providing 5–550% higher fault detection effectiveness. Overall, our approach to prioritizing metamorphic relations leads to increased fault detection effectiveness and reduced average fault detection time. This improvement in efficiency can result in significant time and cost savings when applying metamorphic testing to machine learning and deep learning systems.

Keywords:

metamorphic testing; metamorphic relations; test prioritization; metamorphic relation prioritization; testing and validation; software testing

1. Introduction

Machine learning (ML) is increasingly being deployed in large-scale software and safety-critical systems due to recent advancements in deep learning and reinforcement learning. The software applications powered by ML are applied and utilized in our daily lives: from finance and energy to health and transportation. AI-related accidents are already making headlines, from inaccurate facial recognition systems causing false arrests to unexpected racial and gender discrimination by machine learning software [1,2]. Also, a recent incident with an Uber autonomous car resulted in the death of a pedestrian [3]. Thus, ensuring the reliability and conducting rigorous testing of machine learning systems is very vital to prevent any future disasters. One of the critical components of software testing is the test oracle. The mechanism for determining the correctness of the test case output is known as the test oracle [4]. A program can be considered a non-testable program if there does not exist a test oracle or it is too difficult to determine whether the test output is correct [5]. Hence, those programs suffer from a problem known as the test oracle problem. ML-based systems exhibit non-deterministic behavior and reason in a probabilistic manner. Also, the output is learned and predicted by a machine learning model as results do not have an expected output to compare against the actual output. As a result, the correctness of the output produced by machine learning-based applications cannot be easily determined and suffers from the test oracle problem [6].

Metamorphic testing (MT) is a property-based software testing technique that is used to alleviate the test oracle problem. A central component of MT is a set of metamorphic relations (MRs), which are necessary properties of the target function or algorithm in relation to multiple inputs and their expected outputs [7,8]. An example of MT is to test the implementation of the function

sin (x)

by asserting the validity of the mathematical relationship

sin (x) = sin (π - x)

instead of manually verifying the output for an arbitrary floating point input x. This process involves arbitrarily altering x and checking if the outputs of x and its variant

(π - x)

are consistent. If the outputs are inconsistent, it indicates the presence of a fault in the implementation of

sin (x)

.

In recent years, especially, MT has attracted a lot of attention and detected many real-life faults in various domains. For example, Murphy et al. [9] identified six metamorphic relations, and real bugs were identified in three machine learning tools. In a related project, Xie et al. [10,11] used metamorphic testing for the detection of faults in K-nearest neighbors and Naive Bayes classifiers. The results revealed that both algorithms violated some of the metamorphic relations. In addition, some real faults were detected in the open-source machine learning tool Weka [12]. Zhang et al. [13] applied MT to test DNN-based autonomous driving systems. MT was able to find inconsistent behaviors under different road scenes for real-world autonomous driving models.

Previous work by Srinivasan et al. [14] has shown that several MRs detect the same type of fault and differ only in fault detection effectiveness. Also, each MR in a machine learning or a deep learning application can have multiple source and follow-up test cases; as a result, the execution time of the MRs increases drastically. The source and follow-up test cases in an MR represented as a training dataset could take several days to train and build a machine learning or deep learning model for a large training dataset. In addition, some metamorphic relations are likely to reveal faults or errors in the system based on the specific characteristics of the system and the data. Prioritizing relations that are more likely to uncover faults can help increase the overall effectiveness of the testing process. Hence, the prioritization of MRs is important for conducting effective metamorphic tests on these programs.

Techniques proposed by Huang et al. [15], which focus on test adequacy criteria for selecting MRs, do not consider the unique challenges of ML systems, such as their dependence on diverse data inputs. Similarly, while Cao et al. [16] propose dissimilarity metrics for test cases, their work does not extend to the prioritization of MRs, specifically for ML and deep learning. Approaches such as DeepGini [17] and MetaTrimmer [18] prioritize test cases rather than MRs. Nakajima et al.’s [19] method incorporates dataset diversity but fails to address the diversity between source and follow-up test cases for MR prioritization. These gaps indicate the need for a data-driven MR prioritization approach, specifically tailored to improve fault detection in ML programs, which our work seeks to address.

A previous work by Srinivasan et al. [20] introduced approaches to prioritize MRs based on code coverage and fault-based approach. The fault-based prioritization approach involves generating a large number of mutants for the application under test and executing the mutants with test inputs to generate mutant kill/alive information for each MR. Then, it prioritize the MRs based on the mutant kill/alive information. This fault-based prioritization process can take up to several weeks and months to complete, and as a result, it becomes costly to prioritize MRs for ML applications. The code coverage approach uses statement or branch coverage information to prioritize MRs. Furthermore, the code coverage-based approach does not work effectively for machine learning-based applications, since the data used for training determine the program logic. The machine learning program is usually just a sequence of library function invocations, and as a result, 100% code coverage can be easily achieved when at least one test input is processed [21]. To this end, in this work, we propose metrics for MR prioritization for machine learning programs. The MR prioritization method proposed in this work qualitatively identifies the diversity between the source and follow-up test cases in an MR.

We make the following contributions in this work:

Propose a novel data diversity-based automated method, where we propose five metrics to prioritize metamorphic relations based on diversity between the source and follow-up test cases in an MR.
Evaluate the effectiveness of our proposed MR prioritization methods on two machine learning and one deep learning algorithm implementations. Also, we perform a comparison against three baseline approaches. (1) Random baseline: represents the current practice of executing the MRs in a random order, (2) fault-based ordering: represents the MR ordering based on the fault detection effectiveness of the MRs as proposed by Srinivasan et al. [20], and (3) neuron activation coverage-based ordering: represents the ordering of the MRs based on the number of unique neuron activations for all test inputs in a deep neural network. The neuron is activated if its output is greater than a threshold value (e.g., 0).
Our results indicate that MR prioritization reduces the number of MRs needed to execute on SUT and reduces the time taken to detect a fault by up to 163% when compared to the random ordering of MRs and up to 14% when compared to neuron coverage-based ordering.

2. Background

2.1. Metamorphic Testing

Metamorphic testing is a software testing technique that uses the idea of metamorphic relations to detect faults in software systems. The key concept behind metamorphic testing is to use a set of input–output relationships to generate new test cases that should produce outputs that satisfy certain properties. The metamorphic testing technique uses metamorphic relations to generate test cases and to verify the correctness of the software. A metamorphic relation is a mathematical or logical relationship that must hold between the inputs and outputs of a software system. If the output does not match the relation, it suggests a possible fault in the program.

Metamorphic testing works by applying a series of transformations to the inputs of a software system and then verifying that the outputs of the system satisfy the metamorphic relations. In metamorphic testing, a source test case (ST) is the initial test case that is used to generate follow-up test cases (FT). The source test case is a test case that represents an input to the software program being tested. The follow-up test cases are generated by applying a metamorphic transformation to the source test case.

For example, consider a program that calculates the sum of integers in a list. The program takes as input a list of integers and returns the sum of all integers in the list. Let L be a list of integers: L = [a1, a2, …, an]. The program takes L as input and returns the sum of the integers in L. We can define a source test case for this program by specifying a particular input list L0 and the expected output sum(L0). We can also define a metamorphic relation for this program that relates the input list L to a new transformed list L’, where the sum of the integers in L and L’ is the same. This relation states that if we reverse the order of the integers in the list L, we should obtain a new list L’ with the same sum as L. We can use this relation to generate follow-up test cases for the program. We can apply the metamorphic relation by reversing the order of the integers in the list to obtain a new list L’. We then expect the sum of L and L’ to be the same. The violation of the metamorphic relation indicates the presence of a fault in the program.

In this work, given an ML or DL system M and a dataset D, let

O_{s} = M (D)

denote the output of the system. Assume that a transformation T is applied to D and generates

D^{T}

. Let

O_{f} = M (D^{T})

denote the new output of the system. We refer to the original dataset D and the result

O_{s}

as the source test case and the source output, respectively. Similarly, the transformed

D^{T}

and the result

O_{f}

are referred to as the follow-up test case and the follow-up output, respectively. We consider the entire dataset D and

D^{T}

as a test case, and it is used in the ML or DL model training.

2.2. Machine Learning

Machine learning is a subfield of artificial intelligence that involves developing algorithms and models that can learn patterns and make predictions based on test data. In machine learning, the available data are divided into two separate sets: a training dataset and a test dataset. The training dataset is used to train the machine learning model. It typically consists of a large portion of the available data and is used to teach the model the patterns and relationships in the data. The test dataset is used to evaluate the performance of the trained model on unseen data.

There are three main types of machine learning: supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves training a model on labeled data, where the input data are accompanied by a corresponding output or target variable. Unsupervised learning involves training a model on unlabeled data, where the goal is to find patterns or structure in the data without a specific target variable. In our work, we applied our proposed metrics to supervised classifiers.

In contrast to traditional machine learning (ML) models like kNN, linear regression, and Naive Bayes, testing deep learning (DL) models, including Convolutional Neural Networks (CNNs), requires additional considerations due to the two-stage nature of the development pipeline. These stages are (a) training the CNN model with a training dataset on a DL platform, and (b) performing predictions on new input data using the trained CNN model to produce outputs.

Stage (a): The training process involves adjusting the weights and biases of the network based on the training dataset. In this stage, the relationship between the metamorphic relations (MRs) and the source-level mutation operators is key. The MRs in Appendix A.3, which affect the training dataset (MRs 1, 2, 7, 8), are applied at this stage. Here, the source-level mutation operators listed in Table 1 alter the training dataset in various ways to introduce potential faults. These modifications allow us to test how resilient the CNN model is to variations in the training data and ensure that the learned model is robust.
Stage (b): After the training phase, the trained CNN model is deployed to make predictions on new input data. This phase can be considered the “execution” of the model. At this stage, some MRs in Appendix A.3 (MRs 3, 4, 5, 6, 9, 10) apply transformations to the input data which are related to the prediction behavior of the model. Furthermore, the model-level mutation operators in Table 1, such as Layer Addition, Layer Removal, and Weight Shuffling, are applied directly to the structure of the trained CNN model. These mutations introduce synthetic faults at the model level, allowing us to test the robustness of the CNN’s architecture during the prediction phase.
In deep learning networks, neurons are fundamental building blocks that mimic the behavior of biological neurons in the human brain. A neuron in a neural network receives input from other neurons or from the input data. Each input is multiplied by the corresponding weight, which represents the strength of the connection. The neuron then calculates the weighted sum of the inputs with an additional bias term to determine the output of the neuron. An activation function is applied to the weighted sum. The activation function takes the weighted sum as input and produces the activation or output of the neuron. The output is between 0 and 1, which indicates the probability of the neuron being activated. In our work, we compare our proposed data diversity-based metrics with the neuron activation coverage-based approach. The neuron activation coverage indicates the number of neurons activated in the network in proportion to the total neuron in the network. By measuring neuron activation-based coverage, testers can assess whether the test inputs are sufficiently exploring the network’s various states and validating its behavior. It helps identify areas of the neural network that may not be involved during the testing and, thus, helps to focus on areas of the network that require further testing or improvement.

2.3. Mutation Testing

Mutation testing is a software testing technique that is used to evaluate the quality of test cases. It involves creating small changes or mutants in the code under test, and then running the test cases against these mutated versions of the code. Mutants are considered synthetic faults introduced into the code to test the quality of the test suite. The goal of mutation testing is to identify weaknesses in the test suite by measuring its ability to detect mutants.

Generally, combining mutation testing and metamorphic testing can help identify the fault detection capability of individual metamorphic relations. In our work, we generated mutants or synthetic faults in the training dataset and the deep learning model. Then, we used the mutants to evaluate the fault detection effectiveness of the prioritized MR ordering generated using our proposed approach.

2.4. Dataset Diversity

In this section, we discuss our approach to dataset diversity in the context of testing machine learning systems. We provide a distinction between “data diversity” and “dataset diversity”, which are key to our method but are not synonymous.

Data diversity, as originally introduced by Ammann and Knight [22] in the context of software fault tolerance, refers to a technique where variations in input data are leveraged to improve a system’s fault tolerance. By using different variations of inputs, systems can be tested for faults that might not otherwise be detected through uniform input data. In our work, we extend this concept to metamorphic testing by prioritizing metamorphic relations (MRs) based on the diversity between source and follow-up test cases. The aim is to enhance fault detection by ensuring that MRs with higher a diversity in test cases are executed earlier, reducing the time to uncover faults.

Dataset diversity, on the other hand, is a concept more specific to the field of machine learning [19]. It pertains to the variations within the training and testing datasets, including the distribution of attributes, labels, and instances. In machine learning systems, dataset diversity ensures that the model is exposed to a wide variety of data during testing, increasing the likelihood of revealing faults in the learning algorithm or model behavior.

Both data diversity and dataset diversity are integral to our method. While data diversity focuses on input variations to detect system faults, dataset diversity is centered on the variability within the dataset to uncover issues in machine learning models. By incorporating both aspects, our proposed approach aims to improve the effectiveness of early fault detection in machine learning systems.

3. Proposed Approach

In this section, we discuss our proposed method for the prioritization of MRs. Figure 1 shows the steps to prioritize MRs. In this approach, we prioritize the MRs as follows:

Let the set of source test cases in an MR used for testing the SUT be the prioritizing source test cases $(T_{s p})$ .
Let the set of follow-up test cases in an MR used to test the SUT be prioritizing follow-up test cases $(T_{f p})$ .
For $(T_{s p})$ and $(T_{f p})$ , apply any one of the proposed metrics: rule-based, anomaly detection, clustering-based, or data distribution, and obtain the MR diversity value of each MR based on the formulas provided in Section 3.1. In this approach, the prioritized MR order can be generated by applying a single metric. For example, if a rule-based metric is applied to prioritize MRs, then Formula (4) in Section 3.1.1 is applied to generate the MR diversity value for each MR.
MR diversity value calculated for each MR using the previous step 3 is normalized using the following steps:
(a)
Let the set of MR diversity values of MRs be $D M R_{v}$ .
(b)
Identify the minimum and maximum values in $D M R_{v}$ . Let the minimum value be $M i n_{v}$ and the maximum value be $M a x_{v}$ .
(c)
For each MR diversity value $v \in D M R_{v}$ , generate the normalized MR diversity value using the formula below:

$N_{M R} = \frac{v - M i n_{v}}{M a x_{v} - M i n_{v}}$

(1)

where $N_{M R}$ represents the normalized MR diversity value.
Prioritize the normalized MR diversity values and select the top n MRs from the prioritized ordering to execute based on the resources available for testing.

Our proposed metrics are explained in detail below.

Figure 1. Steps for MR prioritization of ML programs.

3.1. Dataset Diversity Approach

In this work, we propose five metrics to prioritize metamorphic relations based on diversity between the source and follow-up test cases in an MR. Our intuition behind the metrics is that the greater the diversity between the source and follow-up test cases, the greater the fault detection capability of the MR.

The proposed metrics are explained in detail below.

3.1.1. Rule Based Classifier

Rule-based classifiers are just another type of classifier that make the class decision by using various if-then rules. We apply a rule-based classifier (CN2) [23] to determine the diversity between the source and follow-up test cases in a metamorphic relation. CN2 is an instance of a sequential covering algorithm family based on rule-based classification used for learning a set of if-then rules from data. It uses a heuristic called the covering algorithm to identify the set of examples that can be covered by a rule and then removes those examples from the dataset before finding the next rule. The features of the CN2 algorithm are the individual independent variables or predictors that the algorithm uses to make predictions or classifications. They can be numerical (e.g., height, weight), categorical (e.g., color, type), or even ordinal. The output of the CN2 model is a set of logical rules. In addition, the CN2 algorithm is not natively equipped to handle unstructured data such as images. In the real world, the CN2 algorithm has been applied in credit scoring to identify rules that classify customers as good or bad credit risks based on their financial history. In addition, CN2 is applied to identify rules that classify patients into different diagnostic categories based on their symptoms, medical history, and other relevant factors. In this work, the diversity of classification rules between the source and follow-up test cases in a metamorphic relation can be used as an effective indicator of potential faults in a machine learning model. Specifically, we hypothesize that a greater variance in the rule set generated by the CN2 algorithm when applied to the source and follow-up test cases suggests a higher likelihood of MR violation. Also, the rule-based metric can be applied only to categorical and numerical data. We posit that metamorphic relations are designed to preserve certain properties of the model’s behavior when transformations are applied to the input data. If the classification rules generated by CN2 differ significantly between the source and follow-up test cases, it indicates that the model may not be consistently preserving these properties, potentially due to faults in the model. We apply the following steps.

Let the set of source test cases for an MR used for testing SUT be the prioritized source test cases $(T_{s p})$ .
Let the set of follow-up test cases for an MR be prioritized follow-up test cases $(T_{f p})$ .
Apply the sequential coverage algorithm (CN2) on $T_{s p}$ and $T_{f p}$ and obtain the rules. The CN2 induction algorithm is a learning algorithm for the induction of simple and comprehensible rules. The rules generated for an MR are shown in Figure 2. The numbers mentioned in Figure 2 represent a rule in the form of IF Feature A >= Value X AND Feature B >= Value Y THEN Label = Z, where Feature A and Feature B correspond to specific features from the dataset, and Value X and Value Y are the numerical thresholds that the classifier uses to make decisions. The label Z represents the predicted outcome based on these conditions. These thresholds are determined during the training phase of the CN2 rule-based classifier, where the classifier learns to identify patterns in the data by segmenting them based on these specific feature values. The rule is used to make predictions, with variations in the follow-up test cases helping to test the model’s robustness and fault detection capabilities.
Remove the same set of rules covered in both $T_{s p}$ and $T_{f p}$ .
Calculate the total rules identified in $T_{s p}$ and $T_{f p}$ . Let $T o t a l_{s r}$ and $T o t a l_{f r}$ represent the total rules in $T_{s p}$ and $T_{f p}$ , respectively.
We calculate the diversity of an MR using the formula below:

$\begin{matrix} R_{M R} = ∣ (T o t a l_{s r}) - (T o t a l_{f r}) ∣ \end{matrix}$

(2)

where $R_{M R}$ represents the diversity of the MR.
Apply steps 3 to 6 to identify the data diversity value for each MR used to test the SUT.
Prioritize MRs based on the data diversity value identified for each MR.

3.1.2. Anomaly Detection

Anomaly detection (or outlier detection) is the identification of rare items, events, or observations that raise suspicions by differing significantly from the majority of the data. In anomaly detection, an outlier is a data point or observation that significantly deviates from the majority of the data points in a given dataset. Also, outliers in data cause incorrect predictions, reduction in accuracy, and bias in the machine learning model. We apply anomaly detection to identify outliers in the source and follow-up test cases in a metamorphic relation. We hypothesize that the greater the diversity in outliers detected between the source and follow-up, the greater the possibility of detecting a fault by the MR. The anomaly detection metric can be applied to numerical data, categorical data, text data, and image data. We apply the following steps.

Let the set of source test cases for an MR used for testing SUT be the prioritized source test cases $(T_{s p})$ .
Let the set of follow-up test cases for the SUT be prioritized follow-up test cases $(T_{f p})$ .
Apply the z-score to $T_{s p}$ and $T_{f p}$ and identify outlier instances. A z-score, also known as a standard score, is a statistical measure that quantifies the number of standard deviations a data point is from the mean of the dataset. The z-score calculation is performed feature-wise to standardize the values and to identify how many standard deviations a data point is from the feature’s mean. The threshold value of 2.5 was used to identify anomalies. Using a higher threshold of 2.5, we reduced the likelihood of false positives and focused on the most extreme instances [24].
For each feature in $(T_{s p})$ and $(T_{f p})$ ,
(a)
Calculate the mean ( $μ$ ) and standard deviation ( $σ$ ) across all instances within that feature.
(b)
For each instance, determine the z-score of the feature using the formula: $z = \frac{(X - μ)}{σ}$ , where X represents the feature’s value in the instance, $μ$ is the mean of that feature across all instances, and $σ$ is the standard deviation of that feature across all instances.
After computing the z-score for each feature value in each instance, a threshold (e.g., 2.5) is applied to identify outliers. If the absolute value of the z-score for a feature value exceeds the threshold, that feature value is considered an outlier for that specific instance.
Once outliers are identified for each feature in both $T_{s p}$ and $T_{f p}$ , identical outliers (i.e., outliers that appear in both $T_{s p}$ and $T_{f p}$ for the same feature and instance) are removed to ensure that only unique discrepancies are considered in the diversity calculation.
Calculate the total outliers in $T_{s p}$ and $T_{f p}$ , respectively. Let $T o t a l_{s o}$ and $T o t a l_{f o}$ be the count of unique outliers detected in $T_{s p}$ and $T_{f p}$ , respectively.
We calculate the diversity of MR using the formula below:

$\begin{matrix} O_{M R} = ∣ (T o t a l_{s o}) - (T o t a l_{f o}) ∣ \end{matrix}$

(3)

where $O_{M R}$ represents the diversity of the MR.
Apply steps 3 to 9 to identify the data diversity value for each MR used to test the SUT.
Prioritize MRs based on the data diversity value identified for each MR.

3.1.3. Clustering-Based

In this metric, we partition the input space, i.e., all possible inputs, into different regions and determine how the data are distributed in the space. This would help prioritize MRs that cover a wide range of input space and can help maximize the coverage of the testing process, thus ensuring that the system is thoroughly tested across all possible inputs. To achieve this, we use clustering methods that simply try to group similar patterns into clusters whose members are more similar to each other based on some distance measure than to members of other clusters. We hypothesize that the greater the diversity in clusters detected between the source and follow-up test cases, the greater the possibility of detecting the fault by the MR. If these clusters are diverse between the source and follow-up cases, it suggests that the follow-up tests are not merely re-validations of the same scenarios but are instead exploring new dimensions of the test space. The clustering-based metric can be applied to numerical data, categorical data, text, and image data. We apply the following steps.

Let the set of source test cases used for testing SUT be the prioritized source test cases $(T_{s p})$ .
Let the set of follow-up test cases for the SUT be prioritized follow-up test cases $(T_{f p})$ .
Apply the K mean algorithm to $T_{s p}$ and $T_{f p}$ and identify the clusters with K (number of clusters). Let the clusters of $T_{s p}$ and $T_{f p}$ be $S_{c l}$ and $F_{c l}$ .
Find the distance between the clusters by calculating the euclidean distances between the clusters in $S_{c l}$ . Let the total distance between the clusters be $T s_{c d s}$ .
Determine the size of each cluster in $S_{c l}$ . Let the total size of the clusters in $S_{c l}$ be $T s_{s c}$ .
Find the distance within the cluster by calculating the average of the distances from the observations to the centroid of each cluster in $S_{c l}$ . Let the average distance within the cluster in $S_{c l}$ be $A s_{c d}$ .
Apply steps 4 to 6 to $F_{c l}$ , respectively. Let $T f_{c d s}$ be the total distance between the clusters, $T f_{s c}$ be the total size of the clusters, and $A f_{c d}$ be the average distance within the clusters.

$\begin{matrix} R_{M R} = ∣ (T o t a l_{s r}) - (T o t a l_{f r}) ∣ \end{matrix}$

(4)
We calculate the diversity of MR using the formula below:

$\begin{matrix} \begin{matrix} C_{M R} = ∣ (T s_{c d s} + T s_{s c} + A s_{c d}) - (T f_{c d s} + T f_{s c} + A f_{c d}) ∣ \end{matrix} \end{matrix}$

(5)

where $C_{M R}$ is the diversity of the MR.
Apply steps 3 to 8 to identify the data diversity value for each MR used to test the SUT.
Prioritize MRs based on the data diversity value identified for each MR.

3.1.4. Data Distribution

Data used in training machine learning models often form very similar patterns. Data distribution refers to the way that data are spread out or organized in a dataset. It describes the pattern of values that occur and how frequently they occur in a given dataset. By analyzing the distribution of the data, we can gain insights into the range of values, the frequencies of values, and any biases or anomalies in the data that may impact the effectiveness of the metamorphic relations. In this metric, we used the shape and spread of the data distribution of the dataset to find the diversity of metamorphic relations. The spread of distribution involves calculating range, variance, and standard deviation of the source and follow-up test cases. Similarly, the shape of the distribution involves calculating the skewness and kurtosis of the source and the follow-up test case of the MR. By comparing the data distribution of the source test cases (ST) and follow-up test cases (FT), we aim to quantify the difference in the data characteristics between the two sets. A larger difference in the data distribution suggests that the MR is being applied to a more diverse range of input data, potentially covering a wider range of behaviors of the SUT and increasing the fault detection capability. The hypothesis is that a greater diversity in the data distribution between ST and FT indicates a greater potential for fault detection by the MR. This is based on the assumption that applying the MR to a diverse set of input data increases the chances of exposing faults or inconsistencies in the SUT. We apply the following steps, and the steps are shown in Figure 3.

Let the set of source test cases used for testing SUT be the prioritized source test cases $(T_{s p})$ .
Let the set of follow-up test cases for the SUT be prioritized follow-up test cases $(T_{f p})$ .
For each feature in $T_{s p}$ ,
(a)
Calculate the skewness and kurtosis for the feature.
(b)
Calculate the range, variance, and standard deviation for the feature.
Let $S T_{s k}$ be the sum of the skewness and kurtosis for $T_{s p}$ across all features.
Let $S T_{r v s}$ be the sum of variance, range, and standard deviation for $T_{s p}$ across all features.
For each feature in $T_{f p}$ ,
(a)
Calculate the skewness and kurtosis for the feature.
(b)
Calculate the range, variance, and standard deviation for the feature.
Sum up the skewness and kurtosis values across all features to obtain $F T_{s k}$ .
Sum up the range, variance, and standard deviation values across all features to obtain $F T_{r v s}$ .
We calculate the diversity of MR using the formula below:

$\begin{matrix} D D_{M R} \begin{matrix} = ∣ (S T_{s k} + S T_{r v s}) - (F T_{s k} + F T_{r v s}) ∣ \end{matrix} \end{matrix}$

(6)

where $D D_{M R}$ is the diversity of MR.
Apply steps 3 to 8 to identify the data diversity value for each MR used to test the SUT.
Prioritize MRs based on the data diversity value identified for each MR.

3.1.5. Feature-Based

In this metric, we use the texture features of the images to find the diversity between the source and the follow-up test cases in an MR. In many real-world applications, the input data to a system can be images or other types of visual data. In such cases, the texture features of the images can provide insight into the underlying characteristics of the data, such as the presence of patterns, edges, or other structural features. By analyzing the textural features of the source and follow-up test cases, it is possible to identify differences in the spatial arrangement of pixel intensities between the two images, which can then help prioritize metamorphic relations.

To identify the texture, we apply the Gray-Level Co-occurrence Matrix (GLCM) [25]. It is a matrix that represents the co-occurrence of gray-level values between two adjacent pixels in an image. The GLCM is calculated by analyzing the spatial relationship between pixel pairs at a specified distance and direction in an image. The GLCM matrix is used to extract texture features from an image, such as contrast, energy, homogeneity, and correlation. Energy is a measure of the overall uniformity or smoothness of the texture in the image. Homogeneity is a measure of the closeness of the distribution of gray levels of adjacent pixels in the image. These features are used to distinguish between different types of textures in an image. We hypothesize that the greater the diversity in the texture identified between the source and the follow-up test cases in an MR, the greater the fault detection capability of the MR.

Let the set of source test cases used for testing SUT be the prioritized source test cases $(T_{s p})$ .
Let the set of follow-up test cases for the SUT be prioritized follow-up test cases $(T_{f p})$ .
Apply the Gray-level Co-occurrence Matrix (GLCM) method for $T_{s p}$ and extract statistical properties such as contrast, correlation, energy, and homogeneity from the GLCM. Let $S T_{G L C M}$ be the sum of contrast, correlation, energy, and homogeneity for $T_{s p}$ .
Apply step 3 to $T_{f p}$ . Let $F T_{G L C M}$ be the sum of contrast, correlation, energy, and homogeneity for $T_{f p}$ .
We calculate the diversity of MR using the formula below:

$\begin{matrix} T D_{M R} \begin{matrix} = ∣ (S T_{G L C M}) - (F T_{G L C M}) ∣ \end{matrix} \end{matrix}$

(7)

where $T D_{M R}$ is the diversity of MR.
Apply steps 3 to 5 to identify the data diversity value for each MR used to test the SUT.
Prioritize MRs based on the data diversity value identified for each MR.

4. Evaluation Setup

In this section, we provide the subject programs, metamorphic relations, and mutant generation procedure for evaluating the proposed metrics.

4.1. Subject Programs

We applied the validation procedure mentioned above to the following three machine learning programs and one deep learning program to evaluate our proposed MR prioritization methods.

IBk (http://weka.sourceforge.net/doc.dev/weka/classifiers/lazy/IBk.html (accessed on 3 August 2024)): K-nearest neighbors (KNN) classifier in the Weka machine learning library [26]. The input to IBk is a training dataset and a test dataset to be represented in .arff format. The output is the classification predictions made on the instances in the test dataset.
Linear Regression (https://weka.sourceforge.io/doc.dev/weka/classifiers/functions/LinearRegression (accessed on 3 August 2024)): Linear regression is a linear approach that models the relationship between input variables and a single output variable. The subject program is the linear regression implementation in the Weka library. The latest stable version of Weka is 3.8.5, which consists of multiple linear regressions, named LinearRegression() class. The input to the linear regression program is a training dataset and a test dataset represented in .arff format. The output is the prediction made for the instances in the test dataset.
Naive Bayes (https://weka.sourceforge.io/doc.dev/weka/classifiers/bayes/NaiveBayes.html (accessed on 3 August 2024)): A Naive Bayes classifier is a probabilistic machine learning model that is used for the classification task. The subject program is the Naive Bayes algorithm implementation, named NaiveBayes() class in the Weka library. The input to the Naive Bayes program is a training dataset and a test dataset represented in .arff format. The output is a classification of the instances in the test dataset.
Convolutional Neural Network (https://javadoc.io/doc/org.deeplearning4j/deeplearning4j-nn/latest/index.html (accessed on 3 August 2024)): A convolutional neural network (CNN) is a class of artificial neural networks and is most commonly applied to analyze visual imagery. The subject program is the CNN algorithm implementation in the Deeplearning4j library. The input to the CNN is the MNIST handwritten digit dataset, and the output is to classify the image of a handwritten digit into one of 10 classes representing integral values from 0 to 9. We also provided a fashion dataset as input to the CNN algorithm. It is a dataset comprising 60,000 small square 28 × 28 pixel grayscale images of items of 10 types of clothing, such as shoes, t-shirts, dresses, and more.

4.2. Metamorphic Relations

For conducting MT on IBk and Naive Bayes, we used 11 MRs developed by Xie et al. [10]. These MRs were developed on the basis of the user expectations of supervised classifiers. These MRs modify the training and test data so that the predictions do not change between the source and follow-up test cases. Similarly, to test the linear regression system, we used 10 MRs developed by Luu et al. [27]. The property of linear regression is related to the addition of data points, the rescaling of inputs, the shifting of variables, the reordering of data, and the rotation of independent variables. These MRs were grouped into 6 categories and derived from the properties of the targeted algorithm. For testing the convolutional neural network, we used 11 MRs developed by Ding et al. [28]. The MRs were developed on three levels: system level, dataset level, and data item level. MRs at the dataset level are created in relation to the classification accuracy of reorganized training datasets. The MRs at the data item level are created on the relation of the classification accuracy of reproduced individual images. The MRs for the subject program are provided in Appendix A.

4.3. Source and Follow-Up Test Cases

As we described earlier, MT involves the process of generating source and follow-up test cases based on the MRs used for testing. The MRs used for testing the SUT contain one source and one follow-up test case.

IBk uses a training dataset to train the k-nearest neighbor classifier, and a test dataset is used to evaluate the performance of the trained classifier. We used the training dataset from the machine learning repository (https://archive.ics.uci.edu/ml/datasets/ecoli (accessed on 3 August 2024)) as a source test case. The training data contain 336 instances and 8 attributes. To generate follow-up test cases using these source test cases, we applied the input transformations described in the MRs.

Linear regression uses a training dataset to train the model, and a test dataset is used to evaluate the performance of the trained model. We obtained the training data from the machine learning repository (https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD (accessed on 3 August 2024)) as a source test case. The training data contain 515,345 instances and 90 attributes. To generate the follow-up test case using these source test cases, we applied the input transformations described in the MRs. Similarly, for Naive Bayes, we obtained training data from the machine learning repository https://archive.ics.uci.edu/ml/datasets/Adult as the source test case. The training data contain 48,842 instances and 14 attributes. To generate the follow-up test case using these source test cases, we applied the input transformations described in the MRs.

For testing a convolutional neural network, we obtained training data (http://yann.lecun.com/exdb/mnist/ (accessed on 3 August 2024)) as a source test case. The training dataset contains 60,000 handwritten digits and 10,000 images for testing. To generate follow-up test cases using these source test cases, we applied the input transformations described in the MRs. Similarly, we used the fashion dataset (https://www.kaggle.com/datasets/zalando-research/fashionmnist (accessed on 3 August 2024)), which contains 60,000 images for training and 10,000 images for testing.

4.4. Mutant Generation

For each subject program, we generated a mutant set for the construction and evaluation of the prioritized MR order. In order to generate mutants, we applied the source-level operators in the training dataset for each of the subject-program and model-level mutation operators designed for conducting mutation testing on machine learning and deep learning applications as proposed by Shen et al. [29] and Jahangirova et al. [30]. Table 1 shows all the mutation operators at the source and model levels used in our evaluation. Source-level mutation operators modify the training dataset, the model structure, or hyper-parameters before training the model. These operators introduce faults into the machine learning and deep learning model during the training process. In addition, model-level mutation operators modify an already trained model by changing its weights, architecture, or hyperparameters. These operators are used to evaluate the robustness of the trained model by introducing faults and observing the model’s response to them.

Table 2 shows the number of mutants generated for the SUT. We applied all the source level operators such as data repetition, label error, data missing, data shuffle, noise perturbation, layer addition, and layer removal on the training dataset for the CNN. Also, we used model-level operators such as layer addition, layer deactivation, and weight shuffling for the CNN. For IBk, Naive Bayes, and linear regression, all the source-level operators such as data repetition, label error, data missing, data shuffle, and noise perturbation were applied on the training dataset. For the source-level operators, we applied each operator five times to the training dataset randomly to 11 different percentages (10, 15, 20, 25, 30, 35, 40, 45, 50, 55, and 60) resulting in 11 configurations of the same mutation operator as presented in the previous work by Jahangirova et al. and Liu et al. [30,31]. Similarly, for model-level mutants, we applied layer addition and layer removal to each layer in the network. For weight shuffling, we applied the mutation operator to 0.1% of the weights in the network as proposed in the previous work by Jahangirova et al. [30].

It is typical for mutation tools to generate some equivalent mutants; mutants that are syntactically different but functionally equivalent to the original program. Therefore, these equivalent mutants always produce the same output as the original program and cannot be detected using testing. Due to the complexity of the subject programs and the large number of mutants generated, it was practically hard to find equivalent mutants manually. Mutants that produced the same predictions as the original program within a tolerance threshold were considered equivalent mutants and discarded. The remaining non-equivalent mutants were used for further analysis. Table 1 shows the number of mutants used for the evaluation after this filtering.

5. Evaluation Methodology

In this work, we planned to evaluate the utility of the developed prioritization approaches on the following aspects: (1) Fault detection effectiveness of MR sets, (2) Effective number of MRs required for testing, (3) Time taken to detect a fault, and (4) Average percentage of fault detected (APFD). We evaluated the effectiveness of the proposed approaches with the following: (1) Random baseline: This represents the current practice of executing source and follow-up test cases of the MRs in random order. (2) Fault-based ordering: This represents the MR order based on the fault detection effectiveness of the MRs. The approach selects the MR that has detected the highest number of faults and places it in the prioritized MR ordering. The process continues until all possible faults are revealed [20]. (3) Neuron activation coverage-based ordering: This approach picks the MR that has the highest proportion of neurons activated across the layers of the neural network and places it in the prioritized MR order. In deep learning models, a neuron is a mathematical function that takes in one or more inputs, processes them using weights and biases, and produces an output. During training, the input data activate different neurons in the model, and the output of the model is based on the activation of these neurons. NAC-based prioritization of metamorphic relations involves analyzing the activation of individual neurons in the model when presented with different transformed inputs and prioritizing the metamorphic relations that are most likely to activate the greatest number of neurons.

We conducted experiments to find answers to the following research questions:

Research Question 1 (RQ1): Are the proposed MR prioritization approach based on data diversity more effective than the random baseline?
Research Question 2 (RQ2): Are the proposed MR prioritization approach more effective than the fault-based based prioritization?
Research Question 3 (RQ3): How does the proposed MR prioritization approach perform when compared to neuron activation coverage-based ordering?

5.1. Evaluation Procedure

In order to answer the above research questions, we carried out the following validation procedure similar to previous work [20]:

We generated a set of mutants that represent the faults for the SUT as described in Section 4.4. Then, we divided this mutant set into two subsets: prioritization set of faults $F_{p}$ , and the validation set of faults $F_{v}$ . The prioritization set of faults $F_{p}$ are used to generate a prioritized MR order for fault-based ordering. $F_{v}$ is used to evaluate the prioritized MR order. For each fault $f \in F_{p}$ and $f \in F_{v}$ , we logged whether each $M R_{i}$ revealed f when executed with $(T_{s p})$ and $(T_{f p})$ .
We used the method described in Section 3 to obtain data diversity-based MR ordering. Then, we applied the obtained MR ordering to $F_{v}$ and logged the mutant killing information.
Creating the random baseline: We generated 100 random MR orderings and applied each of these orderings to $F_{v}$ . The mutant killing information for each of these random orderings was logged. The average mutant killing rates of these 100 random orderings were computed to obtain the fault detection effectiveness of the random baseline.
Creating the neuron activation coverage-based ordering: We used neuron activation coverage (NAC) to generate prioritized MR ordering. The NAC aims to explore and understand the behavior of the deep neural network. The activation of diverse sets of neurons can provide a better understanding of how the network responds to different inputs and whether it exhibits the expected behaviors. A low coverage of neurons in the network can leave different behaviors of the deep neural network unexplored and may struggle to generalize to different input variations [32]. In fact, previous work by Pei et al. [32] showed that the neuron activation coverage is a good metric to measure the comprehensiveness of the DNN testing and helped to generate diverse inputs. The prioritization step is as follows.
(a)
Let $T_{s p}$ for a MR be the input to a CNN. Calculate the neuron activation data $a_{i, j}$ for each neuron j in each layer i for the source test case $T_{s p}$ .
(b)
Similarly, let $T_{f p}$ for an MR be the input to a CNN. Calculate the neuron activation data $a_{i, j}$ for each neuron j in each layer i for the follow-up test case $T_{f p}$ .
(c)
Let S be a set containing the neuron activation data $a_{i, j}$ for the source test case and F be a set containing the neuron activation data $a_{i, j}$ for the follow-up test case.
(d)
Compute the union of the two sets of activated neurons to obtain the total number of neurons activated by both test cases, $U = S \cup F$ .
(e)
Calculate the coverage score for the metamorphic relation as the proportion of neurons in the network that were activated by the source or follow-up test case using the formula below.

$c (MR) = \frac{| U |}{total number of neurons in the network}$

(8)

where $C_{M R}$ is the neuron activation coverage of MR.
(f)
Apply step 4a to 4e for all MRs.
(g)
Compare the neuron activation coverage ( $C_{M R}$ ) value of each MR and arrange them in descending order according to their coverage scores. Start with the MR having the highest coverage score and continue in descending order. Repeat the comparison and rearrangement process until all MRs are ranked.
(h)
Apply the prioritized MR ordering generated to $F_{v}$ and log the mutant killing information of the prioritized MR ordering.
Creating the fault-based ordering: fault-based MR prioritization utilizes fault detection information of MRs to create the MR order. Create the prioritized MR ordering as follows [20]:
(a)
Let the source test case used for testing the SUT be the prioritizing source test case $(T_{s p})$ .
(b)
Let the follow-up test case used to test the SUT be prioritizing follow-up test case $(T_{f p})$ .
(c)
Let the set of faults detected in the SUT be prioritizing set of faults $F_{p}$ . For each fault $f \in F_{p}$ , log whether each $M R_{i}$ revealed f when executed with $(T_{s p})$ and $(T_{f p})$ .
(d)
Use the following greedy approach to create the prioritized ordering of the MRs:
Select the MR that revealed the highest number of faults in $F_{p}$ and place it in the prioritized MR ordering. If there are multiple MRs with the same highest number, select one MR from them randomly.
Remove each $f \in F_{p}$ detected as faulty by that MR from $F_{p}$ .
Repeat steps 5(d)i and 5(d)ii until all possible faults are revealed.
(e)
Apply the prioritized MR ordering generated to $F_{v}$ and log the mutant killing information of the prioritized MR ordering.

5.2. Evaluation Measures

We used the following measures to evaluate the effectiveness of the MR orderings generated by our proposed metrics:

To measure the fault detection effectiveness of a given set of MRs, we use the percentage of mutants killed by those MRs. We calculate the relative improvement in fault detection effectiveness using the formula below:

$\begin{matrix} Relative Improvement (%) = \\ (\frac{Effectiveness of the Approach - Effectiveness of the Baseline}{Effectiveness of the Baseline}) \times 100 \end{matrix}$

where
- Effectiveness of the Approach is the fault detection effectiveness measure using our proposed metrics.
- Effectiveness of the Baseline is the fault detection effectiveness measure for the baseline method such as random, fault-based, or neuron coverage-based.
To calculate effective MR set size, we used the following approach: MT fault detection effectiveness typically increases as the number of MRs used for testing increases. However, after a certain number of MRs, the rate of increase in fault detection slows due to factors such as redundancy of MRs. Therefore, when there is no significant increase in fault detection between two MR sets of consecutive sizes of size m and $m + 1$ , where the MR set of size $m + 1$ is created by adding one MR to the MR set set of size m, the effective MR set size can be determined. That is, if the difference in the effectiveness of fault detection of the MR set of size m and the MR set of size $m + 1$ is less than some threshold value, m would be the effective MR set size that should be used for testing. The threshold is a predefined value used to determine whether the difference in fault detection effectiveness between two sets of MRs is significant enough to justify the inclusion of additional MRs in the test set. Determining this threshold value should be performed considering the critical nature of the SUT. In this work, we used two threshold values of 5% and 2.5%, as used in previous related work, for determining the oracle dataset size [33]. For example, in Figure 4c, the relative improvement % between the MR sets 1 and 2 was 3.25%, which is less than the threshold 5%. So, users can now select the first two MRs for execution.
We used the following approach to find the average time taken to detect a fault: for each killable mutant m in $F_{v}$ , we calculated the time taken to kill the mutant ( $t_{m}$ ) by computing the time taken to execute the source and follow-up test cases of the MRs in a given prioritized order until m is killed (here, it is assumed that the source and follow-up test cases for each MR are executed sequentially). Then, the average time taken to detect a fault is computed using the following formula:

$\frac{\sum t_{m}}{# killable mutants in F_{v}}$

(9)
We calculate the average percentage of fault detected (APFD) developed by Elbaum et al. [34,35,36] that measures the average rate of fault detection per percentage of test suite execution. The APFD is calculated by taking the weighted average of the number of faults detected during the execution of the MRs. APFD can be calculated using the following formula:

$APFD = \frac{1 - M R F_{1} + M R F_{2} + \dots + M R F_{m}}{n m} + \frac{1}{2 n}$

(10)

where $M R F_{i}$ represents the fault $F_{i}$ detected by the metamorphic relations under evaluation, m represents the number of faults present in the SUT, and n represents the total number of MRs used.

6. Result and Analysis

In this section, we discuss our experimental results and provide answers to the three research questions that we list in Section 5. For each subject program, we performed the validation procedure described in Section 5.1 using the setup described in Table 3. In this setup, we used the generated mutant set for evaluating the prioritized MR ordering as

F_{v},

and the source test cases as

T_{s p}

. For example, in Table 3, for IBk, Ecoli refers to the dataset we describe in Section 4.3. Similarly, for CNN, MNIST and Fashion datasets refer to the 60,000 images used for training and testing. YearPrediction refers to the training dataset used for the linear regression program, respectively. Figure 4 shows the average fault detection effectiveness for the evaluation runs described above vs. the MR set size used for testing each subject program. We also plot the percentage of faults detected by the random baseline for comparison. The results for each research question are provided below.

6.1. RQ1: Comparison of MR Prioritization Approaches versus Random Baseline

6.1.1. Fault Detection Effectiveness

We formulated the following statistical hypothesis to answer RQ1 in the context of fault detection effectiveness:

Hypothesis 1.

For a given MR set of size m, the fault detection effectiveness of the MR set produced by the data diversity-based approach is higher than that of the random baseline.

The null hypothesis

H_{0 x}

for each of the above-defined hypotheses

H_{X}

is that the data diversity-based approaches perform equal to or worse than the random baseline.

In Table 4a–e, we list the relative improvement in average fault detection effectiveness for the data diversity-based method over the currently used random approach of prioritizing MRs for linear regression, Naive Bayes, IBk, and CNN, respectively. To evaluate the above hypotheses, we use the paired permutation test, as it does not make any assumptions about the underlying distribution of the data [37]. We apply the paired permutation test to each MR set size for each of the subject programs with

α

= 0.05, and the relative improvements that are significant are marked with a *.

As shown in Table 4a, for IBk, the rule-based and data distribution metric shows improvements in average fault detection effectiveness over the random approach for all MR set sizes except for the last MR set size. The last MR set size does not show any improvements, as the fault detection effectiveness reaches a saturation point by the last metamorphic relation set size. In particular, among the different MR set sizes, the increase in the percentage of fault detection varies from 0% to 129%. Therefore, we reject the null hypotheses

H_{01}

for IBk for rule-based, anomaly, clustering-based, and data distribution-based metrics.

Similarly for linear regression, Table 4b indicates that all metrics show significant consistent improvements in fault detection over the random baseline. The fault detection varies from 0% to 114% for the metrics. Therefore, we can reject the null hypothesis in general for linear regression. Table 4c shows the relative improvement in the average percentage of fault detection for Naive Bayes. All metrics show improvements for MR set sizes

m = 1

to

m = 5

. Therefore, we can reject the null hypothesis in general for CNN only for MR set sizes

m = 1

to

m = 5

.

Table 4d,e show the relative improvements in the average percentage of fault detection for CNN. All metrics show improvements for all MR set sizes except MR set size

m = 9

and

m = 10

. Furthermore, the relative improvement between the MR prioritization methods varies between 0% and 457% for the MNIST dataset and between 0% and 108% for the Fashion dataset. Therefore, we can reject the null hypothesis in general for CNN. So, overall, the data diversity-based approach outperformed the random approach for all our subject programs.

6.1.2. Effective Number of MRs Used for Testing

In Figure 4, the effective number of MRs using fault detection thresholds for various subject programs is illustrated as mentioned in Section 5.2. The vertical lines indicate the size of the effective MR set for different approaches: brown for clustering-based, green for data distribution-based, violet for fault-based, and orange for the random baseline approach. The associated thresholds are annotated near each vertical line for clarity.

In the linear regression context, the clustering, data distribution, anomaly, and rule-based metrics yield an effective MR set size of 2 at the 2.5% fault detection threshold. In contrast, the random approach results in an MR set size of 5 as shown in an orange vertical line for the same threshold of 2.5%. This represents an 85% reduction in the size of the MR set when using the data diversity-based metrics. The significant reduction indicates that the data diversity metrics can achieve similar, or even superior, fault detection performance with a smaller set of MRs when compared to the random approach, thereby improving the testing efficiency.

For the IBk classifier, clustering and data distribution-based metrics produce an effective MR set size of 3 and 2, respectively, at the 2.5% threshold. Meanwhile, the random approach results in a much larger MR set size of 8. Additionally, the anomaly and rule-based approaches yield an MR set size of 3 for the 2.5% threshold. For the Naive Bayes classifier, both clustering-based and random-based metrics generate an MR set size of 1 for the 2.5% threshold, while the random approach generates an MR set size of 3. These observations underscore the effectiveness of clustering-based methods in reducing the testing effort while maintaining a high standard of fault detection.

Similarly, for CNNs trained on the MNIST dataset, the clustering-based, data distribution-based, anomaly-based, and feature-based metrics all produce an effective MR set size of 2 at the 5.0% fault detection threshold, while the random approach results in a substantially larger MR set size of 10. For CNNs trained on the Fashion–MNIST dataset, both feature-based and clustering-based metrics produce an effective MR set size of 3 at the 5.0% threshold, compared to an MR set size of 6 for the random approach.

These results demonstrate the efficiency of clustering and data-driven approaches in reducing the size of the MR set, thereby minimizing computational resources while maintaining a high rate of fault detection. This efficiency accelerates the validation process, which is crucial in iterative development cycles, particularly in machine learning projects where resource conservation and rapid validation are critical for success.

6.1.3. Average Percentage of Faults Detected

Table 5 shows the APFD for the SUT. We can observe that for all our subject programs, our proposed metrics provide a higher APFD value between 0.35 and 0.98 when compared to random-based prioritization. This leads to detecting the most important faults earlier in the testing process compared to the random approach.

6.1.4. The Average Time Taken to Detect a Fault

Table 6 shows the average time taken to detect a fault in linear regression, Naive Bayes, IBk, and CNN, respectively. The first column of the table represents the average time taken to detect a fault using our proposed metrics. As shown in these results, the use of our proposed approach resulted in significant reductions in the average time taken to detect a fault by 48% compared to the random baseline for IBk. Similarly, for linear regression, Naive Bayes, and CNN using the MNIST dataset, our proposed metrics provided 62%, 48%, and 43% reduction in time taken to detect a fault compared to random approach. Similarly, for CNN using the Fashion dataset, our proposed metrics provided a 54% reduction in the time taken to detect a fault compared to the random approach. Table 7 illustrates the time taken to prioritize MRs using different approaches for various subject systems. Notably, the rule-based approach results in significantly longer processing times, as seen with linear regression, requiring up to 72,000 s. In contrast, anomaly-based, clustering-based, and data distribution-based approaches demonstrate considerable efficiency across all subjects, often completing tasks in a fraction of the time required by rule-based and fault-based methods. For instance, in the Naive Bayes subject, the anomaly-based approach took only 72 s, while the fault-based method took 15,000 s, indicating a substantial reduction in time. Table 8 shows the time taken to generate the metrics.

Specifically, for the CNN models using the MNIST and Fashion datasets, the clustering-based, data distribution-based, and feature-based approaches required notably more time compared to simpler subjects, but they were still far more efficient than the fault-based approach, which took an extreme 242,000 s for MNIST and 900,000 s for the Fashion dataset. These observations confirm that our proposed approaches not only facilitate a faster prioritization of MRs but also offer a significant reduction in the time and cost associated with the MR prioritization process.

When applying the diversity metrics and prioritization techniques, it is important to note that while faster methods like anomaly and clustering-based approaches drastically reduce the time for MR prioritization, these methods still incur overhead depending on the complexity of the system and the dataset size. In CNN models, for example, the prioritization process, even when using more efficient methods, requires more time than simpler systems like Naive Bayes. This is likely due to the higher computational demands of CNNs in calculating feature diversity, distribution analysis, and anomaly detection across large and complex datasets. Thus, the overhead can vary considerably based on the method chosen and the complexity of the subject system. The proposed methods do reduce time and cost in the prioritization process, but this efficiency comes with its own computational expense, especially in models with larger feature sets or more intricate architectures like deep learning systems.

6.2. RQ2: Comparison of the Proposed MR Prioritization Approach against the Fault-Based Approach

In this research question, we evaluate whether the data diversity-based approach outperforms the fault-based approach. To answer the research question, we formulate the following hypotheses.

Hypothesis 2.

For a given MR set of size m, the fault detection effectiveness of the MR set produced using a data diversity-based approach is higher than the fault detection effectiveness of the MR set produced by the fault-based prioritization.

The null hypothesis

H_{0 x}

for each of the above-defined hypotheses

H_{x}

is that the MR sets generated by the proposed method perform equal to or worse than the MR sets generated by fault-based prioritization in terms of fault detection effectiveness.

6.2.1. Fault Detection Effectiveness

We provide the answer to RQ2 in the context of the effectiveness of fault detection. In this research question, we evaluate whether the data diversity-based approach outperforms the fault-based approach. Table 9a shows the relative improvement in the fault detection percentage between the MR sets generated by the data diversity-based approach over the fault-based approach for IBk. We observed that the data distribution metric performs equal to the fault-based approach for all the MR set sizes. Also, rule-based and clustering-based perform equally to the fault-based approach except for MR set size

m = 1

. Therefore, we cannot reject the null hypothesis

H_{02}

for the data distribution metric. Table 9b shows the relative improvement in the fault detection percentage between the MR sets generated by the data diversity-based approach over the fault-based approach for linear regression. The results indicate that the anomaly-based and data distribution-based metric outperform the other two metrics. Specifically for an MR set size

m = 1

, both the anomaly-based and data distribution-based metrics show a relative improvement in fault detection effectiveness of 29%, while the clustering-based and rule-based show either no improvement or a decrease in effectiveness. For MR set sizes

m = 2

to

m = 4

, the anomaly-based, data distribution-based, and clustering-based metrics show an improvement in fault detection effectiveness, while the rule-based approaches show no improvements. For MR set sizes

m = 5

to

m = 10

, all approaches show no improvements in effectiveness compared to the fault-based approach. The rule-based metric performs equally as the fault-based approach except for MR set size

m = 1

. Therefore, we cannot reject the null hypotheses

H_{02}

for the data distribution-based and anomaly-based metrics.

Table 9c shows the relative improvement in the fault detection percentage between the MR sets generated by the data diversity-based approach over the fault-based approach for Naive Bayes. The results indicate that our proposed metrics perform equally to fault-based for all MR set sizes. Therefore, we cannot reject the null hypotheses

H_{02}

.

Table 9d shows the relative improvement in the fault detection percentage between the MR sets generated by the data diversity-based approach over the fault-based approach for CNN using the MNIST dataset. We observe that the clustering-based, data distribution-based, and feature-based approaches performed equal to fault detection for all the MR set sizes, and the anomaly-based approach performed equally to fault detection for only MR set sizes

m = 4

to

m = 10

. Table 9e shows the relative improvement in the fault detection percentage between the MR sets generated by the data diversity-based approach over the fault-based approach for CNN using the fashion dataset. We observe that the feature-based metric provided greater fault detection effectiveness when compared to the fault-based for MR set sizes

m = 5

to

m = 9

. However, the fault-based approach provided marginally greater or equal fault detection effectiveness when compared to anomaly, clustering, and data distribution-based metrics for all the MR set sizes. Therefore, we can reject the null hypotheses

H_{02}

only for the feature-based metric.

6.2.2. Effective Number of MRs Used for Testing

In Figure 4, for IBK, we observe that the data distribution-based and anomaly-based metrics provide an effective MR set size of 2 for the 2.5% threshold, and the fault-based approach also provides an effective MR set size of 2 for the 2.5% threshold. For Naive Bayes, the rule-based and fault-based metrics provide an effective MR set size of 1 for the 2.5% threshold. Similarly, for CNN using MNIST and fashion datasets, all our metrics provide an effective MR set size of 2 for the 5.0% threshold, and the fault-based approach also provides an effective MR set size of 2 for the 5.0% threshold. In linear regression, all our proposed metrics provide an effective MR set size of 2 for the 2.5% threshold, and the fault-based approach also provides an effective MR set size of 2 for the 2.5% threshold. The results indicate that the effective number of MRs required for testing SUT using our proposed metrics did not show improvements over the fault-based approach.

6.2.3. The Average Time Taken to Detect a Fault

Table 6 shows the average time taken to detect a fault in linear regression, IBk, Naive Bayes, and CNN, respectively. As shown in these results, using our proposed metric resulted in reductions in the average time taken to detect a fault by 13.04% when compared to the fault-based approach for linear regression. This reduction in average time to detect a fault using our metrics is due to the higher fault detection effectiveness of the data distribution-, anomaly-, and clustering-based metrics when compared to the fault-based approach for certain MR set sizes. For IBk, Naive Bayes, and CNN using the MNIST and Fashion datasets, the data diversity-based approach did not reduce the time taken to detect a fault since the fault detection effectiveness of our proposed metrics was less than the fault-based approach for CNN and IBK for certain MR set sizes.

6.2.4. Average Percentage of Faults Detected

Table 5 shows the APFD for the SUT. We can observe that for all our subject programs, APFD for our proposed metrics vary from 0.21 to 0.98 when compared to the fault-based for all the subject programs. We found that the fault-based approach provides a higher APFD when compared to the anomaly-based, data distribution-based, and clustering-based approaches for IBk. Similarly, for Naive Bayes, only the clustering-based metric provided equal APFD when compared to the fault-based. This occurred due to the clustering-based metric’s capability to identify pivotal groupings or patterns in the data that align well with Naive Bayes’ probabilistic decision boundaries, thus effectively uncovering faults. For linear regression, the fault-based approach provides higher APFD when compared to data distribution and rule-based approach. Finally, for CNN using the MNIST dataset, our proposed metric provides APFD equal to the fault-based approach since the MNIST dataset consists of handwritten digit images, which have relatively simple and well-defined features. For CNN using the Fashion dataset, the fault-based approach provides a higher APFD when compared to our proposed metrics since the Fashion dataset provides inherent complexity and variability compared to the MNIST dataset.

6.3. RQ3: Comparison of the Proposed Metric with the Approach Based on Neuron Activation Cover

In this research question, we evaluate whether the data diversity-based ranking approach outperforms the neuron activation coverage-based approach. To answer the research question, we formulate the following hypothesis.

Hypothesis 3.

For a given MR set of size m, the fault detection effectiveness of the MR set produced using data diversity-based approach is higher than the fault detection effectiveness of the MR set produced by the neuron activation coverage-based approach.

The null hypothesis

H_{0 x}

for each of the above-defined hypotheses

H_{x}

is that the MR sets generated by the data diversity-based method perform equally to or worse than the MR sets generated by the neuron activation coverage-based approach in terms of fault detection effectiveness.

6.3.1. Fault Detection Effectiveness

We provide the answer to RQ3 in the context of fault detection effectiveness. In this research question, we evaluate whether the data diversity-based approach outperforms the neuron activation coverage-based approach. Table 10 and Table 11 show the relative improvements in the percentage of fault detection between MR sets generated by the data diversity-based approach over the neuron activation coverage-based approach for CNN using the MNIST and the Fashion datasets. We observed that our proposed metrics outperformed the neuron activation coverage-based approach for all the MR set sizes. This lack of performance by the neuron activation coverage-based approach is due to the lack of variation in neuron activation in the network between the source and follow-up test cases of the MR for CNN. For CNN using the MNIST dataset, with the exception of MR6, all MRs showed the same neuron activation coverage, which resulted in MR6 being prioritized over other MRs, which were selected and ordered randomly. This randomness in the prioritization of MRs resulted in reduced fault detection effectiveness of the neuron activation coverage-based approach. However, we observed variations in neuron activation in the network between the source and follow-up test cases of MRs for CNN using the Fashion dataset. This higher complexity and diversity in the Fashion dataset led to more varied activation patterns across different neurons in the network. Therefore, we can reject the null hypothesis

H_{02}

for all our metrics.

6.3.2. Average Percentage of Faults Detected

Table 5 shows the APFD for the subject programs. We can observe that for CNN using the MNIST and Fashion datasets, our proposed metrics provided a higher APFD between 0.35 and 0.88 when compared to the neuron activation coverage-based approach. The superiority of the proposed metrics over the neuron activation coverage in these cases suggests that simply achieving high activation coverage does not necessarily equate to effective fault detection.

7. Discussion

Our research introduced a novel method for prioritizing metamorphic relations (MRs) in software testing, particularly for machine learning and deep learning systems, utilizing metrics based on data diversity. The proposed approach not only improved fault detection effectiveness by 5% to 550% compared to neuron activation coverage-based prioritization but also significantly reduced the time taken to detect faults by up to 62% compared to a random execution of MRs. Also, our proposed metric performed closer to fault-based approach.

The results suggest that data diversity is a critical factor in the effectiveness of MRs, aligning with the theory that diversity in test cases can expose more faults. This finding supports previous studies that highlight the importance of test case diversity, but extends it by quantifying the impact in the context of metamorphic testing. A major strength of our study is the introduction of quantifiable and practical metrics for prioritizing MRs, which can be readily applied in real-world settings. However, the study is not without limitations. The effectiveness of the proposed metrics can vary depending on the specific characteristics of the data and the ML algorithms used. Additionally, the generalizability of the results to other forms of white-box software testing remains to be explored.

The practical implications of our findings are profound, particularly for industries and sectors where rapid deployment of machine learning models is critical. Our methods can significantly reduce the time and resources spent on testing without compromising quality, thereby accelerating the pace of software development. Theoretically, this research advances our understanding of MR prioritization in the context of machine learning, offering a new perspective on how data diversity impacts the effectiveness of tests.

When selecting a metric for a subject program in practice, the choice of metric should be guided by the type of data used in the subject program. Rule-based and clustering-based metrics are suitable for structured or numerical data, while data distribution-based, clustering-based, anomaly-based, and feature-based metrics are appropriate for image or unstructured data. Domain expertise and understanding of the subject program play a crucial role in determining which metric might be most effective in detecting faults. Additionally, all our subjects show some variation in the performance of different metrics across the subject programs; it is important to note that the differences are relatively small in most cases. For example, in the CNN—Fashion dataset, the anomaly-based metric achieves the highest performance at 0.93, but the data distribution and clustering-based metrics are not far behind at 0.88 and 0.87, respectively. Similarly, for the linear regression subject program, the clustering-based and fault-based metrics both perform well at 0.97.

Prior work by Srinivasan et al. [20] proposed a fault-based and code coverage-based approach to prioritize MRs in the context of regression testing. The fault-based approach applies a greedy algorithm on the faults detected by the MRs on the previous version of the SUT to prioritize the MRs for testing the next version of the SUT. The code coverage-based approach uses statements and branches covered by MRs to prioritize MRs. The proposed approach was applied to three diverse applications, and the result indicates that the code coverage-based approach did not work effectively for machine learning-based programs. Huang et al. [15] proposed metamorphic relations prioritization, and the selection based on test adequacy criteria aims to improve the efficiency and effectiveness of metamorphic testing by selecting and prioritizing metamorphic relations (MRs) based on their potential to achieve test adequacy. The approach likely involves defining a set of test adequacy criteria, such as statement coverage or branch coverage, and then calculating a metric for each MR based on its contribution to meeting these criteria. MRs are prioritized and selected based on these metrics, with the goal of maximizing test adequacy while minimizing test execution time. Experiments would compare the fault detection effectiveness of MR sets selected using this approach to those selected randomly or using other criteria. Sun et al. [38] provided path-directed source test case generation, and prioritization in metamorphic testing focuses on improving the efficiency and effectiveness of metamorphic testing by leveraging path information. This approach generates source test cases by considering the program’s control flow, aiming to increase code coverage. The generated test cases are then prioritized based on path distance, with the goal of maximizing the exploration of different program paths. By targeting specific program paths, this method aims to enhance fault detection capabilities. Experiments compared the fault detection effectiveness of path-directed test case generation with random or other test case generation techniques. However, the techniques proposed does not involve prioritizing metamorphic relations for machine learning programs based on data diversity.

Cao et al. [16] proposed six metrics to measure the dissimilarity between the source and follow-up test cases. The authors then evaluated the effectiveness of their approach by applying it to four different metamorphic relations and comparing the results to those obtained using traditional dissimilarity metrics. The results showed that their approach is effective in identifying cases where the traditional metrics fail and that it can improve the efficiency of metamorphic testing by reducing the number of redundant test cases that need to be executed. The work does not propose metrics to prioritize metamorphic relations for machine learning and deep learning programs.

Mayer and Guderlei [1] conducted an empirical study to assess the usefulness of MRs. Based on the result of the experiment, the authors devised several criteria to assess the quality of the MRs on the basis of potential usefulness. These criteria were defined based on the different execution paths undertaken by the source and follow-up test cases. Although this work provides some directions for selecting useful MRs, it does not directly addresses MR prioritization. In particular, we focus on developing automated methods for MR prioritization.

Hui et al. [39] provided qualitative guidelines for MR prioritization. The guidelines could be used to select effective MR and improve MT performance. The author also provided three measurable metrics for MR selection: degree of MR, algebraic complexity of MR, and distance between test inputs of MR. The empirical result showed inconsistency between the correlation value and the metrics, indicating that other factors could influence the effectiveness of MR. Nakajima et al. [19] focused on the use of dataset diversity, that is, the use of multiple datasets that are diverse in terms of their origin, characteristics, and distribution, to improve the effectiveness of metamorphic testing for ML software. The proposed method takes into account the dataset dependency of training results and provides a new way to generate follow-up test input. The approach was evaluated in a case of software testing of neural network programs to classify handwritten numbers. The results showed that the proposed method was able to find defects in the machine learning software that were not found by traditional metamorphic testing methods. In contrast, our work prioritizes metamorphic relations based on the diversity between the source and follow-up test cases in an MR. Sun et al. [38] proposed a technique to generate good source test cases in metamorphic testing based on constraint solvers and symbolic execution techniques. In addition, a prioritization of source test cases was conducted on the basis of path distances among test cases. This work does not focus on the prioritization of metamorphic relations. In addition, the work does not perform a prioritization of metamorphic relations for machine learning and deep learning programs.

Feng et al. [17] proposed a new method to identify and prioritize test cases that are likely to be most effective in improving the robustness of a deep neural network. The authors introduced a metric called DeepGini, which measures the importance of each test case in terms of its ability to activate a large number of neurons in the network. They demonstrated that using DeepGini to select a subset of test cases for training leads to improved robustness and accuracy of the network, as well as faster training times. The experimental result indicates that DeepGini outperforms existing coverage-based techniques in prioritizing tests with respect to both effectiveness and efficiency. However, the proposed technique does not focus on prioritizing metamorphic relations.

Torres et al. [18] introduced MetaTrimmer, which generates random test data, identifies violated metamorphic relations, and derives constraints to improve their precision. Experiments on real-world programs demonstrated that MetaTrimmer effectively selects informative metamorphic relations, leading to increased fault detection rates compared to traditional random testing. By automating the selection and refinement process, MetaTrimmer significantly enhances the practicality and effectiveness of metamorphic testing. Liu et al. [40] proposed similarity-based Metamorphic Relations Selection Strategy for Numerical Computation Programs, which focuses on improving metamorphic testing for numerical computation programs. Unlike random or data-driven approaches, this method leverages the concept of similarity between input relations of different metamorphic relations. By calculating similarities, it selects a set of metamorphic relations that are diverse and cover a wide range of program behaviors. This strategy aims to enhance test effectiveness by ensuring that the selected MRs are complementary rather than redundant. However, the proposed work focuses on MR selection based on distance metric for numerical programs in comparison to our proposed work on MR prioritization for ML programs.

Xie et al. [41] proposed MUT Model: a metric for characterizing metamorphic relations diversity introduces a quantitative approach to measure the diversity of a set of metamorphic relations (MRs). The MUT model defines a mathematical framework to calculate diversity based on factors such as the types of operations involved in the MRs, the complexity of the input transformations, and the expected output relationships. By quantifying MR diversity, the MUT model aims to provide a foundation for selecting diverse sets of MRs, which can improve the effectiveness of metamorphic testing by increasing the likelihood of uncovering different types of faults.

8. Threats to Validity

External Validity: Our proposed metric is applied to supervised ML classifiers. Metrics such as rule-based are not suitable for unsupervised classifiers. We plan to design more metrics to cover unsupervised classifiers and non-numerical data. In this work, all experimental subjects are implementation of machine learning and the deep learning algorithm used in the Weka ML library. Although they are popular algorithms and widely used, our findings may not be generalizable to commercial machine learning projects such as autonomous vehicles. To mitigate this threat, we plan to explore the effectiveness of our proposed approach in AI systems in the health care domain and Apollo autonomous driving platform projects in the future.

Internal Validity: We used a dataset having an average of 500,000 instances and 90 attributes on our subject programs. However, the size of the dataset could be small for generating the proposed metrics and prioritizing the MRs. We used less than 300 mutants to validate our prioritized MR ordering for the CNN program. However, it is possible that the number of mutants used is low. However, despite the low number of mutants, those are generated using a set of mutation operators that are specifically designed for machine learning and deep learning and have a higher likelihood of detecting fault.

We used 10–11 MRs to test our subject programs. However, six to nine MRs are generally used for testing [42]. In addition, the time and resources consumed in MT do not only depend on the number of MRs. Although we used MRs with single-source/follow-up test cases in this experiment, depending on the SUT, there can be MRs with multiple source and follow-up test cases. Thus, as the number of MRs increases, the number of test cases can increase exponentially. Additionally, prioritizing MRs can help you identify redundant or less important MRs that may not need to be tested. This can further reduce the testing effort and allow you to focus on the most important MRs.

9. Conclusions

Metamorphic relations in metamorphic testing have varying fault detection effectiveness and also have multiple source and follow-up test cases. As a result, the execution and training of the machine learning model with large source and follow-up test cases can exponentially increase the execution time and cost of metamorphic testing for ML applications. To overcome this problem, we developed five metrics based on diversity between the source and follow-up test case for the prioritization of MRs.

We evaluated our proposed metrics on the implementation of three open-source machine learning and deep learning algorithms. The experimental results show that using MR prioritization in ML and DL programs can increase the fault detection effectiveness of a given MR set to 114% for linear regression, 129% for IBk, and up to 457% for CNN compared to the random ordering of MRs. In particular, using our proposed MR prioritization approaches reduced the average time taken to detect a fault compared to the fault-based prioritization of MRs by up to 25% for our subject programs. Our proposed method also provided a higher fault detection effectiveness when compared to neuron activation coverage-based prioritization. Our finding also indicates that the number of MRs needed to execute and test is reduced by 85%, which helps to uncover faults earlier in the development cycle and helps software testing practitioners. Future studies should explore the applicability of our prioritization metrics across different domains and types of software systems to evaluate their universality and limitations.

Author Contributions

M.S. made substantial contributions to the conception and design of the work, wrote the main manuscript text, and performed the analysis and interpretation of data. U.K. provided extensive feedback that was crucial in critically revising the draft and approved the version to be published. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. MRs Used for Testing the Subject Programs

In this Section, we discuss the MRs used in each of the subject programs.

Appendix A.1. MRs for IBk and Naive Bayes Program

The MRs used for testing IBK and Naive Bayes program are discussed below, and the MRs were obtained from Xie et al. [10]

MR1 (Consistency with affine transformation): The result should be same if we apply some affine transformation function, f(x) = kx + b, (k ≠ 0), to every value x in some subset of features in the training and testing data. The MR contains one source test case (training and testing set). The MR contains one source and follow-up test case.
MR2 (Permutation of the attribute): If we permute the m attributes of all the samples and the test data, the result should remain unchanged. The MR contains one source and follow-up test case.
MR3 (Addition of uninformative attributes): If we add some new feature that is equally associated with all classes, the predictions of the test data should not be changed. The MR contains one source and follow-up test case.
MR4 (Consistency with re-prediction): Suppose we predict some test case t as class $l_{i}$ . If we append t to our training data and re-create the model, t should still be classified as class $l_{i}$ . The MR contains one source and follow-up test case.
MR5 (Additional training sample): For the source input, suppose we obtain the result $c_{t}$ = $l_{i}$ for the test case $t_{s}$ . In the follow-up input, we duplicate all samples in S and L, which have label $l_{i}$ . The output of the follow-up test case should still be $l_{i}$ . The MR contains one source and follow-up test case.
MR6 (Addition of classes by re-labeling sample): For some test cases not of class $l_{i}$ , we switch the class label from x to $x^{*}$ . Then, every test case predicted as class $l_{i}$ should still be predicted as class $l_{i}$ with the re-labeled samples. The MR contains one source and follow-up test case.
MR7 (Permutation of class labels): If we permute the order of the class labels with some random permutation p( $l_{i}$ ), where $l_{i}$ is a class label, all test cases that were predicted as $l_{i}$ should now be predicted as p( $l_{i}$ ). The MR contains one source and follow-up test case.
MR8 (Addition of informative attribute): If we add some new feature that is strongly associated with one class, $l_{i}$ , then for every prediction that was class $l_{i}$ , the prediction with this new attribute should also be class $l_{i}$ . The MR contains one source and follow-up test case.
MR9 (Addition of classes by duplicating samples): Suppose we duplicate every class except for n, and give them all a new class. Then, every test case predicted as class $l_{i}$ should still be predicted as class $l_{i}$ with the duplicated samples. The MR contains one source and follow-up test case.
MR10 (Removal of classes): If we remove some class $l_{i}$ , the remaining predictions should remain unchanged. The MR contains one source and follow-up test case.
MR11 (Removal of samples): If we remove samples that have not been predicted as class $l_{i}$ , then all cases which were predicted as $l_{i}$ should remain unchanged. The MR contains one source and follow-up test case.

Appendix A.2. MRs for Linear Regression

The MRs used for testing the linear regression program are discussed below, and the MRs were obtained from Luu et al. [27]

MR1 (Inserting a predicted point): The regression line will remain the same after being updated by adding a point selected from the line into the original dataset. Then, we expect the follow-up output to be same as the source output.
MR2 (Inserting the Centroid): The linear regression line will remain the same after being updated by adding the centroid into the original dataset. Therefore, adding the centroid of data into the source input to form a follow-up input will not change the follow-up estimator for the regression form.
MR3 (Reflecting the dependent variable): Reflecting the points over a certain x-axis will reflect the regression line over the same axis. The follow-up estimator becomes a reflection of the source estimator when the sign of the dependent variable is reversed in the follow-up input set.
MR4 (Reflecting an independent variable while keeping the others unchanged): Reflecting the points over the y-axis will reflect the regression line over the same axis. The x-coordinate of an independent variable becomes reflected while the x-coordinate of other independent variables and the y-coordinates remain the same.
MR5 (Scaling the dependent variable): The source data points are scaled in the y-axis by a given factor to generate follow-up data points. Then, we expect the follow-up output to be scaled proportionally by the same factor.
MR6 (Scaling an independent variable while keeping the others unchanged): The source data points are scaled in a particular x axis by a given factor to generate follow-up data points. We expect the slope of the regression line to be scaled reciprocally with respect to that axis.
MR6 (Swapping Samples): Swapping any two data points does not alter the regression hyperplane. Given the source input $I_{s}$ and source output $O_{s}$ , suppose that we swap two data points to define the follow-up input $I_{f}$ and follow-up output $O_{f}$ . Then, the follow-up output $O_{f}$ would be the same as the source input $O_{s}$ .
MR7 (Swapping two independent variables while keeping the others unchanged): Swapping two independent variable points while keeping the others unchanged will only swap the two relevant components of the follow-up estimator.
MR8 (Rotating two independent variables while keeping the others unchanged): The points are rotated in the plane perpendicular to the y-axis; the regression hyperplane will be rotated. Basically, if we rotate the axes of any two independent variables by an angle, the corresponding components of the estimator would also be rotated by the same angle.
MR9 (Shifting the dependent variable): When the points are shifted by a given distance along the y axis, the regression line will be shifted along this axis by the same distance. Therefore, if a constant is added to the values of the dependent variable, the intercept of the new estimator would be increased by the same value.
MR10 (Shifting an independent variable while keeping the others unchanged): When the points are shifted by a given distance along a certain x-axis, the regression line will be shifted in parallel along this axis accordingly. So, if a constant is added into values of an independent variable, the intercept component of the follow-up estimator would be decreased by an amount equal to the product of the constant and the value of the corresponding component of the source estimator.

Appendix A.3. MRs for CNN Program

The MRs used for testing the CNN are discussed below, and the MRs were obtained from Ding et al. [28] and Dwarakanath et al. [43].

MR1: Adding 10% of new images into each category of the training dataset should not affect the classification accuracy. By adding new images to each category of the training dataset, the model should still accurately classify images even if it has not been specifically trained on those exact images.
MR2: Duplicating 10% of images of each category in the training dataset should not affect the classification accuracy. The model should achieve similar accuracy before and after duplicating 10% of images in each category of the training dataset.
MR3: Adding 10% of images into each category of the validation dataset should not affect the classification accuracy. This metamorphic relation tests the robustness of a model to changes in the validation dataset. This MR determines if the model can generalize well to new examples that were not included in the original validation set.
MR4: Adding 10% of images into each category of the test dataset should not affect the classification accuracy. This metamorphic relation tests whether the addition of new images to the test dataset has an impact on the classification accuracy of a trained machine learning model. Specifically, it checks if the model’s accuracy remains the same when 10% more images are added to each category of the test dataset.
MR5: Removing one category of the data from the dataset should not affect the classification accuracy of the remaining categories. This metamorphic relation aims to verify if the machine learning model is able to generalize and classify the remaining categories accurately after one of the categories has been removed from the dataset.
MR6: Adding one category of the diffraction images through duplicating one existing category of data in the dataset should not affect the classification accuracy.
MR7: Permutation of input channels (i.e., RGB channels) for training and test data. In other words, if we apply a permutation to the input channels of the training and test data, the output of the model should be the same regardless of the permutation.
MR8: Permutation of the convolution operation order for the training and test data. The output of the model should be the same regardless of the permutation. This metamorphic relation tests the invariance of a convolutional neural network to the permutation of the convolution operation order. So, if the order of convolutions in a CNN is changed, the output of the model should remain the same.
MR9: Normalizing the test data. In other words, if we normalize the test data using the same statistics as the training data, the output of the model should be the same regardless of whether the test data were normalized or not
MR10: Scaling the test data by a constant. If we scale the test data by a constant factor, the output of the model should be the same regardless of the scaling factor.

References

Mayer, J.; Guderlei, R. An empirical study on the selection of good metamorphic relations. In Proceedings of the 30th Annual International Computer Software and Applications Conference (COMPSAC’06), Chicaco, IL, USA, 17–21 September 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 1, pp. 475–484. [Google Scholar]
Ziegler, C. A Google self-driving car caused a crash for the first time. Verge 2016, 198. [Google Scholar]
Ohnsman, A. Lidar Maker Velodyne ‘Baffled’ By Self-Driving Uber’s Failure to Avoid Pedestrian. Forbes 2018. Available online: https://www.forbes.com/sites/alanohnsman/2018/03/23/lidar-maker-velodyne-baffled-by-self-driving-ubers-failure-to-avoid-pedestrian/ (accessed on 3 August 2024).
Barr, E.T.; Harman, M.; McMinn, P.; Shahbaz, M.; Yoo, S. The oracle problem in software testing: A survey. IEEE Trans. Softw. Eng. 2015, 41, 507–525. [Google Scholar] [CrossRef]
Weyuker, E.J. On testing non-testable programs. Comput. J. 1982, 25, 465–470. [Google Scholar] [CrossRef]
Zhang, J.M.; Harman, M.; Ma, L.; Liu, Y. Machine learning testing: Survey, landscapes and horizons. IEEE Trans. Softw. Eng. 2020, 48, 1–36. [Google Scholar] [CrossRef]
Chen, T.Y.; Kuo, F.C.; Liu, H.; Poon, P.L.; Towey, D.; Tse, T.; Zhou, Z.Q. Metamorphic testing: A review of challenges and opportunities. Acm Comput. Surv. 2018, 51, 4. [Google Scholar] [CrossRef]
Segura, S.; Zhou, Z.Q. Metamorphic testing 20 years later: A hands-on introduction. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, Gothenburg, Sweden, 27 May–3 June 2018; ACM: New York, NY, USA, 2018; pp. 538–539. [Google Scholar]
Murphy, C.; Kaiser, G.E.; Hu, L. Properties of Machine Learning Applications for Use in Metamorphic Testing; Department of Computer Science Columbia University: New York, NY, USA, 2008. [Google Scholar]
Xie, X.; Ho, J.W.; Murphy, C.; Kaiser, G.; Xu, B.; Chen, T.Y. Testing and validating machine learning classifiers by metamorphic testing. J. Syst. Softw. 2011, 84, 544–558. [Google Scholar] [CrossRef] [PubMed]
Xie, X.; Ho, J.; Murphy, C.; Kaiser, G.; Xu, B.; Chen, T.Y. Application of metamorphic testing to supervised classifiers. In Proceedings of the 2009 Ninth International Conference on Quality Software, Jeju, Republic of Korea, 24–25 August 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 135–144. [Google Scholar]
Gewehr, J.E.; Szugat, M.; Zimmer, R. BioWeka—Extending the Weka framework for bioinformatics. Bioinformatics 2007, 23, 651–653. [Google Scholar] [CrossRef] [PubMed][Green Version]
Zhang, M.; Zhang, Y.; Zhang, L.; Liu, C.; Khurshid, S. Deeproad: Gan-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), Montpellier, France, 3–7 September 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 132–142. [Google Scholar]
Srinivasan, M.; Shahri, M.P.; Kahanda, I.; Kanewala, U. Quality Assurance of Bioinformatics Software: A Case Study of Testing a Biomedical Text Processing Tool Using Metamorphic Testing. In Proceedings of the 2018 IEEE/ACM 3rd International Workshop on Metamorphic Testing (MET), Gothenburg, Sweden, 27 May 2018; pp. 26–33. [Google Scholar]
Huang, D.; Luo, Y.; Li, M. Metamorphic Relations Prioritization and Selection Based on Test Adequacy Criteria. In Proceedings of the 2022 4th International Academic Exchange Conference on Science and Technology Innovation (IAECST), Guangzhou, China, 9–11 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 503–508. [Google Scholar]
Cao, Y.; Zhou, Z.Q.; Chen, T.Y. On the correlation between the effectiveness of metamorphic relations and dissimilarities of test case executions. In Proceedings of the 2013 13th International Conference on Quality Software, Najing, China, 29–30 July 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 153–162. [Google Scholar]
Feng, Y.; Shi, Q.; Gao, X.; Wan, J.; Fang, C.; Chen, Z. Deepgini: Prioritizing massive tests to enhance the robustness of deep neural networks. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, Virtual, 18–22 July 2020; pp. 177–188. [Google Scholar]
Duque-Torres, A. Selecting and Constraining Metamorphic Relations. In Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, Lisbon, Portugal, 14–20April 2024; pp. 212–216. [Google Scholar]
Nakajima, S. Dataset diversity for metamorphic testing of machine learning software. In Proceedings of the Structured Object-Oriented Formal Language and Method: 8th International Workshop, SOFL+ MSVL 2018, Gold Coast, QLD, Australia, 16 November 2018; Revised Selected Papers 8. Springer: Berlin/Heidelberg, Germany, 2019; pp. 21–38. [Google Scholar]
Srinivasan, M.; Kanewala, U. Metamorphic relation prioritization for effective regression testing. Softw. Test. Verif. Reliab. 2022, 32, e1807. [Google Scholar] [CrossRef]
Riccio, V.; Jahangirova, G.; Stocco, A.; Humbatova, N.; Weiss, M.; Tonella, P. Testing machine learning based systems: A systematic mapping. Empir. Softw. Eng. 2020, 25, 5193–5254. [Google Scholar] [CrossRef]
Ammann, P.; Offutt, J. Introduction to Software Testing; Cambridge University Press: Cambridge, UK, 2016. [Google Scholar]
Clark, P.; Niblett, T. The CN2 induction algorithm. Mach. Learn. 1989, 3, 261–283. [Google Scholar] [CrossRef]
Iglewicz, B.; Hoaglin, D.C. Volume 16: How to Detect and Handle Outliers; Quality Press: Seattle, WA, USA, 1993. [Google Scholar]
Haralick, R.M.; Shanmugam, K.; Dinstein, I.H. Textural features for image classification. IEEE Trans. Syst. Man Cybern. 1973, 6, 610–621. [Google Scholar] [CrossRef]
Aha, D.W.; Kibler, D.; Albert, M.K. Instance-based learning algorithms. Mach. Learn. 1991, 6, 37–66. [Google Scholar] [CrossRef]
Luu, Q.H.; Lau, M.F.; Ng, S.P.; Chen, T.Y. Testing multiple linear regression systems with metamorphic testing. J. Syst. Softw. 2021, 182, 111062. [Google Scholar] [CrossRef]
Ding, J.; Kang, X.; Hu, X.H. Validating a deep learning framework by metamorphic testing. In Proceedings of the 2017 IEEE/ACM 2nd International Workshop on Metamorphic Testing (MET), Buenos Aires, Argentina, 22–22 May 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 28–34. [Google Scholar]
Shen, W.; Wan, J.; Chen, Z. Munn: Mutation analysis of neural networks. In Proceedings of the 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C), Lisbon, Portugal, 6–20 July 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 108–115. [Google Scholar]
Jahangirova, G.; Tonella, P. An empirical evaluation of mutation operators for deep learning systems. In Proceedings of the 2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST), Porto, Portugal, 24–28 October 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 74–84. [Google Scholar]
Ma, L.; Zhang, F.; Sun, J.; Xue, M.; Li, B.; Juefei-Xu, F.; Xie, C.; Li, L.; Liu, Y.; Zhao, J.; et al. Deepmutation: Mutation testing of deep learning systems. In Proceedings of the 2018 IEEE 29th International Symposium on Software Reliability Engineering (ISSRE), Memphis, TN, USA, 15–18 October 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 100–111. [Google Scholar]
Pei, K.; Cao, Y.; Yang, J.; Jana, S. Deepxplore: Automated whitebox testing of deep learning systems. In Proceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, 28 October 2017; pp. 1–18. [Google Scholar]
Gay, G.; Staats, M.; Whalen, M.; Heimdahl, M.P. Automated oracle data selection support. IEEE Trans. Softw. Eng. 2015, 41, 1119–1137. [Google Scholar] [CrossRef]
Elbaum, S.; Malishevsky, A.G.; Rothermel, G. Test case prioritization: A family of empirical studies. IEEE Trans. Softw. Eng. 2002, 28, 159–182. [Google Scholar] [CrossRef]
Malishevsky, A.G.; Ruthruff, J.R.; Rothermel, G.; Elbaum, S. Cost-cognizant test case prioritization. In Citeseer; Technical Report; University of Nebraska: Lincoln, NE, USA, 2006. [Google Scholar]
Elbaum, S.; Rothermel, G.; Kanduri, S.; Malishevsky, A.G. Selecting a cost-effective test case prioritization technique. Softw. Qual. J. 2004, 12, 185–210. [Google Scholar] [CrossRef]
Good, P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Sun, C.A.; Liu, B.; Fu, A.; Liu, Y.; Liu, H. Path-directed source test case generation and prioritization in metamorphic testing. J. Syst. Softw. 2022, 183, 111091. [Google Scholar] [CrossRef]
Hui, Z.W.; Huang, S.; Li, H.; Liu, J.H.; Rao, L.P. Measurable metrics for qualitative guidelines of metamorphic relation. In Proceedings of the 2015 IEEE 39th Annual Computer Software and Applications Conference, Taichung, Taiwan, 1–5 July 2015; IEEE: Piscataway, NJ, USA, 2015; Volume 3, pp. 417–422. [Google Scholar]
Liu, S.; Yan, S.; Yang, X. A Similarity-based Metamorphic Relations Selection Strategy for Numerical Computation Programs. In Proceedings of the 2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC), Qingdao, China, 2–4 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 290–294. [Google Scholar]
Xie, X.; Li, Z.; Chen, J.; Zhang, Y.; Wang, X.; Kwaku Kudjo, P. MUT Model: A metric for characterizing metamorphic relations diversity. Softw. Qual. J. 2024, 1–43. [Google Scholar] [CrossRef]
Segura, S.; Fraser, G.; Sanchez, A.B.; Ruiz-Cortés, A. A survey on metamorphic testing. IEEE Trans. Softw. Eng. 2016, 42, 805–824. [Google Scholar] [CrossRef]
Dwarakanath, A.; Ahuja, M.; Sikand, S.; Rao, R.M.; Bose, R.J.C.; Dubash, N.; Podder, S. Identifying implementation bugs in machine learning based image classifiers using metamorphic testing. In Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, Amsterdam, The Netherlands, 16–21 July 2018; pp. 118–128. [Google Scholar]

Figure 2. Rules generated using CN2 algorithm for source and follow-up test cases of an MR.

Figure 3. An example of data distribution metric. Step1 and step2 involve generating source and follow-up test cases. In step3, 1.5 and 0.43 are the kurtosis and skew values calculated for the example source and follow-up test cases. In step4, 119,247 and 134,854 are the mean and variance calculated for the ST and FT, respectively. In step5, 119,428 and 134,854 are the sum of the value calculated in step3 and step4 for ST and FT, respectively. In step6, the diversity of MR is calculated.

Figure 4. Fault detection effectiveness for IBk, Naive Bayes, linear regression, and convolutional neural network.

Table 1. Mutation operators.

Mutation Operator	Description	Type
Data Repetition	Duplicate a portion of training data	Source Level
Label Error	Change the label for data	Source Level
Data Missing	Remove part of training data	Source Level
Data Shuffle	Shuffle training data	Source Level
Noise Perturbation	Add noise to training data	Source Level
Layer Addition	Add a layer to DNNs structure	Model Level
Layer Removal	Delete a layer of DNN	Model Level
Layer Deactivation	Deactivate the effects of a layer	Model Level
Weight Shuffling	Shuffle the weights of neuron’s connections to the previous layer	Model Level

Table 2. Mutants generated for the SUTs.

Subject	# of Mutants
Linear Regression	220
IBk	220
Naive Bayes	220
CNN- MNIST Dataset	292
CNN- Fashion Dataset	292

Table 3. Validation setup.

Subject	$T_{sp}$	$F_{v}$
IBk	Ecoli	Source Level Mutants
Linear Regression	YearPrediction	Source Level Mutants
Naive Bayes	Adult	Source Level Mutants
CNN	MNIST	Source and Model level Mutants
CNN	Fashion	Source and Model level Mutants

Table 4. Relative improvement in fault detection effectiveness of the data diversity-based approach compared to a random baseline for the SUT.

(a) IBk
MR Set-Size	Rule-Based	Data Distribution	Anomaly-Based	Clustering-Based
1	129% *	183% *	48% *	−37%
2	98% *	98% *	4% *	98% *
3	68% *	68% *	27% *	67% *
4	52% *	52% *	52% *	52% *
5	37% *	37% *	37% *	37% *
6	26% *	26% *	26% *	26% *
7	17% *	17% *	17% *	17% *
8	8% *	8% *	8% *	8% *
9	6% *	6% *	6% *	6% *
10	4%	4%	4%	4%
11	0%	0%	0%	0%
(b) Linear Regression
MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Rule Based
1	114% *	65% *	114% *	20%
2	38% *	40%	40% *	−0.14%
3	17% *	18% *	18% *	3% *
4	1% *	2% *	1% *	0.41% *
5	0.20%	0.71%	0.71% *	0.71% *
6	0.20%	0.51%	0.51% *	0.51%
7	0.20% *	0%	0%	0%
8	0	0%	0%	0%
9	0	0%	0%	0%
10	0	0%	0%	0%
(c) Naive Bayes
MR Set-Size	Rule-Based	Anomaly-Based	Clustering-Based	Data Distribution
1	38.02% *	40.84% *	40.84% *	40.84 *
2	8.69% *	8.69% *	8.69% *	8.69 *
3	2.04% *	2.04% *	2.04% *	2.04% *
4	1.01% *	1.01% *	1.01% *	1.01*
5	1.01% *	1.01% *	1.01% *	1.01*
6	0%	0%	0%	0%
7	0%	0%	0%	0%
8	0%	0%	0%	0%
9	0%	0%	0%	0%
10	0%	0%	0%	0%
11	0%	0%	0%	0%
(d) Convolutional Neural Network using MNIST Dataset
MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Feature Based
1	185% *	457% *	457% *	457% *
2	42% *	178%	178% *	178% *
3	5%	105% *	105%	105% *
4	62% *	63% *	63% *	63% *
5	34% *	34% *	34% *	34% *
6	25%	26% *	26% *	26% *
7	14% *	15% *	15% *	15% *
8	5%	5% *	5% *	5% *
9	0%	0%	0%	0%
10	0%	0%	0%	0%
(e) Convolutional Neural Network using Fashion Dataset
MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Feature Based
1	108% *	-33%	16% *	27% *
2	42% *	30% *	16% *	6% *
3	15% *	8%	15%	−11%
4	7% *	7% *	9% *	−5%
5	4% *	6% *	5% *	5% *
6	3% *	4% *	4% *	5% *
7	2%	3%	3% *	4% *
8	1%	1%	1%	2%
9	0%	0%	1%	1%
10	0%	0%	0%	0%

Table 5. Average percentage of fault detected for subject programs.

Metrics	IBk	Linear Regression	Naive Bayes	CNN—MNIST Dataset	CNN—Fashion Dataset
Data Distribution	0.85	0.89	0.90	0.39	0.88
Rule-based	0.98	0.92	0.93	-	-
Clustering-based	0.91	0.97	0.95	0.39	0.87
Anomaly-based	0.72	0.95	0.92	0.36	0.93
Feature-based	-	-	-	0.35	0.85
Fault-based	0.98	0.97	0.95	0.39	0.95
Neuron Activation Coverage-based	-	-	-	0.29	0.84
Random-based	0.72	0.78	0.67	0.21	0.83

Table 6. Comparison of time taken to detect a fault using the data diversity-based, random baseline, fault-based, and neuron activation coverage-based approaches to SUT.

	IBk	Linear Regression	Naive Bayes	CNN—MNIST Dataset	CNN—Fashion Dataset
Data diversity-based	3.30s	133.29 s	2.90 s	5885.67 s	6923.21 s
Random-based	6.39 s	350.54 s	5.34 s	9032.5 s	9892.57 s
Fault-based	2.51 s	150.69 s	2.42 s	6275.08 s	6279.43 s
Neuron activation coverage-based	-	-	-	7032.50 s	7278.70 s
Average % time reduction of data diversity-based vs. random-based	48.30%	62.96%	48.36%	43.92%	54.13%
Average % time reduction of data diversity-based vs. fault-based	−25.64%	13.04%	−16.55%	−6.61%	−9.30%
Average % time reduction of data diversity-based vs. neuron activation coverage-based	-	-	-	14.98%	5.13%

Table 7. Time taken to prioritize MRs using our proposed approaches.

Subject	Ruled Based	Anomaly-Based	Clustering-Based	Data Distribution-Based	Feature-Based	Fault-Based
IBk	726 s	220 s	220 s	220 s	-	48,400 s
Linear Regression	72,000 s	2400 s	1200 s	1200 s	-	132,000 s
Naive Bayes	88 s	72 s	198 s	62 s	-	15,000 s
CNN—MNIST Dataset	-	792 s	2640 s	2640 s	2640 s	242,000 s
CNN—Fashion Dataset	-	600 s	3600 s	1800 s	3600 s	900,000 s

Table 8. Time taken to generate the metrics for SUT.

SUT	Rule-Based	Anomaly-Based	Clustering-Based	Data Distribution	Feature-Based
IBk	30 s	10 s	10 s	10 s	-
Linear Regression	3600 s	120 s	60 s	10 s	-
Naive Bayes	8 s	6 s	18 s	5 s	-
CNN—MNIST Dataset	-	360 s	120 s	60 s	80 s
CNN—Fashion Dataset	-	60 s	360 s	180 s	360 s

Table 9. Relative improvement in fault detection effectiveness of the data diversity-based approach compared to the fault-based baseline for the SUT.

(a) IBk
MR Set-Size	Rule-Based	Data Distribution	Anomaly-Based	Clustering-Based
1	−19%	0%	−47%	−77% *
2	0%	0%	−47% *	0%
3	0%	0%	−23% *	0%
4	0%	0%	0%	0%
5	0%	0%	0%	0%
6	0%	0%	0%	0%
7	0%	0%	0%	0%
8	0%	0%	0%	0%
9	0%	0%	0%	0%
10	0%	0%	0%	0%
11	0%	0%	0%	0%
(b) Linear Regression
MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Rule-Based
1	29% *	0%	29% *	−12.37% *
2	0%	1% *	1%	0%
3	0.2%	1% *	1%	0%
4	0.4% *	1% *	1%	0%
5	−0.5% *	0%	0%	0%
6	−0.3%	0%	0%	0%
7	−0.2%	0%	0%	0%
8	0	0%	0%	0%
9	0	0%	0%	0%
10	0	0%	0%	0%
(c) Naive Bayes
MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Rule-Based
1	0%	0%	0%	−2.04% *
2	0%	0%	0%	−2.0% *
3	0%	0%	0%	0%
4	0%	0%	0%	0%
5	0%	0%	0%	0%
6	0%	0%	0%	0%
7	0%	0%	0%	0%
8	0%	0%	0%	0%
9	0%	0%	0%	0%
10	0%	0%	0%	0%
11	0%	0%	0%	0%
(d) Convolutional Neural Network Using MNIST Dataset
MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Feature-Based
1	−48% *	0%	0%	0%
2	−48% *	0%	0%	0%
3	−48%	0%	0%	0%
4	0%	0%	0%	0%
5	0%	0%	0%	0%
6	0%	0%	0%	0%
7	0%	0%	0%	0%
8	0%	0%	0%	0%
9	0%	0%	0%	0%
10	0%	0%	0%	0%
(e) Convolutional Neural Network using Fashion Dataset
MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Feature-Based
1	0%	−68% *	−44% *	28% *
2	0%	−7% *	−29% *	6% *
3	−5% *	−10% *	−5% *	−11% *
4	−5%	−5%	−3% *	−5%
5	−4% *	−2% *	−3% *	5% *
6	−3% *	−1%	−2% *	5% *
7	−1%	−1%	−1%	4% *
8	−1%	0%	−1%	2% *
9	0%	0%	0%	1% *
10	0%	0%	0%	0%

Table 10. Relative improvement in fault detection effectiveness of data diversity-based ordering over the neuron activation coverage-based approach for convolutional neural network using MNIST dataset.

MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Feature-Based
1	100% *	100% *	100% *	100% *
2	233% *	550%	550% *	550% *
3	100% *	290% *	290% *	290% *
4	95% *	95% *	95% *	95% *
5	77% *	77% *	77% *	77% *
6	77% *	77% *	77% *	77% *
7	39% *	39% *	39%	39% *
8	14% *	14% *	14% *	14% *
9	5% *	5% *	5% *	5% *
10	0%	0%	0%	0%

Table 11. Relative improvement in fault detection effectiveness of data diversity-based ordering over the neuron activation coverage-based approach for convolutional neural network using Fashion dataset.

MR Set-Size	Anomaly-Based	Clustering-Based	Data Distribution	Feature-Based
1	87% *	−40%	5% *	15% *
2	37% *	26%	−3%	3% *
3	7% *	1%	7% *	−17%
4	5%	5% *	8% *	−6%
5	5% *	7% *	6% *	6% *
6	4%	5% *	5%	6% *
7	3% *	4% *	4% *	5% *
8	1%	1%	1% *	0%
9	0%	1%	0%	0%
10	0	0%	0%	0%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Srinivasan, M.; Kanewala, U. Improving Early Fault Detection in Machine Learning Systems Using Data Diversity-Driven Metamorphic Relation Prioritization. Electronics 2024, 13, 3380. https://doi.org/10.3390/electronics13173380

AMA Style

Srinivasan M, Kanewala U. Improving Early Fault Detection in Machine Learning Systems Using Data Diversity-Driven Metamorphic Relation Prioritization. Electronics. 2024; 13(17):3380. https://doi.org/10.3390/electronics13173380

Chicago/Turabian Style

Srinivasan, Madhusudan, and Upulee Kanewala. 2024. "Improving Early Fault Detection in Machine Learning Systems Using Data Diversity-Driven Metamorphic Relation Prioritization" Electronics 13, no. 17: 3380. https://doi.org/10.3390/electronics13173380

APA Style

Srinivasan, M., & Kanewala, U. (2024). Improving Early Fault Detection in Machine Learning Systems Using Data Diversity-Driven Metamorphic Relation Prioritization. Electronics, 13(17), 3380. https://doi.org/10.3390/electronics13173380

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Improving Early Fault Detection in Machine Learning Systems Using Data Diversity-Driven Metamorphic Relation Prioritization

Abstract

1. Introduction

2. Background

2.1. Metamorphic Testing

2.2. Machine Learning

2.3. Mutation Testing

2.4. Dataset Diversity

3. Proposed Approach

3.1. Dataset Diversity Approach

3.1.1. Rule Based Classifier

3.1.2. Anomaly Detection

3.1.3. Clustering-Based

3.1.4. Data Distribution

3.1.5. Feature-Based

4. Evaluation Setup

4.1. Subject Programs

4.2. Metamorphic Relations

4.3. Source and Follow-Up Test Cases

4.4. Mutant Generation

5. Evaluation Methodology

5.1. Evaluation Procedure

5.2. Evaluation Measures

6. Result and Analysis

6.1. RQ1: Comparison of MR Prioritization Approaches versus Random Baseline

6.1.1. Fault Detection Effectiveness

6.1.2. Effective Number of MRs Used for Testing

6.1.3. Average Percentage of Faults Detected

6.1.4. The Average Time Taken to Detect a Fault

6.2. RQ2: Comparison of the Proposed MR Prioritization Approach against the Fault-Based Approach

6.2.1. Fault Detection Effectiveness

6.2.2. Effective Number of MRs Used for Testing

6.2.3. The Average Time Taken to Detect a Fault

6.2.4. Average Percentage of Faults Detected

6.3. RQ3: Comparison of the Proposed Metric with the Approach Based on Neuron Activation Cover

6.3.1. Fault Detection Effectiveness

6.3.2. Average Percentage of Faults Detected

7. Discussion

8. Threats to Validity

9. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. MRs Used for Testing the Subject Programs

Appendix A.1. MRs for IBk and Naive Bayes Program

Appendix A.2. MRs for Linear Regression

Appendix A.3. MRs for CNN Program

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI