1. Introduction
Real-world training data often include noises (or errors), which can be mainly categorized into two types: label error and feature error [
1,
2,
3,
4,
5]. Label error arises when the class label of data is incorrect, while the feature error arises when the features of data are corrupted. These noises are made for various reasons. For example, sensor involved applications (such as WSN and IoT) may make noises due to the intrinsic instability of sensors [
6,
7]. In addition, big data further contribute to the emergence of noise [
8]. When training data are noisy, the performance of learning based on it will be degraded. These two types of error have been individually studied by many works. We focus on the label error to study in this work.
The label errors are mainly caused by the subjective nature of the labeling task and lack of information for making the true label. Domain experts usually provide labeling that mainly depends on their heuristics and domain knowledge. It is a crucial fact that mislabeling cannot be even avoided with a thorough inspection of domain experts. It happens commonly when a consensus is not made during the annotation process by multiple domain experts. Mislabeling is very common in domains requiring rapid development, such as bioinformatics. For example, in a study on breast tumor [
9], there existed nine subjective mislabelings among forty-nine features in the training data. Furthermore, mislabeling is also caused by insufficient information available to the expert [
10,
11]. An example of such information includes the unavailability of data of certain observation results of tests. Physicians are not confident to conclude the crisp diagnosis decision in the presence of partial information.
The existence of mislabeled data usually degrades the performance of learning [
12,
13,
14,
15,
16,
17]. In general, the goal of a learning algorithm is to search for the best hypothesis from its hypothesis space. In supervised learning, the best hypothesis is usually decided by the correlations between the features and the labels of training data. Therefore, searching for the best hypothesis will be influenced by the mislabeled data, which results in selecting a non-optimal hypothesis. The non-optimal hypothesis can bring a set of negative effects, including classification accuracy reduction, classifier construction time and complexity increase, and others.
The approaches dealing with mislabeled data are categorized into two main groups: robust algorithm design [
18,
19,
20,
21,
22] and noise filter [
23,
24,
25,
26,
27,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39,
40,
41]. The first one is mainly good at developing a novel algorithm that deals with noisy data during model training. The second approach is good at the identification and filtering of mislabeled data before the training. Evidence exists that it is usually difficult to develop a robust algorithm that is insensitive to noisy data. Furthermore, it is revealed that mislabeled noisy data have a severe impact on the approach; even if the design is claimed to be robust. In comparison, filter based approaches have significant performance leverage over the robust algorithm. The core contribution of this work is in the area of filter based approaches.
Several filter based approaches are used to deal with mislabeled data, where the ensemble learning based filter (EnFilter) is a widely used approach based on its promising performance [
23,
24,
30,
33]. EnFilter leverages others with a unique approach by employing multiple classifiers to identify noises based on their voting.
According to the adopted voting mechanism, EnFilter consists of two types: single-voting based (SVFilter) and multiple-voting based (MVFilter).The SVFilter detects noises only based on one-time voting of multiple classifiers, and therefore, it has a potential instability problem.
To solve this instability problem, an MVFilter was proposed in our previous work [
40]. In essence, an MVFilter consists of a set of SVFilters (assume this number is t). For the training data, if at least m (
SVFilters treat it as noisy, then the MVFilter regards it as noisy. The internal mechanism of the SVFilter makes comparisons between each SVFilter, therefore through their fusion, the MVFilter can improve the noise detection stability and accuracy compared to the SVFilter. In the design of the MVFilter, one of the key issues is how to define the value of m (called the decision point), which actually defines the noise detection rule. An optimal decision point (ODP) could maximize the performance of an MVFilter. In [
40], the decision points were empirically explored with different representative values. However, a systematic approach to determining the ODP is lacking.
To this end, a novel approach is proposed in this work to compute the ODP for an MVFilter. Instead of only considering the number of errors, our approach takes cost information into account because many applications have unequal costs for various errors. When a cost matrix containing various cost values is given, the ODP selected by our approach is able to identify the noises that minimize the expected cost.
The core idea of our approach is as follows: firstly, estimating the mislabeled data distribution in the noisy training dataset; secondly, estimating the expected costs of each possible decision point; and finally, the optimal decision point determined by minimizing the expected cost.
We tested our approach based on a set of MVFilters. The experimental results show that our approach can significantly improve the performance of existing MVFilters. Our approach consistently works well for different datasets and different cost matrices. In addition, our approach is effective and straightforward. Only a few predefined parameters and prior knowledge are required.
In the next section, we will briefly review ensemble learning based noise filters.
Section 3 analyzes the performance of the MVFilter when costs are considered. Our novel approach is presented in
Section 4. The experimental evaluations are presented in
Section 5.
Section 6 concludes this work and presents future work.
2. Related Work
This work presents an approach to improve multiple-voting based ensemble filters for mislabeled data recognition. As necessary background knowledge, conventional ensemble learning based filters (EnFilter) will be introduced firstly. Then, multiple-voting based filter (MVFilter) will be presented.
EnFilter employs an ensemble classifier to detect mislabeled instances by constructing a set of base level classifiers and then using their classifications to identify mislabeled instances. The general approach is to tag an instance as mislabeled if x of the m base level classifiers cannot classify it correctly. The majority filter (MF) and consensus filter (CF) are the representative EnFilter algorithms [
27,
28]. MF tags an instance as mislabeled if more than half of the m base level classifiers classify it incorrectly. CF requires that all base level classifiers must fail to classify an instance as the class given by its training label for it to be eliminated from the training data.
The reason for employing ensemble classifiers in EnFilter is that the ensemble classifier has better performance than each base level classifier on a dataset if two conditions hold: (1) the probability of a correct classification by each individual classifier is greater than 0.,5 and (2) the errors in predictions of the base level classifiers are independent.
Algorithm 1 enlists the majority filter (MF) algorithm as a representative EnFilter algorithm. In Step 1, it initializes with the n disjoint subset of the training set E. In Step 2; it initializes the empty set A to reflect the noisy examples. The main loop in Steps 3–6 processes each subset E, in an iterative manner. Step 4 establishes subset E having all examples from E except the one existing in E. These examples from E are used in an arbitrary inductive learning algorithm in Step 6 to induce a hypothesis (a classifier) H. In Step 14, all those examples from E are added to A as potential noisy examples for which the majority of the hypothesis does not hold. CF is more conservative than MF because of the severer condition for noise identification, which ultimately results in fewer instances being eliminated from the training set. With a such property, the CF differs from MF in Step 14, and thus, it considers examples in E as noisy only if all of them are classified incorrectly by the hypothesis. Furthermore, CF has the risk of retaining bad data.
As Algorithm 1 shows, the core of EnFilter is adopting a voting mechanism to recognize noises. Training data x, in subset E after data partitioning on E, will be voted on by the multiple classifiers, which are trained based on the data in E\E. Suppose y(x) is the function to determine whether x is mislabeled, then y(x) = vote(classifiers(E\E, x). Like MF and CF, the conventional EnFilter decides y(x) only based on one-time voting and therefore, is a single-voting based filter (SVFilter).
As pointed out in [
40], the SVFilter suffers from an instability problem. For data x, if the SVFilter runs twice, the first data random partitioning might assign x to subset E
, while the second time, it assigns x to subset E
. Therefore, we have y(x) = vote(classifiers(E\E
and y(x) = vote(classifiers(E\E
. Note that as there is diversity between E\E
and E\E
, the voting results of two SVFilters might be different. Therefore, instead of one-time voting, multiple-voting based filters (MVFilter) have been proposed to address this instability problem.
MVFilter consists of t SVFilters. Each SVFilter generates its own decision about suspected mislabeled data index A
. Finally, all the different decisions A
will be combined by the MVFilter to output the final decision about which data are mislabeled. Therefore, the decision function of MVFilter can be described as y(x) = vote
(vote
(E\E
, vote
(E\E
, …, vote
(E\E
. In this function, vote
is the voting policy used by each SVFilter; vote
is the voting policy used by MVFilter; E
is the subset containing x obtained from the i
th SVFilter. Usually, the vote
policy can either be based on majority voting or consensus voting. For the vote
policy, we have developed three policies: majority voting, consensus voting, and one-time veto. One-time vote means that if at least one SVFilter tags data as mislabeled, then the MVFilter will tag these data as mislabeled. In the MVFilter, different vote
and vote
policies can be combined to make various algorithms. As the example of MVFilter, the MF
[
40] algorithm is presented in Algorithm 2, which utilizes majority voting for both vote
and vote
.
Algorithm 1 Majority filtering algorithm. |
Algorithm: majority filtering (MF) |
Input: E (training set) |
Parameter: n (number of subsets), y (number of learning algorithms), A, A, …, A(y kinds of learning algorithms) |
Output: A (detected noisy subset of E) |
(1) form n disjoint almost equally sized subset of E, where |
(2) |
(3) for i=1, …, n do |
(4) form |
(5) for j = 1,…y do |
(6) induce H based on examples in E and A |
(7) end for |
(8) for every do |
(9) |
(10) for j = 1,…,y do |
(11) if Hj incorrectly classifies e |
(12) then |
(13) end for |
(14) if , then |
(15) end for |
(16) end for |
Algorithm 2 MFMF algorithm. |
MajorityFiltering_MajorityFiltering (MFMF) |
Input: E (training set) |
Parameter: n (number of subsets), y (number of learning algorithms), t (number of times of subsets partitioning), A1, A2, …, Ay(y kinds of learning algorithms) |
Output: A (detected noisy subset of E) |
(1) for p = 1,…, t do |
(2) form n disjoint almost equally sized subset of E, where |
(3) |
(4) for i = 1, …, n do |
(5) form |
(6) for j = 1,…y do |
(7) induce H based on examples in E and A |
(8) end for |
(9) for every do |
(10) |
(11) for j = 1,…,y do |
(12) if H incorrectly classifies e |
(13) then |
(14) end for |
(15) if , then |
(16) end for |
(17) end for |
(18) end for |
(19) |
(20) for every do |
(21) |
(22) for j = 1,…, p do |
(23) if |
(24) then |
(25) end for |
(26) if , then |
(27) end for |
3. Analysis of Decision Point, Error Probability, and Cost for MVFilter
The multiple-voting based filter (MVFilter) consists of several single-voting based filters (SVFilter). The MVFilter treats data as mislabeled if at least m out of t SVFilters identify these data as mislabeled. Obviously, for different m values, the recognized noises by an MVFilter will be different. The selection of the m value plays an important role in an MVFilter. Because the m value decides the noise identifying results, it is called the “decision point” in this work. Our goal is to find a way to decide the “optimal decision point” to maximize the performance of MVFilter.
When a filter works on a noisy training dataset, it is usually hard to recognize all the noises perfectly. The errors made by a filter include two types: The first type (E1) occurs when declaring a correctly labeled example as mislabeled and is subsequently discarded. The second type of error (E2) corresponds to declaring a mislabeled example as correctly labeled. For a well designed filter, it is desirable to avoid both E1 and E2 errors. However, conceptually, E1 and E2 are conflicting. To reduce E1 errors, the filter should make a more severe noise detection policy, which tends to increase E2 errors. In MVFilter, the selection of the decision point will influence the probability to make an E1 or E2 error.
3.1. Relationship between the Decision Point and Error Probability in MVFilter
An MVFilter fuses the noise detection results of multiple SVFilters, while an SVFilter fuses the classification results of multiple classifiers. Therefore, the errors made by each classifier are the basis to infer the errors made by an MVFilter.
Let P(E1 and P(E2 be the probability that classifier i makes an E1 and E2 error, respectively. To clarify the analysis, it is assumed that all the various classifiers in an SVFilter have the same probability of making an error. Therefore, we assume that and . The most commonly used SVFilters include the majority filter (MF) and consensus filter (CF). The analysis here is based on MF, while a similar analysis can be conducted for CF.
MF makes an E1 (or E2) error when more than half of these classifiers make an E1 (or E2) error. If the number of classifiers in MF is
y, then we have:
Suppose an MVFilter consists of
t majority filters (MMF). Let
and
denote the probability that each MF makes an E1 and E2 error, respectively. To simplify the analysis, it is assumed that
and
. The decision rule of an MVFilter is “if at least m of the t SVFilters think data is mislabeled, then these data are identified as mislabeled”. This m value, called the decision point, will influence the probabilities of making an error for an MVFilter. Let MMF represent an MVFilter consisting of multiple majority filters, then the following relationships can be found:
The decision point value m can be any number between one and
t. Among all possible values, the representative decision points include
m = 1,
m =
t/2,
m =
t. When m = 1, data will be identified as mislabeled if at least one SVFilter thinks these data are mislabeled. When
m =
t, data will be identified as mislabeled only if all the
t SVFilters think these data are mislabeled. Conceptually, the noise detection rule is too loose if the decision point is one, while the rule is too strict if the decision point is
t. In this sense,
m =
t/2 is usually moderate. For these three representative decision points, we have the following relationships:
For the above relationships, normally we have:
As and are conflicting, the optimal decision point should make a trade-off between these two probabilities. Therefore, if the probability of making errors is the only concern of MVFilter, the optimal decision point (ODP) is .
3.2. Relationship between the Decision Point and Error Cost
In
Section 3.1, for an MVFilter, the relationships between the optimal decision point and probabilities of making errors are analyzed. In this section, the costs of misrecognitions are considered. We will further analyze the relationships between the decision point and expected costs.
Misrecognition/error costs allow us to specify the relative importance of different kinds of errors. In fact, many applications have unequal misrecognition costs. In our previous work [
41] while studying the behaviors of the supervised feature selection algorithm, we noticed a trade-off of a smaller and bigger number of noise-free data preferences among various algorithms. As a consequence of this trade-off, different costs should be determined for different errors. A smaller number of noise-free data yields a higher type 1 error cost compared to the type 2 error cost.
The various misrecognition costs are defined by a cost matrix. The cost matrix reflects the domain specific costs representing the cost sensitive model in the critical medical domain. Therefore, associative costs for a different type of error are finalized by the domain expert keeping the clinical context and consequences in mind.
As shown in
Table 1, cost matrix C usually has the following structure, wherein the cost matrix rows correspond to predicted results, while columns correspond to actual results, i.e., row/column = predict/actual.
For correctly classified mislabeled (or noise-free) data, the cost is zero, and hence, it is normally assumed that
in the above matrix. With this assumption, the expected cost of an MVFilter is:
As
Section 3.1 shows,
and
are correlated with the decision point value. Therefore,
is determined by both the decision point value and the cost matrix. If the cost matrix is fixed, then
is only influenced by the decision point value. Therefore, the cost concerned optimal decision point should be:
In this equation, if
,
will be the dominant factor to determine ODP value. The ODP is the decision point that can minimize
. From the analysis in
Section 3.1, we know that it is highly probable that
. On the other hand, if
,
will be the dominant factor to determine the OPD value. In this case, it is likely that
.
It should be noted that the ODP can be determined from the above analysis only in some extreme cases (for example, when or . However, for the other cases, directly calculating the ODP is extremely difficult. In addition, the above equation of the ODP is obtained by making several assumptions. Therefore, it is not very useful to calculate the ODP value through mathematically inferring since the calculated ODP is influenced by the assumptions.
4. Novel Approach to Determine the Optimal Decision Point
In this section, we present our approach that can select the optimal decision point for an MVFilter by considering both cost information and the dataset itself.
Given a noisy training dataset, we define that the ODP is the value that can minimize the expected cost of an MVFilter. As pointed out in
Section 3, mathematically inferring the ODP is difficult. Therefore, instead of directly inferring, we try to estimate the ODP implicitly.
For a noisy training dataset E, if we already know which data in E are mislabeled, it is trivial to decide the ODP. We just need to explore all the possible decision points. The OPD will be the point that minimizes the overall costs of misrecognitions.
Of course, the mislabeled data distribution in E is unknown since our mission is to identify mislabeled data from E. However, if there exists another noisy dataset E’ similar to E and with a known mislabeled data distribution, then we could implicitly estimate ODP from E’ instead of E since their ODPs should be similar.
This actually is the key idea of our approach. Given a noisy dataset E to handle, we will generate another dataset E’. The new generated E’ requires: (1) E’ and E are from the same/similar data distribution, and (2) the mislabeled data distributions in E and E’ are similar. If such an E’ could be generated, we can easily get the ODP based on E’ since the mislabeled data distribution in E’ is known.
In many real applications, in addition to the noisy dataset E, usually another validation dataset E is available. E contains only noise-free data and coming from the same data distribution as E. As there are no mislabeled data in E, the artificial erroneous labels are put into E. Here, we assume that the prior knowledge of the noise ratio in E is available, which is used to determine the erroneous labels put into E. Through the above procedures, E can be converted to E’. The optimal decision point from E’ can be used to estimate the actual decision point in E.
As the actual mislabeled data distribution in E is not available, we put erroneous labels in E in a random manner based on the prior noise ratio information. Although the mislabeled data in E are also stochastic, the mislabeled data distribution in E and E’ can have a great difference. In this case, the ODP value obtained from E’ is actually not optimal for E. To solve this problem, the ODP is estimated several times. This method uses the numIter parameter to control the specified number of iterations. Each time E’ changes since random erroneous labels are put into E. For each time, all the possible decision points (from one to t) will be explored, and accordingly, the cost of misrecognition is recorded. The average cost of each decision point value is obtained by taking the mean value of this decision point multiple times the misrecognition costs. Finally, the decision point having the least average cost is selected as the optimal decision point. The details of our algorithm are shown in Algorithm 3.
Algorithm 3 Optimal decision point estimation for MVFilter. |
Algorithm: Searching optimal decision point for MVFilter |
Input: E (training set), E (noise-free dataset) |
Parameter: numIter (number of iterations to search ODP), noiseRatio (the noise ratio in E), MVFilter (the multiple-voting based filter algorithm), t (number of single-voting filters in MVFilter), C (cost matrix) |
Output: ODP (optimal decision point) |
(1) |
(2) for i = 1,…, numIter do |
(3) randIndex←RandomPermutation( |
(4) noiseIndex←randIndex(1: |
(5) |
(6) |
(7) for m = 1,…, t do |
(8) noiseIndexDetected←MVFilter(E’, m) |
(9) index←InterSection(noiseIndex, noiseIndexDetected) |
(10) indexE1←noiseIndexDetected\index |
(11) indexE2←noiseIndex\index |
(12) |
(13) costVector(m) ←cost |
(14) end for |
(15) costMatrix = [costrMatrix; costVector]; |
(16) end for |
(17) |
In Algorithm 3, it is assumed that another noise-free dataset E exists, which has the same distribution as E. Usually in a training dataset, some labels are certainly correct. These partial noise-free data are also used as a validation dataset in many applications. However, for a few applications, if E is unavailable, then this algorithm cannot be used directly. To solve this problem, we can directly use an MVFilter to mine the noise-free data from E. In this case, the loose noise detection policy is preferred by MVFilter. To generate E, the main concern is to make less E2 errors. Therefore, a small decision point value (for example, one) should be used by MVFilter. By this method, E can be collected from E. Then, Algorithm 3 can be used. The parameter noiseRatio in Algorithm 3 should also be noted. This parameter represents the noise ratio in E (mislabeled percentage of E). It is used to decide the number of erroneous labels to generate in E’. Here, we assume this is prior knowledge. For many applications, through years of experience, the rough noise ratio in a noisy training set is usually known. If this value is totally unknown, we also provide a solution. This parameter can be estimated from E by using an MVFilter. To estimate this parameter more accurately, MVFilter should select a decision point that considers the E1 and E2 error simultaneously. The value t/2 is a reasonable decision point since this decision point usually has a good trade-off between E1 and E2 errors.
5. Experimental Work
In this section, a set of experiments is conducted to verify the effectiveness of our proposed approach. To test its performance, several representative single-voting and multiple-voting based filters are used. SVFilters include the majority filter (MF) and consensus filter (CF) [
27,
28]. MF based MVFilters include MF
MF
, and MF
[
40]. CF based MVFilters include CF
, CF
, and CF
[
40]. Suppose the number of SVFilters in an MVFilter is t. In MF
and CF
, the decision point is 1. In MF
and CF
, the decision point is t/2, while in MF
and CF
, the decision point is t. When the decision point is determined by our approach, the MF based MVFilter is denoted by MF
and the CF based MVFilter is denoted by CF
. When filtering noises, the costs incurred by MF
and CF
will be compared to other methods. If our approach is effective, MF
and CF
should incur less cost compared to other methods.
Six bioinformatics datasets from the UCI repository were used in this experiment. Information on these datasets is tabulated in
Table 2, where pos/neg presents the percentage of the number of positive examples against that of negative examples.
An SVFilter (referring to Algorithm 1) is configured as follows: the number of subsets is 3 (n = 3); three learning algorithms are used (y = 3) including naive Bayes, decision tree, and 3-nearest neighbor. The configurations of an MVFilter (referring to Algorithm 2) are basically identical to the SVFilter configurations. One additional parameter in MVFilter is the number of SVFilters, which equals nine in the experiments (t = 9). Our proposed algorithm (referring to Algorithm 3) is based on MVFilter. Its additional parameter is the number of iterations to search for ODP. Here, it equals ten (numIter = 10).
The experiments were performed on each benchmark dataset by dividing it into a training set and test set. The filter algorithms were applied to each training set to remove the mislabeled data. Test data were only used by our algorithm, which is represented as E in Algorithm 3. It is important to clarify here that domain experts were involved to establish the noise-free benchmark dataset, which included the desired labels finalized after coming to a common consensus.
Making the cost value as a baseline computation, the performance of each filter algorithm was evaluated against each dataset D using the following steps:
Evaluating the performance of each filter using three trials derived from the threefold cross-validation of D. For each trail, 2/3 of D or Tr were used for the training set. We purposely changed some correct labels in the Tr using the predefined mislabeled ratio to generate the mislabeled data. For this purpose, three different mislabeled ratios were used: 10%, 20%, and 30%. As an example, for a 10% mislabeled ratio, 10% of the samples from Tr were randomly selected and then the correct labels changed.
The average cost of each algorithm was calculated by taking the mean cost of errors for each filter of the three trails.
In order to avoid the influence of the partitioning of D on the generated mislabeled data, we considered ten cost values retrieved from each experiment conducted ten times (i.e., repeating the previous two steps ten times).
Finally, the reported cost value was obtained as the mean of these ten values.
5.1. Experimental Investigation
Next, the experimental results of each dataset will be presented.
Table 3 shows the comparisons of each filter in terms of cost on the Heart dataset. This table consists of three parts corresponding to three noise ratios (10%, 20%, and 30%). Under each noise ratio, the experiments were based on nine different cost matrices. Here, it was assumed that
, so only
and
were needed to define a cost matrix. For example, in the second row of
Table 3, 1:1 means
, while 1:20 means
. The last column in
Table 3, Ave., represents the average cost of each filter based on all nine cost matrices.
Table 3 shows that for all three noise ratios, CF
had the lowest average cost among all the CF based filters. Likewise, MF
was the best one among all the MF based filters. Moreover, under all the noise ratios and cost matrices, CF
and MF
outperformed other filters in most cases. This was in contrast to the other filters that heavily depend on cost matrices. For example, CF
showed outstanding performance when
, but its performance decreased dramatically when
increased. When the correlation between the cost and noise ratio was considered, we found that the cost of all the filters increased with noise ratio growth. However, compared with other filters, the cost increases of CF
and MF
were slow. In detail, when the noise ratio grew from 10% to 30%, the cost increase of CF
was 44, MF
was 12, while the cost increase of other filters was fast (for example, 97 for CF
and 102 for MF
). By further comparing CF
and MF
, we found that under this dataset, CF
had a smaller average cost value. However, with the noise ratio increasing, the performance difference between them became small.
Table 4 shows the cost comparisons of each filter based on the Wdbcdataset. The experimental conclusions in
Table 4 are similar to those of
Table 3. In most cases (under different noise ratios and cost matrices), CF
and MF
were the winners. In addition, their advantages were more obvious when the noise ratio and cost value increased. When the noise ratio was 10%, CF
outperformed MF
. However, they showed similar performance when the noise ratio grew.
Table 5 presents the experimental results based on the Wpbc dataset. Similar to the experimental conclusions from
Table 3 and
Table 4, our approach could effectively improve the performance of the MF and CF based filters. In addition, CF
and MF
consistently worked well in different cases. Except for CF
and MF
, the performance of the other filters usually had a dramatic decline when the noise ratio increased. Moreover, other filters had obvious performance changes when the relationship of
and
changed. For example, MF
worked well when
, but its performance became poor when
.
Table 6,
Table 7 and
Table 8 show the experimental results on the datasets of Spect, Spect1, and Promoter. Similar to the above analysis, these three tables clearly indicate the superiority of CF
and MF
.
Several important conclusions can be drawn by summarizing the above evaluation results:
(1) Selecting the optimal decision point by our approach could effectively improve the performance of an MVFilter. (2) CF and MF adapted to various noise ratios. In particular, even in a high noise ratio environment, the cost increases of CF and MF were not great. (3) Under different cost matrices, CF and MF consistently outperformed other filters. The advantages of CF and MF were more obvious when the difference between and was big. (4) Given a noisy training dataset, our proposed approach proved to be effective under different noise ratios and cost matrices if two conditions hold: (a) the noise ratio of this dataset is known; (b) there exists another noise-free training dataset that is drawn from the same distribution as this noisy dataset.
5.2. Extended Experimental Investigation
As pointed out above, our approach was verified to work well if the noise ratio and additional noise-free dataset were available. To further confirm the usability of our approach, we evaluated it in an environment where the two kinds of information were not available, that is the noisy training dataset E was the only available information.
The noise ratio was estimated by the CF
algorithm. As an MVFilter, CF
consists of t consensus filters. The decision point here equals t/2. In other words, if at least t/2 CFs identify data as mislabeled, then CF
will regard that these data are mislabeled. For a noisy training dataset E, if n data are identified by CF
, then the estimated noise ratio is
. The parameter configurations of CF
were consistent with before (referring to the beginning of
Section 4).
The noise-free dataset was obtained by applying CF algorithm on E. CF consists of t consensus filters. If at least one of CF identify data as mislabeled, then CF will regard these data as mislabeled. Conceptually, the noise detection is loose, which aims to remove all the potential mislabeled data. Suppose the noises recognized by CF are A. Then, the noise-free dataset is the subset of E, which excludes A. The configurations of CF were in accordance with above experiments.
Table 9,
Table 10,
Table 11,
Table 12,
Table 13 and
Table 14 show the experimental results on the benchmark datasets. Under all five datasets and all the noise ratios, CF
and MF
still showed outstanding performance, which defeated other filters in most cases. When compared to the experiment results in
Table 3,
Table 4,
Table 5,
Table 6,
Table 7 and
Table 8, we found that the performances of CF
and MF
had a certain extent of degradation in a few cases. However, in general, the performance change was moderate. This indicates that even without a noise ratio and an additional noise-free dataset, our proposed approach still worked well. One of the reasons was that in our approach, the mislabeled data distribution was estimated multiple times. Although the estimated mislabeled distribution for each time might be distorted, their fusion approached the real distribution. Then, the estimated ODP was also close to the real optimal decision point.
The two independent experimental evaluations in
Section 5.1 and
Section 5.2 proved that our proposed approach was effective and able to improve the performance of any MVFilter by selecting the optimal decision point. In particular, in the high noise ratio and high cost values, our approach showed significant improvements compared to other filters.
6. Conclusions and Future Works
In mislabeled data detection, the multiple-voting based filter (MVFilter) is generally superior to the conventional single-voting based filter (SVFilter). However, one important unsolved issue in the MVFilter is how to choose the optimal decision point (ODP) to maximize its noise detection performance.
In this paper, a novel approach was proposed to solve this issue. This approach implicitly computed the ODP by estimating the mislabeled data distribution in the noisy training dataset. Our approach took a noisy dataset and a cost matrix as input, then output an ODP, which aimed to minimize the expected cost of errors. Note that minimizing cost was one important contribution of this work, because most existing works were not aware of the importance of cost. They just implicitly assumed that all errors were equally costly, but in most real applications, this is far from the case.
A set of experimental evaluations was conducted, which proved the effectiveness of our approach. With the aid of our approach, an MVFilter could effectively reduced the cost. In particular, in the difficult noise detection environment (when the noise ratio was high or cost was big), the advantages of our approach were more obvious. Furthermore, the proposed methodology could also be extended to a multi-class problem. One possible strategy is the naive way to divide the multi-class problem into several two-class problems, and then, the proposed approach can solve each two-class problem.
Although the clean dataset (i.e., validation cases) and cost matrix are available in most cases, the prior information of noise ratio is not easily available; therefore, the current solution needs to be improved to alleviate the prior information requirement. Therefore, in future work, we will focus on developing more elegant approaches to further improve the current proposed approach.
7. Availability of Data and Material