Skip to Content
You are currently on the new version of our website. Access the old version .
Applied SciencesApplied Sciences
  • Article
  • Open Access

16 April 2023

Efficient False Positive Control Algorithms in Big Data Mining

,
,
,
,
and
1
School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
2
Northeastern University, Shenyang 110819, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Big Data Engineering and Application

Abstract

The typical hypothesis testing issue in statistical analysis is determining whether a pattern is significantly associated with a specific class label. This usually leads to highly challenging multiple-hypothesis testing problems in big data mining scenarios, as millions or billions of hypothesis tests in large-scale exploratory data analysis can result in a large number of false positive results. The permutation testing-based FWER control method (PFWER) is theoretically effective in dealing with multiple hypothesis testing issues. In reality, however, this theoretical approach confronts a serious computational efficiency problem. It takes an extremely long time to compute an appropriate FWER false positive control threshold using PFWER, which is almost impossible to achieve in a reasonable amount of time using human effort on medium- or large-scale data. Although some methods for improving the efficiency of the FWER false positive control threshold calculation have been proposed, most of them are stand-alone, and there is still a lot of space for efficiency improvement. To address this problem, this paper proposes a distributed PFWER false-positive threshold calculation method for large-scale data. The computational effectiveness increases significantly when compared to the current approaches. The FP-growth algorithm is used first for pattern mining, and the mining process reduces the computation of invalid patterns by using pruning operations and index optimization for merging patterns with index transactions. The distributed computing technique is introduced on this basis, and the constructed FP tree is decomposed into a set of subtrees, each corresponding to a subtask. All subtrees (subtasks) are distributed to different computing nodes. Each node independently calculates the local significance threshold according to the designated subtasks. Finally, all local results are aggregated to compute the FWER false positive control threshold, which is completely consistent with the theoretical result. A series of experimental findings on 11 real-world datasets demonstrate that the distributed algorithm proposed in this paper can significantly improve the computation efficiency of PFWER while ensuring its theoretical accuracy.

1. Introduction

In statistical analysis, we often need to test whether a pattern is significantly associated with a given class label, which is the classical hypothesis testing problem [1]. We frequently need to conduct this task on large datasets due to the increasing data size. For example, detecting whether a certain genetic pattern in massive bioinformatics data is significantly associated with a certain disease [2], focusing on whether a certain user behavior pattern is significantly associated with the sale of a certain item in massive market shopping data, etc. [3]. This raises a challenging issue of multiple hypothesis testing because millions or billions of hypothesis tests in large-scale exploratory data analysis can result in many false positives, resulting in a substantial waste of resources [4].
The FWER control method based on permutation testing (PFWER) has been theoretically shown to be an effective method for mitigating multiple hypothesis testing problems [5,6]. Compared with traditional FWER control methods (e.g., Bonferroni correction [7], the SRB algorithm [8], the Simes algorithm [9], Hochbeg [10], etc.), it has received much attention for its ability to control the overall probability of false positives at a lower level without assuming independent identical distributions. The PFWER control method is based on the principle of perturbing the class labels in the original data and then performing a certain number of random combinations and recalculating the significance threshold (i.e., p-value) that satisfies the FWER constraint [11]. The p-values corrected by the PFWER control technique can better control the false positives of the overall results in a more realistic scenario because the initial association of class labels with datasets is randomly perturbed. (i.e., where the assumption of an independent identical distribution between variables is not required).
Although the PFWER control method can theoretically produce more reasonable FWER thresholds, it is highly computationally intensive. Each class label permutation requires the calculation of the corresponding p-value for all patterns embedded in the data (typically in the order of the original data size), and the selection of the smallest p-value among them, and the same process is typically repeated 1000 to 10,000 times [11,12]. The FastWY algorithm [13] exploits the inherent properties of discrete test statistics and successfully reduces the computational burden of the Westfall–Young permutation-based procedure. The Westfall–Young Light algorithm [5] is based on an incremental search strategy where the enumerated frequent patterns are computed only once. Several orders of magnitude in the p-value pre-computation reduce the corresponding running time of the p-value computation task. These PFWER control methods, however, are all single-machine algorithms, and there is still space for significant efficiency improvements.
To address the aforementioned problem, a distributed FWER false positive threshold calculation method for large-scale data is proposed in this article. The computational efficiency is greatly improved when compared to current methods. The FP-growth algorithm is used first for pattern mining, and the mining process lowers the computation of invalid patterns by merging patterns with index transactions via pruning operations and index optimization. On this basis, the concept of distributed computing is introduced, and the constructed FP tree is decomposed into a set of subtrees, each of which corresponds to a subtask, and all subtrees (subtasks) are distributed to different computing nodes, each of which independently computes the local significance threshold based on the assigned subtasks. Finally, the results of all nodes’ local computations are aggregated, and the FWER false positive control thresholds that are completely consistent with the theoretical results are calculated.
The main contributions of this paper are as follows.
(1)
A distributed PFWER false positive control algorithm is proposed. Based on the proof that the threshold calculation task is decomposable, the PFWER false-positive control threshold calculation problem on large data is extended to a distributed solvable problem through task decomposition and the merging of local results. Theoretical analysis and experimental findings indicate that the algorithm outperforms similar algorithms in terms of execution efficiency.
(2)
An FP tree with an index structure and a pruning strategy is proposed. The pruning strategy can reduce the number of condition trees constructed, and the index structure can reduce the computation of redundant patterns in FP tree construction. The experimental findings show that the two strategies can significantly reduce the number of traversals of the dataset and the pattern computation overhead, which greatly improves computational efficiency.
The paper is structured as follows: Section 2 is an introduction to the relevant concepts and techniques. Section 3 introduces the distributed PFWER false positive control algorithm. Section 4 tests the correctness and computational efficiency of the distributed PFWER false positive control algorithm through experiments and provides a theoretical analysis of the experimental results. Section 5 concludes the paper and discusses the focus of future work.

3. PFWER-Based Distributed False Positive Control Algorithm

The FWER control method can control multiple hypothesis detection problems that require strict control of false positive errors. In this paper, a transactional dataset with binary labels is selected as the computational vehicle for the distributed false positive control algorithm. Considering that there is a certain degree of dependence among the hypotheses in the transactional dataset, and therefore, the computed p-values also have a certain degree of dependence, this chapter will use the Westfall–Young Light algorithm [5] based on the Westfall and Young [30,43] substitution process for the computation. This algorithm can control FWER under the α level, but the implementation of this algorithm involves a large number of resampling and replacement operations, and the computation is very slow. Therefore, the main objective of this chapter is to improve the computational speed and accuracy of the false positive control algorithm in large-scale data computation using a distributed strategy.

3.1. Problem Definition

Definition 1.
Let l 0 , l 1 be two class labels, and the transaction dataset is D = T 1 , l 1 , T 2 , l 2 , , T n , l n , where each transaction T i is composed of a set of items set, i.e., T i = t 1 , t 2 , , t k . Each transaction T i in the transaction dataset carries a binary class label l i l 0 , l 1 .
Definition 2.
Let the pattern S be a set of items, i.e., S = t 1 , t 2 , , t i , t i { 1 , , m } . Let σ S denote the number of dataset D containing pattern S, σ 1 S denote the number of dataset D labeled l 1 as containing pattern S, σ 0 S denote the number of dataset D labeled l 1 as containing pattern S, and σ 0 S denotes the number of datasets D in which the label l 0 is the number of containing patterns S. Based on the above two definitions, a 2 × 2 column-linked table can be constructed, as shown in Table 3.
Table 3. A 2 × 2 contingency table.
Definition 3.
The null hypothesis H 0 is that the pattern S is not significantly associated with the label l i and let δ be the corrected significance level, the null hypothesis is rejected, and the pattern S is considered to be significantly associated with the label l i if and only if the p-value δ .
Definition 4.
A false positive is the probability of finding an incorrect association (Type I error) [5].
Section 2.1.4 has shown that the p-value calculation method used in this paper is the Fisher exact test. The Fisher exact test observes that the values n , n 1 , σ S of the edges of the 2 × 2 column table are fixed. Thus, under the null hypothesis that mode S and the labels l i are independent of each other, the calculation of σ 1 S follows the hypergeometric distribution, as shown in Equation (7).
p F ( σ 1 ( S ) = a | σ ( S ) , n 1 , n ) = n 1 a n n 1 σ ( S ) a n σ ( S )
Let b be the observed value of σ 1 S in the S column table, and the p-value obtained using Fisher’s exact test is shown in Equation (8). The p-value of Fisher’s exact test is the value of all σ 1 S = a than the cumulative sum of probabilities that are lower.
p S F ( b ) = p S F ( a | σ ( S ) , n 1 , n ) p S F ( b | σ ( S ) , n 1 , n ) p S F ( a | σ ( S ) , n 1 , n )

3.2. Overall Framework of the Algorithm

The general framework of the distributed PFWER false positive control algorithm proposed in this paper is shown in Figure 1.
Figure 1. Overall framework of distributed PFWER false positive control.
Since the null hypothesis, H 0 , proposed in this paper is that pattern S is not significantly associated with label l i , more than one pattern S can be mined in the transactional dataset D. There is a dependency between different patterns, S, and the p-values computed from the labels l i , so the PFWER false positive control is performed using the permutation method proposed by Westfall and Young [30,43]. The permutation-based method is very computationally intensive, so the Spark framework is used for parallel computing to improve the overall computational rate. The algorithm proposed in this chapter can be broadly divided into the following three stages.
  • Label permutation operation. According to the replacement method proposed by Westfall and Young [30,43], it is known that to calculate the truncated p-value (corrected significance level δ ) more accurately, it is necessary to perform a replacement operation on the label l i (generally performing j r = 10 3 10 4 times replacement) to achieve the purpose of breaking the association between pattern S and label l i .
  • Finding the hypothesis to be tested in multiple hypothesis testing. Since the null hypothesis is composed of two key elements, pattern S and label l i , the main task of the second stage of the algorithm is to find all patterns S and their corresponding labels l i in the transactional dataset D.
  • False-positive correction calculation. After finding the hypotheses to be tested and permuting the labels, the p-value of each hypothesis was calculated according to Fisher’s exact test. The false-positive correction was then performed according to the Westfall and Young [30,43] replacement method, and finally, the FWER was controlled at the α level.

3.3. Index-Tree Algorithm

In order to solve the problem, the hypothesis determination process will dig out a large number of redundant patterns, which affects the computational speed. In this paper, we propose an Index-Tree algorithm, which uses a reduction strategy to reduce the construction of conditional trees and, thus, the computation of patterns. It also adopts an index optimization strategy to reduce the computational overhead caused by multiple traversals of the dataset, further reducing the computation of redundant patterns and speeding up the overall computational speed of the false positive control.

3.3.1. Pattern Mining

The main purpose of pattern mining in this paper is to find all hypotheses. The hypothesis is composed of two key elements patterns S with labels l i , so in the hypothesis determination phase, it is necessary to mine all patterns S by pattern mining methods. Then the hypothesis is determined by traversing the dataset to find the labels that contain the corresponding pattern transactions.
As shown in Figure 2, this paper uses the FP-Growth algorithm for pattern mining. However, since this paper wants to control the false positives in multiple hypothesis testing, that is, to find all patterns S for which the p-value is calculated, the false positive control is performed using the PFWER control method. In other words, the minimum support count in the FP-Growth algorithm is to be set to 1. This makes the computational efficiency of the FP-Growth algorithm for pattern mining very low. Since pattern mining is also only one step in all the computational aspects of this paper, there is a subsequent PFWER false positive control calculation. Therefore, it is necessary to improve the FP-Growth algorithm without changing the effect of the PFWER false positive control in order to reduce the memory overhead and improve the computational efficiency. To solve the above problems, a pruning operation and an index optimization operation are adopted to reduce the redundant patterns and improve the computational efficiency.
Figure 2. Pattern mining purpose.

3.3.2. Pruning Operation

This chapter focuses on controlling the number of false positive errors in multiple hypothesis testing using the PFWER control method. According to the concept of FWER control, it is known that FWER (family-wise error rate) is the probability of at least one false positive error, and to ensure that the probability of error is as small as possible is to make F W E R ( δ ) α . This means that reducing the significance level of p S F b from the original α to δ is a guarantee that F W E R ( δ ) α . In this way, the problem becomes one of computing the significance threshold δ , where δ = m a x { δ | F W E R ( δ ) α } . Since the Westfall–Young Light algorithm [5] requires j r = 10 3 10 4 times permutation operation for label i in order to make the label unassociated with the pattern, and determines whether a false positive error has occurred by checking whether there is p m i n α where p m i n min p S F b . Then the cluster error rate is calculated as shown in Equation (9).
F W E R ( δ ) = 1 j r i = 1 j r 1 [ p min ( i ) δ ]
where 1 p m i n j δ means that if p m i n j δ is true, then it is 1, otherwise it is 0. The final δ to be found is the p m i n i i = 1 j r of α quantile point.
Theorem 1.
If there exists S 1 S 2 and σ S 1 = σ S 2 , then σ 1 S 1 = σ 1 S 2 , σ 0 S 1 = σ 0 S 2 and for each permuted label, σ 1 S 1 = σ 1 S 2 , σ 0 S 1 = σ 0 S 2 .
Theorem 2.
If S 1 S 2 and there exists σ S 1 = σ S 2 , then p S 1 F b = p S 2 F b .
Proof. 
Since σ S 1 = σ S 2 , the values n , n 1 , σ S of the edges in the 2 × 2 column table are fixed, so for S 1 and S 2 , the three values n, n 1 , and σ S are equal. Equations (10) and (11) can be obtained from Equation (7). Obviously, Equation (12) can be derived from Equations (10) and (11). By substituting Equation (12) into the Fisher exact test formula, it can be deduced that p S 1 F b = p S 2 F b .
p F ( σ 1 ( S 1 ) = a | σ ( S 1 ) , n 1 , n ) = n 1 a n n 1 σ ( S 1 ) a / n σ ( S 1 )
p F ( σ 1 ( S 2 ) = a | σ ( S 2 ) , n 1 , n ) = n 1 a n n 1 σ ( S 2 ) a / n σ ( S 2 )
p F ( σ 1 ( S 1 ) = a | σ ( S 1 ) , n 1 , n ) = p F ( σ 1 ( S 2 ) = a | σ ( S 2 ) , n 1 , n )
   □
Theorem 3.
If S 1 S 2 and σ S 1 = σ S 2 , then only the p-value of mode S 1 needs to be computed.
Proof. 
According to Equation (9), it is known that the final estimate of F W E R ( δ ) is related to p m i n i after each permutation and p m i n = min p S F b . By Theorem 2, we know that if S 1 S 2 and there exists σ S 1 = σ S 2 , then p S 1 F b = p S 2 F b . If  p S 1 F b = p S 2 F b is the minimum value of the p-value in this substitution, then p m i n picks p S 1 F b as the same as the result of p m i n picking p S 2 F b . Therefore, it is sufficient to compute only the p-value p S 2 F b of the mode S 1 , without computing the p-value of the mode S 1 . If  p S 1 F b = p S 2 F b is not the minimum value of p-value in this replacement, since p S 1 F b = p S 2 F b , then p m i n and p S 1 F b are the same as the result of comparing p S 2 F b , so it is sufficient to perform the calculation only once.    □
Theorem 4.
In FP-Tree, if there exists σ I 1 = σ I 2 and I 1 . n e x t = I 2 in the item header table, while for all I 1 . l i n k . n e x t and I 2 . l i n k . n e x t there are σ I 1 . l i n k . n e x t = σ I 2 . l i n k . n e x t , and in the FP-Tree I 1 . l i n k . n e x t . c h i l d = I 2 . l i n k . n e x t and I 2 . l i n k . n e x t . p a r e n t = I 1 . l i n k . n e x t , such that S 1 = S I 1 , S 2 = S I 1 , I 2 , we have S 1 S 2 and σ S 1 = σ S 2 .
Proof. 
According to { I 2 , I 5 : 1 , I 1 , I 3 : 2 , I 1 , I 2 , I 3 : 1 , I 1 , I 2 , I 3 , I 5 : 1 , I 1 , I 2 , I 3 , I 4 : 2 , I 2 : 4 , I 1 , I 3 , I 4 : 2 } the dataset constructed by the FP-Tree is shown in Figure 3. Where σ I 1 = σ I 3 , the number of nodes in the item header table is I 1 at one position on I 3 and for all I 1 and I 3 chains on the number of supported nodes σ I . l i n k . n e x t is the same for all I 1 and I 3 links. In FP-Tree, all I 3 nodes’ parent nodes are I 1 nodes and all I 1 nodes’ children are I 3 nodes. Obviously, there is σ { I 1 } = σ { I 1 , I 3 } . Let S 1 = S I 1 , S 2 = S I 1 , I 3 , then S 1 S 2 , σ S 1 = σ ( S I 1 ) = σ S σ { I 1 } , σ S 2 = σ ( S I 1 , I 3 ) = σ S σ { I 1 , I 3 } , so σ S 1 = σ S 2 .    □
Figure 3. Pattern mining purpose.
The nodes I 1 and I 3 that satisfy the condition of Theorem 4 in the FP-Tree can be combined into one node I 1 ; that is, patterns S 1 and S 2 can be combined into one pattern, and then according to Theorems 1–3, it is only necessary to calculate the p-value of pattern S 1 to reduce the amount of computation in memory and speed up the computation of the single-machine algorithm.

3.3.3. Index Optimization

From the 2 × 2 column table, we know that after mining pattern S, we need to find all the S T i , l i = l 1 support numbers σ 1 S , and this process requires traversing the whole dataset once. Since the Westfall–Young Light algorithm [5] starts from a minimum support number of 1, the number of patterns to be mined is very large, and it would be too expensive to traverse the dataset once for each pattern mined to find its σ 1 S . When performing pattern mining, an index can be added to speed up the query, which is the position of transaction T i , so that counting l i = l 1 takes only linear time to find. The transaction dataset D with the index added is shown in Table 4.
Table 4. Transaction dataset with index.
The FP-Tree with indexed structure is constructed based on the above dataset, as shown in Figure 4. The conditional pattern bases are constructed on the basis of the indexed FP-Tree, and the conditional pattern bases are constructed from the smallest to the largest support counts, that is, from I 5 : < I 2 , I 1 : 8 > < I 2 : 0 > , I 4 : < I 2 , I 1 : 3 , 5 > < I 1 : 9 , 12 > , I 1 : < I 2 : 2 , 3 , 5 , 8 > . Next, we construct the indexed conditional FP-Tree based on the indexed conditional pattern base and find the pattern S I = t i , { T I D i } with the index structure. The conditional pattern bases are constructed based on the index FP-Tree by supporting the degree counts from small to large; that is, starting from I 5 : < I 2 , I 1 : 8 > , < I 2 : 0 > , I 4 : < I 2 , I 1 : 3 , 5 > < I 1 : 9 , 12 > , I 1 : < I 2 : 2 , 3 , 5 , 8 > . Next, we construct the indexed conditional FP tree based on the indexed conditional pattern base and find the pattern with the index structure S I = t i , { T I D i } .
Figure 4. Pattern mining purpose.
The null hypothesis H 0 proposed in this paper is that pattern S is not significantly associated with the label l i , and the parameter to be tested in this paper can be set as θ = { S , l i | S T j , j = 1 , , n , i = 0 , 1 } . According to Table 3 and Equation (9), the key variables for false positive control for the selected dataset D and the determined null hypothesis h 0 are n, n 1 , σ S , and σ 1 S in the case of l i = l 1 obtained after label replacement. Where n, n 1 can already be determined when the sample dataset is selected, and n, n 1 is fixed. While σ S and σ 1 S are the support counts when S T i and l i = l 1 and S T j , respectively. It is easy to see from the structure of dataset D that once the set of transactions T i to which the pattern S belongs is known, the set of labels l i corresponding to it can be found, and the support count σ S is the size of the set T i , and it is also not difficult to determine σ 1 S based on the correspondence between transactions and labels. Therefore, it is not necessary to know what the specific pattern S is when performing the PFWER false positive control calculation, but only to know what sets of transactions T i are available to mine the pattern S. Finding out which transaction sets exist that can be mined for patterns is more important for subsequent computation. In this way, it is clearly more advantageous to use a vertical data format for data mining.
The transactional dataset of Table 4 is converted into a vertical data format representation, as shown in Table 5. Data mining is performed to find the patterns to be computed by intersecting the index set of each pair of items in the item set. For example, the index set of the pattern I 1 , I 2 is T I D I 1 , I 2 = T I D I 1 T I D I 2 = 2 , 3 , 5 , 8 .
Table 5. Vertical data format transaction dataset.
Theorem 5.
If there exists T I D S 1 = T I D S 2 , then p S 1 F b = p S 2 F b .
Proof. 
If there exists T I D S 1 = T I D S 2 , then it means that the number of transactions containing patterns S 1 and S 2 are equal, i.e., T I D S 1 = T I D S 2 , so there is σ S 1 = σ S 2 . Again, since labels have a one-to-one correspondence with transactions, although j r permutations are performed, pattern s 1 belongs to the same transaction set as pattern s 2 . Therefore, for each permutation, there is σ 1 S 1 = σ 1 S 2 . The total number of transactions n is fixed with the number of labels l i = l 1 for the same dataset n 1 , substituting σ S , σ 1 S , n and n 1 into Equations (7) and  (8) to find p S 1 F b = p S 2 F b .    □
Substituting p S 1 F b = p S 2 F b into Equation (9) (FWER false positive control formula) shows that p S 1 F b has the same effect as p S 2 F b on Equation (9), so their calculated p-values have the same influence on Equation (9) for different patterns with the same index set, so it is sufficient to perform the p-value calculation only once.
Based on the above problem analysis, it is clear that mining the set of transactions containing pattern S is more useful for the subsequent computation than mining all patterns in the dataset and then computing the corresponding dataset. Inspired by the vertical data format, the index tree is pruned again according to Theorem 5 to reduce the computation of invalid patterns generated in the data mining process.
The conditional pattern base of I 4 is < I 2 , I 1 : 3 , 5 > < I 1 : 9 , 12 > according to the item header table in Figure 4, and the conditional tree constructed from this conditional pattern base is shown in Figure 5. The conditional tree using I 4 is pattern mined using the FP-Growth algorithm by combining all nodes on this single path and then combining the combined set with that node to form the pattern output. According to the above statement, from the conditional tree of I 4 , we will receive S 1 = I 1 , I 4 , 3 , 5 , 9 , 12 , S 2 = I 2 , I 4 , 3 , 5 , S 3 = I 1 , I 2 , I 4 , 3 , 5 , but actually patterns S 2 and S 3 are exactly equal for the PFWER false positive control calculation, and there is no need to repeat the calculation; therefore, it is only necessary to know the index set of each node to substitute into the FWER control formula for the calculation. When the number of nodes contained in a single-path conditional tree is very large, it can reduce a lot of additional computational overhead.
Figure 5. Condition tree of I 4 .
The purpose of Algorithm 1 is to mine the index set containing the patterns and to provide computational preparation for the subsequent PFWER false positive control. The first line of the algorithm constructs the set of frequent1 -items and calculates their support counts. The second line of the algorithm constructs the index tree. In the third line of the algorithm, it calls Algorithm 2 to perform pruning operations on the index tree. If the condition tree contains only a single path, then the index set of nodes on this path is output, otherwise, the condition tree is constructed for the pattern β α in the tenth to thirteenth lines of the algorithm, and if the condition tree is not controlled, then the algorithm is recursively called for mining, and finally, all index sets containing the pattern are obtained.
The first line of Algorithm 2 iterates through the nodes in the item head table, and the second to fifth lines determine whether two adjacent nodes with the same support count in the item head table are to be merged. If the nodes in the FP tree satisfy the pruning condition in Section 3.3.2, the nodes in the FP tree are merged in the sixth line of the algorithm, the term header table is updated, and the pruned index tree is returned.
Algorithm 1 Index Tree
Require:   D = T i , l i
Ensure:  I n = T I D i
 1: c r e a t e i t e m _ 1 , σ s i z e i n d e x
 2: I F P _ T r e e c r e a t e T r e e ( i t e m _ 1 , D )
 3: t r e e I P F P _ T r e e ( I F P _ T r e e )
 4: I F P _ G r o w t h ( t r e e , β )
 5: if  p a t h t r e e then
 6:    for  n o d e p a t h  do
 7:         T I D ( β n o d e )
 8:    end for
 9: else
10:    for each a i ( a i , T I D b )  do
11:         b β a i T I D b T I D a σ s i z e ( T I D b )
12:         c r e a t e ( D b )
13:         c r e a t e ( t r e e b )
14:         t r e e i b I P F P T r e e ( t r e e b )
15:        if  t r e e i b  then
16:            I F P G r o w t h ( t r e e i b , b )
17:        end if
18:    end for
19: end if
Algorithm 2 IPFP Tree
Require:  i t e m s
Ensure:  I P F P t r e e
 1: for  i i t e m s   do
 2:     if  σ H e a d ( i ) ) = σ H e a d ( i 1 ) )  then
 3:         for  n o d e i l i n k i , n o d e i - 1 l i n k i - 1  do
 4:             if  σ ( n o d e i ) = σ ( n o d e i 1 )  then
 5:                 if  n o d e i . c h i l d = n o d e i 1 and n o d e i 1 . p a r e n t = n o d e i  then
 6:                      r e m o v e ( n o d e i 1 )
 7:                      u p d a t e ( H e a d )
 8:                 end if
 9:             end if
10:        end for
11:     end if
12:      i i + 1
13: end for

3.4. Distributed PFWER Control Algorithm

3.4.1. Label Replacement

The first stage of the distributed PFWER false positive control algorithm is the label replacement stage, where the purpose of label replacement is to make no relationship between labels and patterns. Therefore, it is necessary to disrupt and reshuffle the labels, which generally requires j r = 10 3 10 4 times the permutation of the labels. This process can be run in parallel on the cluster, and the execution is shown in Figure 6.
Figure 6. Parallel label replacement.
First, we read the label data using the sc.textFile() method and store it in labelRDD; then we perform a random permutation operation on the read labels in parallel. Then we perform a merge operation on the disordered set of labels in the cluster, and finally, we receive a permuted set of labels.

3.4.2. Hypothesis Determination

The second part of the distributed PFWER false positive control algorithm is to find the parameters to be tested in the multiple hypothesis test. Since the null hypothesis is composed of two key elements, pattern S and label l i , it is known from the theorem in Section 3.3 that the parameters to be determined in the actual computation are the set of all indexes mapped to pattern S and their corresponding labels l i in the transaction dataset D, so the main task in this stage is to find the above two parameters.
Based on the PFWER false positive control characteristics combined with the Index-Tree algorithm in Section 3.3, this leads to a distributed computational method for hypothesis determination and the parallel computation of index sets and their labels mapped to patterns. The method consists of three important phases divided into a dataset-partitioning phase, a frequent 1-item set and FP tree construction phase, and a group mining phase with pattern-mapped index sets and their labels. Figure 7 shows the computational framework of distributed hypothesis determination, where the dataset is divided into n partitions in the dataset partitioning phase and subsequent computations are performed in parallel. The main objective of the frequent 1-item set and FP tree construction phase is to construct frequent 1-item sets with index structures and to construct FP trees with index structures based on frequent 1-item sets and transactional datasets. Figure 8 illustrates the process of constructing a frequent 1-term set as follows
Figure 7. Find hypothetical computing frameworks in parallel.
Figure 8. Constructing frequent 1-item sets.
  • First, the items in the dataset should be split using the flatMap operator to construct <key = item,value = index> key-value pairs in parallel and the map operator to construct <key = item,value = 1> key-value pairs.
  • Secondly, the key-value pairs of <key = item,value = 1> are computed cumulatively using the reduceByKey algorithm. The computed key is the item name, and the value is the number of items in the dataset.
  • Next, the key-value pair <key=item,value = index> is computed using the groupByKey operator to obtain a new key-value pair <key = item,value = index>, where the value is the index set containing the key values.
  • Finally, use the join operator to combine <key = item,value = index> and <key = item,value = count> into a new key-value pair <key = item,value = count + index> and output it in descending order of the count of the values in each key-value pair to get the item header table for subsequent calculations.
The FP tree with index structure is constructed by traversing the transaction dataset based on the frequent 1-item sets with index structure. Next, the frequent 1-item set is divided into h groups, the group numbers are denoted by h i d , and each group contains a complete FP tree with an index structure. The conditional pattern base and the conditional pattern tree are constructed for each h i d group, and then the index set containing the patterns is mined using the Index-Tree algorithm. Since the labels correspond to the transaction data, the index set containing the patterns can be computed while the corresponding label set can be determined, and obviously, the two parameters related to the null hypothesis in the hypothesis test have been determined.

3.4.3. False Positive Control

This section mainly uses the false positive control method proposed by Westfall and Young [30,43] to control the FWER at the α level, which is implemented with the main idea that a new resampled transactional dataset with no relationship between patterns and labels can be generated by just randomly arranging the class labels. This allows one to determine whether a false positive error has occurred by computing the minimum p-value after each permutation, p min = min p S F , and checking whether p m i n δ holds. The subsequent sections of this paper refer to this method as the WY replacement algorithm.
The disadvantage of the WY replacement algorithm is that it is computationally expensive in addition to having a large number of replacement operations. Terada [13] and other researchers found that in Fisher’s exact test, when 2 × 2 columns are fixed, then the value n , n 1 , σ S at the edge of the table is also fixed, and according to Equations (7) and (8) it is not difficult to find that the p-value is ultimately a function about σ 1 S . Since the objects in the 2 × 2 column table are discrete and can only take finitely many values, it can be determined that σ 1 S is bounded, i.e., σ 1 S σ 1 S m i n , σ 1 S m a x . Where σ 1 S m a x = min n 1 , σ S , σ 1 S m i n = max 0 , σ S n n 1 . From the bound of σ 1 S , it can also be further deduced that there exists a minimum reachable p-value φ σ S strictly greater than 0 as follows.
φ ( σ S ) = min { p S F ( a ) | σ 1 ( S ) min a σ 1 ( S ) max }
According to Equation (8), the p-value calculated for Fisher’s exact test is the cumulative sum of the results obtained using Equation (7), and the values calculated in Equation (7) are all greater than 0. It can be inferred that when σ 1 S = σ 1 S m i n or σ 1 S = σ 1 S m a x , the minimum reachable p-value φ σ S . It is then possible to call all patterns S of φ σ S δ the set of measurable patterns so that patterns not in the set of κ ( δ ) cannot be statistically significant under δ . On this basis, a monotonically decreasing lower bound φ ^ ( σ ) on the minimum achievable p-value can be introduced, as shown in Equation (14).
φ ^ σ = φ σ S 0 σ S n 1 1 / n n 1 n 1 σ S n
The monotonically decreasing lower bound φ ^ ( σ ) on the minimum achievable p-value gives κ ^ ( δ ) = S | φ ^ ( σ ) δ , which satisfies κ ( δ ) κ ^ ( δ ) , which, in turn, can be rewritten as κ ^ ( δ ) = S | σ S σ δ due to monotonicity. That means only the mode S satisfying condition κ ^ ( δ ) = S | σ S σ δ is valuable for the PFWER false positive control calculation. Based on the above, the pseudo-code of the distributed PFWER false positive control algorithm is proposed, as shown in Algorithms 3 and 4.
Algorithm 3  DS-FWER(D)
Require:  D
Ensure:  δ
 1:  l a b e l D i s t r i b u t e d L a b e l P e r m u t a t i o n ( D )
 2:  p min ( i ) 1
 3:   σ 1 , δ φ ^ σ
 4:   i t e m I n d e x f l a t M a p ( D ) , i t e m O n e m a p ( D )
 5:   i t e m C o u n t r e d u c e B y K e y ( i t e m O n e ) , i t e m I n d e x s g r o u p B y K e y ( i t e m I n e d x )
 6:   i t e m i t e m C o u n t . j o i n ( i t e m I n d e x s )
 7:   t r e e c r e a t e F 1 T r e e ( i t e m ) , F 1 _ t r e e I P F P T r e e ( t r e e )
 8:   i t e m G r o u p g r o u p ( i t e m )
 9:   i n d e x I n d e x t r e e ( i t e m G r o u p , F 1 _ t r e e )
10:   W Y ( i n d e x , l a b e l )
11:  Return α quantile of p min ( i ) i = 1 j r
Algorithm 4 WY Algorithm
Require:   i n d e x , l a b e l
Ensure:  σ
 1:   p S F ( σ 1 ( S ) )
 2:  for  i = 1 , , j r   do
 3:      Compute σ 1 ( S )
 4:       p min ( i ) p min ( i ) , p S F ( σ 1 ( S ) )
 5: end for
 6: F W E R ( δ ) = 1 j r i = 1 j r 1 p min ( i ) δ
 7: while  F W E R ( δ ) > α do
 8:       σ σ + 1 δ φ ^ ( σ )
 9:       F W E R ( δ ) = 1 j r i = 1 j r 1 p min ( i ) δ
10: end while
11: for  i n d e x L i s t i n d e x do
12:      Compute σ ( S )
13:      if  σ ( S ) σ  then
14:           W Y ( i n d e x , l a b e l )
15:      end if
16: end for
The first line of Algorithm 3 uses distributed label permutation to obtain the permuted label set with indexed positions, the second line initializes all minimum p-values in j r permutation calculations to 1, the third line initializes the minimum support of the pattern, and the modified significance threshold δ is initialized according to this minimum support for subsequent calculations. The fourth to seventh lines of the algorithm uses parallel methods to construct frequent 1-item sets with indexed structures with FP trees, and the eighth line groups the frequent 1-item sets and distributes the grouped data to each node in the cluster. We rewrite the Index-Tree algorithm and change its input to FP tree and frequent 1-item sets, and mine its index set on each node according to FP tree and the grouped frequent 1-item sets. Finally, the index set and label set are substituted into the WY replacement algorithm to obtain the set of j r minimum p-values p m i n i i = 1 j r , then the significance threshold of p-value calculation is set to p m i n i i = 1 j r of the α quantile will eventually control the FWER under the α level.
Algorithm 4 is the WY permutation algorithm. The first line of the algorithm computes all p-values p S F σ 1 S in the bounds using Fisher’s exact test. The second to fourth lines of the algorithm calculate the σ 1 S value of the index set for each permutation for j r permutations and calculate the minimum p-value p m i n i . The fifth line of the algorithm finds the current F W E R ( δ ) value based on p m i n i i = 1 j r . Lines six to eight of the algorithm perform a round-robin operation where the minimum support is the current minimum support plus 1 if F W E R ( δ ) > α and update the significance threshold at the same time until F W E R ( δ ) α . For all the mined index sets of σ S σ , the WY replacement algorithm is executed to find the final modified significance threshold. Finally, the corrected significance thresholds found on each node are compared, and the smallest significance threshold among all nodes is the final result.

3.5. Proof of Correctness

The first is the correctness of the data cut, and the second is the correctness of the final result obtained by executing the WY permutation algorithm in parallel.
According to Section 3.3, we can find the index sets of all patterns S and perform the de-duplication operation on these index sets before performing the PFWER false positive control computation to reduce the amount of data to be computed while ensuring the correctness of the result computation. This chapter uses the distributed false positive control algorithm process to group the frequent 1-item sets with index structure, and each node will use the index FP tree and the grouped frequent 1-item sets for index set mining. The Index-Tree algorithm determines the conditional pattern base for each item in the head table based on the FP tree and then constructs a conditional tree based on the conditional pattern base to perform subsequent pattern mining. Therefore, as long as the initial index FP tree is consistent for each set of item headers, the index set obtained by the distributed computation will be the same as the index set obtained in the stand-alone case.
Theorem 6.
The minimum value of the significance threshold among all nodes is the overall significance threshold, and the overall significance threshold is the same as the result of the significance threshold computed by a single machine.
Proof. 
The WY replacement algorithm for example performs σ = σ + 1 and δ = φ ^ ( σ ) operations whenever it meets F W E R ( δ ) > α . Let I n 1 and I n 2 be two different index sets at different nodes with I n 1 and I n 2 support of σ I n 1 and σ I n 2 , and σ I n 1 < σ I n 2 . According to Equation (9) and δ = max δ | F W E R ( δ α ) we can find δ I n 2 < δ I n 1 , which also verifies the property that δ decreases monotonically with σ . Therefore, I n 2 index sets smaller than the current support count can be directly ignored and will not have an impact on the final result, so the final significance threshold is the minimum of the significance thresholds obtained for all nodes and is the same as the result of the stand-alone calculation. □

4. Experiments and Performance Analysis

This chapter validates the algorithm through experiments in the following four areas: Section 4.3.1 determines the parameters used in the distributed PFWER false positive control. Section 4.3.2 tests the pruning efficiency of the algorithm and verifies the effect of the pruning operation on the algorithm. Section 4.3.3 focuses on verifying the accuracy of the calculation of the distributed PFWER false positive control algorithm. Section 4.3.4 tests the operational efficiency of the distributed PFWER false positive control algorithm by comparing the runtime of the distributed PFWER false positive control algorithm with that of the stand-alone PFWER false positive control algorithm using different datasets. The above four experimental directions verify the difference between the distributed false positive control algorithm and the stand-alone false positive control algorithm for false positive control results on the one hand. On the other hand, the distributed false positive algorithm is verified for its ability to improve the computation rate. The experiments use different datasets to demonstrate the robustness and general applicability of the algorithms.

4.1. Experimental Environment Configuration

The algorithm in this paper is written in Java language and uses the Spark framework for distributed computation. The experimental code writing environment is shown in Table 6.
Table 6. Coding environment description.
The algorithm proposed in this paper is a distributed false positive control algorithm, so the main experimental part of the algorithm is completed on the cluster. The test cluster environment of the experiment is shown in Table 7.
Table 7. Experimental environment description.

4.2. Experimental Dataset

The information on the datasets used in the experiments of this paper is shown in Table 8. We performed our experiments using 11 datasets: they are available at FIMI’04 (http://fimi.ua.ac.be, 7 June 2022), UCI (https://archive.ics.uci.edu/ml/index.php, 7 June 2022) and kdd2018 (https://github.com/VandinLab/TopKWY, 10 June 2022). The datasets labeled with (L) in the dataset description are the datasets with binary classification labels, and the datasets labeled with (U) are the datasets without classification. For datasets without transactions classified into two categories, a single item with a frequency closer to 0.5 is chosen to be removed from the transaction dataset to artificially divide the dataset into two groups, and n / n 1 is used to represent the ratio of the number of transactions in the dataset to the number of transactions labeled l 1 , with two decimal places retained.
Table 8. Experimental dataset.

4.3. Distributed PFWER False Positive Control Experiment

4.3.1. Determination of The Number of Permutations

  • Experimental description: This section focuses on determining the parameter used in the distributed PFWER false positive control, i.e., the number of label replacements, jr. Label replacement is an important element to ensure the accuracy of the distributed PFWER false positive control results, and its purpose is to make sure there is no relationship between labels and patterns. The null hypothesis proposed in this paper is satisfied by the absence of an association between the mode and label and by avoiding the influence of inter-mode dependencies on the computational results. The experiment is to test the effect of the PFWER false positive control algorithm on the false positive control effect by setting different numbers of substitutions in the label substitution stage. In this paper, the FP-Growth algorithm will be used to perform the pattern mining operation for all comparison experiments.
  • Experimental analysis: The distributed PFWER false positive control uses a permutation-based approach for the control calculation. The known cost in setting the permutation value, jr, is that the larger the jr, the more accurate the final corrected significance threshold is estimated, but the cost is that the running time increases with the increase in jr. The following figure represents the computation for different datasets with different jr.
The horizontal coordinate of Figure 9 is the number of permutations, jr, and the vertical coordinate indicates the final support count. Figure 10 indicates the running time corresponding to different datasets selected with different replacement counts, the horizontal coordinate is the replacement count jr, and the vertical coordinate indicates the running time in (s). Since the label replacement is a random replacement process, there will be individual label disruptions that are not very good in the process of disrupting the label order. However, from the overall experimental results, the support count tends to be stable at j r = 10 3 10 4 ; if the number of permutations is increased on this basis, it has little effect on the calculation but will greatly increase the running time of the algorithm, so the experimental parameter chosen in this paper is j r = 10 3 or j r = 10 4 .
Figure 9. The number of replacement experiments.
Figure 10. Run time changes.

4.3.2. Pruning Efficiency Analysis

(1)
Experimental description
The PFWER false positive control algorithm needs to find all the hypotheses to be tested in the dataset, and these hypotheses to be tested are composed of the patterns mined in the transaction set and their corresponding permuted labels. Therefore, it is necessary to use techniques related to pattern mining. In the computation process, it is found that using Fisher’s exact test to calculate the p-value and using the WY replacement process for false positive control can reduce the computation of PFWER false positive control by some pruning operations and speeding up the computation, which does not affect the computation results.
The purpose of the experimental tests in this section is to verify the effect of pruning operations on the algorithm. From the above experimental description, it can be seen that the execution of the pruning operation reduces the number of patterns to be calculated for PFWER false positive control and does not affect the false positive control effect. Therefore, the experiments in this section will verify the efficiency of the pruning operation in terms of both the number of patterns that need to be computed before and after the pruning operation and the change in the significance threshold.
(2)
Experimental analysis
The purple bars in Figure 11 show the number of patterns mined before the pruning operation, and the green bars show the number of patterns mined after the pruning operation. The experimental results show that the use of the pruning operation in the calculation of the PFWER false positive control can effectively reduce the number of patterns calculated, thus reducing the number of p-values that need to be calculated by Fisher’s exact test and thus can effectively improve the efficiency of the PFWER false positive control.
Figure 11. The number of modes before and after pruning operations in different datasets.
Table 9 shows the effect of pruning on the run speed of different datasets before and after the pruning operation, and it can be seen from the data in the table that for most of the datasets, the pruning operation can improve the run efficiency.
Table 9. Time comparison before and after pruning.
Figure 12 represents the changes in the support counts of different datasets before and after the pruning operation. From the experimental results in Figure 12, we can see that the results calculated by the PFWER false positive control algorithm before and after performing the pruning operation are basically the same, thus verifying the correctness of the pruning operation.
Figure 12. Impact of pruning operation on support count.
Figure 13 shows the comparison of the significance thresholds of the PFWER false positive control after performing pruning operations with and without the pruning operation on different datasets, with the vertical coordinate as the logarithm with base 10. Since the PFWER false positive control performs random permutations of jr times labels that affect the final significance threshold results, it is acceptable to have some deviation in the significance thresholds after performing the pruning operation with and without pruning on individual datasets.
Figure 13. Significant threshold before and after pruning operation.

4.3.3. Accuracy Test

(1)
Experiment Description
The experiments in this section focus on verifying the accuracy of the computation of the distributed PFWER false positive control algorithm. The distributed PFWER false positive algorithm will process the data in the transaction dataset and then perform the PFWER false positive control calculation in parallel on each node of the cluster. The most important point in this process is to ensure that the calculation results of the algorithm in the distributed case are consistent with the results of the stand-alone calculation. The most important point in this process is to ensure that the algorithm’s computational results in the distributed case are consistent with those of the stand-alone computation. The main reason for ensuring the same results of the two runs is that the corrected saliency thresholds obtained in the end are the same.
(2)
Experimental Analysis
Figure 14 gives a comparison of the minimum support calculated by the distributed PFWER false positive control with that of the stand-alone PFWER false positive control, from which it can be seen that the final minimum support obtained for different datasets performing PFWER false positive control is basically the same in the distributed and stand-alone cases, demonstrating the accuracy of the distributed algorithm calculation.
Figure 14. PFWER support for different datasets.
Figure 15 shows the final corrected significant threshold for the distributed PFWER false positive control versus the corrected significant threshold obtained from the PFWER false positive control in the stand-alone case, with the vertical coordinate as the logarithm with base 10. The experimental results show that the results of the corrected significance thresholds obtained for the single machine on different datasets are in general agreement with the results calculated by the distributed PFWER false positive control algorithm proposed in this paper.
Figure 15. Modified significance thresholds for different datasets of PFWER.

4.3.4. Operational Efficiency Test

(1)
Experimental Description
The main purpose of using distributed techniques for PFWER false positive control calculations in this paper is to improve the computational efficiency of the procedure. The distributed PFWER false-positive control algorithm reduces the time spent on the experiment and does not affect the final results of the experiment, as the model is reduced in the hypothesis determination. In this section, the runtime of the distributed PFWER false positive control algorithm, the stand-alone PFWER false positive algorithm, and the existing FastWY [13] and WYlight [5] algorithms are compared using different datasets to test the efficiency of the distributed PFWER false positive control algorithm.
(2)
Experimental Analysis
The running time units for the algorithms in Figure 16 are seconds (s). The experiments focus on showing a comparison of the run times of the distributed PFWER false positive control algorithm, the stand-alone PFWER false positive algorithm, the FastWY algorithm [13], and the WYlight algorithm [5] running different datasets. The experimental results show that the use of the distributed PFWER false positive control algorithm can effectively improve the computational speed of the algorithm while avoiding the limitations of the stand-alone in-memory computation and can efficiently perform false positive control computations in large-scale data situations, which is of good use.
Figure 16. Runtime comparison of distributed PFWER control algorithms with existing algorithms.

4.4. Summary

The distributed PFWER false positive control algorithm has been analyzed and tested experimentally. The experimental data show that the distributed PFWER false positive control algorithm has the same control results as the stand-alone case and is better in terms of operational efficiency than running on a single machine. The algorithm can effectively address the problem of excessive computation in multiple hypothesis testing of false positive control for large data.

5. Conclusions

The PFWER control algorithm can obtain a single hypothesis-test significance threshold subject to an arbitrarily specified overall false positive level constraint without assuming an independent identical distribution. Since the PFWER control algorithm is highly time-consuming, this paper proposes a distributed solution to the PFWER control algorithm, which significantly improves the execution efficiency of the PFWER control algorithm without any loss in theoretical accuracy. Specifically, we abstract the PFWER control problem as a frequent pattern mining problem, and by adapting the FP growth algorithm and introducing distributed computing techniques, the constructed FP tree is decomposed into a set of subtrees, each corresponding to a subtask. All subtrees (subtasks) are distributed to different computing nodes, and each node independently computes the local significance threshold according to the assigned subtasks. The local computation outcomes from every node are aggregated, and the FWER false positive control thresholds are calculated to be exactly in line with the theoretical outcomes. To the best of our knowledge, this is the first paper to present a distributed PFWER control algorithm. Experimental results on real datasets show that the proposed algorithm is more computationally efficient than the comparison algorithm.
In the future, we may also consider using unconditional exact tests, i.e., Barnard’s exact tests, to calculate p-values in false positive control methods for multiple hypothesis testing. Unconditional tests, on the other hand, are generally more expensive than conditional tests (often Fisher’s exact tests) because unconditional tests take into account the various scenarios observed in the pattern frequencies and the actual dataset and require the use of an unknown perturbation parameter for subsequent calculations. Another possible path is to extend this paper’s distributed algorithm to multi-categorically labeled transactional datasets, and to explore efficient distributed control of false positives in multiple hypothesis testing processes in other types of datasets.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, X.L., Y.S. and C.C.; validation, X.L., Y.S. and C.C.; formal analysis, X.L., Y.S. and C.C.; data curation, X.L., Y.S. and C.C.; writing—original draft preparation, X.L.; writing—review and editing, Y.Z., X.L., T.X., F.W., Y.S. and C.C.; visualization, Y.Z. and X.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62032013 and 61772124).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Erdogmus, H. Bayesian Hypothesis Testing Illustrated: An Introduction for Software Engineering Researchers. ACM Comput. Surv. 2023, 55, 119:1–119:28. [Google Scholar] [CrossRef]
  2. Munoz, A.; Martos, G.; Gonzalez, J. Level Sets Semimetrics for Probability Measures with Applications in Hypothesis Testing. Methodol. Comput. Appl. Probab. 2023, 25, 21. [Google Scholar] [CrossRef]
  3. Li, Y.; Zhang, C.; Shelby, L.; Huan, T.C. Customers’ self-image congruity and brand preference: A moderated mediation model of self-brand connection and self-motivation. J. Prod. Brand Manag. 2022, 31, 798–807. [Google Scholar] [CrossRef]
  4. Jensen, R.I.T.; Iosifidis, A. Qualifying and raising anti-money laundering alarms with deep learning. Expert Syst. Appl. 2023, 214, 119037. [Google Scholar] [CrossRef]
  5. Llinares-López, F.; Sugiyama, M.; Papaxanthos, L.; Borgwardt, K.M. Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; Cao, L., Zhang, C., Joachims, T., Webb, G.I., Margineantu, D.D., Williams, G., Eds.; ACM: New York, NY, USA, 2015; pp. 725–734. [Google Scholar] [CrossRef]
  6. Dey, M.; Bhandari, S.K. FWER goes to zero for correlated normal. Stat. Probab. Lett. 2023, 193, 109700. [Google Scholar] [CrossRef]
  7. Samarskiĭ, A. Claverie JM: The significance of digital gene expression profiles. Genome Res. 1997, 7, 986–995. [Google Scholar]
  8. Holm, S. A simple sequentially rejective multiple test procedure. Scand. J. Stat. 1979, 6, 65–70. [Google Scholar]
  9. Simes, R.J. An improved Bonferroni procedure for multiple tests of significance. Biometrika 1986, 73, 751–754. [Google Scholar] [CrossRef]
  10. Hochberg, Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988, 75, 800–802. [Google Scholar] [CrossRef]
  11. Pellegrina, L.; Vandin, F. Efficient Mining of the Most Significant Patterns with Permutation Testing. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD 2018), London, UK, 19–23 August 2018; Guo, Y., Farooq, F., Eds.; ACM: New York, NY, USA, 2018; pp. 2070–2079. [Google Scholar] [CrossRef]
  12. Hang, D.; Zeleznik, O.A.; Lu, J.; Joshi, A.D.; Wu, K.; Hu, Z.; Shen, H.; Clish, C.B.; Liang, L.; Eliassen, A.H.; et al. Plasma metabolomic profiles for colorectal cancer precursors in women. Eur. J. Epidemiol. 2022, 37, 413–422. [Google Scholar] [CrossRef]
  13. Terada, A.; Tsuda, K.; Sese, J. Fast Westfall-Young permutation procedure for combinatorial regulation discovery. In Proceedings of the IEEE International Conference on Bioinformatics & Biomedicine, Belfast, UK, 2–5 November 2014. [Google Scholar]
  14. Harvey, C.R.; Liu, Y. False (and Missed) Discoveries in Financial Economics. J. Financ. 2020, 75, 2503–2553. [Google Scholar] [CrossRef]
  15. Kelter, R. Power analysis and type I and type II error rates of Bayesian nonparametric two-sample tests for location-shifts based on the Bayes factor under Cauchy priors. Comput. Stat. Data Anal. 2022, 165, 107326. [Google Scholar] [CrossRef]
  16. Andrade, C. Multiple Testing and Protection Against a Type 1 (False Positive) Error Using the Bonferroni and Hochberg Corrections. Indian J. Psychol. Med. 2019, 41, 99–100. [Google Scholar] [CrossRef] [PubMed]
  17. Blostein, S.D.; Huang, T.S. Detecting small, moving objects in image sequences using sequential hypothesis testing. IEEE Trans. Signal Process. 1991, 39, 1611–1629. [Google Scholar] [CrossRef]
  18. Babu, P.; Stoica, P. Multiple Hypothesis Testing-Based Cepstrum Thresholding for Nonparametric Spectral Estimation. IEEE Signal Process. Lett. 2022, 29, 2367–2371. [Google Scholar] [CrossRef]
  19. Benjamini, Y.; Hochberg, Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B Methodological 1995, 57, 289–300. [Google Scholar] [CrossRef]
  20. Benjamini, Y.; Hochberg, Y. On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics. J. Educ. Behav. Stat. 2000, 25, 60–83. [Google Scholar] [CrossRef]
  21. Yekutieli, K.D. Adaptive linear step-up procedures that control the false discovery rate. Biometrika 2006, 93, 491–507. [Google Scholar]
  22. D’Alberto, R.; Raggi, M. From collection to integration: Non-parametric Statistical Matching between primary and secondary farm data. Stat. J. IAOS 2021, 37, 579–589. [Google Scholar] [CrossRef]
  23. Pawlak, M.; Lv, J. Nonparametric Testing for Hammerstein Systems. IEEE Trans. Autom. Control. 2022, 67, 4568–4584. [Google Scholar] [CrossRef]
  24. Carlson, J.M.; Heckerman, D.; Shani, G. Estimating False Discovery Rates for Contingency Tables. Technical Report MSR-TR-2009-53, 2009, 1–24. Available online: https://www.microsoft.com/en-us/research/publication/estimating-false-discovery-rates-for-contingency-tables/ (accessed on 13 February 2023).
  25. Bestgen, Y. Using Fisher’s Exact Test to Evaluate Association Measures for N-grams. arXiv 2021, arXiv:2104.14209. [Google Scholar]
  26. Pellegrina, L.; Riondato, M.; Vandin, F. SPuManTE: Significant Pattern Mining with Unconditional Testing. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019. [Google Scholar]
  27. Terada, A.; Sese, J. Bonferroni correction hides significant motif combinations. In Proceedings of the 13th IEEE International Conference on BioInformatics and BioEngineering (BIBE 2013), Chania, Greece, 10–13 November 2013; pp. 1–4. [Google Scholar] [CrossRef]
  28. Sultanov, A.; Protsyk, M.; Kuzyshyn, M.; Omelkina, D.; Shevchuk, V.; Farenyuk, O. A statistics-based performance testing methodology: A case study for the I/O bound tasks. In Proceedings of the 17th IEEE International Conference on Computer Sciences and Information Technologies (CSIT 2022), Lviv, Ukraine, 10–12 November 2022; pp. 486–489. [Google Scholar] [CrossRef]
  29. Paschali, M.; Zhao, Q.; Adeli, E.; Pohl, K.M. Bridging the Gap Between Deep Learning and Hypothesis-Driven Analysis via Permutation Testing; Springer: Cham, Switzerland, 2022. [Google Scholar]
  30. Young, S.S.; Young, S.S.; Young, S.S. Resampling-Based Multiple Testing: Examples and Methods for p-value Adjustment; John Wiley & Sons: Hoboken, NJ, USA, 1993. [Google Scholar]
  31. Schwender, H.; Sandrine, D.; Mark, J.; van der Laan, J. Multiple Testing Procedures with Applications to Genomics. Stat. Pap. 2009, 50, 681–682. [Google Scholar] [CrossRef]
  32. Webb, G.I. Discovering Significant Patterns. Mach. Learn. 2007, 68, 1–33. [Google Scholar] [CrossRef]
  33. Liu, G.; Zhang, H.; Wong, L. Controlling False Positives in Association Rule Mining. In Proceedings of the VLDB Endowment, Seattle, WA, USA, 29 August–3 September 2011. [Google Scholar]
  34. Yan, D.; Qu, W.; Guo, G.; Wang, X. PrefixFPM: A Parallel Framework for General-Purpose Frequent Pattern Mining. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020. [Google Scholar]
  35. Messner, W. Hypothesis Testing and Machine Learning: Interpreting Variable Effects in Deep Artificial Neural Networks using Cohen’s f2. arXiv 2023, arXiv:2302.01407. [Google Scholar]
  36. Yu, J.; Wen, Y.; Yang, L.; Zhao, Z.; Guo, Y.; Guo, X. Monitoring on triboelectric nanogenerator and deep learning method. Nano Energy 2022, 92, 106698. [Google Scholar] [CrossRef]
  37. Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, USA, 2011; pp. 248–253. [Google Scholar]
  38. Han, J.; Jian, P.; Yin, Y.; Mao, R. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 2004, 8, 53–87. [Google Scholar] [CrossRef]
  39. White, T. Hadoop—The Definitive Guide: Storage and Analysis at Internet Scale, 2nd ed.; O’Reilly Media: Sebastopol, CA, USA, 2011. [Google Scholar]
  40. Ji, K.; Kwon, Y. New Spam Filtering Method with Hadoop Tuning-Based MapReduce Naïve Bayes. Comput. Syst. Sci. Eng. 2023, 45, 201–214. [Google Scholar] [CrossRef]
  41. Zaharia, M.; Chowdhury, M.; Das, T.; Dave, A.; Stoica, I. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, San Jose, CA, USA, 25–27 April 2012. [Google Scholar]
  42. Chambers, B.; Zaharia, M. Spark: The Definitive Guide: Big Data Processing Made Simple; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2018. [Google Scholar]
  43. Dalleiger, S.; Vreeken, J. Discovering Significant Patterns under Sequential False Discovery Control. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, 14–18 August 2022; Zhang, A., Rangwala, H., Eds.; ACM: New York, NY, USA, 2022; pp. 263–272. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.