Next Article in Journal
Optimal Stream Gauge Network Design Using Entropy Theory and Importance of Stream Gauge Stations
Previous Article in Journal
Causal Composition: Structural Differences among Dynamically Equivalent Systems
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance

Department of Mathematical Sciences, University of Arkansas, Fayetteville, AR 72701, USA
Entropy 2019, 21(10), 990; https://doi.org/10.3390/e21100990
Submission received: 16 August 2019 / Revised: 28 September 2019 / Accepted: 10 October 2019 / Published: 11 October 2019
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
Measuring and testing association between categorical variables is one of the long-standing problems in multivariate statistics. In this paper, I define a broad class of association measures for categorical variables based on weighted Minkowski distance. The proposed framework subsumes some important measures including Cramér’s V, distance covariance, total variation distance and a slightly modified mean variance index. In addition, I establish the strong consistency of the defined measures for testing independence in two-way contingency tables, and derive the scaled forms of unweighted measures.

1. Introduction

Measuring and testing the association between categorical variables from observed data is one of the long-standing problems in multivariate statistics. The observed frequencies of two categorical variables are often displayed in a two-way contingency table, and a multinomial distribution can be used to model the cell counts. To be specific, let X and Y be two categorical random variables with finite sampling spaces X and Y ( | X | < , | Y | < , where | · | stands for the cardinality of a set), and a simple random sample of size N can be summarized in a | X | × | Y | table with count N x y in cell ( x , y ) . Let f ( x , y ) , f ( x ) , and f ( y ) be the joint and marginal probabilities of X and Y, i.e., f ( x , y ) = P ( X = x , Y = y ) , f ( x ) = P ( X = x ) , f ( y ) = P ( Y = y ) , then the statistical independence between X and Y can be defined as f ( x , y ) = f ( x ) f ( y ) for any ( x , y ) X × Y , i.e., all joint probabilities equal the product of their marginal probabilities. Pearson’s chi-squared statistic,
X 2 = x X y Y ( f N ( x , y ) f N ( x ) f N ( y ) ) 2 f N ( x ) f N ( y ) / N ,
where f N ( x , y ) = N x y / N , f N ( x ) = y Y N x y / N , and f N ( y ) = x X N x y / N , has been widely used to test independence in two-way contingency tables. Under independence and sufficient sample size, X 2 approximately follows a chi-squared distribution with d f = ( | X | 1 ) ( | Y | 1 ) . However, for insufficient sample size (e.g., min x , y N x + N + y / N < 5 , where N x + = y Y N x y , N + y = x X N x y ), the chi-squared test tends to be conservative. Zhang (2019) suggested a random permutation test based on the test statistic
D 2 = x X y Y ( f N ( x , y ) f N ( x ) f N ( y ) ) 2 ,
which is derived from the squared distance covariance, a measure of the dependence between two random vectors of any type (discrete or continuous) [1,2]. The D 2 statistic is closely related to Pearson’s chi-squared statistic, both measuring the squared distance between f ( x , y ) and f ( x ) f ( y ) , ( x , y ) X × Y . In the numerical study of Zhang (2019), the distance covariance test was evaluated in terms of the statistical power and type I error rate under various settings (see Figures 1–3 in [1]). It is found that for relatively large sample sizes, the distance covariance test performs similarly well as Pearson’s chi-squared test. However, for relatively small sample sizes, the distance covariance test is substantially more powerful and it controls the type I error rate at the nominal level. For small sample, Pearson’s chi-squared test exhibits substantial conservativeness, in the sense that the type I error rate is much lower than the nominal level and it fails to reject many false hypotheses. For instance, in a simulation setting with 20 by 20 table and only 50 samples, the statistical power and type I error rate are both close to zero by Pearson’s chi-squared test, indicating an extreme conservativeness.
Although the distance covariance test has better empirical performance than Pearson’s chi-squared test, especially for small sample size, its theoretical properties have not been investigated. In addition, Zhang (2019) only studied two alternative measures, including distance covariance and projection correlation, but there are many other association measures in the literature remaining unexplored. To name a few, Goodman and Kruskal (1954) introduced two association measures for categorical variables, namely the concentration coefficient and the λ coefficient [3]. Cui et al. (2015) developed a generic association measure based on a mean-variance index [4]. Theil (1970) proposed measuring the association between two categorical variables by the uncertainty coefficient [5]. McCane and Albert (2008) introduced the symbolic covariance, which expresses the covariance between categorical variables in terms of symbolic expressions [6]. In addition, Reshef et al. (2011) proposed a pairwise dependence measure called maximal information coefficient (MIC) based on the grid that maximizes the mutual information gained from the data [7].
The purpose of this paper is to extend my previous work [1] to a broad class of association measures using a general weighted Minkowski distance, and numerically evaluate some selected measures from the proposed class. The proposed class unifies many existing measures including ϕ coefficient, Cramér’s V, distance covariance, total variation distance and a slightly modified mean variance index. Furthermore, the strong consistency of the independence tests based on these measures was established, and the scaled forms of unweighted measures were derived. The proposed class provides a rich set of alternatives to the prevailing chi-squared statistic, and it has many potential applications. For instance, it can be applied to the correlation-based modeling, such as correlation-based deep learning [8]. As enlightened by a reviewer, the proposed method may also be applied to the pseudorandom number generator tests, and may improve some existing chi-squared based tests including the poker test and gap test [9].
The remainder of this paper is structured as follows: In Section 2, I introduce the defined class of association measures, and study some important special cases. The scaled forms of unweighted measures are also derived. Section 3 compares the performance of selected measures using simulated data. Section 4 discusses some extensions including the application to ordinal data and conditional independence test for three-way tables.

2. Methods

2.1. A Class of Association Measures for Categorical Variables

As the strength of association between two categorical variables can be reflected by the distance between f ( x , y ) and f ( x ) f ( y ) , here I define a class of measures based on the weighted Minkowski distance
L r , ω ( X , Y ) = x X y Y | f ( x , y ) f ( x ) f ( y ) | r ω r ( x , y ) 1 r ,
where r 1 , ω ( x , y ) > 0 , and ω ( x , y ) only depends on the marginal distributions of X and Y. For 0 < r < 1 , the defined distance violates the triangle inequality therefore it is not a metric. However, r = is allowed, and I denote by L , ω ( X , Y ) the maximum norm. It can be proved that L 1 , ω ( X , Y ) L 2 , ω ( X , Y ) L , ω ( X , Y ) for a given weight ω ( x , y ) . Throughout this paper, I denote by L r ( X , Y ) the unweighted measures, i.e., ω ( x , y ) = 1 . The defined class is quite broad and I begin with some important special cases.
Firstly, most of the chi-squared-type measures belong to the defined class. For instance, the ϕ coefficient for 2 × 2 tables, i.e., | X | = | Y | = 2 ,
ϕ ( X , Y ) = x X y Y | f ( x , y ) f ( x ) f ( y ) | 2 f ( x ) f ( y ) 1 2 ,
is a special case of L 2 , ω ( X , Y ) , where ω ( x , y ) = { f ( x ) f ( y ) } 1 / 2 . Extensions of ϕ ( X , Y ) to I × J tables including Cramér’s V and Tschuprow’s T [10,11],
V ( X , Y ) = x X y Y | f ( x , y ) f ( x ) f ( y ) | 2 f ( x ) f ( y ) 1 2 1 min ( | X | 1 , | Y | 1 ) 1 2 , T ( X , Y ) = x X y Y | f ( x , y ) f ( x ) f ( y ) | 2 f ( x ) f ( y ) 1 2 1 ( | X | 1 ) ( | Y | 1 ) 1 2 ,
are also special cases of L 2 , ω ( X , Y ) , where ω ( x , y ) = { f ( x ) f ( y ) min ( | X | 1 , | Y | 1 ) } 1 / 2 for Cramér’s V, and ω ( x , y ) = { f ( x ) f ( y ) ( | X | 1 ) ( | Y | 1 ) } 1 / 2 for Tschuprow’s T.
Distance covariance for categorical variables also belongs to the defined class. Distance covariance is a measure of statistical dependence between two random vectors X and Y. It is a special case of Hilbert-Schmidt independence criterion (HSIC) [12]. Let ( X 1 , Y 1 ) , ( X 2 , Y 2 ) and ( X 3 , Y 3 ) be three independent copies of ( X , Y ) , the distance covariance between X and Y is defined as the square root of
dCov 2 ( X , Y ) = cov ( X 1 X 2 , Y 1 Y 2 ) 2 cov ( X 1 X 2 , Y 1 Y 3 ) ,
where · represents distance between vectors, e.g., Euclidean distance. An alternative definition of distance covariance is given in Sejdinovic et al. (2013) [12], which only uses two independent copies of ( X , Y ) . A proof of the equivalency between the two definitions is provided in Appendix A.1. One property of distance covariance is that dCov 2 ( X , Y ) = 0 if and only if X and Y are statistically independent, indicating its potential of measuring nonlinear dependence. Zhang (2019) studied the distance covariance for categorical variables under multinomial model. Define X 1 X 2 = 0 if X 1 = X 2 and 1 otherwise, one can show that
dCov ( X , Y ) = x X y Y | f ( x , y ) f ( x ) f ( y ) | 2 1 2 ,
and it is easy to see that dCov ( X , Y ) = L 2 ( X , Y ) . A detailed proof of Equation (3) is provided in Appendix A.2.
Another special case is total variation distance, which is defined as the largest difference between two probability measures [13]. Let μ 0 ( · ) and μ α ( · ) be the measures under independence and dependence respectively, the total variation distance between μ 0 and μ α can be used to measure the dependence between variables X and Y
δ ( μ 0 , μ α ) = max S X × Y | μ 0 ( S ) μ α ( S ) | .
In the case of discrete sampling spaces, let S + = { ( x , y ) , s . t . , f ( x , y ) > f ( x ) f ( y ) } and S = { ( x , y ) , s . t . , f ( x , y ) < f ( x ) f ( y ) } , then we have
δ ( μ 0 , μ α ) =   | μ 0 ( S + ) μ α ( S + ) |   =   | μ 0 ( S ) μ α ( S ) |   = 1 2 x X y Y | f ( x , y ) f ( x ) f ( y ) | ,
therefore δ ( μ 0 , μ α ) = L 1 , ω ( X , Y ) , where ω ( X , Y ) = 1 2 .
In addition, I pointed out that the mean variance index (MV) recently developed by Cui et al. [4] also belongs to our defined class, subject to some slight modifications. The MV between two variables X and Y is defined as M V ( X | Y ) = E X ( V Y ( F ( X | Y ) ) ) , where F ( x | y ) stands for conditional distribution function. It can be proved that M V ( X | Y ) = 0 if and only if X and Y are independent. The MV measure is originally developed for continuous variables. To make it suitable for categorical variables while maintaining the main theoretical property, I slightly modified the definition of MV. First, I replaced the conditional c.d.f. F ( x | y ) with conditional p.m.f. f ( x | y ) . Second, as the MV measure is generally asymmetric, i.e., M V ( X | Y ) M V ( Y | X ) , I considered a symmetric version of the index, M V ( X , Y ) = 1 2 ( M V ( X | Y ) + M V ( Y | X ) ) . With the two modifications, one can prove the following result (a detailed proof is provided in Appendix A.3)
MV ( X , Y ) = x X y Y 1 2 | f ( x , y ) f ( x ) f ( y ) | 2 f ( x ) f ( y ) + f ( y ) f ( x ) 1 2 ,
therefore MV ( X , Y ) = L 2 , ω ( X , Y ) , where ω ( x , y ) = 1 2 f ( x ) f ( y ) + f ( y ) f ( x ) . As 1 2 f ( x ) f ( y ) + f ( y ) f ( x ) 1 , we also have MV ( X , Y ) L 2 ( X , Y ) .
Similar as the MV index, the symmetric version of some other directional association measures (e.g., the concentration coefficient [3]), are also the special cases of L r , ω .

2.2. Sample Estimate and Independence Test

Given a simple random sample of size N, one can estimate L r , ω , N ( X , Y ) using sample quantities
L r , ω , N ( X , Y ) = x X y Y | f N ( x , y ) f N ( x ) f N ( y ) | r ω N r ( x , y ) 1 r ,
where f N ( x , y ) , f N ( x ) and f N ( y ) represent the maximum likelihood estimates of joint and marginal probabilities, respectively, i.e., f N ( x , y ) = N x y / N , f N ( x ) = y Y N x y / N , and f N ( y ) = x X N x y / N . The following theorem establishes the strong consistency of the independence test based on L r , ω , N ( X , Y ) (a detailed proof is provided in Appendix A.4):
Theorem 1.
Assume that the estimated weights are bounded above by a constant C > 0 , i.e., sup x , y ω N ( x , y ) = C , then for any r 1 and ϵ > 0 , we have P L r , ω , N ( X , Y ) > ϵ < ( 2 | X | | Y | + 2 | Y | + 2 | X | ) exp ( N ϵ 2 / 18 C 2 ) under independence. The inequality also holds for maximum norm L , ω , N ( X , Y ) .
It is noteworthy that the asymptotic null distribution of L r , ω , N ( X , Y ) is impratical to derive. The theorem above provides a simple way to compute the upper bound of p-value, however, the bound ( 2 | X | | Y | + 2 | Y | + 2 | X | ) exp ( N ϵ 2 / 18 C 2 ) is generally not tight, thus the p-value could be largely overestimated. Here, I suggest a simple permutation procedure to evaluate the significance. One can randomly shuffle the observations of X (or equivalently, the observations of Y) for M times, and compute the test statistic L r , ω , N ( X p e r m , Y ) for each permuted dataset. The permutation p-value can be computed as the proportion of L r , ω , N ( X p e r m , Y ) ’s that exceed the actually observed one. I used the permutation p-value to evaluate statistical significant in our simulation studies.

2.3. Scaled Forms of Unweighted Measures

Motivated by the classic correlation coefficient, I define the following scaled form for unweighted measure L r ( X , Y ) :
L r * ( X , Y ) = L r ( X , Y ) L r ( X , X ) L r ( Y , Y ) ,
where L r ( X , X ) = x X x X | f ( x , x ) f ( x ) f ( x ) | r 1 r , f ( x , x ) = f ( x ) if x = x and f ( x , x ) = 0 otherwise.
The term L r ( X , X ) can be written as
L r ( X , X ) = x X | f ( x ) f 2 ( x ) | r + x X x X x | f ( x ) f ( x ) | r 1 r ,
and as examples, the explicit expressions for L 1 ( X , X ) , L 2 ( X , X ) , and L ( X , X ) are given below
  • L 1 ( X , X ) = x X f ( x ) f 2 ( x ) + x X x X x f ( x ) f ( x ) = 2 1 x X f 2 ( x )
  • L 2 ( X , X ) = x X f 2 ( x ) x X f 2 ( x ) + 1 2 x X f 3 ( x ) 1 2
  • L ( X , X ) = max x X f ( x ) f 2 ( x ) max x X , x X x f ( x ) f ( x ) = max x X f ( x ) f 2 ( x )
It can be seen that L 2 * ( X , Y ) is same as the distance correlation between X and Y [1], therefore 0 L 2 * ( X , Y ) 1 , where L r * ( X , Y ) = 0 if and only if X and Y are independent. In fact, for any 1 r < , if f ( x ) > 0 , f ( y ) > 0 for x X , y Y , it can be proved that 0 L r * ( X , Y ) 1 , where L r * ( X , Y ) = 0 if and only if X and Y are independent, and L r * ( X , Y ) = 1 if and only if X and Y have perfect association, i.e., | X | = | Y | and for any x X , there exists a unique y Y , such that f ( x , y ) = f ( x ) = f ( y ) .
For L * ( X , Y ) , by Cauchy-Schwarz inequality,
L ( X , Y ) = max x X , y Y | f ( x , y ) f ( x ) f ( y ) | = max x X , y Y | c o v ( I { X = x } , I { Y = y } ) | max x X , y Y V ( I { X = x } ) V ( I { Y = y } ) = max x X V ( I { X = x } ) max y Y V ( I { Y = y } ) = max x X f ( x ) f 2 ( x ) max y Y f ( y ) f 2 ( y ) = L ( X , X ) L ( Y , Y ) ,
therefore 0 L * 1 . However, in general, L * ( X , Y ) = 1 does not imply that X and Y are perfectly associated. I gave an example in Table 1, where L * ( X , Y ) = 1 but X and Y are not perfectly associated.

3. Numerical Study

Two simulation studies were conducted to compare the performance of some selected measures from our defined class. In both simulations, I set | X | = | Y | = 10 and varied the sample size from 25 to 500, so that the simulated contingency tables were relatively large and sparse (average count N / | X | | Y | is between 0.25 and 5).
In the first simulation study, I considered the independence test based on different unweighted measures, including L 1 , L 2 , L 4 and L , under the following multinomial settings:
  • Setting 1: f ( x , y ) = 0.05 for 10 randomly selected cells and f ( x , y ) = 0.5 90 for the remaining 90 cells
  • Setting 2: f ( x , y ) = 0.08 for 10 randomly selected cells and f ( x , y ) = 0.2 90 for the remaining 90 cells
  • Setting 3: f ( x , y ) = 0.1 for one randomly selected cell and f ( x , y ) = 0.9 99 for the remaining 99 cells
  • Setting 4: f ( x , y ) = 0.2 for one randomly selected cell and f ( x , y ) = 0.8 99 for the remaining 99 cells
For each test, the p-values were computed based on 2000 random permutations. Figure 1 summarizes the empirical statistical power of the four tests under significance level 0.05. It could be seen that, in settings 1 and 2, the L 2 measure (Euclidean distance) performed consistently better than the other three (comparable to L 4 ). The maximum norm L performs the worst in these two settings. In settings 3 and 4, where a single cell accounts for most deviation from independence, the maximum norm performs the best, while the L 1 measure (Manhattan distance) gives the lowest power. Figure 2 summarizes the type I error rate, where it can be seen that all the four tests control the type I error rates at the nominal level of 0.05.
In the second simulation study, I focused on L 2 , ω ( X , Y ) as it subsumes many popular measures. In particular, I compared three different weight functions, including ω ( x , y ) = 1 (distance covariance), ω ( x , y ) = { f ( x ) f ( y ) } 1 / 2 (Pearson’s chi-squared), and ω ( x , y ) = 1 2 ( f ( x ) f ( y ) + f ( y ) f ( x ) ) (modified mean variance index). Figure 3 shows the empirical statistical power of the three measures under settings 1 and 2, where it can be seen that the unweighted L 2 compares favorably to the weighted ones.
Based on the simulation studies, I recommend to the unweighted L r measures with a moderate choice of r, for instance, r = 2 , 3 , 4 for large sparse tables, because they could give satisfactory and stable statistical power in general scenarios. The maximum norm L is not recommended, unless one is very confident that there exist a very small number of cells that account for most deviation from independence.

4. Discussion

In this work, I proposed a rich class of dependence measures for categorical variables based on weighted Minkowski distance. The defined class unifies a number of existing measures including Cramér’s V, distance covariance, total variation distance and a slightly modified mean variance index. I provided the scaled forms of unweighted measures, which range from 0 (independence) to 1 (perfect association). Further, I established the strong consistency of the defined measures and suggested a simple permutation test for evaluating significance. Although I have used nominal and univariate categorical variables for illustrations, the proposed framework can be extended to other data types and problems:
First, the proposed measures can be used to detect ordinal association by assigning proper weights. Similar as Pearson’s correlation coefficient, one may assign larger weights to more extreme categories of X and Y. To be specific, let d ( x , x ) be the predefined distance between categories X = x and X = x , and d ( y , y ) be the distance between y and y , and one could apply the following weight function
ω ( x , y ) = E ( d ( x , X ) d ( y , Y ) ) = x X x , y Y y d ( x , x ) d ( y , y ) f ( x ) f ( y ) ,
which assigns larger weights to cells in the corners but smaller weights to cells in the center of the table.
Second, my framework can be generalized to random vectors and multi-way tables. In the case of three-way table ( X , Y , Z ) , one can define the following Minkowski distance between f ( x , y , z ) and f ( x , y ) f ( z )
L r , ω ( ( X , Y ) , Z ) = x X y Y z Z | f ( x , y , z ) f ( x , y ) f ( z ) | r ω r ( x , y , z ) 1 r ,
which can be used to test the joint independence between ( X , Y ) and Z, or equivalently, to test the homogeneity of the joint distribution of ( X , Y ) at different levels of Z. A similar permutation procedure can be applied to evaluate the statistical significance. One can also define the distance between f ( x , y , z ) and f ( x ) f ( y ) f ( z ) to test the mutual independence of ( X , Y , Z )
L r , ω ( X , Y , Z ) = x X y Y z Z | f ( x , y , z ) f ( x ) f ( y ) f ( z ) | r ω r ( x , y , z ) 1 r ,
Furthermore, the framework can be extended to conditional independence test in three-way tables [14], by defining distance between conditional joint probabilities f ( x , y | z ) and the product of conditional marginal probabilities f ( x | z ) f ( y | z )
L r , ω ( X , Y | Z ) = x X y Y z Z | f ( x , y | z ) f ( x | z ) f ( y | z ) | r ω r ( x , y , z ) 1 r .

Author Contributions

Q.Z. conceived of the presented idea, developed the theory, performed the computations and wrote the manuscript.

Funding

This research received no external funding.

Acknowledgments

The author would like to thank the editor and two reviewers for their thoughtful comments and efforts towards improving the manuscript.

Conflicts of Interest

The author declares no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MICmaximum information coefficient
HSICHilbert-Schmidt independence criterion
MVmean variance index
c.d.f.cumulative distribution function
p.m.f.probability mass function

Appendix A. Technical Details

Appendix A.1. Equivalency between Two Definitions of Distance Covariance

  • Definition by Szekely et al. (2007): dCov 2 ( X , Y ) = E X 1 X 2 Y 1 Y 2 ( X 1 X 2 Y 1 Y 2 ) + E X 1 X 2 ( X 1 X 2 ) E Y 1 Y 2 ( Y 1 Y 2 ) 2 E X 1 X 2 Y 1 Y 3 ( X 1 X 2 Y 1 Y 3 )
  • Definition by Sejdinovic et al. (2013): dCov 2 ( X , Y ) = E X 1 X 2 Y 1 Y 2 ( X 1 X 2 Y 1 Y 2 ) + E X 1 X 2 ( X 1 X 2 ) E Y 1 Y 2 ( Y 1 Y 2 ) 2 E X 1 Y 1 ( E X 2 ( X 1 X 2 ) E Y 2 ( Y 1 Y 2 ) )
The first two terms are the same, and the equivalency between the two definitions can be showed as follows:
E X 1 Y 1 ( E X 2 ( X 1 X 2 ) E Y 2 ( Y 1 Y 2 ) ) = x 1 y 1 x 2 x 1 x 2 f ( x 2 ) d x 2 y 2 y 1 y 2 f ( y 2 ) d y 2 f ( x 1 , y 1 ) d x 1 d y 1 = x 1 y 1 x 2 x 1 x 2 f ( x 2 ) d x 2 y 3 y 1 y 3 f ( y 3 ) d y 3 f ( x 1 , y 1 ) d x 1 d y 1 = x 1 x 2 y 1 y 3 x 1 x 2 y 1 y 3 f ( x 1 , y 1 ) f ( x 2 ) f ( y 3 ) d x 1 d x 2 d y 1 d y 3 = E X 1 X 2 Y 1 Y 3 ( X 1 X 2 Y 1 Y 3 )

Appendix A.2. Derivation of Equation (3)

Following Zhang (2019), I rewrite categorical variables X and Y as two random vectors of dimensions | X | and | Y | , X = { I ( X = x ) } x X and Y = { I ( Y = y ) } y Y , where I ( · ) stands for the indicator function. Let X 1 X 2 equal 0 if X 1 = X 2 and 1 otherwise. Let ( X 1 , Y 2 ) , ( X 2 , Y 2 ) , ( X 3 , Y 3 ) be three independent copies of ( X , Y ) . By Equation (2), the squared distance covariance can be also expressed as
dCov 2 ( X , Y ) = E ( X 1 X 2 Y 1 Y 2 ) + E ( X 1 X 2 ) E ( Y 1 Y 2 ) 2 E ( X 1 X 2 Y 1 Y 3 ) .
Under multinomial sampling scheme, it is straightforward to show that
E ( X 1 X 2 Y 1 Y 2 ) = P ( X 1 X 2 , Y 1 Y 2 ) = x X y Y f ( x , y ) [ 1 f ( x ) f ( y ) + f ( x , y ) ] , E ( X 1 X 2 ) E ( Y 1 Y 2 ) = P ( X 1 X 2 ) P ( Y 1 Y 2 ) = [ 1 x X f 2 ( x ) ] [ 1 y Y f 2 ( y ) ] , E ( X 1 X 2 Y 1 Y 3 ) = P ( X 1 X 2 , Y 1 Y 3 ) = x X y Y f ( x , y ) [ 1 f ( x ) ] [ 1 f ( y ) ] .
Summarizing the results above, I have
dCov ( X , Y ) = x X y Y | f ( x , y ) f ( x ) f ( y ) | 2 1 2 .

Appendix A.3. Derivation of the Modified Mean Variance Index

The symmetric mean variance index is defined as
M V ( X , Y ) = 1 2 ( M V ( X | Y ) + M V ( Y | X ) ) = 1 2 ( E X ( V Y ( f ( X | Y ) ) ) + E Y ( V X ( f ( Y | X ) ) ) ) .
I first derived the explicit formula for E X ( V Y ( f ( X | Y ) ) ) :
E X ( V Y ( f ( X | Y ) ) ) = E X E Y ( f ( X | Y ) ) 2 E X E Y 2 ( f ( X | Y ) ) = x X y Y f 2 ( x , y ) f ( x ) f ( y ) x X f 3 ( x ) = x X y Y f ( x ) f ( y ) ( f 2 ( x , y ) f 2 ( x ) f 2 ( y ) ) = x X y Y f ( x ) f ( y ) { ( f ( x , y ) f ( x ) f ( y ) ) 2 2 f 2 ( x ) f 2 ( y ) + 2 f ( x , y ) f ( x ) f ( y ) } = x X y Y f ( x ) f ( y ) ( f ( x , y ) f ( x ) f ( y ) ) 2
Similarly, it can seen that E Y ( V X ( f ( Y | X ) ) ) = x X y Y f ( y ) f ( x ) ( f ( x , y ) f ( x ) f ( y ) ) 2 , therefore
M V ( X , Y ) = x X y Y 1 2 f ( y ) f ( x ) + f ( x ) f ( y ) ( f ( x , y ) f ( x ) f ( y ) ) 2 .

Appendix A.4. Proof of Theorem 1

Because L 1 , ω , N L 2 , ω , N L , ω , N , I only need prove the strong consistency for L 1 , ω , N . For categorical variable X, let f ( x ) x X be the probability mass function, N be the sample size, and f N ( x ) be the sample estimate, Biau and Gyorfi (2005) [15] proved the following result
Lemma A1.
For any ϵ > 0 , P ( x X | f N ( x ) f ( x ) | > ϵ ) < 2 | X | e N ϵ 2 2 .
As sup x , y ω ( x , y ) = C > 0 , I have
L 1 , ω , N ( X , Y ) C ( x X y Y | f N ( x , y ) f ( x , y ) | + x X y Y | f ( x , y ) f ( x ) f ( y ) | + x X y Y | f N ( x ) f N ( y ) f ( x ) f ( y ) | ) .
Under independence, I have x X y Y | f ( x , y ) f ( x ) f ( y ) | = 0 . By Lemma A1, the first term x X y Y | f N ( x , y ) f ( x , y ) | satisfies that
P C x X y Y | f N ( x , y ) f ( x , y ) | > ϵ 3 < 2 | X | | Y | e N ϵ 2 18 C 2 ,
The third term can be bounded as follows:
x X y Y | f N ( x ) f N ( y ) f ( x ) f ( y ) | x X y Y | f ( x ) f N ( y ) f ( x ) f ( y ) | + x X y Y | f N ( x ) f N ( y ) f ( x ) f N ( y ) | = y Y | f N ( y ) f ( y ) | + x X | f N ( x ) f ( x ) |
By Lemma A1, I have
P C y Y | f N ( y ) f ( y ) | > ϵ 3 < 2 | Y | e N ϵ 2 18 C 2 ,
P C x X | f N ( x ) f ( x ) | > ϵ 3 < 2 | X | e N ϵ 2 18 C 2 ,
and summarizing the results above, I have
P L 1 , ω , N ( X , Y ) > ϵ P C x X y Y | f N ( x , y ) f ( x , y ) | > ϵ 3 + P C y Y | f N ( y ) f ( y ) | > ϵ 3 + P C x X | f N ( x ) f ( x ) | > ϵ 3 < 2 | X | | Y | e N ϵ 2 18 C 2 + 2 | Y | e N ϵ 2 18 C 2 + 2 | X | e N ϵ 2 18 C 2 = ( 2 | X | | Y | + 2 | X | + 2 | Y | ) e N ϵ 2 18 C 2 .

References

  1. Zhang, Q. Independence test for large sparse contingency tables based on distance correlation. Stat. Probab. Lett. 2019, 148, 17–22. [Google Scholar] [CrossRef]
  2. Szekely, G.; Rizzo, M.; Bakirov, N. Measuring and testing dependence by correlation of distances. Ann. Stat. 2007, 35, 2769–2794. [Google Scholar] [CrossRef]
  3. Goodman, L.; Kruskal, W. Measures of association for cross classifications, part I. J. Am. Stat. Assoc. 1954, 49, 732–764. [Google Scholar]
  4. Cui, H.; Li, R.; Zhong, W. Model-Free Feature Screening for Ultrahigh Dimensional Discriminant Analysis. J. Am. Stat. Assoc. 2015, 110, 630–641. [Google Scholar] [CrossRef] [PubMed]
  5. Theil, H. On the estimation of relationships involving qualitative variables. Am. J. Sociol. 1970, 76, 103–154. [Google Scholar] [CrossRef]
  6. McCane, B.; Albert, M. Distance functions for categorical and mixed variables. Pattern Recognit. Lett. 2008, 29, 986–993. [Google Scholar] [CrossRef] [Green Version]
  7. Reshef, D.; Reshef, Y.; Finucane, H.; Grossman, S.; McVean, P.; Turnbaugh, E.; Lander, M.; Mitzenmacher, M.; Sabeti, P. Detecting novel associations in large data sets. Science 2011, 334, 1518–1524. [Google Scholar] [CrossRef]
  8. Moews, B.; Herrmann, M.; Ibikunle, G. Lagged correlation-based deep learning for directional trend change prediction in financial time series. arXiv 2018, arXiv:1811.11287. [Google Scholar] [CrossRef]
  9. Knuth, D. The Art of Computer Programming, 3rd ed.; Addison-Wesley: Boston, MA, USA, 1997. [Google Scholar]
  10. Cramér, H. Mathematical Methods of Statistics; Princeton Press: Princeton, NJ, USA, 1946. [Google Scholar]
  11. Tschuprow, A. Principles of the Mathematical Theory of Correlation. Bull. Am. Math. Soc. 1939, 46, 389. [Google Scholar]
  12. Sejdinovic, D.; Sriperumbudur, B.; Gretton, A.; Fukumizu, K. Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat. 2013, 41, 2263–2291. [Google Scholar] [CrossRef]
  13. Sriperumbudur, B.; Fukumizu, K.; Gretton, A.; Scholkopf, B.; Lanckriet, G. On the empirical estimation of integral probability metric. Electron. J. Stat. 2012, 6, 1550–1599. [Google Scholar] [CrossRef]
  14. Zhang, Q.; Tinker, J. Testing conditional independence and homogeneity in large sparse three-way tables using conditional distance covariance. Stat 2019, 8, 1–9. [Google Scholar] [CrossRef]
  15. Biau, G.; Gyorfi, L. On the asymptotic properties of a nonparametric l1-test statistic of homogeneity. IEEE Trans. Inf. Theory 2005, 51, 3965–3973. [Google Scholar] [CrossRef]
Figure 1. Empirical statistical power of four different measures including L 1 (blue), L 2 (red), L 4 (black) and L (green), under settings 1–4. In each setting, sample sizes are n = 25 , 50 , 75 , 100 , 200 , 500 , and all results were based on 1000 replications.
Figure 1. Empirical statistical power of four different measures including L 1 (blue), L 2 (red), L 4 (black) and L (green), under settings 1–4. In each setting, sample sizes are n = 25 , 50 , 75 , 100 , 200 , 500 , and all results were based on 1000 replications.
Entropy 21 00990 g001
Figure 2. Empirical Type I error rate of four different measures including L 1 (blue), L 2 (red), L 4 (black) and L (green), under settings 1–4. In each setting, sample sizes are n = 25 , 50 , 75 , 100 , 200 , 500 , and all results were based on 1000 replications.
Figure 2. Empirical Type I error rate of four different measures including L 1 (blue), L 2 (red), L 4 (black) and L (green), under settings 1–4. In each setting, sample sizes are n = 25 , 50 , 75 , 100 , 200 , 500 , and all results were based on 1000 replications.
Entropy 21 00990 g002
Figure 3. Empirical statistical power of three L 2 , ω measures including ω ( x , y ) = 1 (distance covariance), ω ( x , y ) = { f ( x ) f ( y ) } 1 / 2 (chi-squared), and ω ( x , y ) = 1 2 ( f ( x ) f ( y ) + f ( y ) f ( x ) ) (symmetric mean variance index), under settings 1 and 2. In each setting, sample sizes are n = 25 , 50 , 75 , 100 , 200 , 500 , and all results were based on 1000 replications.
Figure 3. Empirical statistical power of three L 2 , ω measures including ω ( x , y ) = 1 (distance covariance), ω ( x , y ) = { f ( x ) f ( y ) } 1 / 2 (chi-squared), and ω ( x , y ) = 1 2 ( f ( x ) f ( y ) + f ( y ) f ( x ) ) (symmetric mean variance index), under settings 1 and 2. In each setting, sample sizes are n = 25 , 50 , 75 , 100 , 200 , 500 , and all results were based on 1000 replications.
Entropy 21 00990 g003
Table 1. An example that X and Y are not perfectly associated, but L * ( X , Y ) = 1 .
Table 1. An example that X and Y are not perfectly associated, but L * ( X , Y ) = 1 .
Y = 1Y = 2Y = 3
X = 11/200
X = 201/81/8
X = 301/81/8

Share and Cite

MDPI and ACS Style

Zhang, Q. A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance. Entropy 2019, 21, 990. https://doi.org/10.3390/e21100990

AMA Style

Zhang Q. A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance. Entropy. 2019; 21(10):990. https://doi.org/10.3390/e21100990

Chicago/Turabian Style

Zhang, Qingyang. 2019. "A Class of Association Measures for Categorical Variables Based on Weighted Minkowski Distance" Entropy 21, no. 10: 990. https://doi.org/10.3390/e21100990

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop