Next Article in Journal
Multi-Objective Optimization of Cell Voltage Based on a Comprehensive Index Evaluation Model in the Aluminum Electrolysis Process
Previous Article in Journal
A Generalized Residual-Based Test for Fractional Cointegration in Panel Data with Fixed Effects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Optimizing Allocation Rules in Discrete and Continuous Discriminant Analysis

by
Dário Ferreira
* and
Sandra S. Ferreira
Department of Mathematics, Center of Mathematics, University of Beira Interior, 6201-001 Covilhã, Portugal
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(8), 1173; https://doi.org/10.3390/math12081173
Submission received: 18 March 2024 / Revised: 5 April 2024 / Accepted: 10 April 2024 / Published: 13 April 2024

Abstract

:
This paper presents an approach for the study of probabilistic outcomes in experiments with multiple possible results. An approach to obtain confidence ellipsoids for the vector of probabilities, which represents the likelihood of specific results, for both discrete and continuous discriminant analysis, is presented. The obtention of optimal allocation rules, in order to reduce the allocation costs is investigated. In the context of discrete discriminant analysis, the approach focuses on assigning elements to specific groups in an optimal way. Whereas in the continuous case, the approach involves determining the regions where each action is the optimal choice. The effectiveness of the proposed approach is examined with two numerical applications. One of them uses real data, while the other one uses simulated data.
MSC:
62H11; 62H30; 62F15; 62E20

1. Introduction

The process of allocating mixed data with several possible outcomes begins with the identification of the data. This includes identifying the type of data, the source of the data, and any relevant characteristics. Once the data have been identified, the next step is to determine the type of allocation that will be used. There are several different types of allocation, including random, manual, and automated allocation. Moreover, there are multiple methods for examining data with categorical outcomes. These methods include descriptive statistics, bivariate analyses, log-linear regression, multinomial logistic regression, discriminant analyses and quadratic discriminant analysis.
Descriptive statistics are used to summarize the characteristics of a data set.
Bivariate analysis is used to examine the relationship between two variables. In the case of categorical data, bivariate analysis consists of calculating the frequency or percentage of one variable for each category of the other variable.
The purpose of log-linear models is to study the relationship between an outcome variable and one or more explanatory variables, where the outcome variable is expressed as the logarithm of a linear combination of the explanatory variables. The log-linear model is a frequently used and simple structure for a contingency table, see Hand and Christen [1]. It is based on the same principle as the analysis of variance models, as outlined in Birch [2], Fienberg and Rinaldo [3] or Goodman [4], and is formed by calculating the logarithms of the cell probabilities. Examples of log-linear models with frequency tables can be found in Haberman [5].
On the other hand, Multinomial logistic regression is a generalized linear model that models the log-odds of the categorical response being true as a linear combination of predictor variables. It is a significant tool in statistical modelling and prediction, extending the capabilities of the binary logistic regression model to situations with more than two outcomes.
The goal of discriminant analysis is to classify individuals into distinct groups, maximizing the difference between groups and minimizing the variation within groups. Discriminant analysis has many advantages over the other mentioned methods. For example, it is able to identify and measure the effects of one or more independent variables on a dependent variable, making it a more powerful tool for predicting the outcome of a certain event. This method was first introduced by Sir Ronald Fisher in 1936, in Fisher [6], and is a popular research topic. Just to give a few examples over time, Rao [7] applied discriminant analysis to two types of problems confronted in biological research, while Friedman [8] proposed alternatives to the usual maximum likelihood estimates for the covariance matrices in linear and quadratic discriminant analysis, in small sample high-dimensional settings. McFarland et al. [9] used the theory of Bessel functions to derive stochastic representations for the exact distributions of the “plug-in” quadratic discriminant functions for classifying a newly obtained observation. Discriminant analysis was used in Perriere and Thioulouse [10] to separate Gram negative bacteria proteins according to their subcellular location. The problem of classifying an individual into one of several populations based on mixed nominal, continuous, and ordinal data was studied in Flury, Bourkai and Flury [11]. Modern research continues to find discriminant analysis helpful in problem-solving. For example, using dynamic feature extraction along with quadratic discriminant analysis classifier can significantly enhance fault classification and diagnosis in dynamic nonlinear processes, as demonstrated by Li, Jia, and Mao [12]. Furthermore, Tong et al. [13] demonstrated the effectiveness of discriminant analysis in their work on bearing fault diagnosis, representing a crucial development in fault diagnosis for dynamic nonlinear processes. For more information on discriminant analysis, see McLachlan [14].
Discriminant analysis can be divided into two categories: continuous and discrete. In the continuous case, the predictor variables are continuous and the output of the analysis is a continuous function. In the discrete case, the predictor variables are discrete and the output of the analysis is a classification. In both cases, discriminant analysis is used to find the optimal allocation rules. In the continuous case, the optimal allocation rule is determined by finding the optimal separating hyperplane which maximizes the difference between the groups. In the discrete case, the optimal allocation rule is determined by finding the optimal decision tree, which maximizes the accuracy of the classification.
A unified approach, providing optimal allocation rules to minimize expected costs for both continuous and mixed cases, is presented in Ferreira et al. [15]. We will follow this last approach.
Our goal is to obtain confidence ellipsoids for the vector of probabilities, p , of getting particular results in an experiment with m possible outcomes. Through duality, we will also test hypotheses. In addition, we will use support planes for ellipsoids, see Schott [16], to obtain simultaneous confidence intervals for these probabilities. Moreover, we will generalize these results to vectors
Ψ = G p ,
where G is a known matrix. Furthermore, we will show how to obtain optimal allocation rules, in order to reduce the allocation costs, first for the discrete case and then for the continuous case.
The rest of this paper is organized as follows. The next section, for preliminary results, is divided in two subsections. In the first one, we provide a simplified form of the continuous mapping theorem, CMT, based on Kallenberg [17]. This theorem will grant us the ability to derive confidence ellipsoids from the N ( | 0 , W ( p ) ) distribution in the second subsection, where we will follow the approach of Scheffé [18]. Section 3 begins with the study of individual samples, followed by the study of pairs and structured families of independent samples. These samples refer to the treatments of a fixed effects experiment. The optimal allocation rules will be obtained in Section 4. Section 5 contains two numerical applications. The first, referring to the discrete case and using real data; the second, referring to the continuous case and using simulated data. We end the paper with some conclusions.

2. Preliminary Results

2.1. Asymptotic Distributions

The Central Limit Theorem, CLT, serves as a fundamental result in probability theory and statistics, providing insights into the behavior of sample averages. Let Y n be a sequence of random variables that converges in distribution to a limiting random variable Y , denoted as Y n d Y . Additionally, consider continuous functions g ( Y n ) and g ( Y ) . The CLT, as established by Kallenberg [17], asserts that under these conditions, g ( Y n ) converges in distribution to g ( Y ) , i.e., g ( Y n ) d g ( Y ) .
Moreover, if the convergence in distribution of Y n to Y , given by F Y n ( y ) F Y ( y ) , is uniform on a compact set L, then the convergence in distribution of g ( Y n ) to g ( Y ) is also uniform on the same set L. This uniform convergence property is valuable in understanding the stability of the distributional characteristics of the transformed random variables.
Further insights can be gained by considering the continuity of the function g. If g is a continuous function, then for any given ϵ > 0 , there exists a compact set K ( ϵ ) such that the probability measure associated with F Y n ( y ) , denoted as ϕ n ( . ) , satisfies ϕ n ( K ( ϵ ) ) 1 ϵ . Notably, the Cartesian product
K ˙ ( ϵ ) = K ( ϵ ) × L
is also a compact set. Consequently, the convergence in distribution of g ( Y n ) to g ( Y ) is uniform on the set K ˙ ( ϵ ) .
In summary, the asymptotic distribution of g ( Y n ) can be expressed as F g ( Y ) ( x ) , provided that x is a point of continuity for F g ( Y ) . This convergence is established through the concept of weak convergence of probability measures, highlighting the broader applicability of the CLT in understanding the behavior of transformed random variables in a distributional sense.

2.2. Confidence Ellipsoids

Confidence ellipsoids serve as valuable tools for visualizing and quantifying uncertainty in multivariate data, providing insights into the variability of observations. These ellipsoids are particularly useful for testing hypotheses, constructing simultaneous confidence intervals, and making predictions in statistical analysis.
Consider a set of observations modeled by the linear equation Y = X β + ϵ , where X is a design matrix, β is a vector of fixed effects, and ϵ is a vector of errors. The associated confidence ellipsoid, denoted as E q , can be expressed as
E q = { Y μ Ω : ( Y μ ) C 1 ( Y μ ) x h , 1 q } ,
where μ = X β and C = X X . Here, Ω represents the feasible region of the ellipsoid, and x h , 1 q is a critical value from a chi-square distribution with h degrees of freedom, ensuring a desired confidence level 1 q .
The confidence ellipsoid E q possesses notable properties. Firstly, it is a convex set, implying that any point within the ellipsoid is a convex combination of points on its boundary. This property facilitates a comprehensive understanding of the distribution of observations.
Secondly, the confidence ellipsoid aids in constructing simultaneous confidence intervals. For a vector of interest, τ , the simultaneous confidence interval for μ Y is given by
τ ( Y μ ) x h , 1 q τ C τ .
This interval provides a range for the linear combination τ Y with a specified level of confidence.
Finally, confidence ellipsoids are instrumental in hypothesis testing. Consider the following null hypothesis:
H 0 : τ μ = a 0 .
If the hypothesis H 0 is true, then E q will contain the point ( Y μ ) a 0 = ( τ Y τ μ ) ; otherwise, the hypothesis is rejected.

3. Inference

In statistical inference, we often encounter experiments with multiple possible outcomes, each associated with a certain probability. To quantify the likelihood of obtaining specific results in such experiments, given the probabilities p 1 , , p m of each outcome, and the observed frequencies n 1 , , n m in n trials, we can employ the following probability function:
p r i = 1 m ( N i = n i ) = n ! i = 1 m n i ! i = 1 m p i n i d M ( | n , p ) ,
where M represents a singular multivariate distribution.
Consider the probability vector p = ( p 1 , , p m ) , and define the vector of estimators:
p ˜ n = ( p ˜ n , 1 , , p ˜ n , m ) ,
with p ˜ n , i = n i n for i = 1 , , m . These estimators represent the observed probabilities of each outcome. As the number of trials n approaches infinity, Wilks’ theorem, in Wilks [19], provides insight into the asymptotic behavior of the estimators:
n ( p ˜ n p ) d N ( 0 , W ( p ) ) ,
where d denotes convergence in distribution, and N ( 0 , W ( p ) ) represents a normal distribution with zero mean vector and covariance matrix W ( p ) .
The covariance matrix W ( p ) is defined as
W ( p ) = D ( p ) p p ,
where D ( p ) is a diagonal matrix with the components of p along its diagonal. The probabilities in the initial distribution can be estimated from the observed frequencies through maximum likelihood estimation, where each p ˜ n , i = n i n for i = 1 , , m . This result indicates that as the sample size grows, the distribution of the estimated probabilities becomes approximately normal, facilitating the application of classical statistical inference techniques. The covariance matrix W ( p ) characterizes the variability in the estimates, and its asymptotic normality allows for the construction of confidence intervals and hypothesis tests based on the estimated probabilities p ˜ n .

3.1. One Sample

Let α 1 , , α m constitute an orthonormal basis for the orthogonal complement, Ω , of Ω . Then, we can write
W ( p ˜ n ) = j = 1 m v j α j α j ,
where v j is the eigenvalue associated with α j , see Horn [20]. We can further decompose the matrix W ( p ˜ n ) as
W ( p ˜ n ) = j = 1 m 1 v i α j α j + v m α m α m .
Let α m be the eigenvector associated with the largest eigenvalue v m . Then, we can write
W ( p ˜ n ) + = j = 1 m 1 v j 1 α j α j + v m 1 α m α m ,
where + denotes the Moore–Penrose inverse, expressed as
W ( p ˜ n ) + = j = 1 m v j 1 α j α j .
According to Schott [16] and attending to the continuity of the Moore–Penrose inverse, if
W ( p ˜ n ) p W ( p ) ,
where p denotes convergence in probability, then
W ( p ˜ n ) + p W ( p ) + .
Moreover, the continuity of the Moore–Penrose inverse implies that the covariance matrix of the vector n ( p ˜ n p ) , / Σ ( n ( p ˜ n p ) ) will converge in probability to W ( p ) ,
/ Σ n ( p ˜ n p ) p W ( p ) .
Furthermore, the continuity of the Moore–Penrose inverse implies that, if c is a vector, the inner product c W ( p ˜ n ) c will converge in probability to the inner product c W ( p ) c , i.e.,
c W ( p ˜ n ) c p c W ( p ) c
and the variance of the vector n c ( p ˜ n p ) c will also converge in probability to the inner product c W ( p ) c , i.e.,
V a r n c ( p ˜ n p ) c p c W ( p ) c .
These properties of the Moore–Penrose inverse can be used to draw statistical inference from the data. For example, the covariance matrix of the vector n ( p ˜ n p ) can be used to estimate the variance of the vector p , and the inner product c W ( p ˜ n ) c can be used to estimate the inner product c W ( p ) c . Besides this, we have
n ( p ˜ n p ) N ( 0 , W ( p ) ) n ( c p ˜ n c p ) N ( 0 , c W ( p ) c ) ;
see Tsui [21]. We can then make use of the central limit theorem to show that
n c ( p ˜ n p ) c c W ( p ˜ n ) c N ( 0 , 1 ) .
Using this result, we can construct a confidence interval for c p , with confidence level ( 1 q ) , given by
p r c p ˜ n z q c W ( p ˜ n ) c n c p c p ˜ n + z q c W ( p ˜ n ) c n 1 q .
We may then use duality to test the hypothesis
H 0 ( c ) : c p = c p 0
at the limit q. This method is useful for testing hypotheses about parameters in a variety of different models, including linear mixed models.

3.2. Pair of Samples

We denote M ( | m ( l ) , p ( l ) ) as the singular multivariate distribution for a pair of independent samples, l = 1 , 2 , with p ( l ) = ( p 1 ( l ) , , p m ( l ) ) and p i ( l ) 0 . Thus, distributions and the corresponding statistics, such as estimators, will be independent.
We have
n ( l ) p ˜ ( l ) p ( l ) N ( | 0 , W ( p ( l ) ) ) .
We can also take the difference between each of the p i ( 1 ) and p i ( 2 ) , i = 1 , , m , and end up with unbiased estimators for q i , as follows:
q ˜ i = p ˜ i ( 1 ) p ˜ i ( 2 ) , i = 1 , , m .
If n ( 2 ) n ( 1 ) r , we will have
n ( 1 ) n 1 1 + r n ( 2 ) n r 1 + r ,
with n = n ( 1 ) + n ( 2 ) . So,
V a r n p ˜ i ( 1 ) ( r + 1 ) p i ( 1 ) ( 1 p i ( 1 ) ) V a r n p ˜ i ( 2 ) r + 1 r p i ( 2 ) ( 1 p i ( 2 ) ) , i = 1 , , m
and, given the independence of the two samples,
V a r n q ˜ i ( r + 1 ) p i ( 1 ) ( 1 p i ( 1 ) ) + 1 r p i ( 2 ) ( 1 p i ( 2 ) ) , i = 1 , , m .
When r 0 , the variance of n q ˜ i tends to zero, so it becomes an unbiased estimator. The same is true for r . To compare the two multivariate distributions, M ( | m ( 1 ) , p ( 1 ) ) and M ( | m ( 2 ) , p ( 2 ) ) , we can use the Hotelling’s T 2 statistic to measure the distance between the two distributions:
T 2 = n p ˜ ( 1 ) p ˜ ( 2 ) T W ( p ( 1 ) ) 1 p ˜ ( 1 ) p ˜ ( 2 ) .
It is well known that if
n ( 2 ) n ( 1 ) r ,
then
T 2 χ m 1 2 ( r ) .
Now, reasoning as in the previous subsection, it may be shown that
n ( q ˜ i q i ) ( r + 1 ) p ˜ i ( 1 ) ( 1 p ˜ i ( 1 ) ) + 1 r p ˜ i ( 2 ) ( 1 p ˜ i ( 2 ) ) N ( | 0 , 1 ) , i = 1 , , m ,
and so
p r q ˜ i z q ( r + 1 ) p ˜ i ( 1 ) ( 1 p ˜ i ( 1 ) ) + 1 r p ˜ i ( 2 ) ( 1 p ˜ i ( 2 ) ) q i q ˜ i + z q ( r + 1 ) p ˜ i ( 1 ) ( 1 p ˜ i ( 1 ) ) + 1 r p ˜ i ( 2 ) ( 1 p ˜ i ( 2 ) ) 1 q , i = 1 , , m .
Thus, through duality, we get a 1 q limit level test for
H 0 ( q 0 , i ) : q i = q 0 , i , i = 1 , , m .
In addition,
V a r n c p ˜ ( 1 ) ( r + 1 ) c W ( p ( 1 ) ) c V a r n c p ˜ ( 2 ) r + 1 r c W ( p ( 2 ) ) c ,
as n , so that
V a r n ( c p ˜ ( 1 ) c p ˜ ( 2 ) ) n ( r + 1 ) c W ( p ( 1 ) ) + 1 r W ( p ( 2 ) c ,
and that
n ( c p ˜ ( 1 ) c p ˜ ( 2 ) ) ( c p ( 1 ) c p ( 2 ) ) ( r + 1 ) ( c W ( p ˜ ( 1 ) ) + 1 r W ( p ˜ ( 2 ) ) c ) =
n ( c q ˜ c q ) ( r + 1 ) ( c W ( p ˜ ( 1 ) ) + 1 r W ( p ˜ ( 2 ) ) c ) N ( | 0 , 1 ) .
Thus,
p r c q ˜ z q ( r + 1 ) ( c W ( p ˜ ( 1 ) ) + 1 r W ( p ˜ ( 2 ) ) c ) c q
c q ˜ + z q ( r + 1 ) ( c W ( p ˜ ( 1 ) ) + 1 r W ( p ˜ ( 2 ) ) c ) 1 q .
And, using duality once again, we can obtain limit level q tests for
H 0 ( c q ) : c q = d l , l = 1 , 2 .

3.3. Structured Families of Samples

Let us consider d treatments of fixed-effects designs, with a total of
n = l = 1 d n ( l )
individuals, and d independent samples. This configuration is referred to as a structured family, and the distributions of the samples,
M ( | n ( l ) , p ( l ) ) , l = 1 , , d ,
with p ( l ) = ( p 1 ( l ) , , p m ( l ) ) , l = 1 , , d , and p i ( l ) 0 , i = 1 , , m , l = 1 , , d , characterize it. Moreover, consider
n ( l ) n r ( l ) , l = 1 , , d .
Now, we will have
X ˙ n ( l ) = n r ( l ) p ˜ n ( l ) ( l ) p ( l ) W + ( p l ) p ˜ n ( l ) p ( l ) G ( | m 1 ) , l = 1 , , d ,
where G ( | m 1 ) is the gamma distribution with a shape parameter of m 1 and a scale parameter of 1. So, with X ˙ n = ( X ˙ n ( 1 ) , , X ˙ n ( d ) ) , given the independence of the X ˙ n ( l ) , l = 1 , , d , we have
X ˙ n G ( | m 1 ) ,
with
G ˙ ( x | m 1 ) = l = 1 d G ( x l | m 1 ) ,
whenever x = ( x 1 , , x d ) .
We now study the action of the factors in the base design, assuming that the hypotheses of absence of effects and interactions are linked to an orthogonal partition,
R d = j = 1 m j ,
and that the row vectors of A j constitute an orthonormal basis for j , j = 1 , , m . This leads to the hypotheses
H 0 , j ( c ) : A j η ( c ) = 0 , j = 1 , , m ,
where η ( c ) = ( c p ( 1 ) , , c p ( d ) ) .
More generally,
H 0 , j ( c ) : Ψ j ( c ) = Ψ 0 , j ( c ) , j = 1 , , m .
We have
Ψ j ( c ) = A j η ( c ) .
So, we will have
η ˜ ( c ) = ( c p ( 1 ) , , c p ( d ) ) Ψ ˜ j ( c ) = A j η ˜ ( c ) N ( | Ψ j ( c ) , A j D ( c ) A j ) ,
where D ( c ) is the diagonal matrix, with
c W ( p ( 1 ) ) c , , c W ( p ( d ) ) c
along its diagonal, and the estimator is given by
Ψ ˜ j ( c ) A j D ( c ) ( A j ) + Ψ ˜ j ( c ) G ( | g j ) , j = 1 , , m .
Moreover, x g j , 1 q will be the critical value for a limit level q test for
H 0 , j ( c ) , j = 1 , , m .

4. Optimal Allocation Rules

4.1. The Discrete Case

Consider a set of populations combined together, each placed into different groups. A randomly selected sample of n elements from this mixture belongs to a specified population and class, denoted as n i , j for elements in class § i and population P j , with i = 1 to w and j = 1 to m.
Our objective is to determine an allocation rule that minimizes the overall cost of assigning elements to their respective populations. The cost of assigning an element from P j to P j is represented by c j , j .
Let E i , j be the event that a randomly chosen element belongs to § i and P j , with E i , · [ E · , j ] representing the event that a randomly chosen element belongs to § i [ P j ]. Define
p i , j = p r ( E i , j ) p i , · = p r ( E i , · ) p · , j = p r ( E · , j ) ,
for i = 1 , , w and j = 1 , , m . Additionally, introduce conditional probabilities
p i | j = p r ( E i , · | E · , j ) p j | i = p r ( E · , j | E i , · ) ,
for i = 1 , , w and j = 1 , , m .
If the elements located in § i are assigned to P j , i = 1 , , w , j = 1 , , m , the average cost is given by
c i ( j ) = j = 1 m p j , i c j , j , i = 1 , , w , j = 1 , , m .
This leads to consistent estimators
p ˜ i , j = n i , j n p ˜ i , · = j = 1 m p ˜ i , j p ˜ · , j p ˜ i | j = p ˜ i , j p ˜ i , · ,
as the sample size increases, and the estimated average cost
c ˜ i ( j ) = j = 1 m p ˜ j , i c j , j , i = 1 , , w , j = 1 , , m .
An optimum allocation for x i occurs if there exists a j ( i ) such that c i ( j ( i ) ) < c i ( j ) for j j ( i ) , in which case
p r j j ( i ) ( c ˜ i ( j ( i ) ) < c ˜ i ( j ) ) n 1 , i = 1 , , w .
Furthermore, if there is an optimum allocation for all x i , i = 1 , , w ,
p r i = 1 w j j ( i ) c ˜ i ( j ( i ) ) < c ˜ i ( j ) n 1 .
Discriminant analysis is consistent, indicating that the probability of the allocation rule being optimal increases as the initial sample size grows to infinity. This holds true for all § i , with i ranging from 1 to w.
The global average cost of assigning elements of § i , i = 1 , , w , to P j ( i ) is given by
c . = i = 1 w p i c i ( j ( i ) ) ,
with the consistent estimator
c ˜ . = i = 1 w p ˜ i c i ( j ( i ) ) ,
and the limit level 1 p confidence interval
c ˜ . ± z q d W ( p ˜ ) d ,
where z q is the critical value of a standard normal distribution at level 1 q , and d is the vector with all components null except the one with index 1 + ( i 1 ) j ( i ) , i = 1 , , w , which is 1.

4.2. The Continuos Case

For each event A i , i = 1 , , k , that occurs when an element of P is randomly chosen to belong to P i , i = 1 , , k , the probability of the event is p i = p r ( A i ) . To determine the allocation, a partition of the observations vector X R n is created such that each element is assigned to P i when X Ω i , i = 1 , , k . The probability of assigning to P j an element of P i is denoted as q i , j , i = 1 , , k and j = 1 , , k , and is calculated from the densities of X in the P i , i = 1 , , k . The expected cost of the decision is then determined by summing the product of the costs of each pair of events and their associated probabilities.
The average cost incurred by the decision will be equal to the sum of the probabilities multiplied by the expected cost for each action multiplied by the probability of that action. This can be expressed as the sum of integrals for each possible action over the region where that action is the optimal choice. That is,
E ( C ) = j = 1 k Ω j g j ( x ) l = 1 n d x l ,
with
g j ( x ) = i = 1 k p i c i , j f i ( x ) ; j = 1 , , k .
To minimize the expected cost, one must take the regions where each action is the optimal choice,
Ω j = { x : g j ( x ) g i ( x ) , i = 1 , , k } ; j = 1 , , k .
In the situation of identical decision costs, for which c i , j = c , i j , this is especially noteworthy. This means that
g j ( x ) = c i j p i f i ( x ) = c i = 1 k p i f i ( x ) c p j f j ( x ) ; j = 1 , , k ,
and the sets
Ω j = { x : p j f j ( x ) p i f i ( x ) , i = 1 , , k } ; j = 1 , , k .
are applicable. In addition, if p i = 1 k , i = 1 , , k , then
Ω j = { x : f j ( x ) f i ( x ) , i = 1 , , k } ; j = 1 , , k .
This means that, given x , the population with the greatest probability will be chosen.
In summary, the expected cost associated with the decision can be expressed as a sum of integrals over the regions where each action is the optimal choice. This decision can be further simplified when the costs for each action are equal and the probabilities are the same. In this situation, the decision is simply to choose the population with the greatest probability.

5. Numerical Applications

5.1. The Discrete Case—An Application to Real Data

The purpose of this subsection is to illustrate the theory, relative to the discrete case. To do this, we use an application with real data, obtained from Instituto Superior Dom Bosco [22], related to the impact of four economic policies on greenhouse gas emissions: Policy A, Policy B, Policy C, and Policy D. The data were provided by the Instituto Superior Dom Bosco, and will be used to highlight the effectiveness of the discrimination rule in guiding resource allocation decisions.
Let us assume that for each region, the four policies have four distinct categories. This would create 4 4 = 256 possible combinations. If our null hypothesis is true and all of the combinations are equally likely, then any given combination should have a probability of 1 256 .
We present a discrimination rule, which will classify regions into two groups, “High Emission” (Population 1) and “Low Emission” (Population 2), based on the values of the statistic:
v = | x 1 x 2 | + | x 1 x 3 | + | x 1 x 4 | + | x 2 x 3 | + | x 2 x 4 | + | x 3 x 4 | + 1 ,
where each policy category has a value of 1, 2, 3, or 4.
The choice of this statistic is based on its ability to capture the differences between the four economic policies in terms of their impact on greenhouse gas emissions. By calculating the absolute differences between the categories of each policy and summing them, we obtain a single value that represents the overall dissimilarity between the policies. In addition, the use of the absolute differences ensures that the magnitude of the differences is taken into account, regardless of their direction. This is important in the context of evaluating policy effectiveness, as both positive and negative differences can contribute to the overall impact.
The collection of potential values for v is D = 1 , 4 , 5 , 7 , 8 , 9 , 10 , 12 , 13 . We use the values of r h = p h , 1 p h , 2 , h D of the conditional probabilities and allocate to population 2 or 1 those with v D ( x ) or v D ( x ) , respectively.
Suppose n h , l denotes the number of elements with v = h in population l. We can obtain the sample size for each population as follows:
h D n h , l , l = 1 , 2
and the probability of obtaining v = h in a randomly chosen element from that sample as
n h , l h D n h , l , h D , l = 1 , 2 .
The overall sample size will be
l = 1 2 h D n h , l ,
while the number of its elements, with v = h , will be
l = 1 2 n h , l , h D .
The probability of a randomly chosen element of the global sample having v = h [belonging to population l] will be
l = 1 2 n h , l l = 1 2 h D n h , l
and
h D n h , l l = 1 2 h D n h , l .
We begin by finding the ratios r h = p h , 1 p h , 2 , h D , of the conditional probabilities, and set D ( x ) = h ; r h x , allocating to population 2 [population 1] those with v D ( x ) [ v D ( x ) ] , therefore to those with r v x [ r v > x ] .
Table 1 displays the results, which indicate that x = 1 should be chosen and D ( 1 ) = 1 , thus giving to population 1, the elements with v = 1 , and to population 2, the elements with v > 1 .
By computing
q 1 , l = v D ( 1 ) p v , l ; l = 1 , 2 q 2 , l = v D ( 1 ) p v , l ; l = 1 , 2 ,
the outcomes are summarized in Table 2.
The cost of wrong decisions for both populations was calculated to be c ¯ 1 = 0.73972 and c ¯ 2 = 0.02265 . The global average cost of wrong decisions can be estimated by the sum of the product of the probability of randomly chosen elements for the mixture, Π · , 1 and Π · , 2 , and the respective population average costs, which is
c ¯ = 0.73972 Π · , 1 + 0.02265 Π · , 2 .

5.2. The Continuous Case—An Application to Simulated Data

In this subsection, we used R software, version 4.3.0, to simulate and compute a mixture of three populations with mean vectors
μ 1 = ( 1.5 ; 1.5 ) μ 2 = ( 3.5 ; 3 ) μ 3 = ( 5 ; 4.5 ) ,
and identical covariance matrices I 2 . All measures were taken based on the corresponding normal homocedastic distributions. These values were assumed based on a reasonable assumption for the data generation process, taking into account the need for clear separation between the populations for effective simulation and analysis. In a real-world context, such as in the field of botany, these could, for example, represent averages of certain measurable traits, such as the lengths of petals from three different types of plants.
Simulated samples, with dimension 1000 were generated. These samples are presented in Figure 1.
Table 3 shows the costs of misallocation. The rows represent the real population of the elements and the columns the populations they were allocated to.
The shapes for the areas of distribution for the elements were established as illustrated in Figure 2. These are appropriate for normally distributed data with identical covariance matrices. The selected rectangular regions aligned with the axes allowed for efficient calculation of the probabilities of allocation and cost of misallocation.
The average cost will depend on the amount of elements that are allocated to P 1 from P 2 and to P 2 from P 3 , as the probabilities of allocating to P 3 one element of P 1 and to P 1 one element of P 3 are very small.
With C 1 , C 2 and C 3 representing the regions in which the elements are allocated to populations P 1 , P 2 and P 3 , respectively, we must minimize
q = C 1 d F 2 + C 2 d F 1 + C 3 d F 2 + C 2 d F 3 .
As shown in Figure 2, our approach was to optimize the points where the horizontal and vertical lines intersect the x-axis at X = 2 and X = 4 and the y-axis at Y = 2 and Y = 4 , respectively. These intersection points act as delineations for the designated regions. The selection of those initial points was based on an informed estimation leveraging the mean vectors of the three populations. Noting the fact that the mean vectors for the three populations are μ 1 = ( 1.5 ; 1.5 ) , μ 2 = ( 3.5 ; 3 ) , and μ 3 = ( 5 ; 4.5 ) , the initial points were chosen to lie approximately midway between these centroids. Taking into account that the populations are distributed normally with identical covariance matrices, this ensures that the initial delineations for the partitioning areas ( C 1 , C 2 , C 3 ) in the clustering process are not skewed towards one or more populations. Once these initial parameters were set, the optimization process was run, using the ‘stats’ package of R software, version 4.3.0. in R, particularly the ‘optim’ function, continually adjusting the parameters in order to find the values that minimized the objective function, which, in this case, was the cost of misallocation.
After 1000 repetitions of this optimization process, the suitable partitions displayed in Figure 3 were obtained. These partitions represent the areas of allocation for the elements within the three populations that produced the minimum misallocation cost, and, therefore, the most efficient allocation of elements within the identified clusters.
The cuts that are most cost-efficient can be represented by the straight lines
x = 2.3 ; x = 4.75 y = 1.45 ; y = 4.15 .
Table 4 shows the number of elements in each area.
The corresponding obtained total minimum cost was 638.
So, using simulated samples and different partitions of the sample, we were able to obtain the most cost-efficient cuts and the corresponding total minimum cost.

6. Conclusions

The paper discusses a comprehensive method to obtain optimal allocation rules for fixed-effect experiments. The study process is divided into four sections, namely, asymptotic distributions, confidence ellipsoids, inference, and optimal allocation rules.
Under the section of asymptotic distributions, the Central Limit Theorem (CLT) is accredited for providing valuable insights in terms of the behavior of sample averages.
In confidence ellipsoids, the ellipsoid properties are studied in developing a comprehensive understanding of the distribution of observations. The convex shape of the ellipsoid allows every point within the ellipsoid to be a convex combination of points on its boundary. Simultaneous confidence intervals can be constructed using these ellipsoids.
In the section on inference, the probability functions are detailed. These functions quantify the likelihood of obtaining specific results in experiments. As the number of trials increases, the asymptotic behavior of estimators is established. The distribution of estimated probabilities becomes approximately normal as the sample size grows.
The section on optimal allocation rules further elaborates on the methods for determining an allocation rule that minimizes the overall cost of assigning elements to their respective populations.
The paper concludes with two numerical applications. The first is a real data application showcasing the employment of computational techniques throughout the paper to analyze the real data. The second application undertakes the simulation of data to validate the findings of the study.

Author Contributions

Conceptualization, D.F.; Methodology, D.F. and S.S.F.; Investigation, S.S.F.; Writing—review & editing, D.F. and S.S.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Portuguese Foundation for Science and Technology through the projects UIDB/00212/2020, UIDB/04630/2020 and UIDB/00297/2020.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no conflict of interest.

References

  1. Hand, D.; Christen, P. A note on using the F-measure for evaluating record linkage algorithms. Stat. Comput. 2018, 28, 539–547. [Google Scholar] [CrossRef]
  2. Birch, M.W. Maximum likelihood in three-way contingency tables. J. R. Stat. Soc. Ser. Methodol. 1963, 25, 220–233. [Google Scholar] [CrossRef]
  3. Fienberg, S.E.; Rinaldo, A. Three centuries of categorical data analysis: Log-linear models and maximum likelihood estimation. J. Stat. Plan. Inference 2007, 137, 3430–3445. [Google Scholar] [CrossRef]
  4. Goodman, L.A. Interactions in multidimensional contingency tables. Ann. Math. Stat. 1964, 35, 632–646. [Google Scholar] [CrossRef]
  5. Haberman, S.J. Log-Linear Models for Frequency Tables with Ordered Classifications. Biometrics 1974, 30, 589–600. [Google Scholar] [CrossRef]
  6. Fisher, R.A. The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 1936, 7, 179–188. [Google Scholar] [CrossRef]
  7. Rao, R.C. The utilization of multiple measurements in problems of biological classification. J. R. Stat. Soc. Ser. 1948, 10, 159–203. [Google Scholar] [CrossRef]
  8. Friedman, J.H. Regularized Discriminant Analysis. J. Am. Stat. Assoc. 1989, 84, 165–175. [Google Scholar] [CrossRef]
  9. McFarland, H.R.; Richards, D.S.P. Exact Misclassification Probabilities for Plug-In Normal Quadratic Discriminant Functions: I. The Equal-Means Case. J. Multivar. Anal. 2001, 77, 21–53. [Google Scholar] [CrossRef]
  10. Perriere, G.; Thioulouse, J. Use of Correspondence Discriminant Analysis to predict the subcellular location of bacterial proteins. Comput. Methods Programs Biomed. 2003, 70, 99–105. [Google Scholar] [CrossRef] [PubMed]
  11. Flury, L.; Bourkai, B.; Flury, B.D. The discrimination subspace model. J. Am. Stat. Assoc. 1997, 92, 758–766. [Google Scholar] [CrossRef]
  12. Li, H.; Jia, M.; Mao, Z. Dynamic Feature Extraction-Based Quadratic Discriminant Analysis for Industrial Process Fault Classification and Diagnosis. Entropy 2023, 25, 1664. [Google Scholar] [CrossRef] [PubMed]
  13. Tong, Z.; Li, W.; Zhang, B.; Gao, H.; Zhu, X.; Zio, E. Bearing Fault Diagnosis Based on Discriminant Analysis Using Multi-View Learning. Mathematics 2022, 10, 3889. [Google Scholar] [CrossRef]
  14. McLachlan, G.J. Discriminant Analysis and Statistical Pattern Recognition; Wiley Interscience: New York, NY, USA, 2004; ISBN 978-0-471-69115-0. [Google Scholar]
  15. Ferreira, S.S.; Ferreira, D.; Nunes, C.; Mexia, J.T. Discriminant Analysis and Decision Theory. Far East J. Math. Sci. 2011, 51, 69–79. [Google Scholar]
  16. Schott, J.R. Matrix Analysis for Statistics, 3rd ed.; Wiley: New York, NY, USA, 2017. [Google Scholar]
  17. Kallenberg, O. Foundations of Modern Probability, 2nd ed.; Springer: New York, NY, USA, 2010. [Google Scholar]
  18. Scheffé, H. The Analysis of Variance; John Willey & Sons: New York, NY, USA, 1959. [Google Scholar]
  19. Wilks, S.S. Mathematical Statistics; Wiley: New York, NY, USA, 1962. [Google Scholar]
  20. Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
  21. Tsui, K.W. Asymptotic theory of statistical inference for random coefficients regression models. J. Multivar. Anal. 1995, 54, 281–299. [Google Scholar]
  22. Instituto Superior Dom Bosco—Maputo Mozambique. Available online: https://isdb.co.mz/ (accessed on 7 August 2018).
Figure 1. Simulated samples, with blue, red, and green points, corresponding to populations with mean μ 1 , μ 2 , and μ 3 , respectively.
Figure 1. Simulated samples, with blue, red, and green points, corresponding to populations with mean μ 1 , μ 2 , and μ 3 , respectively.
Mathematics 12 01173 g001
Figure 2. Shape of the zones to be considered.
Figure 2. Shape of the zones to be considered.
Mathematics 12 01173 g002
Figure 3. Considered cuts. The colors of the points, blue, red, and green, represent populations with mean values of μ 1 , μ 2 , and μ 3 , respectively.
Figure 3. Considered cuts. The colors of the points, blue, red, and green, represent populations with mean values of μ 1 , μ 2 , and μ 3 , respectively.
Mathematics 12 01173 g003
Table 1. Numbers, probabilities and ratios of cases for the values of v from the different population.
Table 1. Numbers, probabilities and ratios of cases for the values of v from the different population.
h14578910111213Total
n h . 1 17,456738755939136925438099767040 n · . 1 = 23,598
n h . 2 38,193542156288862422172 n · . 2 = 39,078
n h . · 55,64912809119671457260404101968742 n · . · = 62,676
p h . 1 0.7400.0310.0320.0400.0580.0110.0160.0420.0280.0021
p h . 2 0.9770.0140.0040.0010.0020.0000.0010.0010.0000.0001
Π h . · 0.8880.0200.0150.0150.0230.0040.0060.0160.0110.0011
r h 0.7572.2558.01555.53525.76270.10326.22075.04665.26533.120
Table 2. Contingency table of probabilities.
Table 2. Contingency table of probabilities.
Population 1Population 2
q 1 , l 0.7400.977
q 2 , l 0.2600.023
Table 3. Costs of misallocation.
Table 3. Costs of misallocation.
P 1 P 2 P 3
P 1 012
P 2 101
P 3 210
Table 4. Number of elements in each area.
Table 4. Number of elements in each area.
P 2 P 3 P 3
P 1 = 5 P 1 = 0 P 1 = 0
P 2 = 26 P 2 = 102 P 2 = 6
P 3 = 1 P 3 = 261 P 3 = 364
P 1 P 2 P 3
P 1 = 398 P 1 = 125 P 1 = 0
P 2 = 106 P 2 = 626 P 2 = 82
P 3 = 0 P 3 = 159 P 3 = 213
P 1 P 1 P 2
P 1 = 381 P 1 = 90 P 1 = 1
P 2 = 8 P 2 = 40 P 2 = 4
P 3 = 0 P 3 = 1 P 3 = 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ferreira, D.; Ferreira, S.S. Optimizing Allocation Rules in Discrete and Continuous Discriminant Analysis. Mathematics 2024, 12, 1173. https://doi.org/10.3390/math12081173

AMA Style

Ferreira D, Ferreira SS. Optimizing Allocation Rules in Discrete and Continuous Discriminant Analysis. Mathematics. 2024; 12(8):1173. https://doi.org/10.3390/math12081173

Chicago/Turabian Style

Ferreira, Dário, and Sandra S. Ferreira. 2024. "Optimizing Allocation Rules in Discrete and Continuous Discriminant Analysis" Mathematics 12, no. 8: 1173. https://doi.org/10.3390/math12081173

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop