Next Article in Journal
A Novel Approach for Improving Reverse Osmosis Model Accuracy: Numerical Optimization for Water Purification Systems
Previous Article in Journal
Optimization of Autoencoders for Speckle Reduction in SAR Imagery Through Variance Analysis and Quantitative Evaluation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Partial Least Squares Regression for Binary Data

by
Laura Vicente-Gonzalez
,
Elisa Frutos-Bernal
and
Jose Luis Vicente-Villardon
*
Departamento de Estadística, Facultad de Medicina, Universidad de Salamanca, 37007 Salamanca, Spain
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(3), 458; https://doi.org/10.3390/math13030458
Submission received: 11 December 2024 / Revised: 25 January 2025 / Accepted: 27 January 2025 / Published: 30 January 2025
(This article belongs to the Section D1: Probability and Statistics)

Abstract

:
Classical Partial Least Squares Regression (PLSR) models were developed primarily for continuous data, allowing dimensionality reduction while preserving relationships between predictors and responses. However, their application to binary data is limited. This study introduces Binary Partial Least Squares Regression (BPLSR), a novel extension of the PLSR methodology designed specifically for scenarios involving binary predictors and responses. BPLSR adapts the classical PLSR framework to handle the unique properties of binary datasets. A key feature of this approach is the introduction of a triplot representation that integrates logistic biplots. This visualization tool provides an intuitive interpretation of relationships between individuals and variables from both predictor and response matrices, enhancing the interpretability of binary data analysis. To illustrate the applicability and effectiveness of BPLSR, the method was applied to a real-world dataset of strains of Colletotrichum graminicola, a pathogenic fungus. The results demonstrated the ability of the method to represent binary relationships between predictors and responses, underscoring its potential as a robust analytical tool. This work extends the capabilities of traditional PLSR methods and provides a practical and versatile solution for binary data analysis with broad applications in diverse research areas.
MSC:
62A09; 62-08; 62H30; 62H99

1. Introduction

Due to the great technological advances nowadays, it is increasingly common to find datasets that contain a large number of variables that have been measured in a given sample. The set of variables can range from quantitative to binary, nominal or ordinal, and, very often, a combination of several types.
The rapid growth of data science has led to an increasing demand for statistical techniques that can handle large and complex datasets.
In this context, traditional univariate statistical techniques, repeated for each variable, do not capture the structure of the data, especially the relationships among variables, so it is necessary to use multivariate techniques that offer a more powerful approach. These methods focus on the study of relationships among variables, similarities and differences among individuals and the variables responsible for them, treating all available information simultaneously.
Probably the most used multivariate statistical technique is Principal Component Analysis (PCA). Pearson [1] proposed PCA over 120 years ago, which was later formalized by Hotelling [2,3]. PCA is a dimensionality reduction technique that can be used to summarize the information contained in a large dataset. PCA works by finding a set of orthogonal axes, called principal components, that capture the most variation in the data. It was not until a few years later, in 1971, that K.R. Gabriel [4] proposed a simultaneous graphical representation of the rows and columns of a data matrix, related to PCA, which he called the biplot.
This type of methods allow for visualizing the structure of the data, avoiding making use only of a list of p-values associated with the performed analysis. Biplots can also allow the visualization of more complex multivariate models such as Canonical Variate Analysis or Multivariate Analysis of Variance (MANOVA) [5,6].
All of these techniques were developed for continuous variables, but when the measured variables are of a categorical type, either binary, nominal or ordinal, these representations are not adequate.
There are several dimension reduction methods for categorical data, for example, Item Response Theory (see, for example, [7]) or Factor Analysis for categorical data (see, for example, [8]), but none of these techniques have an associated biplot representation.
We will focus here on biplot representations based on logistic responses that are closely related to the mentioned techniques. It is worth highlighting the work of Vicente-Villardon et al. [9], Demey et al. [10] who developed a biplot for a single matrix of binary data. These proposals are based on the generalized bivariate models proposed by Gabriel [11].
In many practical situations, we have two or more data matrices, with symmetric or asymmetric roles. In this work, we focus on the particular case in which we have two data matrices with asymmetric roles, a set of predictor variables and a set of response variables. Generally, the goal when we have this situation is to find models that allow predicting the responses from the predictors. In both datasets, it is possible to have variables of all the types mentioned above.
For two multivariate data matrices with one playing the role of responses, one of the most used multivariate statistical techniques is Partial Least Squares Regression (PLSR) [12,13]. PLSR is a regression technique that can be used to predict a set of response variables from a set of predictor variables. PLSR works by finding a set of linear combinations of the predictor variables that are most correlated with the response variables. The classical PLSR models were developed mainly for continuous data and to obtain dimension reductions in both responses and predictors, in a similar way to PCA with the difference being that the focus is on prediction rather than variance maximization. A biplot representation for PLSR that can help in the interpretation of the results was proposed by Oyedele and Lubbe [14].
More recently, Vicente-Gonzalez and Vicente-Villardon [15] proposed a PLSR method where all the responses are binary and predictors are continuous. This extends PLS generalized linear regression [16] for a single response variable (PLS1) to several responses (PLS2) in a non-trivial way because it implies a dimension reduction not present in the former work. This proposal goes with a biplot for both matrices, a continuous one for the predictors and a logistic one for the binary responses.
In the context of genomic data, [17,18] proposed a procedure for a single response and numerical predictors but using Iteratively Re-weighted Least Squares (IRWLS), similar to the procedures of [16]. The work of Vicente-Gonzalez and Vicente-Villardon [15] is also an extension of those for several responses.
When the two matrices are binary, none of the procedures mentioned before are suitable. In this paper, we propose a PLSR (PLS2) procedure when both matrices, responses and predictors, have binary variables. We use the logistic biplots, mentioned before, to make a triplot representation that combines individuals and variables of the two initial matrices.
Our proposal can be framed on the PLS regression method as described in [19] in contrast with other approaches as Structural Equation Modeling (SEM) with mainly two methods: covariance-based SEM (CB-SEM) and Partial Least Squares SEM (PLS-SEM). Following [20], CB-SEM is used to check (confirm or reject) theories and their underlying hypothesis by determining how closely a proposed theoretical model can reproduce the covariance matrix for an observed sample dataset. On the other hand, PLS is a causal predictive approach to SEM that focuses on explaining the variance of the dependent variables.
CB-SEM may work with any kind of data, including binary, but the purpose of the method is explaining covariances and not the variables. As pointed by [20] (p. 11), PLS-SEM is similar but not equivalent to PLSR as described in [19,21]. PLSR is a regression-based method that explores the (linear) relationships between multiple independent variables and single or multiple dependent variables. It is different from regression because the method derives composite factors for independent variables. It differs from PLS-SEM because this relies on prespecified networks of relationships between constructs, as well as between the constructs and their measures. For a more complete comparison of PLS-SEM and PLSR, you can check [22].
Our proposal is then related to the PLS regression method, using a generalized linear model for several binary responses and including also a biplot representation for the interpretation of the results.
The database that initially motivated the development of this method is studying the relation among the origins of nine types of corn strains and the presence or absence of genes. The data are composed of two binary matrices, the first one containing indicators of the nine classes and the second a set of binary genetic polymorphisms.
In Section 2, we describe Partial Least Squares and the generalization to two sets of binary variables; Section 3 describes biplots and its construction for the PLS models, and Section 4 presents an example to illustrate the proposed models. The conclusions are in Section 5. Section 6 contains some software notes and the availability of the data.

2. Partial Least Squares

Let X I × J be a multivariate matrix of predictors with J variables and Y I × K be a multivariate matrix of responses with K variables, both measured in I individuals.
When we have only a few predictors (fewer than the number of individuals) that are not strongly correlated, Multivariate Linear Regression (MLR) could be employed to predict the responses. However, when the predictors are highly collinear or when there is a large number of them, traditional linear regression models may be inadequate because of the absence of parameter estimates. This situation commonly arises, among others, in genomic studies, where a set of thousands of gene expressions is available as independent variables.
Partial Least Squares Regression (PLSR) [12,13,19] is one of the most recognized and widely applied methods for modeling the relationships between two datasets, particularly when both sets of variables are continuous. The method is particularly useful in Chemometrics [21], although it has been applied in many other disciplines.
When the responses are continuous, Oyedele and Lubbe [14] introduced an associated biplot representation, although earlier, less formal versions exist, such as in [23]. More recently, PLS-Biplot has been applied to team effectiveness [24].
When the response matrix contains binary variables, Partial Least Squares Discriminant Analysis (PLS-DA) [25] is often employed, which essentially fits a PLS regression to a dummy or fictitious variable. Bastien et al. [16] introduced a PLS model for a single binary response, analogous to the PLS-1 model. More recently, Vicente-Gonzalez and Vicente-Villardon [15] proposed a non-trivial extension for multiple binary variables, incorporating dimension reduction for binary data and resulting in a PLS-2 model, which serves as an alternative to PLS-DA.
In regression, to address the discriminant problem, logistic regression (LR) can be used instead of Multivariate Linear Regression, allowing the binary nature of the responses to be captured. While this is often a valid approach, LR faces the same limitations as MLR regarding the number of individuals, variables, and collinearity. Additionally, the separation problem may also arise.
Currently, no PLSR alternative exists for situations where both the predictor and response matrices are binary.
The aim of this section is to extend PLSR to cases where both the predictor and response matrices consist of binary variables, by using logistic functions instead of linear ones. This approach, referred to as Binary Partial Least Squares Regression (BPLSR), enables us to account for the binary nature of the responses and addresses the limitations of traditional PLSR methods when dealing with binary data.
This generalization is based on the adaptation of the NIPALS algorithm for the binary data proposed in previous works by Vicente-Gonzalez and Vicente-Villardon [15]. The adaptation consists of replacing the linear regression step in the NIPALS algorithm by a logistic regression step. This allows the procedure to capture the binary nature of the data. Previously, we made an outline of the classical PLS-2 model.

2.1. Partial Least Squares Regression: NIPALS Algorithm

PLS seeks to identify linear combinations of continuous predictors X that optimally predict a set of continuous responses Y . Here, we outline the classical NIPALS algorithm [26]. Both sets are usually column-centered and probably standardized. PLS operates by extracting latent variables that capture the most significant variance and covariance between predictors and responses. It uses the iterative extraction of components to build a predictive model that can handle collinearity and high-dimensional data effectively.
The latent model for predictors X , in S dimensions, is
X = T P T + E = X ^ + E ,
where P J × S are variable loadings, T I × S are scores for individuals, and E I × J is a matrix of residuals.
In the same way, Y is the latent model
Y = T Q T + F = Y ^ + F ,
where Q K × S are loadings, T I × S are scores, and F I × K are residuals.
Observe that both models share the matrix of scores that best predict the responses. The NIPALS algorithm to obtain the parameters can be described as follows:
Note that by removing the step where predictors are updated in the NIPALS algorithm, we would end up with the principal components for the responses. The procedure alternatively updates the scores in matrix T using the information of responses and predictors.
Here, P T P = I and Q T Q = I , and  T contains the scores from X that best explain the set of responses. We have the relationship
T = X P .
A potential issue with this and subsequent algorithms is that they may have multiple starting points, leading to different solutions. The algorithm generates sequences of decreasing residual sum of squares values and eventually converges to at least a local minimum.
If we need the regression coefficients to express the Y variables as functions of the predictors X , we have  
Y ^ = X B .
Considering that Y ^ = T Q T , we have
Y ^ = T Q T = X P Q T .
Thus, the regression coefficients in terms of the original variables are
B = P Q T .
We have to note that the updating steps in the Algorithm 1 minimize the sum of squares of the residuals that can also be understood as a lost function:
i = 1 I j = 1 J ( y i j y ^ i j ) 2 = i = 1 I j = 1 J ( y i j s = 1 S t i s t q j s ) 2 = t r ( ( Y Y ^ ) T ( Y Y ^ ) ) = t r ( ( Y T Q T ) T ( Y T Q T ) )
for the responses and in a similar way for the predictors.
Algorithm 1 NIPALS algorithm.
1:
procedure NIPALS( X , Y , S )
2:
    for  s = 1 S  do
3:
         t ( s ) y ( j ) for some j. Or any other choice.                          ▹ Init: t ( s )
4:
        repeat
5:
            p ( s ) X T t ( s ) / X T t ( s )                                                    ▹ Update: p ( s )
6:
            t ( s ) Xp ( s )                                                                   ▹ Update: t ( s )
7:
            q ( s ) Y T t ( s ) / Y T t ( s )                                                    ▹ Update: q ( s )
8:
            t ( s ) Yq ( s )                                                                   ▹ Update: t ( s )
9:
        until  t ( s ) does not change
10:
         X X t ( s ) p ( s ) T                                                                    ▹ Update: X
11:
         Y Y t ( s ) q ( s ) T                                                                   ▹ Update: Y
    return  T = t ( 1 ) , , t ( S ) , P = p ( 1 ) , , p ( S ) and Q = q ( 1 ) , , q ( S )

2.2. PLS for Binary Responses

When responses are binary, the updating steps of the NIPALS algorithm (steps 7 and 8) are inadequate because the implicit linear relationship it assumes is not adequate. This is similar to how linear regression is unsuitable for binary responses, with logistic regression being the more appropriate method. The difference in step 11 makes also no sense.
Recently, Vicente-Gonzalez and Vicente-Villardon [15] generalized the NIPALS algorithm to handle a set of binary responses by incorporating a logistic relationship with the latent variables.
The expected values for the responses are now probabilities, and we use the logit function as the link function. Let Y ^ = Π y , then we have
logit ( Π y ) = 1 q 0 T + T Q T .
This equation generalizes Equation (2), except that we need to include a vector q 0 with intercepts for each variable, as binary variables cannot be centered in the same way as continuous. Each probability π i k y can be expressed as
π i k y = e q k 0 + s = 1 S t i s q k s 1 + e q k 0 + s = 1 S t i s q k s .
Or in matrix form, if we have
M y = 1 q 0 T + T Q T ,
then
Π y = e M y ÷ ( 1 + e M y ) ,
where the operations apply to each element of the matrices.
The updating steps of the responses have to be changed for binary responses. Rather than minimizing the cost function in Equation (7), we minimize the cost defined as
L y = i = 1 I k = 1 K y i k log ( π i k y ) ( 1 y i k ) log ( 1 π i k y ) .
We interpret here the function as a cost to minimize rather than a likelihood to maximize. We look for the parameters T , Q , and q 0 that minimize the cost function. There are not closed-form solutions for the optimization problem, so an iterative algorithm, in which we obtain a sequence decreasing values of the lost function at each iteration, is used. We use the gradient method in a recursive way calculating one component at each time fixing the previous one. The update for each parameter is as follows:
q j 0 q j 0 α L y q j 0 = q j 0 α i = 1 I ( π i j y y i j ) ;
t i s t i s α L y t i s = t i s α j = 1 J q j s ( π i j y y i j ) ;
q j s q j s α L y q j s = q j s α i = 1 I t i s ( π i j y y i k ) ,
for some α . Or, in matrix form,
q 0 q 0 α ( Π y Y ) T 1 I ;
t ( s ) t ( s ) α ( Π y Y ) q ( s ) ;
q ( s ) q ( s ) α ( Π y Y ) T t ( s ) ,
where t ( s ) = ( t 1 s , , t I s ) and q ( s ) = ( q 1 s , , q J s ) . Then, the adaptation of the algorithm is straightforward. We present a recursive procedure that calculates one component at a time while keeping the previously computed components fixed.
In practice, the choice of α can be avoided by using a pre-programmed optimization routine. This algorithm is implemented in the package MultBiplotR [27] developed in the R language [28].
In the same way as for the continuous case, the regression coefficients in terms of the original variables are
B = P Q T .
Algorithm 2 is essentially the same as Algorithm 1 except that we have made the necessary adaptations to cope with binary responses. Now, the relation between the manifest and latent variables is logistic rather than linear. Observe that is also a linear relation on the logits, i.e., a generalized linear model.
Algorithm 2 NIPALS-Binary responses.
1:
procedure NIPALSBin( X , Y , S , α , ϵ )
2:
    Set T = [ ] , P = [ ] and Q = [ ] (empty matrices)
3:
     q 0 = 1 I Y T 1 I or any other good guess.                                        ▹ Init: q 0
4:
    repeat
5:
         Π y = e 1 q 0 T ÷ ( 1 + e 1 q 0 T )
6:
         q 0 q 0 α ( Π y Y ) T 1 I                                                      ▹ Update: q 0
7:
    until  q 0 does not change
8:
    for  s = 1 S  do
9:
        Init: t ( s ) r a n d o m or any other good guess.
10:
        Init: q ( s ) r a n d o m or any other good guess.
11:
        repeat
12:
            p ( s ) X T t ( s ) / X T t ( s )                                                  ▹ Update: p ( s )
13:
            t ( s ) Xp ( s )                                                                 ▹ Update: t ( s )
14:
            Q = [ Q ; q ( s ) ] , P = [ P ; p ( s ) ] and T = [ T ; t ( s ) ] ▹ Add column q ( s ) to Q , p ( s ) to P and t ( s ) to T
15:
            M y = 1 q 0 T + T Q T
16:
            Π y = e M y ÷ ( 1 + e M y )
17:
            L o l d y ( 1 ) 1 I T ( ( Y Π y ) + ( ( 1 Y ) ( 1 Π y ) ) ) 1 J
18:
           repeat
19:
                q ( s ) q ( s ) α ( Π y Y ) T t ( s )
20:
                q s q ( s )        ▹ Change the s t h column q s of Q for the new guess q ( s )
21:
                M y = 1 q 0 T + T Q T
22:
                Π y = e M y ÷ ( 1 + e M y )
23:
           until  q ( s ) does not change
24:
           repeat
25:
                t ( s ) t ( s ) α ( Π y Y ) q ( s )
26:
                t s t ( s )        ▹ Change the s t h column t s of T for the new guess t ( s )
27:
                M y = 1 q 0 T + T Q T
28:
                Π y = e M y ÷ ( 1 + e M y )
29:
           until  t ( s ) does not change
30:
            L n e w y ( 1 ) 1 I T ( ( Y Π y ) + ( ( 1 Y ) ( 1 Π y ) ) ) 1 J
31:
        until  L o l d y L n e w y < ϵ
32:
         X X t ( s ) p ( s ) T                                                                ▹ Update: X
    return  q 0 , T = t ( 1 ) , , t ( S ) , P = p ( 1 ) , , p ( S ) and Q = q ( 1 ) , , q ( S )
For a direct extension of this algorithm, see [29], implemented in the package BiplotML [30].

2.3. Partial Least Squares Regression for Two Binary Matrices

Now both X I × J and Y I × K contain binary variables measured in the same I individuals. The goal is still predicting Y from X and describing the common structure of predictors and responses.
We denote the expected values of X as E [ X ] = Π x and the expected values of Y as E [ Y ] = Π y . Then, we have
logit ( Π y ) = 1 q 0 T + T Q T ;
and
logit ( Π x ) = 1 p 0 T + T P T .
These Equations are generalizations of the decompositions of the matrices X and Y for the continuous case in Equations (1) and (2). The only difference is that we need to add the intersection vector for each variable ( p 0 and q 0 , respectively). This is because, unlike the continuous case, the variables cannot be centered beforehand.
The relation among the observed and latent variables can be written as
π i k y = e q k 0 + s = 1 S t i s q k s 1 + e q k 0 + s = 1 S t i s q k s ; π i j x = e p j 0 + s = 1 S t i s p j s 1 + e p j 0 + s = 1 S t i s p j s .
That is, it is a logistic rather than linear relation.
Now, we have to design an algorithm to calculate the model parameters. The procedure alternatively updates the common scores in T using the information of responses and predictors. We use the gradient method and the lost functions as follows:
L y = i = 1 I k = 1 K y i k log ( π i k y ) ( 1 y i k ) log ( 1 π i k y ) ,
and
L x = i = 1 I j = 1 J x i j log ( π i j x ) ( 1 x i j ) log ( 1 π i j x ) .
We update the scores using a gradient descent procedures as in Equations (16)–(18) for the responses. The extension for predictors in T is straightforward:
p 0 p 0 α ( Π x X ) T 1 I
t ( s ) t ( s ) α ( Π x X ) p ( s )
p ( s ) p ( s ) α ( Π x X ) T t ( s ) ,

NIPALS Algorithm for Two Binary Matrices

Then, we will search for the parameters T , P y p 0 for the predictors and T , Q y q 0 for the responses. We will obtain the scores T for X that better predict Y with a logistic regression model. The scores are also related to predictors with a logistic response model.
The calculations can be organized in an alternating algorithm that calculates the parameters t ( s ) = ( t 1 s , , t I s ) , p ( s ) = ( p 1 s , , p J s ) and q ( s ) = ( q 1 s , , q K s ) for each dimension s, while fixing the parameters already obtained in the previous dimensions. First, we need to calculate the constant (dimension 0) p ( 0 ) = ( p 10 , , p J 0 ) and q ( 0 ) = ( p 10 , , q K 0 ) separately. This is so because data cannot be centered beforehand.
The algorithm is as follows.
In logistic regression, there is a potential problem known as separation. This occurs when the data are perfectly separable by a linear combination of the predictors. In this case, the maximum likelihood estimator does not exist and tends to infinity [31]. Even if the separation is not perfect (quasi-separation), the estimator can be highly unstable.
To avoid separation, the usual solution is to use a penalized likelihood [32]. This adds a penalty term to the likelihood function that encourages the coefficients to be smaller. In our case, we will use the Ridge penalty [33]:
L y = i = 1 I k = 1 K y i k log ( π i k y ) ( 1 y i k ) log ( 1 π i k y ) + λ k = 1 K s = 1 S q k s 2 .
Gradient adaptation for the penalized cost function is automatic:
L y q k s = i = 1 I t i s ( π i k y y i k ) + 2 λ q k s .
To address the separation problem, we can introduce a regularization parameter λ . This parameter penalizes large coefficients, making the solution less likely to be separated.
The BPLSR algorithm (Algorithm 3) can be sensitive to the random start. To obtain the best possible solution, it is recommended to test several starting points. In practice, it has been observed that the solutions from different starting points produce similar predictions of the responses.
Algorithm 3 PLSR for two binary matrices: BPLSR.
1:
procedure BPLSR( X , Y , S , α , ϵ )
2:
    Set T = [ ] , P = [ ] and Q = [ ] (empty matrices)
3:
     q 0 = 1 I Y T 1 I and p 0 = 1 I X T 1 I or any other good guesses.                        ▹ Init: q 0 and p 0
4:
    repeat
5:
         Π y = e 1 q 0 T ÷ ( 1 + e 1 q 0 T )
6:
         q 0 q 0 α ( Π y Y ) T 1 I                                                                             ▹ Update: q 0
7:
    until  q 0 does not change
8:
    repeat
9:
         Π x = e 1 p 0 T ÷ ( 1 + e 1 p 0 T )
10:
         p 0 p 0 α ( Π x X ) T 1 I                                                                             ▹ Update: p 0
11:
    until  p 0 does not change
12:
    for  s = 1 S  do
13:
        Init: t ( s ) , q ( s ) and p ( s ) with random values or any other good guesses.
14:
         Q = [ Q ; q ( s ) ] , P = [ P ; p ( s ) ] and T = [ T ; t ( s ) ]     ▹ Add column q ( s ) to Q , p ( s ) to P and t ( s ) to T
15:
        repeat
16:
            M x = 1 p 0 T + T P T
17:
            Π x = e M x ÷ ( 1 + e M x )
18:
           repeat
19:
                p ( s ) p ( s ) α ( Π x X ) T t ( s )
20:
                p s p ( s )        ▹ Change the s t h column p s of P for the new guess p ( s )
21:
                M x = 1 p 0 T + T P T
22:
                Π x = e M x ÷ ( 1 + e M x )
23:
           until  p ( s ) does not change
24:
           repeat
25:
                t ( s ) t ( s ) α ( Π x X ) p ( s )
26:
                t s t ( s )        ▹ Change the s t h column t s of T for the new guess t ( s )
27:
                M x = 1 p 0 T + T P T
28:
                Π x = e M x ÷ ( 1 + e M x )
29:
           until  t ( s ) does not change
30:
            M y = 1 q 0 T + T Q T
31:
            Π y = e M y ÷ ( 1 + e M y )
32:
            L o l d y ( 1 ) 1 I T ( ( Y Π y ) + ( ( 1 Y ) ( 1 Π y ) ) ) 1 J
33:
           repeat
34:
                q ( s ) q ( s ) α ( Π y Y ) T t ( s )
35:
                q s q ( s ) ▹ Change the s t h column q s of Q for the new guess q ( s )
36:
                M y = 1 q 0 T + T Q T
37:
                Π y = e M y ÷ ( 1 + e M y )
38:
           until  q ( s ) does not change
39:
           repeat
40:
                t ( s ) t ( s ) α ( Π y Y ) q ( s )
41:
                t s t ( s )        ▹ Change the s t h column t s of T for the new guess t ( s )
42:
                M y = 1 q 0 T + T Q T
43:
                Π y = e M y ÷ ( 1 + e M y )
44:
           until  t ( s ) does not change
45:
            L n e w y ( 1 ) 1 I T ( ( Y Π y ) + ( ( 1 Y ) ( 1 Π y ) ) ) 1 J
46:
        until  L o l d y L n e w y < ϵ
    return  p 0 , q 0 , T = t ( 1 ) , , t ( S ) , P = p ( 1 ) , , p ( S ) and Q = q ( 1 ) , , q ( S )
The BPLSR algorithm can also suffer from the separation problem. A good strategy could be trying different values of λ and checking how the values of the lost change as a function of the regularization parameter. We could also check the change in the values of the model parameter with changes in the regularization parameter as is normally performed in Ridge Regression. We include the regularization parameter in the routines for the calculation of the algorithms. A good strategy could be also using generalized cross-validation as in [34]. As a general rule, the regularization parameter depends on the size of the matrices. We are working in simulations to identify good values of the regularization parameter in different scenarios.
If we eliminate the part where p 0 and P are updated in Algorithms 2 and 3, we obtain a reduced dimension approximation of a single data matrix analogous to a multidimensional logistic two-parameter model in Item Response Theory; see [35,36]. It is also similar to Principal Component Analysis for binary data as in [37,38,39,40], or to Factor Analysis for binary data [8].

3. Biplots

A biplot [4] is a type of graph that simultaneously displays two sets of information: the scores of the observations and the loadings of the variables in a reduced-dimensional space. It is commonly used in techniques like Principal Component Analysis (PCA) and Partial Least Squares (PLS) regression, among many others, to visualize high-dimensional data in two or three dimensions.
Scores represent the projections of the original data points (observations) onto a reduced number of components or latent variables. Loadings indicate the contributions of the original variables to these components, showing how each variable relates to the reduced dimensions.
In a biplot, observations (samples) are often represented by points, while the loadings (variables) are shown as vectors (arrows). The angles and lengths of the arrows help interpret how variables are correlated and how they contribute to the variation in the data.
Formally, a biplot represents the approximate decomposition of a data matrix X into the product of two lower-rank matrices (typically two or three) as follows:
X = E [ X ] + E = X ^ + E = A B T + E
This decomposition approximates the original matrix as accurately as possible based on specific criteria, with A and B capturing the main structure and E representing the residuals or error. In fact, a biplot decomposes the expected approximate structure of the data matrix into the product of two other matrices. The matrix X ^ captures the majority of the data’s characteristics based on a predefined criterion.
Matrices A and B can be represented as points into a two- or three-dimensional space in such a way that the inner product
x i j x ^ i j = a i T b j
approximates the element x i j .
Biplots were initially proposed in connection to PCA [4] but have been used also accompanying other techniques as Multivariate Analysis of Variance (MANOVA) [5,41], Correspondence Analysis [42] among many others. In the context of PLS for continuous data, it is clear that Equations (1) and (2) define a biplot. The properties of those representations can be developed by Oyedele and Lubbe [14] and Vicente-Gonzalez and Vicente-Villardon [15].
Most biplots are based on the assumption of linear relationships between observed and latent variables and are primarily designed for continuous data. For binary data, however, these representations are not ideal, much like how linear regression struggles to accurately model relationships with a binary response.
Vicente-Villardon et al. [9] introduces a method where the relationship between the observed variables and the dimensions is modeled using a logistic approach. This method is linked to psychometric techniques, such as Item Response Theory or latent trait analysis. The original paper proposes an algorithm based on alternating generalized regressions using the Newton–Raphson method to maximize the likelihood. However, the algorithm faces the same challenge as logistic regression when separation occurs. Specifically, for some binary variables, individuals displaying the presence of a characteristic are fully separated from those without it (by a hyperplane) in the final representation. This method is referred to as the logistic biplot.
Demey et al. [10] introduced a new algorithm that combines Principal Component Analysis, Cluster Analysis, and Logistic Regression to create this type of biplot. The result is termed an “external logistic biplot” because it uses a two-step approach (principal coordinates for individuals and logistic regressions for variables) rather than simultaneously obtaining the row and column markers. This simplification helps to avoid the separation problem but comes at the cost of reduced goodness of fit.
More recently, Babativa-Márquez and Vicente-Villardón [29] proposed an algorithm that improves parameter fitting by employing the nonlinear conjugate gradient algorithm.
More recently, in the context of PLS methods, Vicente-Gonzalez and Vicente-Villardon [15] proposed a generalization of the NIPALS algorithm for binary responses, where the components are obtained iteratively using the gradient method. The paper also introduced a procedure to address the separation issue encountered in earlier algorithms for logistic biplots.

3.1. Logistic Biplots

Linear biplots decompose the expected values of the approximation into a product of two lower-rank matrices. In contrast, the logistic biplot decomposes the expected probabilities using the logit link function, similar to how generalized linear models operate.
For any binary data matrix X I × J , if we call E [ X ] = Π the decomposition, using the logit link, we have
logit ( Π ) = 1 I b 0 T + AB T .
The constants b 0 T have to be included because data cannot be centered beforehand as in the continuous case. With this decomposition, the inner product of the biplot markers (coordinates) is the logit of a expected probability, except for a constant,
l o g i t ( π i j ) = b j 0 + a i T b j .
The logits are easily converted into probabilities
π i j = e b j 0 + a i T b j 1 + e b j 0 + a i T b j .
Then, by projecting a row marker a i onto a column marker b j , we obtain, except for a constant, the expected logit and thus the expected probability for the entry x i j of X . The constant b j 0 serves to determine the exact point where the logit is zero or any other value.
Due to the generalized nature of the model, the geometric interpretation closely resembles that of linear biplots. Computational procedures are analogous to those employed in previous cases, with the addition of the constant term. For example, in two dimensions, we can determine, on the direction of the vector b j = b j 1 , b j 2 , what point predicts a given probability p. Let x , y be the point satisfying the equation
y = b j 2 b j 1 x .
Prediction also verifies that
logit ( p ) = b j 0 + b j 1 x + b j 2 y .
Therefore, we obtain
x = logit ( p ) b j 0 b j 1 b j 1 2 + b j 2 2 ; y = logit ( p ) b j 0 b j 2 b j 1 2 + b j 2 2 .
The point in the direction of b j that predicts 0.5 ( l o g i t 0 , 5 = 0 ) is
x = b j 0 b j 1 b j 1 2 + b j 2 2 ; y = b j 0 b j 2 b j 1 2 + b j 2 2 .
Using Equation (33), we can position markers for different probabilities along the direction of b j , creating a graded scale similar to a coordinate axis. The interpretation remains fundamentally similar to that of linear biplots; however, markers representing equidistant probabilities may not be spaced equally in the graph.
As an illustrative simple example with data taken from [43], the data that companies collect can include the expected, such as your name, date of birth, and email address, as well as more unusual details, such as your pets, hobbies, height, weight, and even personal preferences in the bedroom. They may also store your banking information and links to your social media accounts, along with any data you share there.
How companies use these data varies depending on their business, but it often leads to targeted advertising and optimizing website management.
The infographic taken from the site https://clario.co/blog/which-company-uses-most-data/, (accessed on 26 January 2025), is shown in Figure 1. The picture shows information on your face, voice, and environment that some internet companies collect.
The original data are transformed into a binary matrix, where the rows represent internet companies, the columns represent the types of information collected, and each entry is marked as 1 if the company collects that information, and 0 if it does not. The data matrix is shown in Table 1. With these data, we perform a logistic biplot that summarizes the information in a two-dimensional graph. We obtain a joint representation of rows and columns of the data matrix. Distances among companies are interpreted as similarities, i.e., companies lying near on the biplot display have similar profiles in relation to the collected information. Angles between variables (kind of information) are interpreted as correlations. Acute small angles mean strong positive correlations, near straight angles mean strong negative correlations, and almost rights angles mean no correlation. Projecting the row markers onto column markers, we have the expected probability for each entry.
Figure 2 shows the graphical results as a typical logistic biplot with scales for probabilities 0.05, 0.25, 0.5, 0.75, and 0.95. We use percents (5, 25, 50, 75, and 95) rather than probabilities to simplify the plot. We can use any other values for the probabilities.
The plot displays a two-dimensional scattergram, where points represent the rows (companies) of the data matrix, and vectors or directions correspond to the variables. Distances among points for companies are inversely related to their similarity, for example, points representing Facebook, Instagram and TikTok coincide, indicating that the three companies have exactly the same profiles (the three collect all the available information). Companies with a single point have unique patterns, for example, Offerup, Uber or eBay, and groups with close points have several characteristics in common.
As previously mentioned, the angles between the variable directions provide an approximation of the correlations between them. For example, face recognition, environment recognition, and languages have high positive correlations because the increasing probabilities point in the same direction, i.e., have small acute angles between them. Collecting the contacts is not correlated to this group because they form a straight angle with it. Contacts and image library are negative correlations because the probabilities increase in opposite directions.
The angles are connected to tetrachoric correlations due to the close relationship between this representation and the factorization of the tetrachoric correlation matrix as we will explain later.
Projecting companies into the direction of a variable, we have the expected probabilities given by the approximation. Figure 3 shows the projections onto the variable “Face Recognition”. The point predicting 0.5 is marked with a circle because it is usually the cut point to predict the presence or absence of the characteristic. We can see that “TikTok”, “Instagram” and “Facebook” project on the upper part of the direction, meaning that they have the highest expected probabilities of using face recognition on their sites. Together with “Twitter”, “Zoom”, “Grinder”, and “Tinder” have expected probabilities higher than 0.95. The actual expected probabilities for each company can be approximated using the graded scaled placed along the direction. Companies with expected probabilities higher than 0.5 normally predict the presence of the character.
To simplify the plot, we can place marks for just a few probabilities, for example, 0.5 and 0.75 indicating the point where the expected logit is 0 (probability 0.5) and the direction of increasing probabilities. Figure 4 shows a simplified logistic biplot with probabilities 0.5 and 0.75. The directions are the same as in the previous plots but the scales are simplified, keeping just the cut for the presence prediction and the direction for higher probabilities.
We can use the expected probabilities from Equation (32) to generate predicted binary values: x ^ i j = 1 if π i j 0 , 5 and 0 otherwise. This yields an expected binary matrix X ^ = x ^ i j . In the plot, this indicates that for each variable, the plane is divided into two prediction regions by a line perpendicular to the variable’s arrow and passing through the point predicting 0.5. The region on the side of the arrow predicts the presence of the characteristic, while the opposite region predicts its absence. Figure 5 illustrates this statement. In the figure, points in light blue have observed presences, and points in red have observed absences. We can see that all the points lie on the correct prediction regions.
A guide for the interpretation of the biplot can be found also in the supplementary material by Demey et al. [10].
A common overall goodness-of-fit measure is the percentage of correctly classified entries of the data matrix. To assess the quality of representation for individual rows (individuals) and columns (variables), we can calculate these percentages separately. We can also calculate the percentage of true positives (sensitivity) and true negatives (specificity).
To evaluate the quality of representation for each binary variable, we need a suitable measure that generalizes the measures of predictiveness for continuous data as in [44]. Equation (32) essentially defines a logistic regression model for each variable when the row coordinates are fixed, which allows to evaluate goodness of fit using pseudo- R 2 measures commonly associated to that model, for example, MacFaden, Cox–Snell or Nagelkerke.
Table 2 contains fit measures for the Internet Companies Data. It contains the fit measures for each column separately and for the complete table.
In the data table, 94% of all entries are correctly predicted by the previously described procedure as shown in Figure 5. However, it is more insightful to examine the indices for each individual variable. This is because, in some cases, only a few variables are accurately represented. Such a situation can arise when dealing with a large set of variables, where only a small subset is actually relevant to the problem. For instance, in a genetic polymorphisms matrix, only certain variables may be of real interest.
In our example, all variables show a reasonable percentage of correct classifications, except for “Image Library”, which correctly classifies only 80% of all values and 66.67% of the negatives. Additionally, we include the deviance for each variable in the model, using the latent dimensions as predictors, along with a p-value as an indicator of model fit. This measure, potentially adjusted for multiple comparisons, could be used to select the most significant variables as demonstrated in [10]. We also show pseudo- R 2 measures. All have high values except for “Image Library” and “Voice Recog.” that are worse predicted or represented on the graph.
By assuming that the observed categorical responses are discretized versions of continuous processes, we can leverage the close relationship between the presented model and the factorization of tetrachoric correlations to perform a factor analysis.
Consider a binary variable X j . This variable is assumed to arise from an underlying continuous variable X j * that follows a standard normal distribution. There is a threshold τ j , which divides the continuous variable X * into two categories.
The relationship between X j and X j * can be expressed as
X j = 0 if X j * τ j 1 if X j * > τ j
The tetrachoric correlations are the correlations among the X j * , j = 1 , , J . Let R be a J × J matrix containing the tetrachoric correlations among the J binary variables, and let τ j , j = 1 , , J be the thresholds.
We can factorize the matrix R as
R Λ Λ T
where Λ contains the loadings of a linear factor model for the underlying variables.
It can be shown that there is a close relation between the factor model in Equation (34) and the model in Equation (32). You can find the details in [8]. The model in Equation (32) is actually equivalent to the multidimensional logistic two-parameter model in Item Response Theory (IRT).
The factor model loadings Λ = ( λ j s , j = 1 , , J ; s = 1 , , S ) and the thresholds τ j , j = 1 , , J ) can be calculated from the parameters for the variables in our model as
τ j = b j 0 1 s = 1 S b j s 2 1 / 2
λ j s = b j s 1 s = 1 S b j s 2 1 / 2
In this way, we provide a classical factor interpretation to our model, adding the loadings and communalities (sum of squares of all dimension loadings of each variable). Loadings measure the correlation among the dimensions or latent factors, and the communalities the amount of variance of each variable explained by the factors. For our data, the loadings and communalities are in Table 3.
The two-dimensional solution explains 88% of the variance. All the communalities are higher than 0.71, meaning that a good amount of the variance is explained by the dimensions. The communalities serve also to select important variables. Variables with low communalities may be explained by other dimensions or have little importance for the description of the rows.
We can see also that face recognition, environment recognition and language are mostly related to the first dimension, while product recognition and your contacts are related to the first dimension. Voice recognition is related to both.

3.2. Triplot Representations for Binary Partial (Least Squares) Regression

Equations (1) and (2) define two biplots for the continuous case that share the scores for individuals in T . Some of the properties of those biplots have been described by Oyedele and Lubbe [14].
In the same way, Equations (1) and (8) define two biplots that share scores in T , but now the biplot for responses is a logistic biplot as described in Section 3.1. The details of this representation can be found in [15]. For this case, Equation (19) defines a biplot for the regression coefficients in terms of the observed variables that can help in determining the more important predictors to explain the responses.
The equations in (20) and (21) define two logistic biplots with the particularity that both share the row scores in T . We can plot those together with the loadings for responses Q and predictors P to obtain a joint representation of individuals, responses, and predictors, i.e., a triplot with three sets of markers. Both biplots and then the triplot are interpreted using the rules in Section 3.1.
The measures of quality (correct classifications, sensitivity, specificity, pseudo- R 2 , factor loadings, or communalities) for matrix Y can be used to identify which responses are correctly explained and what dimensions are needed for the explanation. Global qualities can be used to determine the number of dimensions needed for an adequate prediction, taking into account that some of the responses may not be well predicted by the model. In this case, the angles among the directions for responses or the projections of the scores on them are less important because they are not optimized to explain the variability in data but to capture the relations with the predictors. The same occurs for the joint interpretation of rows and columns.

4. Illustrative Example

Anderson et al. [45] estimate that approximately 30% of plant diseases are caused by phytopathogenic fungi. These fungi can have a significant impact on both ecology and agriculture, causing crop losses and environmental damage.
We focus on a specific fungus, Colletotrichum graminicola (or Glomerella graminícola). This fungus is a major pathogen of corn, causing a disease called anthracnose. Anthracnose produces spots on the leaves and stems of corn plants, which can lead to crop losses. There are some studies, such as that of O’Connell et al. [46], which have collected the economic impact it has had in the United States.
The data used in this study come from a project carried out by the CIALE. The project collected samples of corn plants infected with Colletotrichum graminicola from nine different countries: Argentina (AR), Brazil (BR), Canada (CA), Croatia (HR), Slovenia (SI), France (FR), Portugal (PT), Switzerland (CH), and the United States (US). The response matrix has the indicators (dummy variables) for the countries.
The predictors are a set of binary polymorphisms obtained from the RNA sequencing of the fungus Colletotrichum graminicola, which yielded a total of 13,183 variables. However, the sequencing method generates duplicate information, a variable for the presence and another for the absence, that is complementary; so we first removed those variables for the absence. This reduced the number of variables to 6419. We then selected the 2379 variables that were most significant based on their correlation with the response variable, i.e., the presence or absence of anthracnose.
The statistical analysis was carried out with the statistical software R [28]. We used the MultBiplotR package [27], which includes functions for performing the proposed method including biplot analysis.
The biplot representation of the responses is shown in Figure 6. Rather than the rows, the convex hulls containing all the points of each origin are represented.
The BPLSR model achieved an overall classification accuracy of 90.94% considering the binary variables separately. The classification accuracy for each country, together with some additional measures of fit, are shown in Table 4.
In our example, the strains from Brazil correctly classify all of their individuals. The rest have lower percentages of correct classification. Argentina, Croatia, France, Portugal, and Slovenia have 0% sensitivity, that is, none of the presences are correctly classified. Switzerland has a sensitivity of 22.22%. This kind of classification is more useful when we have several (possibly) correlated variables rather than a set of indicators for several classes.
In this particular case, in which we have the indicators of some disjoint classes, we can calculate another classification assigning each individual to the class with higher expected probability. The results of this classification are shown in Table 5.
With this classification, only 49.51% of the strains are correctly classified into their origin. All the strains from Brazil, Canada, and France are correctly classified, but the strains from the USA are mistakenly grouped with those from Canada, while the remaining strains (Argentina, Croatia, France, Portugal, Slovenia, and Switzerland) are classified under France. This likely suggests that the characteristics of all European varieties (including Argentina) are quite similar, while Canada and the USA also share many similarities, with only Brazil showing distinct genetic characteristics. Experts have informed us that the Argentinian strains originated in Europe, which further supports this observation.
We do not include the predictors and its interpretation in this plot because of the big amount. We will show the interpretation in the next step of the analysis.
In conclusion, we identify three groups with different genetic profiles according to the origin of the strains: Brazil, North America (USA and Canada) and Europe (Argentina, Croatia, France, Portugal, Slovenia, and Switzerland). We repeat the analysis, recoding the whole set into those groups named Brazil, North America, and Europe (including Argentina).
The biplot for the new responses is in Figure 7. It is quite clear that all the groups are correctly classified.
Table 6 contains measures of fit for the responses. All measures are very high, so no further comment is needed.
Table 7 contains the loadings and the communalities that are very high for these data, which means that the relations among the latent variables with the responses are very high.
The procedure is also helpful in identifying the variables that contribute to the prediction by examining the predictors most strongly associated with the latent variables used for prediction. We can add the predictors to the previous biplot to obtain a triplot. The resulting plot is shown in Figure 8.
Due to the high number of predictors, the complete plot is too crowded and difficult to see, and we may select just one fraction of the variables. We can use, for example, predictors that have high communalities, or any other measure of fit.
We are interested in the predictors that point in the direction of each response and are highly related to the latent variables. For example, we can calculate the squared cosines between the directions of predictors and responses to identify the most important polymorphisms for each group. Those cosines are easily calculated from the matrices in Equation (20). If the vector l P = ( P 2 1 K ) 1 / 2 contains the lengths of the row vectors of P (directions of predictors on the plot) and l Q = ( Q 2 1 I ) 1 / 2 contains the lengths of the row vectors of Q (directions of responses on the plot), where the exponents apply to each element of the matrices, then the cosines are in the matrix
C = D l P 1 P Q T D l Q 1
where D l P 1 and D l Q 1 are the diagonal matrices with the lengths in the diagonal. Each column of C contains the cosines of all the predictors with a single response; the near are cosines to 1 or −1, the closer are the direction of the responses and predictors.
For each group we select the variables that have squared cosines higher than 0.9 and communalities higher than 0.85 to select also those related to the latent dimensions. The limits for the selection can be changed to select more or fewer predictors.
Figure 9 shows the variables most related to the direction that predicts Brazil, and Table 8 the goodness of fit.
Figure 10 shows the variables most related to the direction that predicts Europe.
The number of variables highly associated with European tribes is significantly lower than that observed for Brazilian or American tribes. Nevertheless, the goodness of fit, as shown in Table 9, remains robust. Interestingly, all genes are directly related to the response variable, except for gene “409898”, which shows an inverse relationship. Notably, this gene has a higher goodness of fit compared to the rest, highlighting its distinct contribution to the model. However, it is worth noting that the classification performance for the European strains does not reach the near-perfect results observed for Brazil. In particular, the percentage of cases correctly classified, as well as the sensitivity and specificity, do not reach 100%. This indicates a slightly lower overall predictive accuracy when analyzing the European dataset compared to the Brazilian one.
Figure 11 shows the variables most related to the direction that predicts North America.
The number of genes associated with the North American strain is significantly higher than those associated with the European and Brazilian strains. Only 10 genes are directly associated with the North American strain, while the rest have an inverse relationship with the response variable. The goodness of fit, as shown in Table 10, is high, indicating good model performance. Nevertheless, the percentage of correctly classified cases does not reach 100%. Interestingly, in certain cases, either sensitivity or specificity achieves perfect accuracy, demonstrating the ability of the method to capture certain aspects of the data with exceptional precision.

5. Conclusions

  • Partial Least Squares Regression (PLSR) has been assessed as a good alternative to Multivariate Linear Regression for exploring the relationships between two datasets, particularly in situations where predictor variables are either highly numerous or exhibit significant collinearity. This evaluation highlights the versatility and effectiveness of PLSR in addressing complex modeling challenges.
  • We propose an extension of the PLS technique applicable to scenarios involving binary predictors and responses. This novel approach, termed Binary Partial Least Squares Regression (BPLSR), is based on a generalization of the NIPALS algorithm, adapted to accommodate binary data structures, thereby broadening the applicability of PLSR techniques.
  • Custom functions implementing the BPLSR method for binary data have been developed in the R programming environment. These functions are included in the MultBiplotR package, which has been prepared for submission to the CRAN repository to ensure wider broader accessibility and usability for researchers and practitioners.
  • The usefulness and performance of the proposed BPLSR technique was validated through a real-world case study, focusing on the classification of Colletotrichum graminicola strains. The results showed promising classification accuracy, highlighting the potential of BPLSR as a powerful tool for binary data analysis.

6. Software and Data Notes

The example of the logistic biplot has been calculated using the function BinaryLogBiplotGD of the package MukltBiplotR [27] for R [28] that is available in CRAN. The next versions of the package will include these data and example as an illustration of the function.
The PLS for binary data has been calculated using the functions BinaryPLSR and Biplot.BinaryPLSR of the same package. The functions will be included in the next version of CRAN. In the meantime, the last version of the package and the actual script can be obtained upon request to the corresponding author.
The data used in the example can also be obtained upon request.

Author Contributions

Conceptualization, L.V.-G., E.F.-B. and J.L.V.-V.; Methodology, L.V.-G., E.F.-B. and J.L.V.-V.; Software, L.V.-G., E.F.-B. and J.L.V.-V.; Investigation, L.V.-G., E.F.-B. and J.L.V.-V.; Writing—original draft, L.V.-G. and E.F.-B.; Writing—review & editing, L.V.-G., E.F.-B. and J.L.V.-V.; Supervision, J.L.V.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by Grants RTI2018-093611-B-100 and PID2021-125349NB-100 from the MCIN of Spain AEI/10.13039/501100011033 and the European Regional Development Fund (ERDF).

Data Availability Statement

No new data were created or analyzed in this study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Pearson, K. One lines and planes of closest fit to systems of points in space. Philos. Mag. 1901, 2, 559–572. [Google Scholar] [CrossRef]
  2. Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
  3. Hotelling, H. Relations Between Two Sets of Variates. Biometrika 1936, 28, 321–377. [Google Scholar] [CrossRef]
  4. Gabriel, K.R. The Biplot Graphic Display of Matrices with Application to Principal Component Analysis. Biometrika 1971, 58, 453–467. [Google Scholar] [CrossRef]
  5. Gabriel, K.R. Analysis of Meteorological Data by Means of Canonical Decomposition and Biplots. J. Appl. Meteorol. 1972, 11, 1071–1077. [Google Scholar] [CrossRef]
  6. Gower, J.C.; Hand, D.J. Biplots; Chapman & Hall/CRC Monographs on Statistics & Applied Probability; Taylor & Francis: Abingdon, UK, 1995. [Google Scholar]
  7. Baker, F.B.; Kim, S.H. Item Response Theory: Parameter Estimation Techniques; CRC Press: Boca Raton, FL, USA, 2004. [Google Scholar]
  8. Jöreskog, K.G.; Moustaki, I. Factor analysis of ordinal variables: A comparison of three approaches. Multivar. Behav. Res. 2001, 36, 347–387. [Google Scholar] [CrossRef]
  9. Vicente-Villardon, J.L.; Galindo-Villardón, P.; Blazquez-Zaballos, A. Logistic biplots. In Multiple Correspondence Analysis and Related Methods; Greenacre, M.J., Blasius, J., Eds.; Statistics in the Social and Behavioral Sciences; Chapman & Hall/CRC: New York, NY, USA, 2006; pp. 503–521. [Google Scholar]
  10. Demey, J.R.; Vicente-Villardon, J.L.; Galindo-Villardón, M.P.; Zambrano, A.Y. Identifying molecular markers associated with classification of genotypes by External Logistic Biplots. Bioinformatics 2008, 24, 2832–2838. [Google Scholar] [CrossRef]
  11. Gabriel, K.R. Generalised bilinear regression. Biometrika 1998, 85, 689–700. [Google Scholar] [CrossRef]
  12. Wold, S.; Ruhe, A.; Wold, H.; Dunn, W.J., III. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. Siam J. Sci. Stat. Comput. 1984, 5, 735–743. [Google Scholar] [CrossRef]
  13. Firinguetti, L.; Kibria, G.; Araya, R. Study of partial least squares and ridge regression methods. Commun. Stat.-Simul. Comput. 2017, 46, 6631–6644. [Google Scholar] [CrossRef]
  14. Oyedele, O.F.; Lubbe, S. The construction of a partial least-squares biplot. J. Appl. Stat. 2015, 42, 2449–2460. [Google Scholar] [CrossRef]
  15. Vicente-Gonzalez, L.; Vicente-Villardon, J.L. Partial Least Squares Regression for Binary Responses and Its Associated Biplot Representation. Mathematics 2022, 10, 2580. [Google Scholar] [CrossRef]
  16. Bastien, P.; Vinzi, V.E.; Tenenhaus, M. PLS generalised linear regression. Comput. Stat. Data Anal. 2005, 48, 17–46. [Google Scholar] [CrossRef]
  17. Bazzoli, C.; Lambert-Lacroix, S. Classification based on extensions of LS-PLS using logistic regression: Application to clinical and multiple genomic data. BMC Bioinform. 2018, 19, 314. [Google Scholar] [CrossRef] [PubMed]
  18. Fort, G.; Lambert-Lacroix, S. Classification using partial least squares with penalized logistic regression. Bioinformatics 2005, 21, 1104–1111. [Google Scholar] [CrossRef] [PubMed]
  19. Abdi, H. Partial least squares regression and projection on latent structure regression (PLS Regression). Wiley Interdiscip. Rev. Comput. Stat. 2010, 2, 97–106. [Google Scholar] [CrossRef]
  20. Hair, J.F., Jr.; Hult, G.T.M.; Ringle, C.M.; Sarstedt, M.; Danks, N.P.; Ray, S. Partial Least Squares Structural Equation Modeling (PLS-SEM) Using R: A Workbook; Springer Nature: Cham, Switzerland, 2021. [Google Scholar]
  21. Wold, S.; Sjöström, M.; Eriksson, L. PLS-regression: A basic tool of chemometrics. Chemom. Intell. Lab. Syst. 2001, 58, 109–130. [Google Scholar] [CrossRef]
  22. Mateos-Aparicio, G. Partial least squares (PLS) methods: Origins, evolution, and application to social sciences. Commun. Stat.-Theory Methods 2011, 40, 2305–2317. [Google Scholar] [CrossRef]
  23. Vargas, M.; Crossa, J.; Eeuwijk, F.A.V.; Ramírez, M.E.; Sayre, K. Using partial least squares regression, factorial regression, and AMMI models for interpreting genotype by environment interaction. Crop Sci. Genet. Cytol. 1999, 39, 955–967. [Google Scholar] [CrossRef]
  24. Silva, A.; Dimas, I.D.; Lourenço, P.R.; Rebelo, T.; Freitas, A. PLS visualization using biplots: An application to team effectiveness. In Proceedings of the Computational Science and Its Applications—ICCSA 2020, Cagliari, Italy, 1–4 July 2020; Lecture Notes in Computer Science. Springer Science and Business Media Deutschland GmbH: Berlin, Germany, 2020; Volume 12251, pp. 214–230. [Google Scholar] [CrossRef]
  25. Barker, M.; Rayens, W. Partial least squares for discrimination. J. Chemom. 2003, 17, 166–173. [Google Scholar] [CrossRef]
  26. Wold, H. Soft modelling by latent variables: The non-linear iterative partial least squares (NIPALS) approach. J. Appl. Probab. 1975, 12, 117–142. [Google Scholar] [CrossRef]
  27. Vicente-Villardon, J.L.; Vicente-Gonzalez, L.; Frutos Bernal, E. MultBiplotR: Multivariate Analysis Using Biplots in R. R Package Version 1.6.14. 2024. Available online: https://cran.r-project.org/web/packages/MultBiplotR (accessed on 26 January 2025).
  28. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2024. [Google Scholar]
  29. Babativa-Márquez, J.G.; Vicente-Villardón, J.L. Logistic Biplot by Conjugate Gradient Algorithms and Iterated SVD. Mathematics 2021, 9, 2015. [Google Scholar] [CrossRef]
  30. Babativa-Márquez, J.G. BiplotML: Biplots Estimation with Machine Learning Algorithms. 2020. Available online: https://cran.r-project.org/src/contrib/Archive/BiplotML/ (accessed on 20 October 2023).
  31. Albert, A.; Anderson, J.A. On the existence of maximum likelihood estimates in logistic regression models. Biometrika 1984, 71, 1–10. [Google Scholar] [CrossRef]
  32. Heinze, G.; Schemper, M. A solution to the problem of separation in logistic regression. Stat. Med. 2002, 21, 2409–2419. [Google Scholar] [CrossRef] [PubMed]
  33. le Cessie, S.; van Houwelingen, J.C. Ridge estimators in logistic regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 1992, 41, 191–201. [Google Scholar] [CrossRef]
  34. Roozbeh, M. Optimal QR-based estimation in partially linear regression models with correlated errors using GCV criterion. Comput. Stat. Data Anal. 2018, 117, 45–61. [Google Scholar] [CrossRef]
  35. Liu, Y.; Magnus, B.; O’Connor, H.; Thissen, D. Multidimensional item response theory. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test Development; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2018; pp. 445–493. [Google Scholar] [CrossRef]
  36. Reckase, M.D. Multidimensional Item Response Theory; Springer: New York, NY, USA, 2009. [Google Scholar]
  37. De Leeuw, J. Principal component analysis of binary data by iterated singular value decomposition. Comput. Stat. Data Anal. 2006, 50, 21–39. [Google Scholar] [CrossRef]
  38. Lee, S.; Huang, J.Z.; Hu, J. Sparse logistic principal components analysis for binary data. Ann. Appl. Stat. 2010, 4, 1579. [Google Scholar] [CrossRef] [PubMed]
  39. Song, Y.; Westerhuis, J.A.; Aben, N.; Michaut, M.; Wessels, L.F.; Smilde, A.K. Principal component analysis of binary genomics data. Briefings Bioinform. 2019, 20, 317–329. [Google Scholar] [CrossRef] [PubMed]
  40. Landgraf, A.J.; Lee, Y. Dimensionality reduction for binary data through the projection of natural parameters. J. Multivar. Anal. 2020, 180, 104668. [Google Scholar] [CrossRef]
  41. Sierra, C.; Ruíz-Barzola, O.; Menéndez, M.; Demey, J.; Vicente-Villardón, J. Geochemical interactions study in surface river sediments at an artisanal mining area by means of Canonical (MANOVA)-Biplot. J. Geochem. Explor. 2017, 175, 72–81. [Google Scholar] [CrossRef]
  42. Greenacre, M.J. Biplots in correspondence analysis. J. Appl. Stat. 1993, 20, 251–269. [Google Scholar] [CrossRef]
  43. Slynchuk, A. Big Brother Brands Report: Which Companies Access Our Personal Data the Most? 2022. Available online: https://clario.co/blog/which-company-uses-most-data/ (accessed on 1 December 2024).
  44. Gardner-Lubbe, S.; Le Roux, N.; Gower, J. Measures of fit in principal component and canonical variate analyses. J. Appl. Stat. 2008, 35, 947–965. [Google Scholar] [CrossRef]
  45. Anderson, P.K.; Cunningham, A.A.; Patel, N.G.; Morales, F.J.; Epstein, P.R.; Daszak, P. Emerging infectious diseases of plants: Pathogen pollution, climate change and agrotechnology drivers. Trends Ecol. Evol. 2004, 19, 535–544. [Google Scholar] [CrossRef] [PubMed]
  46. O’Connell, R.J.; Thon, M.R.; Hacquard, S.; Amyotte, S.G.; Kleemann, J.; Torres, M.F.; Damm, U.; Buiate, E.A.; Epstein, L.; Alkan, N.; et al. Lifestyle transitions in plant pathogenic Colletotrichum fungi deciphered by genome and transcriptome analyses. Nat. Genet. 2012, 44, 1060–1065. [Google Scholar] [CrossRef]
Figure 1. The companies collecting your face, voice, and environment data.
Figure 1. The companies collecting your face, voice, and environment data.
Mathematics 13 00458 g001
Figure 2. Typical logistic biplot representation with probability scales. The point of the scale predicting 0.5 is marked with a circle.
Figure 2. Typical logistic biplot representation with probability scales. The point of the scale predicting 0.5 is marked with a circle.
Mathematics 13 00458 g002
Figure 3. Logistic biplot representation showing the projections of all the companies onto face recognition variable.
Figure 3. Logistic biplot representation showing the projections of all the companies onto face recognition variable.
Mathematics 13 00458 g003
Figure 4. Simplified logistic biplot representation with arrows starting at 0.5 and ending at 0.75 predicted probabilities.
Figure 4. Simplified logistic biplot representation with arrows starting at 0.5 and ending at 0.75 predicted probabilities.
Mathematics 13 00458 g004
Figure 5. Prediction regions for face recognition variable.
Figure 5. Prediction regions for face recognition variable.
Mathematics 13 00458 g005
Figure 6. Logistic biplot for the responses (countries) of the anthracnose example.
Figure 6. Logistic biplot for the responses (countries) of the anthracnose example.
Mathematics 13 00458 g006
Figure 7. Logistic biplot for the grouped responses (countries) of the anthracnose example.
Figure 7. Logistic biplot for the grouped responses (countries) of the anthracnose example.
Mathematics 13 00458 g007
Figure 8. Logistic triplot for the grouped countries of the anthracnose example.
Figure 8. Logistic triplot for the grouped countries of the anthracnose example.
Mathematics 13 00458 g008
Figure 9. Logistic triplot including the more important polymorphisms to classify Brazil.
Figure 9. Logistic triplot including the more important polymorphisms to classify Brazil.
Mathematics 13 00458 g009
Figure 10. Logistic triplot including the more important polymorphisms to classify Europe.
Figure 10. Logistic triplot including the more important polymorphisms to classify Europe.
Mathematics 13 00458 g010
Figure 11. Logistic triplot including the more important polymorphisms to classify North America.
Figure 11. Logistic triplot including the more important polymorphisms to classify North America.
Mathematics 13 00458 g011
Table 1. Internet companies.
Table 1. Internet companies.
CompanyFace Rec.Env. Rec.Prod. Rec.ContactsVoice Rec.Image Lib.Language
Facebook1111111
Instagram1111111
Tinder1101011
Grindr1101011
Uber1001011
TikTok1111111
Strava0001011
Spotify0000101
Myfitnesspal0001010
Clubhouse1001101
Credit Karma0000010
Twiter1101111
Airbnb1100011
Lidl Plus0010001
American Airlines0000011
eBay0010011
Sleepcycle0000101
Paypal0001001
Slimming World0000010
Whatsapp0001011
Zoom1101001
Protect Scotland0001001
CoStar0001010
Offerup0001100
Doordash0001001
Facetune1100011
Google Docs0001001
Google Sheets0001001
Gmail0001001
VSCO0000011
Table 2. Measures of fit for columns.
Table 2. Measures of fit for columns.
DevianceD.Fp-ValNagelkerkeCox–SnellMacFaden% CorrectSensitivitySpecificity
Face recog.36.602.000.000.960.700.91100.00100.00100.00
Environtment recog.42.472.000.000.960.760.92100.00100.00100.00
Product recog.45.912.000.000.960.780.9096.6780.00100.00
Your contacts35.032.000.000.910.690.8293.33100.0080.00
Voice recog.16.982.000.000.610.430.4590.0075.0095.45
Image library12.692.000.000.470.340.3180.0088.8966.67
Language47.982.000.000.970.800.93100.00100.00100.00
Total237.6514.000.000.880.680.7794.2994.7993.86
Table 3. Thresholds, loadings, and communalities.
Table 3. Thresholds, loadings, and communalities.
ThresholdsLoadings Communalities
Dim1Dim2
Face recog.−0.020.140.960.94
Environtment recog.−0.05−0.140.960.94
Product recog.−0.11−0.830.500.93
Your contacts0.040.960.170.95
Voice recog.−0.100.640.600.77
Image library0.03−0.800.260.71
Language0.11−0.060.970.94
Table 4. Columns fit measures.
Table 4. Columns fit measures.
DevianceD.Fp-ValNagelkerkeCox–SnellMacFaden% CorrectSensitivitySpecificity
Argentina3.602.000.170.100.030.0894.170.00100.00
Brazil95.402.000.000.960.600.94100.00100.00100.00
Canada35.302.000.000.480.290.3785.44100.0082.35
Croatia7.482.000.020.130.070.1087.380.00100.00
France20.362.000.000.340.180.2685.440.0097.78
Portugal1.372.000.500.060.010.0597.090.00100.00
Slovenia1.812.000.410.080.020.0797.090.00100.00
Switzerland19.892.000.000.390.180.3393.2022.22100.00
USA30.312.000.000.420.250.3277.6772.2278.82
Total215.5218.000.000.430.210.3590.8351.4695.75
Table 5. Classification of the countries.
Table 5. Classification of the countries.
BrazilCanadaFrance
Argentina006
Brazil2000
Canada0180
Croatia0013
France0013
Portugal003
Slovenia003
Switzerland009
USA0180
Table 6. Columns fit measures for responses of grouped data.
Table 6. Columns fit measures for responses of grouped data.
DevianceD.Fp-ValNagelkerkeCox–SnellMacFaden% CorrectSensitivitySpecificity
Brazil93.552.000.000.950.600.92100.00100.00100.00
Europe139.672.000.000.990.740.98100.00100.00100.00
N. America129.562.000.000.990.720.97100.00100.00100.00
Total362.786.000.000.980.690.96100.00100.00100.00
Table 7. Thresholds, loadings, and communalities.
Table 7. Thresholds, loadings, and communalities.
ThresholdsDim1Dim2Communalities
Brazil−0.02−0.65−0.720.94
Europe0.00−0.030.990.98
N. America0.000.92−0.370.98
Table 8. Fit measures of the variables most related to Brazil.
Table 8. Fit measures of the variables most related to Brazil.
PolymorfismDevianceD.Fp-ValNagelkerkeCox–SnellMacFaden% CorrectSensitivitySpecificity
72599379.242.000.000.870.540.8099.03100.0098.81
85149578.002.000.000.860.530.7999.03100.0098.81
85212778.002.000.000.860.530.7999.03100.0098.81
116245179.952.000.000.860.540.7998.0695.0098.80
116249979.952.000.000.860.540.7998.0695.0098.80
109220687.652.000.000.900.570.8499.0395.24100.00
45308683.112.000.000.900.550.8499.03100.0098.81
45315283.112.000.000.900.550.8499.03100.0098.81
19881180.722.000.000.880.540.8299.03100.0098.81
51987783.112.000.000.900.550.8499.03100.0098.81
41661190.042.000.000.930.580.89100.00100.00100.00
41664990.042.000.000.930.580.89100.00100.00100.00
41665890.042.000.000.930.580.89100.00100.00100.00
44342285.552.000.000.890.560.8299.0395.24100.00
44343685.552.000.000.890.560.8299.0395.24100.00
44344685.552.000.000.890.560.8299.0395.24100.00
14170290.042.000.000.930.580.89100.00100.00100.00
8296683.592.000.000.900.560.8599.03100.0098.81
7976380.212.000.000.860.540.7998.0695.0098.80
Table 9. Fit measures of the variables most related to Europe.
Table 9. Fit measures of the variables most related to Europe.
PolymorfismDevianceD.Fp-ValNagelkerkeCox–SnellMacFaden% CorrectSensitivitySpecificity
24773187.772.000.000.770.570.6292.2391.4992.86
24782887.772.000.000.770.570.6292.2391.4992.86
409898112.742.000.000.890.670.7995.1596.3693.75
8175885.962.000.000.760.570.6192.2395.3590.00
230685112.202.000.000.890.660.7996.1293.8898.15
Table 10. Fit measures of the variables most related to North America.
Table 10. Fit measures of the variables most related to North America.
DevianceD.Fp-ValNagelkerkeCox–SnellMacFaden% CorrectSensitivitySpecificity
13270181.102.000.000.730.540.5789.3298.2877.78
113292082.012.000.000.780.550.6694.1791.78100.00
115676471.052.000.000.720.500.5992.2389.33100.00
119321476.352.000.000.750.520.6293.2090.54100.00
37377968.572.000.000.650.490.4885.4498.1571.43
89789294.502.000.000.810.600.6792.23100.0081.82
89794586.082.000.000.770.570.6393.2096.8887.18
96286572.232.000.000.690.500.5491.2693.9486.49
30699290.082.000.000.790.580.6593.2098.3985.37
76207773.392.000.000.710.510.5791.2691.4390.91
29092983.852.000.000.740.560.5985.4498.1571.43
55389799.662.000.000.840.620.7294.1798.4187.50
61414378.792.000.000.710.530.5586.41100.0072.00
92845582.022.000.000.730.550.5785.44100.0070.59
96419977.152.000.000.700.530.5483.5095.8372.73
22754583.452.000.000.740.560.5989.32100.0076.60
38438187.772.000.000.770.570.6289.32100.0076.60
38441487.772.000.000.770.570.6289.32100.0076.60
38942968.862.000.000.650.490.4977.6791.3066.67
1657979.072.000.000.720.540.5683.5072.7395.83
6296577.162.000.000.720.530.5790.2993.8584.21
236870111.892.000.000.910.660.8498.0698.5197.22
30753276.902.000.000.710.530.5487.3896.5575.56
30758076.902.000.000.710.530.5487.3896.5575.56
30765084.812.000.000.750.560.6088.3598.2576.09
30768984.812.000.000.750.560.6088.3598.2576.09
31111895.522.000.000.830.600.7094.1796.9289.47
31255395.522.000.000.830.600.7094.1796.9289.47
31257195.522.000.000.830.600.7094.1796.9289.47
31610795.522.000.000.830.600.7094.1796.9289.47
31611795.522.000.000.830.600.7094.1796.9289.47
34496484.372.000.000.770.560.6392.2388.8994.03
34502984.372.000.000.770.560.6392.2388.8994.03
34982984.372.000.000.770.560.6392.2388.8994.03
35174587.152.000.000.790.570.6693.2091.4394.12
48631571.902.000.000.700.500.5491.2692.6588.57
62209475.722.000.000.700.520.5487.3896.5575.56
62752777.082.000.000.710.530.5587.3896.5575.56
65919478.952.000.000.720.540.5688.3596.6177.27
66492973.802.000.000.680.510.5286.4196.4973.91
66503873.802.000.000.680.510.5286.4196.4973.91
50259171.172.000.000.670.500.5189.3296.6779.07
50269971.172.000.000.670.500.5189.3296.6779.07
50433171.172.000.000.670.500.5189.3296.6779.07
22671170.152.000.000.660.490.4986.41100.0072.00
22674670.152.000.000.660.490.4986.41100.0072.00
22869062.552.000.000.610.460.4483.50100.0067.92
47804785.072.000.000.790.560.6695.1594.2996.97
47807385.072.000.000.790.560.6695.1594.2996.97
47811785.072.000.000.790.560.6695.1594.2996.97
47814885.072.000.000.790.560.6695.1594.2996.97
48005985.072.000.000.790.560.6695.1594.2996.97
48117685.072.000.000.790.560.6695.1594.2996.97
23411377.472.000.000.710.530.5486.4198.1872.92
36408081.822.000.000.740.550.5991.2696.7782.93
36410281.822.000.000.740.550.5991.2696.7782.93
36419581.822.000.000.740.550.5991.2696.7782.93
37432982.472.000.000.730.550.5879.6192.0067.92
37541275.502.000.000.700.520.5387.3896.5575.56
26575888.812.000.000.800.580.6894.1794.1294.20
26611388.812.000.000.800.580.6894.1794.1294.20
26612588.812.000.000.800.580.6894.1794.1294.20
26617788.812.000.000.800.580.6894.1794.1294.20
26618688.812.000.000.800.580.6894.1794.1294.20
610778.862.000.000.720.530.5687.3898.2174.47
10482767.602.000.000.640.480.4883.50100.0069.09
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Vicente-Gonzalez, L.; Frutos-Bernal, E.; Vicente-Villardon, J.L. Partial Least Squares Regression for Binary Data. Mathematics 2025, 13, 458. https://doi.org/10.3390/math13030458

AMA Style

Vicente-Gonzalez L, Frutos-Bernal E, Vicente-Villardon JL. Partial Least Squares Regression for Binary Data. Mathematics. 2025; 13(3):458. https://doi.org/10.3390/math13030458

Chicago/Turabian Style

Vicente-Gonzalez, Laura, Elisa Frutos-Bernal, and Jose Luis Vicente-Villardon. 2025. "Partial Least Squares Regression for Binary Data" Mathematics 13, no. 3: 458. https://doi.org/10.3390/math13030458

APA Style

Vicente-Gonzalez, L., Frutos-Bernal, E., & Vicente-Villardon, J. L. (2025). Partial Least Squares Regression for Binary Data. Mathematics, 13(3), 458. https://doi.org/10.3390/math13030458

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop