The analysis of multivariate data is a central issue in biomedical research, where the accurate classification of patients and the extraction of reliable conclusions are of critical importance. Linear Discriminant Analysis (LDA) remains one of the most established methods for both dimensionality reduction
[...] Read more.
The analysis of multivariate data is a central issue in biomedical research, where the accurate classification of patients and the extraction of reliable conclusions are of critical importance. Linear Discriminant Analysis (LDA) remains one of the most established methods for both dimensionality reduction and classification of data. In this paper, we examine in detail the theoretical foundations, assumptions, and statistical properties of LDA, and apply the method step by step to real data from the Breast Cancer Wisconsin (Diagnostic) database, which includes cellular features from breast biopsy samples with the aim of distinguishing benign from malignant tumors. Emphasis is placed on the importance of the method’s assumptions, such as multivariate normality, equality of covariance matrices, and absence of multicollinearity, demonstrating that their fulfillment leads to significant improvements in model performance. Specifically, careful preprocessing and strict adherence to these assumptions increase classification accuracy from
(
cross-validated) to
(
cross-validated). To our knowledge, this study is the first to demonstrate the dual use of LDA as both a dimensionality-reduction tool and a predictive classification model for this medical database within the same biomedical analysis framework. Moreover, we provide, for the first time, a systematic comparison between our assumption-aware LDA model and related studies employing the most accurate machine-learning classifiers reported in the literature for this dataset, showing that classical LDA achieves accuracy comparable to these more complex methods. The resulting discriminant model, which uses 13 variables out of the original 30, can be applied easily by clinical researchers to classify new cases as benign or malignant, while simultaneously providing interpretable coefficients that reveal the underlying relationships among variables. The implementation is carried out in the SPSS environment, following the theoretical steps described in the paper, thus offering a user-friendly and reproducible framework for reliable application. In addition, the study establishes a structured and transparent workflow for the proper application of LDA in biomedical research by explicitly linking assumption verification, preprocessing, dimensionality reduction, and classification.
Full article