1. Introduction
It is clear that biomedical sensors will keep improving exponentially, as they become more accessible; more readily available on the market; more intelligent; smaller and more compact; and integrated into personal belongings, such as cellular phones, watches, and eyeglasses. It is an outstanding contribution to human health, entertainment, the military, security, sports, and leisure and in analysing a patient’s physiological data and their interpretation. These sensors are an integral part of biomedical devices; however, this innovation has an ever-evolving trend coupled along with challenges to perform intelligently [
1]. The integration of these sensors, subjected to different environments in both wired and wireless implementations, adds random noise to the system. The sudden upswing of the noise floor yields a degradation in the signal-to-noise ratio (C/N) and low energy bit per noise reference (Eb/No) during data processing. The drastic increase in the probability of error (Pe) eventually decreases the accuracy in such a system [
2]. This interference results in flawed calculation and decision making, especially for artificial neural networks (ANNs) in biomedical applications [
3].
An artificial neural network (ANN) is a part of a computing system that simulates the ability of the human neuron to learn the complex characteristics of the environment, to recognise patterns, and to generalise the inter-relationships between the features, including multidimensional datasets. It is part of the umbrella of artificial intelligence (AI) and deep learning, and it solves problems that would be impossible or difficult to solve by human effort or statistical criteria. ANNs have self-learning skills, enhancing their efficiency as more effective and reliable data become available to them. Primarily, ANNs enable the complex inter-relationships between the features within a given dataset to be identified and have seen widespread adoption in different applications, including biomedical and signal processing, which is readily and publicly available by wearable sensors [
4].
Since data are everywhere—such as on the Internet—neglecting the integrity of a source of information poses a significant threat for ANNs to misinterpret the data. For critical aspects of the medical field, the efficacy of a diagnosis is threatened if the analysis that aids medical experts in research has a weak source, especially if dealing with the multitude of datasets that result in the presence of noisy and biased data, which often occurs from seemingly reliable sources [
5].
Low-powered sensors are a critical source of unreliable information. Sudden changes in conditions can introduce environmental noise, and there are many possible avenues for noise and interference to blend into any part of the system. A sensor’s scalability is another problem, given that the sensors nowadays are low power and with limited computation. Placement or location is another significant constraint, and if they are placed too close to other devices, they may be affected by crosstalk interference [
6].
To understand the data, first, one needs to understand their composition. Data comprise valid and dirty data. Accurate data contain information that is accurate, holds predictive power, and is generalised to the entire set. Dirty data contain information that is misleading, noisy, or erroneous data, such as pragmatic contexts and the semantic- and syntactic-biased errors shown in
Figure 1 [
7].
Noisy data are any random fluctuations considered unwanted, unpleasant, loud, or disruptive that hinder the generalisation of the entire dataset. For an ANN with predictive power to analyse the data, the dataset needs to be free from noise to reduce errors in classification and prediction, which would significantly impact medical experts in their professional biomedical interpretation. Given recent technological solutions, one relevant control algorithm is highlighted in solving these challenges [
8].
Principal component analysis (PCA) is usually part of data visualisation. PCA converts the features of datasets from high dimensional data and can help convert data to low-dimensional data with the aid of covariance, eigenvalues, and eigenvectors in identifying and sorting the strength of the predictive power of all features [
9]. PCA allows for the extraction of critical feature vectors from multidimensional datasets in terms of data visualisation and loses some crucial information along the process [
10], which is a considerable disadvantage, primarily if it is used extensively in the field of machine learning. This study used PCA not as a dimension reduction but as a sample reduction in order to remove the unwanted noisy samples in multidimensional datasets.
This paper proposes an application of a PCA–sample reduction process (SRP) to improve the prediction accuracy of ANNs. Using publicly available biomedical datasets from a different field provides qualitative analysis and demonstrates the effectiveness of said method in improving the accuracy of ANNs. By cleaning the publicly available biomedical datasets before training and testing the ANN, the classification problem is expected to increase the accuracy and to lower the computational cost, which is a great help in dealing with the the analysis and prediction of big multidimensional datasets.
A study to reinforce the methodology of an ANN using PCA–sample reduction is proposed in this paper to determine the proposed system’s accuracy and performance with different sizes of multidimensional biomedical datasets. Furthermore, we investigate heart disease, voice and speech analysis for gender recognition, breast cancer classification, and cancer patient datasets to prove the versatility and flexibility of the proposed data-cleansing technique.
The rest of the paper is structured as follows.
Section 2 provides a literature review of data cleaning in biomedical applications.
Section 3 discusses the basic concepts of principal component analysis and related topics.
Section 4 discusses how the PCA–SRP is integrated with an ANN.
Section 5 provides the recommended
ranges.
Section 6 interprets the result of publicly available biomedical datasets. Lastly,
Section 7 provides a discussion, the conclusion, and future research directions.
2. Data-Cleaning Applications
The volume of data collected nowadays is vastly increasing, and since most data acquired are polluted, the dependability of the data is declining. Various data-cleaning methodologies are available to rectify this issue, but data cleansing remains difficult when working with large data requirements. Data cleaning, also known as data cleansing, is no longer a recent area of research. It aims to increase data quality by detecting and eliminating errors and inconsistencies [
11]. As of now, there are two classifications of data cleansing: traditional data cleansing and data cleansing for big data. Traditional data cleansing techniques are so called because it is not used to manage massive volumes of data, such as Potter’s Wheel and Intelliclean [
12].
Meanwhile, the techniques in
Table 1 such as Cleanix [
13], SCARE [
14], KATARA [
15], and BigDansing [
16] are developed specifically for big data. Regarding the emerging trends in data-cleaning techniques, one of the new challenges that researchers are about to face is scalability [
17]. One of the perennial problems in data analytics is identifying and restoring dirty data, and failure to do so will result in faulty analytics and unreliable decisions. New abstractions and scalability are among the various facets of this issue and are considered when developing data-cleaning methods to cope with the amount and diversity of data [
13,
14,
16]—see
Table 2. Given the significant amount of data, it needs time to be processed to be suitable for big data analysis and decision making. The data’s volume, veracity, and velocity must also be considered when analysing the proposed approaches; however, the researchers mentioned that “Data analytics is not about having the information known, but about discovering the predictive power behind the data collected” [
12].
Cleanix, SCARE, and BigDansing focus on the scalability issue in the data cleansing process. Moreover, SCARE and BigDansing do not require any human-domain expert in the cleansing process. SCARE needs an extensive set of rules to update the dataset; however, no expert is present in the process. Nevertheless, the process is expensive, and if the authorities fail to identify correct fixes for the dirty dataset, it will result in redundancy of the training data and a threshold machine learning parameter that is hard to set precisely [
15]. Furthermore, BigDansing also requires a set of data-quality rules for optimisation of the cleansing process that requires too many regulations to calibrate and put into place before the start of the cleaning process, as shown in
Table 2 below; however, it needs no human-domain expertise to monitor the whole process, although adjusting such parameters is crucial in maintaining the essence of the information in the datasets [
12].
These data-cleansing techniques support various data-cleaning tasks such as abnormal value detection and correction, incomplete data filling, de-duplication and conflict resolution (Cleanix), value modification (SCARED), the identification of correct and incorrect data, and the generation of top-k possible repairs for inaccurate data (KATARA) and rules into a series of transformations that enable distributed computations and several optimisations (BigDansing).
There are two kinds of dirty data: erroneous and noisy—see
Figure 1. The current big-data-cleansing techniques—Cleanix [
13], SCARE [
14], KATARA [
15], and BigDansing [
16]—are focused mostly on erroneous dirty data such as a solution in duplicate entries, missing values, wrong values, and wrong formats. In medical applications, these big-data-cleansing techniques play a critical role in medical record management in medical facilities; however, they do not address noisy data acquired via biomedical sensors. Cleanix, KATARA, and BigDansing are not able to predict the correct values through a machine learning approach and cannot determine how to eliminate the mixed random number [
12]; however, the SCARE technique, though constructed via machine learning, can only replace the missing values with the most precise value. When dealing with noisy data, the common approach in applications with machine learning and artificial neural network (ANN) is to address noise reduction and suppression [
11,
18,
19].
Given all of the above data-cleaning techniques, noise and random sample reduction is a significant part of their scope of discussion, which is vital in biomedical data signal analyses acquired in wearable biomedical sensors due to its subjectiveness in a different kind of environment—an environment often filled with varying sorts of noises, such as thermal and acoustic noises and interference. Moreover, since most wearable biomedical sensors are of low power, by suppressing these kinds of noises through signal processing by applying the filtering threshold method, the unsupervised classification is not effective under a low SNR. When the spectral characteristics of the noise are so similar or near that of the sensor-received signal, the detection performance may be degraded [
20]. When it applies to the classification problem of an artificial neural network (ANN), obtaining the correct values through comprehensive and extensive quantisation in data-signal processing is essential. Still, it is not sufficient to identify which of the gathered data have a predictive power, as some authors mention that “Data analytics is not about having the information known, but about discovering the predictive power behind the data collected” [
12]. Correct values of the data do not guarantee that it holds a valid value in predicting and summarising the entirety of the multidimensional datasets. Mining those correct values is an integral part of data mining in every machine learning and big data analysis.
Complicated and straightforward mistakes are unavoidably present in data input and acquisition. Although much effort could be expended into this front-end procedure to reduce entry mistakes, the truth remains that mistakes in massive datasets are prevalent. Field error rates usually are about 5% or higher unless an organisation takes extraordinary precautions to prevent data inaccuracies [
21]. Moreover, this rate is still quite high, at such a rate that it might lead to erroneous interpretation and decision making.
In the case of cleaning the noisy data via biomedical sensors, most researchers use principal component analysis, such as a reduction in dimensionality or feature space [
22,
23,
24,
25,
26], feature extraction in further data visualisation [
27,
28,
29], and feature selection tools in machine and deep learning applications [
30,
31,
32,
33,
34] for machine learning applications.
Principal component analysis (PCA) is usually utilized in dimension or feature reduction and provides a significant increase in accuracy and efficiency along with other machine learning techniques in many applications aside from biomedical application [
35,
36,
37,
38,
39]. Nevertheless, some of the information is lost during the process of dimension reduction [
10].
Figure 2 shows a sample Excel dataset, and PCA reduces the dimensions vertically (by column), shown in red; for the proposed methodology, it reduces the noisy samples horizontally (by rows). We emphasise that the extraction of features, such as covariance, eigenvalues, eigenvectors, and dimension reduction, is not a novel technique that we propose here, but instead, we propose the implementation of sample reduction using PCA. Identifying the noisy samples that cause irregularity in the multidimensional datasets and omitting or reducing them is the main focus of this study.
The observation and the application of the PCA–sample reduction process are utilised for data cleansing of the noisy multidimensional datasets to increase the accuracy of classification problems in artificial neural networks and to identify recommended threshold ranges. To the best of our knowledge, we are the first to propose this unique approach. We discuss a specific technique for qualitative noise detection and omission of such. These techniques are explained with a motivating example highlighting deep learning classification problems of artificial neural networks in biomedical applications using publicly available biomedical datasets. The simplicity of this technique makes it portable and can apply to a variety of tasks for fast and accurate classification of typical or commonly used artificial neural networks.
In terms of software implementation, developers and data engineers prefer Python programming. It is the best option for projects or programming involving AI and big data analysis as Python is a simple language with a mature and supportive Python community; an abundance of support from renowned corporate sponsors; an extensive and popular selection of libraries; and the ability to work with heavy-hitting frameworks such as TensorFlow, Sci-kit-learn, OpenCV, and Keras [
40].
3. Principal Component Analysis–Sample Reduction Process
Principle component analysis (PCA) aims to minimise the dimensionality of a dataset with many connected variables while keeping as much variance as feasible. The conversion of the new collection of uncorrelated variables known as principal components (PCs) preserves most of the variance included in the original variables. A new set of dimensions or orthogonal measurements are linearly independent and ranked according to the data variance. This means the more crucial principal axis occurs first (more important = more variance/more spread-out data). In general, understanding the PCA, variance, covariance, eigenvalues, and eigenvectors has an essential role in this concept [
41].
Figure 3 shows large positive covariance means that X and Y are completely related, i.e., as X increases, Y also increases. Negative covariance portrays the exact opposite relation. Zero covariance means that X and Y have no relation.
The visualisation of data is an excellent approach to understand the patterns that lie in the multidimensional datasets. When information is placed in the horizontal and vertical axes (two-dimensional plane), it is straightforward to understand and discern the pattern that lies in it; however, the difficulty of conceiving it visually in multidimensional data with many features to consider and computing the data analysis becomes complex. Principal component analysis prerequisites require the discovery of patterns between the datasets so that the data are distributed across each dimension by first analysing the contributions of each feature in providing information to the overall dataset through eigenvector analysis. It then performs dimension reduction by keeping the feature columns with the highest eigenvalues.
To demonstrate the efficacy of PCA,
Figure 4, which projects a set of multidimensional data onto a two-dimensional space, is used as an example. Due to the high dimensional nature of the data points, it can be challenging to identify a linear correlation between the data points. The points are represented as column vectors before being aggregated into a matrix
M. PCA is then applied to
M as shown in Equation (
1):
where
W and
V are the corresponding eigenvalues and eigenvectors, respectively.
is generated using covariant eigenvalues
V with the corresponding top
k eigenvalue in
W, which best represents the data points, and are then selected, before plotting them as shown in
Figure 5 [
30]. The principal axes (
and
) are denoted by the red lines passing through the points that provide a graphical representation of the covariance amongst the points.
Lastly, the vital part of the proposed PCA–sample reduction process (
) is as follows:
where
is the loading score of each sample in dataset
M and
is the
eigenvectors of the samples in matrix
M.
S is the number of samples above the loading score of set-biased
, and
, where
D is the number of samples in the dataset, including the random and noise samples.
The
is the rate of samples cleaned using PCA–SRP in dataset
M based on Equation (
4). Any samples above said threshold are accepted, and those below it will be rejected for the new process dataset
as per Equation (
3).
5. Sc Range Identification
Accuracy testing was conducted for the specified dataset in order to identify the
range shown in
Figure 9, to maximise the number of samples of the cleaned set
S, and to maximise the number of samples of the removed set
R in the processed
.
Assume that there are two sets of data
D: cleaned data
S and
R.
By indexing the dataset
D for identifier, where
n is the number of cleaned data
S, we have the following:
randomly distributes R in S to thoroughly mix the noise in the cleaned data by maintaining its index for identification. Then, is processed in PCA–sample reduction and sorted based on its loading score through the concept of covariance, eigenvalues and eigenvectors, and .
is the selectivity of the data based on the equation in
Table 3,
, where the cut-off is biased to separate the clean-processed data
and to remove processed data
. The dataset was transformed from random to sorted PCA-SR-processed data.
By definition,
(true positive rate or sensitivity) is the number of correct samples in a clean-processed
S in
with a relative to number of
n.
However,
(True Negative Rate or Specificity) is the number of correct samples of
R in
.
Therefore, the most efficient is where and meet.
5.1. Test Selectivity ()
To find
, the actual cleaned
S set and random
R set are needed to identify the accuracy by identifying the
and
lines and by applying Equation (
16).
For
,
and
are maximised:
As shown in Equation (
16),
Maximum
with respect to
is the
by satisfying the Equation (
21).
Note: Be cautious when using as ; it might remove all of the unwanted sample sets but may lose some information.
5.2. Minimum Selectivity ()
The minimum
is the rate of samples in
, which shows that all included samples pass the minimum requirements of having the same attributes as the entire set.
Note: By using as , while it might not lose important information, the dataset D includes an abundance of random and noise samples.
The acceptable
is shown in Equation (
23) in
Figure 10.
Therefore, the recommended
is between the range given in Equation (
25). It varies depending on how much cleaning and removing are needed in the dataset
D.
5.3. ANN Selectivity ()
The ANN requires samples that allow for strong predictive power (see
Section 7 for more details concerning the ANN requirements); we assumed the following:
However, is unknown due to unidentifiable R samples, and it is suggested that is close to 100%.
7. Discussion and Results
Upon acquiring the given datasets, they are contiminated with noise and random samples. By definition, noise is small, unwanted forms of energy or samples; on the other hand, random samples are the unconscious and unspecified values within the range of normal values.
The datasets with noise and random samples are processed using PCA–SRP with 98% selectivity (
), shown in
Table 5, and then subjected to following tests:
PCA–SRP + ANN comparison accuracy testing compares the validation model accuracy with and without PCA–SRP in an ANN.
Sensitivity vs. specificity testing is a diagnostic test to find the approximate range values ().
Receiver operating characteristic (ROC) curve testing compares the methodology PCA–SRP in different datasets in terms of organisation and classification of samples. Moreover, ROC curves also provide a practical evaluation of machine learning techniques.
Accuracy vs. additional random samples testing is a diagnostic test responding to the sudden increase in noise and random sample spikes.
7.1. PCA–SRP + ANN Comparison Accuracy Testing
We compared the validity model accuracy using PCA–SRP in an ANN classification problem as suggested by the results presented in
Figure 20 and determined its effect upon being subjected to noise and random samples.
Table 6 and
Table 7 display the validation accuracy using ANN + PCA–SRP and ANN only, showing a significant increase as shown in
Figure 21 for both noise and random samples.
The difference between dataset accuracy subjected to noise and random samples are the definition of the accuracy curve line between ANN + PCA–SRP to ANN only, even though both of them have a significant accuracy increase. The accuracy of datasets under the influence of random samples swings and deviates more and is less defined in comparison with accuracy with noise samples.
7.2. Sensitivity vs. Specificity Testing
Sensitivity measures how many true positives remain in
S set, and it is described as a sudden dip as
increases. Specificity measures how many true negatives remain in the removed
R set and increases as
increases.
Figure 22 presents the diagnostic testing yielding a recommended
range in normal cleaning and ANN applications, as demonstrated in Equations (
25) and (
26), respectively.
Table 8 shows the relation between
and
, but in practicality,
is hard to determine due to unknown
R sets, and whether it consists of noise or random samples. In general, for the normal cleaning process,
is close to or above the
or
rate. For the given datasets, it is suggested that
is as follows:
However, ANN applications need highly predictive samples [
12], so it is suggested that selectivity (
) should be almost near 100%.
Table 8 shows
at the following range:
A high value loses information and true positives in the S set but increases the specificity of the dataset; however, a low value of adds noise and random samples that yield fewer true negatives and increases sensitivity. Careful adjustment of vouches for a good result as a cleaning agent in the system.
7.3. Receiver Operating Characteristic (ROC) Curve Testing
As observed in
Figure 23, receiver operating characteristic (ROC) datasets have shown an organisation of the samples, even the random and noise samples were strongly mixed into the cleaned set. The larger the area under the curve (AUC), the better the classifier methodology for the true positive rate (sensitivity) vs. true negative rate (1 − specificity) diagram, as seen in
Figure 24. Ideally, the objective is the perfect classifier; nevertheless, a result above the random classifier line would allow us to conclude that the methodology is acceptable.
Given all of the datasets, the cancer patients dataset is acceptable but by not as much as the other datasets; it has a smaller AUC.
7.4. Accuracy vs. Additional Random Samples Testing
Since the cancer patients dataset shows the least-effective classifier, as shown in
Figure 23, it was tested for its response to the injection of the a sudden spike in additional random samples up to 100%.
Figure 25 shows that, upon increasing the random samples, the result of PCA–SRP decreases gradually.
Figure 26 and
Figure 27 present the validation accuracy of both ANN + PCA(SRP) and ANN only to additional noises up to 100% with selectivities (Sc) of 88 and 98%, respectively. It has also shown a reasonable increase in accuracy using 98% selectivity (
) instead of 88%.
The high-valued data have been preserved and maintain their accuracy until a specific additional noise point. Nevertheless, the predictive power remains intact until that point, even though some data were lost in the process.
The maintains the highest performance of cancer patient classifications with high values. The methodology allows for a significant advantage by gradually lessening the decrease in classification problem accuracy over the sudden increase in noise in the system.
The ANN classification problem requires strong training sets, which requires a lot of high
while disregarding the low ones; based on the observation in both
Figure 26 and
Figure 27,
is better, as described in Equation (
26).
8. Conclusions and Future Research
The material presented in this paper shows a significant improvement in the accuracy of an ANN in classification problems with the aid of principal component analysis–sample reduction process (PCA–SRP). The ANN cast off 10% of the learning rate, two layers with 32 and 16 hidden neurons, ReLU activators in hidden layers, and SoftMax in output activators with 100 epochs or iterations, as shown in
Table 3 based on the PCA–SRP and ANN The Python implementation program was implemented on the multidimensional datasets gathered, namely the heart disease, gender voice recognition, breast cancer classification, and cancer patients datasets provided in
Table 4. These datasets were then used in the PCA–SRP + ANN accuracy comparison testing, sensitivity vs. specificity testing, receiver operating characteristic (ROC) curve testing, and accuracy vs. additional random samples testing; the results show significant improvements.
PCA–SRP removed dirty and imprecise datasets, based on the results shown in
Table 5, which allowed us to reduce the number of samples in the process and allowed for a significant increase in accuracy, as shown in
Table 6 and
Table 7. Furthermore, we also determined the recommended
range values for normal cleaning and the ANN classification problem.
Future research will further investigate the performance in massive biomedical datasets and determine how to load them into PCA–SRP cleansing agents; one of the suggestions is loading through batch processing. Another suggestion is to utilise knowledge-based techniques of PCA–SRP in different neural network architectures such as CNNs, RNNs, LSTMs, and GNNs. Furthermore, an investigation into various field applications to explore the incorporation of the investigated cleaning techniques, such as real-time biomedical automation, image-based medical diagnosis classification, and human thoughts processes [
48,
49,
50,
51] could be a desirable research avenue.
The proposed methodology can be applied to a wearable EEG or similar device in order to extract chaotic data from the brain’s unique biometric featured samples for use in cryptography or especially for steganography. Lastly, this formal basis is necessary to design and construct high-quality and helpful software tools to support the data cleansing process of PCA–SRP and its application for artificial neural networks (ANNs).