*5.2. Various Algorithms in ML*

In this section, the main aim is to present the procedures of implementation and basic types of advanced algorithms of ML [133]. Conceptually, ML can be defined as a computer program that can gather information from raw data by extracting features. To deal with real-world difficulties, the newly gathered information can become beneficial to make decisions [134]. In the field of biosensors, especially electrochemical, ML is known as a method or tool that can be used for analyzing and processing data for instance concentration of the analytes, extracting features, and for the prediction of the species. It can be classified into unsupervised learning and supervised learning [135]. ML is very important to predict the sensing model for more than one analyte at a particular time. Until today, various algorithms used in ML were known. Those algorithms are preferred, which gives maximum accuracy of the results and give information related to hidden data. When ML algorithms are trained with their target outputs with a group of input data, then it is referred as supervised learning. During the training process, certain predictions can be made by the algorithms on the input data set and the predicted value can be improved by using the given real value, unless the algorithms get the acceptable accuracy. Particularly, in spectrometric biosensors, great progress has been achieved by these to perform regression and categorization. However, in the case of unsupervised learning, labelled training data sets along with their given outputs are not available. The foremost aim is to determine a set of alike examples or to find out the division of data set in the input space (called density estimation). One of the most common unsupervised learning algorithms is k-Means clustering [136].

**Figure 10.** Electrochemical behaviors (**A**) of different modified electrode at the scan rate of 50 mVs−1. (**B**) Corresponding DPV of CBZ at different modified GCE in 0.1 M PBS. (**C**) DPV of CBZ at Ti2C MXene/Au–Ag NS/GCE in 0.1 M PBS. The linear equation of CBZ (**D**) at different concentrations ranging from 0.006 to 9.8 µM. RVM models with concentration as input (**E**) and current as input (**F**) for estimating CBZ concentration obtained by electrochemical. Comparison of the concentration experimental and RVM predicted values of samples (**G**). Reproduced from Reference [132] with permission. Copyright 2021, Elsevier.

#### *5.3. ML Data Analysis*

The emerging field of biosensor covers both the image data set and sequential data set. The priority in the ML modelling is to develop a suitable model based on the given data sets. After the designing of ML architecture, for a specific biosensor, the workflow will be implemented, which is shown in Figure 11a [137]. The first and foremost requirement is the preprocessing of raw data (sensing data). Various preprocessing methods possess Fourier to transform, denoising, and derivatives. Similarly, the system-specific preprocessing methods must have transformations, normalization, and elimination of baseline drifts and data compression. The overall efficiency of ML model depends on the preprocessing of raw data. For Raman spectroscopy, the requirement of each spectrum is background-subtracted, Savitsky–Golay-smoothed, and [0, 1] min−max scaled [138]. It must be pointed out that the preprocessing of raw data has no guarantee of yielding better results, since it may also remove some informative features from the raw data accidentally.

**Figure 11.** (**a**) Workflow and the scenarios of overfitting and underfitting. Reproduced from Reference [137] with permission. Copyright 2019, Elsevier; (**b**) results of two combined models, i.e., Principal Component Analysis and Support Vector Machine which can be used to distinguish cocaine, oxycodone, tetrahydrocannabinol, and heroin. Reproduced from Reference [139] with permission. Copyright 2018, Elsevier; (**c**) the prediction of partial least squares discriminant analysis (PLS-DA) model for all external human blood donor samples. Reproduced from Reference [140] with permission. Copyright 2018, Elsevier.

The preprocessed or raw data set should be split into three subsets, including training set (about 60%), validation set (about 20%), and test set (about 20%). The training data set is used to extract meaningful information and find optimal hyperparameters of the algorithms. The validation data set is applied when tuning hyperparameters. The test data set is employed to report the performance of algorithms. It can also reflect the impact of different hyperparameters [134]. A classic loss curve shows the scenarios of overfitting and underfitting also, and the convergence and fluctuation are determined clearly (Figure 11a). Hyperparameter tuning is a critical task of the sensing data analysis in the validation phase. Parameters for algorithms include the number of hidden neurons, learning rate, batch size, and so forth. To discover the optimal value for each parameter, approaches including grid search, random search, or Bayesian optimization can be applied. Figure 11b shows the combination of two models, i.e., Principal Component Analysis and Support Vector Machine which can be used to distinguish cocaine, oxycodone, tetrahydrocannabinol, and heroin [139]. The developed partial least squares discriminant analysis (PLS-DA) model correctly predicted all external human blood donor samples as human, and 28 of 29 animal blood donor samples as nonhuman (Figure 11c). The ROC curve in Figure 11c had an area under the curve (AUC) of 0.99, indicating that for a randomly chosen sample, the PLS-DA model only had a 1% chance of incorrectly predicting a nonhuman blood sample as being

human [140]. A deep learning model was developed based on a SERS data set of exosomes from lung-related cells, and then the model was transferred to predict the lung cancer stage using the SERS data set collected in patient plasma samples [141]. The data set similarity is quantitatively evaluated by the Mahalanobis distance between cancer cell exosomes and plasma exosomes clusters. For 43 cancer patients who are in stages I and II, 90.7% of patients can be accurately predicted using the transferred model. Notably, the similarity of cancer cell exosomes and plasma exosomes has a positive correlation to the stage of cancer. The results demonstrated that the transferred model can predict lung cancer using the SERS of plasma exosomes. The AUC for stage I patients was 0.910, and the AUC for the whole cohort was 0.912.
