*3.1. Dataset Preparation*

Attenuated Total Reflectance–Fourier Transform Infrared (ATR-FTIR) spectrometers are increasingly used for in-field screening for illicit drugs, as they are portable instruments and do not require sample preparation [13]. The ATR-FTIR spectra used in this study were extracted from the SWGDRUG public spectral library [14]. They consist of 95 spectra of the targeted illicit drugs and of randomly selected negatives of forensic interest, as shown in Table 1. In order to perform statistical analysis, the spectra were divided into four classes: Class 1—amphetamines (including 2C-x, DOx, and NBOMe hallucinogens); Class 2—opioids; Class 3—cannabinoids; and Class 4—negatives. The class of amphetamines contains the spectra of 25 substances, the class of opioids includes the spectra of 36 compounds, the class of cannabinoids consists of the spectra of 18 substances, and the class of negatives was formed with the spectra of 16 different (randomly selected) compounds. The statistical analysis, autoencoders, and machine learning modeling were performed by using the Python packages numpy 1.24.1, scipy 1.10.0, scikit-learn 1.2.1, and sequitur 1.2.4.

#### *3.2. Exploratory Data Analysis*

In order to identify patterns and anomalies, an exploratory investigation based on statistics and graphical representations was first performed on the dataset. Two types of exploratory data analysis methods were used, i.e., statistical and dimensionalityreduction methods.

### 3.2.1. Statistical Measures

To gain a better understanding of the trends in our dataset, we used a series of statistical parameters. The mean was used to assess the central tendency of the data, while the standard deviation was used to measure the amount of dispersion of the data. A low value of the standard deviation indicates that the data values tend to be close to the true value of the set, and a higher value indicates that the data values are spread out on a larger interval.




**Table 1.** *Cont.*

The skewness considers the extremes of the dataset. A distribution is considered symmetrical if the skewness is between −0.5 and 0.5, moderately skewed if the skewness is between −1 and −0.5 or 0.5 and 1, and highly skewed if the skewness is less than −1 or greater than 1. In his paper on series analysis, Grigoletto [15] argued that the more skewed the data, either positive or negative, the less accurate the analysis is.

Excess kurtosis indicates how much the dataset resembles a normal distribution. This parameter has been successfully used by Loperfido [16] for outlier detection. Distributions similar to the normal distribution are called mesokurtic; those with positive excess kurtosis are referred to as leptokurtic, while distributions with negative excess kurtosis are called platykurtic [17]. Minimum and maximum values were also calculated to account for the peaks of the spectra.

#### 3.2.2. Principal Component Analysis (PCA)

PCA is a multivariate technique [18] that accomplishes dimensionality reduction by linearly transforming the data into a new coordinate system, where the variation in the data can be described with a set of new orthogonal variables, called principal components (PCs). Its advantage is the ability to plot combinations of PC scores in order to identify clusters of closely related data points. PCA was also used as an exploratory analysis method, in order to evaluate to what extent the chosen classes form well-defined clusters.

#### 3.2.3. Independent Component Analysis (ICA)

ICA is a technique often used in signal processing and presumes to separate a multivariate signal into additive subcomponents by making the hypothesis that one subcomponent is Gaussian and all other subcomponents are statistically independent of each other [19]. ICA can also be used for signals that are not generated by mixing, such as our case, where we consider each ATR-FTIR spectrum as a complex multivariate signal. This technique also uses graphical tools to plot combinations of components to identify clusters of similar objects (compounds in our case).

Similarly to PCA, ICA was used as an exploratory method. Even if PCA and ICA have the same role, they differentiate one from another. An important difference is that PCA is focused on dimension reduction, while ICA concentrates on separating information into independent components [20].

#### 3.2.4. Autoencoders

Autoencoders represent a subset of ANN used to obtain efficient representations of data. The algorithm extracts several features and then attempts to recreate the original input from these features [21]. The autoencoder is defined by two functions, i.e., the encoder function and the decoder function. The first step in using these networks is to train both the encoder and the decoder at the same time through gradient descent. The second step consists of removing the decoder part of the model, leaving only the encoder. Thus, the output of a model consists of the key features of the input. Those features can be used in the same way as with PCA or ICA for a two-dimensional representation of spectra. In our paper, we used a linear autoencoder trained on the whole dataset with an encoding dimension of 10. For the graphical representation, we chose the best features for cluster formation.

#### *3.3. Machine Learning Methods (MLM)*

Multiclass classification of the analyzed spectra was then performed with five machine learning models, i.e., SVM [22], eXtreme Gradient Boosting (XGB) [23], Random Forest [24], Gradient Boosting [25], and KNN [26]. These models were chosen due to their efficiency, simplicity, and their fast implementation. Such models have been successfully used to classify counterfeit drugs based on their infrared spectra [27].

For all the models, the dataset was randomly split into two partitions, summing up 60% of all spectra for training and 40% for testing. Each model was then trained on the training set and evaluated on the testing set. The model, training, and test datasets were then deleted. We define this process as a training session. Although the initial dataset for each session was the same, the training and testing sets were different at each iteration, because the entries were randomly selected each time. In other words, the models were trained and evaluated each time on different selections of the same dataset. Each training session was repeated 10 times. Furthermore, the hyperparameter selection was performed using the Optuna 3.1.0 hyperparameter optimization framework.

The following parameters were calculated in order to assess and compare the performances of the models:

$$\text{Sensitivity} \left( \text{TPR} \right) = \frac{\text{TP}}{\text{TP} + \text{FN}} \tag{1}$$

$$\text{Specificity} \,(\text{TNR}) = \frac{\text{TN}}{\text{TN} + \text{FP}} \tag{2}$$

$$\text{Balanceed accuracy} (\text{TNR}) = \frac{\text{TPR} + \text{TNR}}{2} \tag{3}$$

$$\text{Matthews correlation coefficient} = \frac{\text{TP} \times \text{TN} - \text{FP} \times \text{FN}}{\sqrt{(\text{TP} + \text{FP})(\text{TP} + \text{FN})(\text{TN} + \text{FP})(\text{TN} + \text{FN})}} \tag{4}$$

where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.

#### **4. Results and Discussions**
