*4.5. Data Analysis*

All LFQ data analysis was performed using Anaconda Python 3 with Pandas and Scikit-learn libraries. Two steps of data filtration were made. Non-widespread proteins and low quality samples were excluded from the further analysis. Proteins detected in less than 5 samples and samples in which less than 50 proteins were found were not taken into account. Then, the LFQ normalization procedure was performed as follows: protein LFQ level for each sample was divided to the maximum of LFQ level of the certain protein among all the samples, giving the quantitative values to the range from 0 to 1. Then principal component analysis (PCA) was applied to reduce the data dimensions. We switched to a lower number of features expressed in projections of multiple protein LFQ values to the vectors of principal components. The next step was performed to check the consistency of the samples using data from the pairs of technical replicates. A closest Euclidean neighbor in the space of the principal components was found for each sample. The replicated samples were considered as consistent if a closest neighbor for replicate A was the corresponding replicate B and vice versa. The coordinates in the space of principal components for these replicates were averaged and they were considered as one sample. Samples that were not verified by this method were excluded.

Machine learning algorithms KNeighbors (kNN), logistic regression, support vector machine (SVM) and decision tree were used at the final step of the data analysis. In order to find optimal model hyper parameters, a grid search was used. As the number of experimental samples was not large enough, we did not use the separation of the dataset into training and test sets. We could not select a test set for the final quality verification of the models, since in this case the evaluation of the quality of the trained model would be very unstable, due to the randomness of the division of training and test sets, when marginal measurements may fall into the test set. Instead, we used the leave-one-out cross-validation on the whole dataset to control the model quality.

Accordingly, we excluded each sample from the entire set, trained the model on the remaining set, and then checked it on this extracted sample.

Then, we used mean accuracy metrics, i.e., the average proportion of correct answers to estimate the quality. This metrics seems to be adequate, because the class imbalance is insignificant (the ratio of class sizes does not exceed 3.2).

These procedures allowed us to use all available data as efficiently as possible and to avoid randomness in choosing a test sample.
