*2.3. Di*ff*erential Diagnosis Based on Urine Proteome and Separation of Group H from the Total Patients*

A total of 409 proteins were identified and quantified by LFQ, and 241 proteins in 96 samples left after the quality control.

We found 13 principal components to be optimal after the PCA. The consistency check and averaging resulted in 47 samples (group G—19, group D—20, and group H—8). Four machine learning models were tested. A decision tree was added to the three algorithms mentioned above. None of the tested algorithms with optimized hyper parameters showed high proportion of correct decisions. It was lower than 80% overall, thus showing no evident possibility to differentiate the three tested groups on the basis of urine proteome data, and allowing us to conclude that the urine proteome in comparison with plasma proteome has much less differences in the tested groups of patients.

The capabilities of "one against all" models for the separation of classes were tested. It was found that the classifier 1-nn (nearest-neighbor algorithm) gives an accuracy equal to 100% for class H by the entire set of urine proteomics data. It correctly defined all the samples of class H and had no false definitions. Thus, this model showed good capability to separate patients of group H (hypertensive nephropathy) from the remaining groups of renal patients.

#### **3. Discussion**

Focused on efficient processing of proteomics data, this work had two goals: 1) development of the concept of a non-invasive early-stage test system for general kidney malfunctions, and 2) the differentiation by origin of previously diagnosed renal diseases with similar symptoms. For these, using our data set, we first tried to distinguish the total group of CKD patients from the healthy group, and then to differentiate the three groups of patients from each other.

At the first step, it was important to find out whether the common differences in plasma proteome appear in patients with renal disease and in healthy people. This task was not particularly difficult. It turned out that KNeighbors machine learning model (kNN) is able to differentiate CKD patients from healthy individuals with high confidence (97.8% of correct classifier responses).

Glomerular filtration rate (GFR) remains for today the most operating marker of kidney malfunction, which is usually estimated taking into account endogenous filtration markers like serum creatinine and cystatin C [24], but the accuracy of their measuring is still under consideration [25]. It shows that new proteomic biomarkers may facilitate more accurate and earlier detection of renal pathologies [26]. Machine learning has been successfully applied to proteomics data for classification of samples and identification of biomarkers, and can be used across a wide range of diseases [27]. Thus, the high accuracy in separating a group of renal patients from healthy ones that we demonstrated by processing of plasma proteome data using machine learning proved to be effective for the introduction of this approach into clinical diagnosis.

The second goal of our research was to examine the ability of complex differentiation of various CKDs by plasma proteome data. It is difficult to predict in advance the most efficient processing algorithm for analyzing multidimensional data, thus we tested three models. The KNeighbors classifier (Model 1) was chosen as the most effective after comparing with the logistic regression and support vector machine (SVM), because of the best proportion of correct classifier responses (87.5%). After the less represented group of hypertensive patients was excluded, the proportion of correct decisions increased above 96%, thus showing a high ability to separate diabetic patients with indirect kidney damage from patients with autoimmune-caused internal kidney degradation (glomerulonephritis).

Diseases of different origins, which are not only related to kidneys but expressed in symptoms of kidney degradation, may appear in plasma proteomic composition. According to our results, diabetic nephropathy has a specific proteomic signature in blood which is independent of renal degradation. The distinct changes in the expression of individual proteins, such as monocyte chemoattractant protein-1 (MCP-1) and transforming growth factor-β1 (TGF-β1) [28], or even in panels of several proteins [29] noted previously as predicting the rate of renal function decline in diabetes, may also be accompanied by other more complex changes in the plasma proteome. On the other hand, the hypertonic signature of renal degradation with the damage of the tubules and the interstitium was not expressed in specific changes in the blood proteome, according to our data. Therefore, the proposed approach can be recommended for further development as a medical test system based on plasma proteome that can separate glomerulonephritis from other CKD patients, e.g., diabetics.

Urine sampling is even easier compared to venous blood collection and the sample preparation procedure eliminates the extra step of plasma depletion from major proteins. We tried to use both bio-fluids to obtain the proteomics data and compared them to find the most suitable way for differential diagnosis of the three types of kidney diseases. Urine samples obtained from healthy people were not used in this study, due to much lower protein concentrations compared with renal patients.

As well as for the plasma proteome, we tested kNN, SVM, and logistic regression as means for distinguishing the urine proteome datasets. In addition, the decision tree algorithm has been added to the comparison. However, as a result, none of the tested models gave the proportion of correct decisions above the level of 80%, thus showing no clear ability to differentiate simultaneously the three tested groups based on urine proteome data.

The additional "one against all" model showed the best results in the separation of only one tested group of patients. The classifier 1-nn (nearest-neighbor algorithm, Model 2) gives an accuracy of 100% for class H, showing no false definitions across the entire data set. Processing of urine proteome by machine learning showed the ability of distinguishing hypertensive nephropathy from other renal diseases. Since, here, we used a sample set that is not very representative (only 5 hypertensive patients), this approach should be recommended for further testing on larger groups.

Diabetic nephropathy and proliferative forms of glomerulonephritis have a similar histological picture of diffuse glomerulosclerosis, tubulointerstitial fibrosis, and atrophy, and also variable degrees of hyaline arteriolosclerosis and arterial sclerosis. In hypertensive nephropathy, primary pathological changes in afferent arterioles lead to ischemic damage to the glomerular apparatus [30]. Thus, the degradation of various kidney structures can result in a difference in the transmitted proteins that enter the urine of hypertensive patients. Proteinuria of the same origin, associated with glomerular defects probably did not have specific features depending on the genesis of renal degradation, and therefore diabetes and glomerulonephritis did not appear in urinal protein variations according to our study.

To test the differences in pathological renal filtration in the context of protein size, we compared molecular masses of the proteins found in the urine of patients from the three CKD groups. No distinct specificity for a particular disease was found in our case, as shown in Figure 1. Thus, the differences in the total proteome, which can be traced by machine learning, do not concern the molecular masses and sizes of proteins and are manifested in other complex features.

**Figure 1.** Distribution of urine proteins by molecular weight in the three studied groups of patients.

In this study, we conclude that the urine proteome, compared with the plasma proteome, has much less differences in our tested groups. As a result, urine is not quite suited for universal differential diagnosis in this analytical way, but this approach may remain useful in some cases, for example, to isolate patients with hypertensive nephropathy.

In addition, it is possible to use a two-stage approach to the differential diagnosis of CKD (Figure 2), including a combination of primary proteomic analysis of urine to cut off the hypertensive origin of the disease (using Model 2), followed by proteomic analysis of blood plasma to separate diabetic nephropathy and glomerulonephritis (using Model 1).

**Figure 2.** Scheme of two-stage differential diagnosis of chronic kidney disease using proteomic data of urine and plasma.

While the application of the presented strategy can be limited for the hypertensive nephropathy due to the fact that its diagnosis is mostly clinical and proteinuria is absent in most of the cases, for correctly diagnosed diabetic nephropathy, a specific therapy aimed at correcting sugar levels can be more actively applied. Treatment strategies for glomerulonephritis include immunosuppression with glucocorticosteroids and cytostatics, associated with the risk of serious infectious complications. The similar therapy for patients with diabetes can be dangerous, since they have a higher risk of purulent complications. Generally, differential diagnosis between glomerulonephritis and diabetic nephropathy should be carried out on the basis of kidney biopsy. However, several complicating points exist: (1) kidney biopsy is not always unambiguous, (2) this is an invasive method, (3) kidney biopsy should be performed in a specialized center, (4) to assess the dynamics of the process, a second study is required, which in the case of kidney biopsy is associated with repeated invasive intervention, and (5) the patient may refuse an invasive procedure. In this work, we studied the phenotypes expressed in the proteomes that distinguish the presented groups. All associated pathologies, including diabetes, should be undoubtedly reflected in the phenotypic proteomes, which we examined by choosing groups in this way (with and without diabetes) and dividing them using a machine learning approach.
