*2.5. Multivariate Analysis for Predicting Renal Outcome*

To improve predictive performance and find a meaningful combination of proteins that could distinguish patients who were and were not at risk of disease progression, two classifiers of the 412 proteins were generated, one based on random forest (RF) [19] and the other on support vector machine (SVM) [20]. Both the RF and SVM methods selected five proteins (ACP2, CTSA, GM2A, MUC1, and SPARCL1) by an AUC-based RF backward-elimination process [21], according to a >0.3 importance of selection (Table 2). These variables were used to establish a RF model by generating 20,000 decision trees, and a linear SVM model by three repeated iterations of 10-fold cross-validation. Evaluation of the performance of these classifiers showed that the AUC values for RF and SVM were 1.000 and 0.935, respectively (Figure 6A). The nominal binary results of RF and SVM models were transformed in disease prediction scores, which ranged from 0 to 1 (Figure 6B and Table S6). The two classifiers differed significantly from albumin-to-creatinine ratio (likelihood ratio test: *p*-value < 0.05). These five proteins were located in extracellular exosomes, vesicles, or organelles, with three (ACP2, CTSA, and GM2A) located in the lysosomal lumen, MUC1 placed in plasma membrane, and SPARCL1 interacted with collagen in extracellular matrix.

**Figure 6.** ROC curves of RF and SVM classifiers for five selected proteins (ACP2, CTSA, GM2A, MUC1 and SPARCL1). Performance of the two classifiers in the set of 54 samples, 35 from patients with good prognosis and 19 from patients with poor prognosis. (**A**) Areas under the curve (AUC) for the RF (1.0) and SVM (0.935) classifiers. (**B**) Clinical indices (0–1) of the two classifiers.


**Table 2.** AUC-based RF backward-elimination process-based selected feature proteins.

Probability of selection for each variable.

#### *2.6. External Validation of Clinical Models in Public Studies*

Since we were unable to find a benchmarking study in the discovery of urine protein biomarkers that could validate our statistical model, we validated the models with mRNA expression in the kidney, an organ that undoubtedly affects urine samples. The SVM and RF models consisting of five urine proteins were applied to four publicly available GEO datasets (GSE99339 [22], GSE47185 [23], GSE30122 [24], and GSE96804 [25,26]) without model adjustment. In the first GSE99339 dataset, mRNA expression in the renal glomerulus of 187 patients was studied, and the 11 disease groups are diabetic nephropathy (DN), rapidly progressive glomerulonephritis (RPGN), tumor nephrectomies (TN), hypertensive nephropathy (HT), IgA nephropathy, membranous glomerulonephritis (MGN), systemic lupus erythematosus (SLE), thin membrane disease (TMD), focal and segmental glomerulosclerosis (FSGS), focal and segmental glomerulosclerosis and minimal change disease (FSGS&MCD), and minimal change disease (MCD). The two classifiers' prognostic probabilities were highly correlated with each other in 187 samples (ρ = 0.817, Pearson correlation coefficient). In both models, the highest value in the DN group was higher than the other ten disease groups (Figure 7A). RF prediction values in the DN group were significantly higher than other eight groups except for RPGN and HT group (Mann-Whitney U Test: *p*-value < 0.05). SVM prediction values in the DN group were significantly higher than the other nine groups excluding the RPGN group (*p*-value < 0.05). In the second GSE99339 data set, there are a total of 223 kidney glomerulus (*N* = 122) and tubulointerstitia (*N* = 101) mRNA expression levels. The eight disease groups include DN, RPGN, TN, MGN, TMD, FSGS, FSGS&MCD, and MCD. The two classifiers' prognostic probabilities were also highly correlated in 223 samples (*r* = 0.637). The tendency of the predicted values was different depending on the cell type of the kidney (Figure 7B). In the glomeruli, the two model predictions are the highest in the DN group and are statistically significant with other seven groups. However, in the tubulointerstitium, the SVM model prediction values in DN were significant with four other groups except RPGN, TN, FSGS&MCD, and RF model prediction values in DN is only significant with MCD. It indicated that the five urine proteins are more closely related to the glomeruli than the kidney tubulointerstitium.

Meanwhile, we tried to verify whether the prognostic models could predict DKD. In the third GSE30122 data set, of the total of 69 samples, 26 of the 35 kidney glomerulus were normal obtained from living allograft donors, 9 of which were DKD, 34 of which were renal tubulus, of which 24 were normal and 10 were DKD. The results of the RF model in the glomeruli statistically were divided the normal and disease groups (*p*-value < 0.05), but the SVM model were not (*p*-value > 0.05; Figure 7C). The results of the both models in the tubulus statistically were not divided the normal and disease groups (*p*-value > 0.05). In the fourth GSE30122 dataset, 20 kidney glomerulus out of a total of 62 samples were glomerulus from the non-neoplastic part of tumor nephrectomies and 41 of them were

from DN. The results of the both models statistically were not divided the normal and disease groups (*p*-value > 0.05; Figure 7D). It indicated that models for predicting kidney prognosis with urine protein markers in diabetics are difficult to distinguish DKD from normal groups by mRNA expression level in kidney.

**Figure 7.** External validation of RF and SVM clinical models in public four GEO datasets (GSE99339, GSE47185, GSE30122 and GSE96804). (**A**) In the GSE99339 dataset, boxplot of the prognostic probabilities of the two classifiers in 11 disease groups including DN (*N* = 14), RPGN (*N* = 23), TN (*N* = 14), HT (*N* = 15), IgA nephropathy (*N* = 26), MGN (*N* = 21), SLE (*N* = 30), TMD (*N* = 3), FSGS (*N* = 22), FSGS&MCD (*N* = 6), and MCD (*N* = 13). (**B**) In the GSE30122 data set, the prognostic indexes of the two classifiers in the eight disease groups in the renal glomeruli with DN (*N* = 14), RPGN, (*N* = 23), TN (*N* = 17), MGN (*N* = 21), TMD (*N* = 3), FSGS (*N* = 23), FSGS&MCD (*N* = 6), and MCD (*N* = 15) and in the renal tubulointerstitia with DN (*N* = 18), RPGN (*N* = 21), TN (*N* = 6), MGN (*N* = 18), TMD (*N* = 6), FSGS (*N* = 13), FSGS&MCD (*N* = 4), and MCD (*N* = 15). (**C**) In the GSE30122 data set, the prediction values of the two classifiers in the control and disease groups in renal glomerulus (*N* = 26; control and *N* = 9; disease) and in renal tubulus (*N* = 24; control and *N* = 10; disease). (**D**) In the GSE30122 data set, the prediction probabilities of the two classifiers in the control (*N* = 20) and disease (*N* = 41) groups in renal glomeruli.

#### **3. Discussion**

Urine-based approaches for measuring internal biomolecules can be normalized. Ideally, urine should be collected for 24 h and urinary biomolecules measured. Because this method is practically difficult, urinary proteins in random spot samples were calibrated relative to creatinine concentrations [9]. Prolonged storage of urine samples for studying proteins is important because of the activity of urinary proteases depending on the temperature and pH [27]. In this retrospective study, urine samples were stored at −80 ◦C for 7–8 years before LC-MS/MS measurements. In general, it is known that it is stable without urine preservatives stored at −70 or −80 ◦C, and urine samples stored for more than 2.3 years have no significant change in not only most proteins including albumin but also metabolites including creatinine [28–31].

Proteins were extracted from urine samples using an equal volume-based approach similar to ELISA [32]. This procedure for protein standardization was suitable for downstream analysis. Urinary proteins normalized by this method showed lower sample-to-sample variation and higher correlation with immunoassay results.

Albuminuria is primarily used to detect DKD in clinical practice [2,5]. Because glomeruli filter blood, albumin is a good biomarker of chronic kidney disease (CKD) caused by glomerular abnormalities but is insufficient to determine subsequent prognosis [5]. Rather than this, it was determined that finding and measuring specific protein markers that affect pathological function is more clinically meaningful [33,34]. Although causality between albuminuria and prognostic values from the five-protein panel-based clinical models (RF and SVM) cannot be clarified in this retrospective study, it can be inferred by correlation analysis. Correlation analysis between two classifiers and ACR in the 54 enrolled patients reveals a little of bit correlation but no significance (*r* = 0.086; *p*-value > 0.05; SVM and *r* = 0.094; *p*-value > 0.05; RF). Therefore, it was confirmed that there was no causal relationship as well as a correlation. To consider closely at the relationship between them, we divided the three classes based on the ACR value (normal; <30 mg/g, microalbuminuria; 30–300 mg/g and macroalbuminuria; >300 mg/g) and plotted the predicted values of the SVM model according to the two prognostic groups (Figure S1). In T2D patients with normal range and microalbuminuria, SVM results were almost separated between two groups. Rather, it seems to have problems with predictive power in patients with macroalbuminuria. It means that SVM results did not depend on the development of albuminuria in T2D patients and showed the possibility to predict the earlier disease stage before the development of albuminuria. Moreover, the predicted value of RF results accurately separates two prognostic groups regardless of the ACR value.

As a rule, diabetics are persistently exposed to miscellaneous metabolic and hemodynamic risks [35], with DKD resulting from multiple pathophysiological processes. Multiple-biomarker approaches using proteomics and metabolomics may better reveal the complicated disease status thought to be associated with the onset of DKD [4,8]. CKD273, a panel consisting of 273 urinary peptides currently undergoing Phase 3 testing, was a high performance urine peptidomic classifier for CKD diagnosis [36]. Moreover, this classifier was recently validated as a predictor of the development of microalbuminuria in normoalbuminuric with diabetic patients [37]. These 273 intact peptides were derived from 30 independent proteins, 24 of which were quantified in this study. CKD273, which includes cleaved collagenase peptides and SERPINA1 peptides, is a good prognostic marker, showing that the concentrations of cleaved collagenase peptides decrease and those of SERPINA1 peptides increase in the urine of patients with CKD [38,39]. The present study showed a similar pattern of abundance in the urine of PPG patients despite artificial digestion. Our approach, based on protein concentrations in urine samples, could better explain the pathological pathway associated with DKD than the peptidome approach. Indicators of kidney dysfunction include increased blood particles in urine; lysosomal dysfunction in glomerular cells [40], which is related to the autophagy-lysosome pathway [41] abnormal heterotypic cell–cell adhesion among glomerular, tubular, and immune cell compartments, collagenase, and binding proteins (driven by rapid changes in glycolipids) [42] and platelet activation [43].

Our clinical models consist of five selected proteins, four proteins (CTSA, MUC1, GM2A, and SPARCL1) are high in PPG and other one protein (ACP2) is low in PPG. Cystatin A (CTSA), in the same protein family as cystatin C that can measure kidney function [44–46], has been found in our urinary biomarker discovery study. Muc1 is a multifaceted tumor protein, and its relationship with the kidney has recently been highlighted [47] and has been identified as a mutant that causes mendelian disorder medullary cystic kidney disease type 1 [48]. In a meta-analysis study of rat glomerular transcriptome profiling, it was confirmed that GM2A was highly expressed in various diabetic kidney disease rat [49]. Through the mouse kidney injury model experiment, SPARCL1 showed that mRNA expression was not changed in the acute phase, but the expression level was high in the fibrosis of the kidney [50], and it inhibited the movement and invasion of renal cell carcinoma [51]. Lastly, ACP 2, one of the lysosomal enzymes, is a protein used in peptiduria [52] or lysosomal enzymuria [53] that measures kidney disease in diabetic patients. Through the external kidney mRNA published studies, these urine biomarkers we found confirmed differential expression in kidney tissue with DKD.

This study had several limitations. First, the patient population in this study was homogeneous and of small sample size. These results require further validation in a multiethnic cohort including larger numbers of patients to assess applicability to a wider population with T2D, a study currently in progress. Second, DKD was clinically diagnosed in the absence of renal biopsies. Third, it is unclear which organ is derived from the urinary protein signatures. More research is needed to determine whether urinary protein signatures are biomarkers of tubular damage in pathological conditions with a glomerular protein load.
