1. Introduction
A reliable assessment of the developmental potential of oocytes before fertilisation could significantly change the picture of reproductive medicine. According to the World Health Organisation, around of couples suffer from infertility, with more than at the point of fertilisation. It is believed that more than 186 million women currently have fertility problems worldwide. Similarly, the decision to have a first child has been significantly delayed across the world, exceeding 31 years of age in developed countries at the time of childbirth.
Aside from the increasingly widespread fashion for “baby-free” policies among young adults in developed countries and rare cases of congenital infertility, most fertility problems are age-related. The causes of the age- and lifestyle-related infertility epidemic include genetic defects in embryos, decreased sperm parameters, decreased ovarian reserves, endometriosis, polycystic ovarian syndrome, the hostility of cervical mucus, implantation disorders, the obstruction of the fallopian tubes, and submucosal myomas. Age-related infertility is associated with diseases that are virtually non-existent in youth and therefore do not interfere with achieving pregnancy at the physiologically intended time. This is supported by data from Hutterian reproduction, where early efforts to get pregnant, the lack of premarital sex, and the lack of influence of an economic factor on the decision to obtain a pregnancy result in a fertility problem affecting only about 2% of couples [
1].
Due to delayed attempts at pregnancy and the associated factors causing couples to have lower fertility, in vitro fertilisation is becoming an increasingly common treatment option. It is the most effective and sometimes the only possible method of treatment. However, its effectiveness is still far from our expectations, with only limited progress over the past 20 years. According to its key performance indicators (KPIs) [
2] only 75–90% of the oocytes retrieved during the in vitro fertilisation procedure are at the correct stage of development. Only about 80% of them will undergo fertilisation, of which 70% will develop to the cleavage stage by day 3 and 60% will develop to a blastocyst, of which only about 60% will be of so-called top quality (TQ). As a result, we still lack the tools to initially assess the quality of collected oocytes. The availability of such information would significantly enhance the clinical decision-making process, allowing the right number of cells to be fertilised and helping to predict their development. Furthermore, it would also help to identify the exact problems affecting individual oocytes, paving the way for their personalised culture and ultimately improving the quality of the embryos produced.
Several non-invasive methods for assessing oocytes and embryos are currently being studied. Among these, the most commonly used, mainly due to its low cost and widespread availability, is the assessment of embryo morphology performed by embryologists or using automated methods. Metabolomics is also being used experimentally. The metabolic profiling of culture media containing human oocytes can provide information on the metabolic state of the cells, although this requires the integration of automated, high-throughput, real-time metabolomic assessments with microfluidic platforms. However, the most promising is the analysis of the human follicular fluid (hFF) proteome, which can provide a set of indicators of oocyte health based on the presence or absence of specific proteins. It is considered the most promising because of the identification and quantification of hundreds of proteins in a single assay, providing a broad picture of the biological state of the oocytes [
3]. Proteins are key effectors of cellular function. Unlike genomics, their presence and concentrations directly affect the functionality of the oocyte and its ability to develop into further stages. Previous studies have shown that specific protein profiles in hFF can be correlated with oocyte quality and pregnancy outcomes, offering direct and functional links with oocyte developmental competence [
4,
5]. Unfortunately, most studies to date have failed to address two limitations of follicular fluid spectrometric studies—their cost and the availability of fluid samples identified and linked to the embryonic development of the originating oocyte. As a result, most of these studies have been based on small patient groups and samples of their follicular fluid obtained from the largest follicle or from pooled follicular fluids from a given patient.
It therefore seems crucial to obtain information linking the proteomic composition of the follicular fluid with the quality of the oocyte and its development after fertilisation. Hence, the aim of this study was to separately examine obtained follicular fluids and identified oocytes and to assess the subsequent development of embryos derived from them.
3. Discussion
In the present study, we investigated the relationship between the composition of the proteins detected and quantified in the follicular fluids and the development of embryos from the derived oocytes. We studied healthy oocyte donors whose cells were fertilised. The male factor is very important and genetically affects 50% of the material. It is less important for the metabolism of the embryo, especially up until the third day of development. After that, full genome activation begins and the influence of the sperm’s DNA becomes visible.
For this reason, we treated the male factor in our study very restrictively. We excluded any patient whose semen parameters deviated from the norms described in the WHO Manual 2021. We also took into consideration the issue of sperm DNA fragmentation, which, in our experience, has a major impact on embryo development. We implemented the TUNEL method based on cytometry. Whilst the generally accepted norm is less than 15%, in this paper we adopted a value of 12% based on our own observations, as we have observed that up to this level of fragmentation we obtain the best embryos.
In principle, three scenarios can be considered: studying the whole material; removing the most abundant proteins so as not to obscure the signal of the less abundant proteins; and focusing on the most regulatory proteins, such as growth factors, hormones, and key regulators of metabolic pathways, using labelled proteins, for which we are preparing an analysis. For our study, we opted for the middle ground—removing proteins with a significant quantitative advantage by immunodepleting approximately 94% of a total of 14 proteins (albumin, IgG, antitrypsin, IgA, transferrin, haptoglobin, fibrinogen, alpha2-macroglobulin, alpha1-acid glycoprotein, IgM, apolipoprotein AI, apolipoprotein AII, complement C3, and transthyretin). We decided to conduct research in this direction because we wanted to obtain a broad overview of the proteins detectable in hFF with the intention of evaluating them in the context of predicting the quality of the oocytes and the resulting embryos. This result has been achieved, as we identified more than 2000 proteins using this approach, and the creation of such a large collection is a very good result compared to the literature data on hFF [
4,
14,
15,
16]. In order to avoid problems arising from even small shifts in the chromatogram, calibration peptides (iRT peptides) were added to each sample. However, by removing a significant amount of the above proteins, it was possible to quantify and compare the remaining proteins in the study groups.
This also made it possible to study subtle differences in protein abundance in hFFs which are associated with embryonic development. The decision to use the Random Forest to analyse this type of data was based on its ability to detect weak signals in noisy data and also on several methodological considerations related to the nature of mass spectrometry data.
Firstly, this type of data frequently contains outliers, which can significantly disrupt the performance of many classification algorithms. Random Forests inherently mitigate the impact of outliers due to their ensemble approach, where the aggregation of multiple decision trees reduces the influence of any single aberrant data point on the final prediction [
17]. Moreover, the distribution of protein abundances in our dataset varies considerably, with many proteins exhibiting log-normal distributions, while others do not conform to any specific parametric form. Random Forests, as a nonparametric method, do not impose assumptions about the data’s distribution, making them particularly suitable for this heterogeneity. Additionally, Random Forests provide interpretable models through various feature importance metrics, enabling a clearer understanding of the influence of different proteins on the classification outcomes.
In the field of bioinformatics, it is common to encounter datasets characterised by a large number of features and a relatively small number of samples, often including irrelevant variables. Random Forests excel in such high-dimensional settings by effectively managing and utilising a large number of input variables. This adaptability is crucial for enhancing predictive accuracy in scenarios with many irrelevant or noisy features.
Random Forests also demonstrate robust predictive performance in the presence of predominantly noisy variables. The ensemble nature of this algorithm helps in reducing the risk of overfitting, as the errors of individual trees tend to cancel each other out, thereby increasing generalisability of the model. Their consistently high predictive power has positioned Random Forests among the top-performing algorithms in various comparative evaluations. Their ability to extract meaningful insights from complex and noisy biological datasets highlights their utility and effectiveness in bioinformatics research. This performance parity, combined with the added benefits of their interpretability and feature selection, underscores the suitability of Random Forests for tasks requiring both high accuracy and transparency in decision making [
18]. The efficacy of Random Forests in classifying biological data is well documented. Numerous studies have successfully applied this method to classify and analyse various types of biological datasets, validating its robustness and reliability in the domain of bioinformatics [
18,
19,
20,
21]. Finally, Random Forests facilitate feature selection, which is crucial for identifying the most relevant genes or proteins associated with different biological categories. Methods such as Recursive Feature Elimination (RFE) provide valuable insights into the most influential features within a dataset, aiding in the interpretation and understanding of bioinformatics data [
22,
23,
24].
Our findings highlighted several key relationships between protein abundance in hFFs and embryo quality. Dickkopf-related protein 3 was most abundant in hFFs associated with the highest quality embryos. In contrast, immunoglobulin heavy constant alpha 1 and moesin were most abundant in hFFs associated with poor-quality embryos. Transthyretin had the lowest abundance in hFFs associated with fair-quality embryos.
Interestingly, some proteins, including transthyrethin, exhibited their lowest/highest abundance in hFFs associated with fair-quality embryos, but higher/lower levels with both good- and poor-quality embryos. This surprising observation may be explained by differences in the biological processes that influence the trophectoderm’s quality (which is associated with fair embryo quality) versus those impacting overall blastocyst development.
Significant differences in protein abundance were observed between the hFFs from different patients. In some cases, follicular fluids from the same patient had very similar levels of certain proteins, such as immunoglobulin heavy constant alpha 1. In other cases, significant variance was not related to individual patients, as seen with dickkopf-related protein 3. This pattern might be due to differences in protein abundance and associated relative measurement error differences, which warrants further investigation.
Unfortunately, most studies to date have failed address two limitations of the spectrometric testing of follicular fluid—its cost and the availability of the fluid identified and linked to the embryonic development of the originating oocyte. Studies to date have relied on the examination of a single follicular fluid from the largest follicle or of pooled follicular fluid from a given patient. This introduces two types of bias—when testing fluid from the largest follicle, only one fluid is tested, and this may often not be representative. Ovulation stimulation is often an art of compromise between the number of oocytes obtained and their quality. Quite often, it is necessary to sacrifice the largest follicles (which have exceeded their optimum size and thus stage of development) to allow the growth of a greater number of smaller follicles that still need time to mature. Hence, the fluid obtained may come from a follicle with a worse-than-average prognostic status. At the same time, the development of the embryos derived from these fluids is not followed, resulting in the loss of a direct link between the test result and the experimental outcome. On the other hand, combining hFFs does not allow the results to be linked to embryo development (except in rare cases where all follicles develop equally), while introducing a lot of contamination into the study due to mixing fluids containing oocytes at completely different stages of development.
Therefore, the main strength of our study is the material analysed. We were able to collect hFFs from individual ovarian follicles, label them unambiguously, and link them to oocyte quality and development after fertilisation through their individual culture. This allowed us not only to assess the differences in hFF composition between individual donors, but also to investigate the variability in protein composition between individual follicles within the same organism.
A limitation of our study, as with most proteomics studies, is the number of samples tested. Nevertheless, we examined 110 samples, in biological triplicate, from 50 oocyte donors, which is sufficient to start looking for protein differences between the hFFs from oocytes from which we obtained embryos of different qualities. Additionally, the chosen laboratory workflow for the proteomic studies included several factors, such as immunodepletion effects and peptide ion suppression, which could have affected the accuracy of protein quantification. The subsequent analysis used the Random Forest algorithm, which tends to exclude highly correlated features. These sources of bias may have led to the omission of some biomarkers in our study.
The evaluation of follicular fluids requires further research and the results should be collected in databases for comparative re-analysis. Therefore, it seems important to collect individual follicular fluids and to observe the developing embryos derived from them. This will make it possible to modify their stimulation according to its progress and to individualise the culture media according to the metabolic state of the retrieved oocyte.
4. Materials and Methods
4.1. Flow Chart of Patient Recruitment and Fluid Collection and Examination
The study, designed in 2019, was conducted at the Medical University of Gdansk and the Invicta fertility clinics. Donors were deemed to be eligible for the study when it was known that the cells would be fertilised with semen meeting the WHO standards. The women were qualified for in vitro fertilisation due to their willingness to be egg donors. The exclusion criteria were as follows: patients under 18 years of age and over 35 years of age; a sperm donor with reduced semen parameters (below WHO 5th edition standards [
25]); and sperm DNA fragmentation, determined cytometrically by the TUNEL method, above 12%. Cases with abnormal oocyte fertilisation results in previous cycles were also excluded. Due to the pandemic and difficulties in accessing the material, sample collection was extended until early 2023. We recruited 75 egg donors to this study. During follicular fluid collection, in 21 cases individual follicular fluids could not be completely separated from each other. We excluded these cases from further testing. In four additional cases, there were doubts about the compatibility of individual oocytes with their follicular fluids due to possible mislabelling. We finally included 50 donors in the study, and we individually collected a total of 388 cumuli and secured follicular fluids from a minimum of 2 and a maximum of 11 of their follicles. We studied 110 follicular fluids from 50 donors, with 2 to 3 fluids per donor (see
Table 3).
The experiments conducted are part of the project entitled “Identification of Biomarkers of Early Embryonic Development and Pregnancy”, which was approved by the Independent Bioethics Committee at the Medical University of Gdansk (decision 62/2016). All oocyte donors were informed about the protocol and consented to participating in the study. Their written consent obtained also included their permission to publish data related to their treatment, provided that patient anonymity was maintained.
4.2. IVF Procedure and Embryo Development
4.2.1. Stimulation
All patients were treated with in vitro fertilisation (IVF) using short-protocol stimulation [
26]. Before starting stimulation, ultrasound and hormonal tests were performed to exclude the presence of dominant follicles and to verify that peripheral blood hormone levels were as follows: oestradiol below 50
LH below 6
and progesterone below 0.5
Once the effect of a premature recruitment of the dominant follicle had been ruled out, stimulation with gonadotropins was initiated. Menopausal gonadotropins (Menopur, Ferring) with equal FSH and LH activity were used. Dosing was based on the patient’s baseline AMH level (in the range of 150 to 225
per day) with 0.05
triptorelin administered subcutaneously from the first day of stimulation. On the eighth day of stimulation, the stimulation dose was adjusted to prepare for oocyte retrieval. Stimulation was terminated after obtaining at least 3 follicles with a diameter of more than 18
with the administration of 5000
of hCG intramuscularly (Pregnyl, MSD) for final oocyte maturation 36 h before oocyte retrieval.
4.2.2. Oocyte Retrieval (Pick Up) and Collection of Samples
The oocyte retrieval procedure was performed under brief general anaesthesia with Propofol and Fentanyl. Oocytes were retrieved using disposable oocyte retrieval needles (Gynemed, Sierksdorf, Germany) under the control of ultrasound images obtained using the IC-9-RS vaginal transducer and the GE Voluson P6. The fluid collected from the ovarian follicles was immediately transferred to the embryologist, who continuously reported on the cumuli obtained so far (clusters of granulosa cells from the released ovarian thalamus that may contain an oocyte). If no oocyte was obtained from a given follicle, the attempt was repeated by rinsing the follicle with the same fluid and retrieving it again. After the procedure, the samples were filtered through a 5 mesh at room temperature to remove the erythrocytes, white blood cells, and granulosa cells. The fluid was collected and stored at −20 °C for further analysis. The oocytes were kept separately and labelled with the same number as the collected and frozen fluid.
4.2.3. Embryo Culture
The cumuli obtained were stored under conditions of 6% and low oxygen pressure (5% ) in 37 °C in incubators (Labotect C18) inside laminar chambers (Lamil 90 or 120). All oocytes were stripped of their surrounding granulosa cells—they were subjected to decoronisation—2 to 5 h after collection. Their maturity was then graded on a scale: mature cells in the metaphase of their second meiotic division (MII), immature cells in metaphase of their first meiotic division (MI), immature cells at the germinal vesicle (GV) stage, overripe—atretic—cells, and no oocyte in the cumulus. Only mature cells were fertilised. Immature cells, on the other hand, were subjected to further culture in oocyte maturation medium. After one day, their maturity was assessed and additional mature cells were fertilised. In vitro fertilisation was performed by micromanipulation (intracytoplasmic sperm injection—ICSI). The systems used consisted of Nikon Te2000S, U, or E inverted microscopes equipped with Hoffman modulation contrast using Eppendorf NK2 micromanipulators. Heating tables (Okolab, Pozzuoli, Italy) were used to provide full heating of the surface of the ICSI dishes mounted on three-plate microscope tables. Micromanipulator pumps from Eppendorf (Leipzig, Germany); an air pump to hold the egg (CellTram Air), and an oil pump with extra precision to deliver the sperm into the oocyte (CellTram vario) were also used. The entire procedure was carried out with full video documentation, which was analysed by the embryology team as part of the quality control activities of the procedure.
After fertilisation, the cells were cultured in Labotect C18 incubators for a further 5 to 7 days until full maturation—blastocyst formation—or developmental arrest and the onset of apoptosis. Their culture was performed in G1 and G2 sequencing media (Vitrolife, Gothenburg, Sweden). Embryos were assessed on day 1 of culture—the evaluation of fertilisation and rejection of abnormally fertilised cells, day 3—the evaluation of cell divisions (Cummins classification [
27]), and day 5—blastocyst maturity (Istanbul criteria).
4.3. Sample Preparation
The experiments included comparative qualitative and quantitative studies and spectral library preparation for the SWATH-MS quantification on our samples. The process of optimising the sample preparation method and instruments’ operation was carried out in several steps. The entire process is summarised in
Table 4. In brief, after thawing, the hFF was additionally centrifuged at 1000×
g for 10 min to separate all morphological structures (cellular debris). Working on a chromatographic system with microfluidics, we had to take additional steps to obtain as many proteins as possible for the library. We used a MARS 14 column (Agilent, Santa Clara, CA, USA) to immunodeplete proteins present at high concentrations. The samples were not fractionated. Protein concentrations were measured using a spectrophotometer by quantifying their absorbance at 280
Protein material was digested with FASP (tripsin) (1:50 enzyme to protein weight ratio) using a standard Filter-Aided Sample Preparation procedure (FASP) [
28] on a Microcon with 30
of cut-off membrane (Merck-Millipore, Burlington, MA, USA). The Multienzyme Digestion (MED) FASP procedure involved three consecutive digestions with LysC (1:50), trypsin (1:100), and chymotrypsin (1:100) (all enzymes from Promega Corporation, Madison, WI, USA). First, the hFF was lysed using a buffer containing 1% sodium dodecyl sulphate (SDS) and 50
dithiothreitol (DTT) in 100
Tris-HCl of pH for 8 for 10
at 95 °C. (all reagents from Sigma-Aldrich, St. Louis, MO, USA). A total of 100
of protein was applied to each filter. Briefly, the filters were washed several times with a buffer containing 8
urea in 100
Tris-HCl pH 8.5 by centrifugation at 10,000×
g for 20
Proteins were alkylated with 55
iodoacetamide (IAA, Sigma-Aldrich, St. Louis, MO, USA) for 20
at room temperature in the dark. Finally, traces of IAA and urea were washed away with 100
Tris-HCl pH 8.5 and the enzyme was added to the filters for overnight digestion at 37 °C. The resulting peptides were eluted with 100
Tris-HCl pH 8.5. In the case of MED-FASP, the filters were placed in new tubes and the digestion and elution steps were repeated with different enzymes. Digestion with chymotrypsin was carried out for 3
in a buffer containing 10
CaCl
2 in 100
Tris-HCl pH 7.8. The resulting proteolytic peptides were fractionated by RP-HPLC (Reversed-Phase High-Performance Liquid Chromatography) at high pHs and desalted using the STAGE (STop And Go Extraction) tip procedure [
29] on in-house prepared tips filled with C18 solid phase (3M™ Empore™, St. Paul, MN, USA). Briefly, 10
of peptide was added to the tip, which was previously equilibrated with 1% acetic acid in water. After washing, the peptides were eluted with a buffer containing 60% acetonitrile (ACN)/1% acetic acid in water and evaporated in a SpeedVac to obtain volumes ready for Mass Spectrometry (MS) measurements (5
for Q Exactive HF-X or 10
for Triple TOF 5600+). To avoid problems caused by even small shifts in the chromatogram, calibration peptides (iRT peptides) were added to each sample. The iRT (indexed retention time) kit (Biognosys, Zurich, Switzerland) was spiked with samples used for SWATH-MS spectral library preparation or SWATH-MS quantification at a 1:10 standard to sample volume ratio to perform retention time calibration. This allowed for the generation of a collection of over 2000 proteins.
4.4. LC-MS/MS Measurements and Quantitative Data Processing
The LC-MS/MS measurements for the Triple Quad-TOF workflow were acquired on the TripleTOF 5600+ hybrid mass spectrometer with a DuoSpray Ion Source (AB SCIEX, Framingham, MA, USA) coupled with the Eksigent microLC (Ekspert MicroLC 200 Plus System, Eksigent, Redwood City, CA, USA). Samples were loaded onto the LC column using the CTC Pal Autosampler (CTC Analytics AG, Zwinger, Switzerland), using a 5
injection. Buffers A and B constituted of 0.1% (
v/
v) formic acid in water and ACN, respectively. LC separations were performed on the ChromXP C18CL column (3
, 120
,
; Eksigent, Redwood City, CA, USA) using a gradient of 8–40% Buffer B over 30
with a flowrate of 5
. All measurements were performed in a positive ion mode. The system was controlled by the Analyst TF 1.7.1 software (AB SCIEX, Framingham, MA, USA). Data-dependent acquisition (DDA) analyses consisted of a 250
TOF survey scan in the
m/
z range of 400–1000
followed by a 100
Product Ion scan in the
m/
z range of 100–1500
, which resulted in a 2.3
cycle time. The top 20 candidate ions with charge states from 2 to 5 were selected for collision-induced dissociation (CID) fragmentation with rolling collision energy. Former target ions were excluded after 2 occurrences for 5
SWATH-MS [
30] analyses were performed in a looped product ion mode. A set of 25 variable-width windows was constructed via equalized ion frequency distribution with the use of SWATHTuner [
31] to cover the
m/
z range of 400–1000
The collision energy of each window was calculated for +2 to +5 charged ions centred on the window, with a spread of 5. The SWATH-MS1 survey scan was acquired in high-sensitivity mode in the range of 400–1000
at the beginning of each cycle, with an accumulation time of 50
, and it was followed by 40
accumulation time high-sensitivity product ion scans, which resulted in a total cycle time of 1.1
The database search for spectral library construction was performed in ProteinPilot 4.5 software (AB SCIEX, Framingham, MA, USA) using the Paragon algorithm against the SwissProt Homo sapiens database (ver. 26.07.2019; 20,428 entries) merged with the iRT standard sequence and the following parameters: a TripleTOF 5600+ instrument (AB SCIEX, Framingham, MA, USA); the alkylation of cysteines by iodoacetamide; trypsin enzyme digestion, an ID focus on biological modifications; the search effort “thorough ID”; and a threshold of detected proteins [Conf] > 10%. The resulting group file was loaded into MS/MS All with SWATH Acquisition MicroApp 2.01 in PeakView 2.2 (AB SCIEX, Framingham, MA, USA) to automatically create a spectral library with the following set parameters: modified peptides allowed and shared peptides excluded. The library was processed via SWATH-MS measurements of the samples. Retention time calibration was performed manually with the use of iRT kit peptides. The maximum number of peptides per protein was 6 and the extracted ion chromatogram (XIC) parameters were set to a 10
extraction window width and 75
XIC width. The sample preparation workflow and the final results are summarised in
Table 4.
There were two normalisation steps involved. First, the spectra of individual samples were normalised in MarkerView using total area sums. Finally, in the second step, SWATH-MS intensities were normalised in Perseus at the level of all samples.
4.5. The Random Forest Algorithm
The protein abundances obtained from the SWATH-MS workflow were analysed using the Random Forest classifier, which is a versatile and powerful ensemble learning algorithm. Its primary purpose is to create a classification scheme for samples based on features (such as protein abundances) in order to predict associated labels (e.g., embryo quality; see
Section 2.3.1).
The algorithm works by constructing a multitude of decision trees during training, each trained on a different random subset of the dataset. By combining the predictions of these individual trees through averaging, it enhances predictive accuracy and mitigates the risk of overfitting.
A single decision tree within the Random Forest ensemble is constructed using a process that involves recursively partitioning the input feature space based on the values of different features. Here we present an overview of how a single tree is created:
Initialization: The tree starts with a root node that contains a random subset of
training samples:
Best feature selection: Let us denote the subset of samples under consideration in the training phase of the
mth node as
. First, a random subset
containing
n features is created (hyperparameter
n is kept fixed throughout training and is usually set to be the square root of the number of all features). In the next step, the best of the selected features is found based on a chosen criterion (e.g., a Gini impurity or entropy, cf.,
Section 2.3.1) called impurity function
H.
Splitting: The set of samples
is split into two subsets—
and
. The feature
and the threshold
are selected to minimise the mean impurity:
where
The goal of the chosen criterion metric H is to maximise the homogeneity of the target variable (i.e., ideally within each subset we would like to have samples belonging mostly to the same class). This process of splitting the dataset yields two new nodes of the decision tree and is repeated recursively for each subset and until a stopping criterion is met.
Stopping Criterion: The recursion stops when one of the following conditions is met:
The maximum tree depth is reached;
The number of samples in the current node falls below a certain threshold;
Further splitting does not lead to significant improvement in the chosen metric.
Leaf Nodes: Once the stopping criterion is met, the current node becomes a leaf node, and it is assigned a probability distribution based on the distribution of labels.
When making predictions for a new sample using a single decision tree within a Random Forest ensemble, one of the two following steps are typically followed.
Traversal: The new sample is passed down the tree starting from the root node. At each node, the tree evaluates a specific feature of the sample based on the splitting threshold learned during training. Then, the sample is directed either to the left or right child node of the current node. This process continues recursively, with the sample traversing down the tree from one node to another until it reaches a leaf node.
Leaf Node Prediction: Once the sample reaches a leaf node, the tree assigns a probability distribution associated with the node. If our goal is to predict a single class based on the feature, the class with the highest probability is taken.
It is important to note that each decision tree in the Random Forest ensemble makes an independent prediction for the new sample. In classification tasks, the final prediction of the Random Forest classifier is determined by aggregating the predictions of all the trees in the ensemble by averaging (note that this approach is slightly different from the original one, where majority voting is used, see [
32]). An advantage of Random Forests is their ability to rank features by assigning importance to each feature. Typically, the Mean Decrease in Impurity (MDI) is used as an estimate of feature importance. This can be defined separately for each feature
x in every individual tree
T:
The above sum is calculated over the nodes
m splitting the samples
into the two subsets
and
and using feature
x in their splitting criterion. Then, the decrease in impurity
for node
m is calculated to be
Next, the weight
of each node considered in the sum is defined as:
where
w denotes the sum of the weights associated with all samples in the training dataset.
Finally, the feature importance of x over the whole Random Forest is defined by simply averaging all for all trees T in the ensemble.
4.6. Recursive Features’ Elimination with Cross-Validation—Algorithm Description
In our analysis, we employed a version of RFECV (Recursive Feature Elimination with Cross-Validation) implemented in the Python library scikit-learn v1.2.1. This method requires the classifier used to be capable of computing feature importances, a criterion met by the Random Forest classifier (cf.
Section 4.5). Below is a brief description of the algorithm:
Data Partitioning: The data are divided into folds, where each fold uses samples from one patient as test data and the remaining samples to train the classifier. The number of folds equals the number of patients, ensuring each patient’s samples are used as test data exactly once.
Feature Elimination: For each fold, the RFE algorithm begins by iteratively removing features. First, the classifier is fitted to compute feature importances. Then, the least important feature is removed, and the model’s score is calculated using the fold’s test data. This process is repeated until only one feature remains.
Score Averaging: The scores calculated for each fold and each number of features during step 2 are averaged to obtain mean scores as a function of the number of features. The optimal number of features, , is defined as the number with the highest mean score.
Final Model Fitting: Finally, the classifier is fitted over the entire dataset, and the with the highest importance are selected.
Iteration: Steps 1 to 4 are repeated a predefined number of times or until only one feature remains.
This algorithm also allows us to assign scores to the selected features. At each iteration in step 4, a subset of features is selected. When a feature is selected, its score is incremented by one. Thus, features that persist longer throughout the iterations will accumulate higher scores.
An original version of the RFE approach evaluates feature importance using a support vector machine (SVM) model, selecting features for elimination based on their ranked importance [
33]. This method can also be adapted for other models such as Random Forests (RFs), which have intrinsic mechanisms for evaluating feature importance [
34,
35,
36].