1. Introduction
Reference intervals play a crucial role in the medical interpretation and statistical evaluation of laboratory results [
1]. By definition, reference intervals include the central 95% of results measured in non-diseased reference individuals [
2]. It is mandatory that laboratories verify the reference limits obtained from external sources such as assay inserts or handbooks before using them for routine clinical care [
2,
3]. In principle, this requirement is independent of the size of the laboratory, but it is clear that small laboratories with low test numbers and higher economic pressure are more challenged here than large laboratory institutions.
In conventional approaches, the lower and upper limits of reference intervals are determined using direct methods that involve collecting laboratory results from apparently healthy individuals and calculating the 2.5th and 97.5th percentiles with parametric or non-parametric methods [
2]. Although direct methods are currently considered the guideline-compliant “gold standard”, their practical application is challenging due to the cost and time issues for recruiting a sufficient number of well-defined reference individuals, as well as ethical restrictions, especially in small children, and difficulties in excluding “non-healthy” outliers [
3,
4,
5].
This is the reason why the gold-standard procedure is only binding for the de novo definition of reference limits, while for the mere verification of otherwise defined limits, the guideline recommends a simplified approach, which just verifies that no more than two out of twenty values measured in healthy individuals fall outside the given limits [
2]. However, this alternative approach is neither representative nor reproducible, nor is it able to detect reference intervals that are too wide [
4].
To overcome these issues, so-called indirect methods [
3] have been proposed that can be applied to larger numbers of routine laboratory results. They rely on statistical models rather than clinical measures for the definition of an apparently healthy population and attempt to derive the above percentiles from datasets containing an unknown proportion of pathological results [
6,
7,
8,
9,
10,
11]. The most recent of these methods has also been provided as a free software package called refineR [
9], which can be downloaded from the Comprehensive R Archive Network (
https://cran.r-project.org/web/packages/refineR (accessed on 2 May 2024)).
The major advantages of an R package compared to “home-brew” programs are the access via the official CRAN website, the standardized package-type documentation, and the ease of use in the R software environment. As a disadvantage of refineR, some authors, including our group, mention the relatively long computation time of the algorithm and the uncertainty of finding the right statistical model when the number of cases is below 1000 [
9,
10].
Therefore, we have developed an alternative R package called reflimR as refinement of our previously published, Excel-based indirect method [
7]. Our main goal was to provide a much-needed tool that would allow rapid serial verification of reference intervals under routine clinical laboratory conditions [
4]. To effectively support this intent, we integrated into the package an algorithm that uses traffic light colors to indicate how well the estimated reference intervals match the predefined limits used in one’s own laboratory.
In this article, we explain the functions included in reflimR, present results for the example data of the package, and compare them with those of refineR and the guideline-compliant direct method. As a special feature, the example data offer the possibility to test our method both in a direct and an indirect mode so that the influence of pathological outliers can be assessed.
2. Materials and Methods
All calculations and graphics were made with the free statistical software R (
www.r-project.org (accessed on 2 May 2024)). Eight analytes were measured in 456 healthy controls and 156 patients with different stages of hepatitis C ranging from mild infection without histological signs to severe liver damage in the form of fibrosis and cirrhosis [
12].
Table 1 exemplifies four rows from the livertests dataset included in the reflimR package. They illustrate typical values for controls and patients. The first and third rows represent a female and a male person from the healthy control group. The female patient in row 200 is an example of mild early-stage hepatitis with largely unremarkable results except for a significantly elevated GGT. The male patient in row 610, on the other hand, represents a typical cirrhotic stage with increased AST, BIL, and GGT but decreased ALB, ALT, and CHE. For the full names of the analytes see the list of abbreviations. The package also includes a list of target values (see
Section 3), which were derived from the publicly available handbook of L. Thomas (
https://www.clinical-laboratory-diagnostics.com (accessed on 2 May 2024)). The missing lower limits for ALT, AST, and GGT were supplemented from the manufacturer’s assay sheet.
The reflimR method falls into the category of so-called “modified Hoffmann approaches” [
7,
13,
14], which are based on the original work of Robert G Hoffmann (1963) [
15]. Their common element is that they evaluate the linear part of a regression line, which is obtained by comparing the distribution of the (eventually transformed) values with a standard normal distribution. While in the original method a probability–probability plot is generated [
15], most of the newer modifications, including ours, use a normal quantile–quantile plot [
7].
The complete list of ten functions included in the reflimR package can be displayed with the command help (package = reflimR). The reflim function is on the highest level and represents the main function of the package. It can be called with a single command reflim (x), where x is the vector of positive numbers to be analyzed. The output of this function is a set of numeric and text results as well as a graphical representation of the calculated reference limits with colored tolerance ranges (
Figure 1). The reflim function also includes a total of ten arguments with default values defining the appearance of the output. It calls the other nine functions that can be arranged as follows:
Group 1: ri_hist, permissible_uncertainty, interpretation
Group 2: lognorm, iboxplot, truncated_qqplot
Group 3: adjust_digits, bowley, conf_int95
Figure 1.
Graphical output of the reflim function. The vertical lines represent the observed and theoretical reference limits with their respective tolerance ranges.
Figure 1.
Graphical output of the reflim function. The vertical lines represent the observed and theoretical reference limits with their respective tolerance ranges.
Group 1 comprises three higher-level functions that provide the user with the final results: ri_hist creates a graphical output, permissible_uncertainty calculates the tolerance limits of the results [
1,
2], and interpretation assesses the medical significance of deviations from given target values. Group 2 performs the three underlying statistical operations (see
Section 3), and group 3 contains auxiliary functions for miscellaneous tasks like rounding to a plausible number of digits, calculating Bowley’s quartile skewness and determining 95% confidence intervals. The details of each function are available in the respective help files, which can be addressed with a question mark followed by the function name.
Figure 1 shows an example for the graphical output of the evaluation of 1000 realistic laboratory results (e.g., blood glucose in mg/dL), simulated as three Gaussian distributions representing 80% normal values as well as 10% low and 10% high values (mean values 100, 70, and 125 and standard deviations 10, 15, and 15, respectively). The vertical lines represent the observed and theoretical reference limits. The respective tolerance ranges surrounding these vertical lines were derived from the permissible uncertainty of quantitative laboratory results [
16] in a special application serving as an equivalence test for reference limits [
17].
Arbitrary target values of 120, 130, and 140 were set for the upper limit in
Figure 1 to illustrate the traffic light metaphor of the ReflimR approach. Green bars mean that the predicted target values lie inside the tolerance ranges of the reflim calculation. Yellow bars mean that the target values lie outside but the tolerance ranges overlap, whereas red bars mean that the tolerance ranges are completely separated. The respective interpretation outputs of the reflim function are “within tolerance” (green), “slightly increased/decreased” (yellow), and “markedly increased/decreased” (red).
A Shiny application with a graphical user interface is available to facilitate the use of reflimR for those who are not familiar with calling R functions. It can be downloaded from GitHub and installed in the R environment as described on the website (
https://github.com/SandraKla/reflimR_Shiny (accessed on 2 May 2024)).
For a method comparison, the refineR package was used as a published reference [
9]. This package includes two main functions that are called sequentially:
Briefly, the algorithm is based on the assumption that the non-pathological fraction of the data can be modeled with a Box–Cox transformed normal distribution with three parameters (mean, standard deviation, and exponent lambda). In contrast to our method, refineR starts with a sophisticated analysis of the density of the original data aiming to find a lambda value that fits a continuum of right-to-left skewed distributions rather than just our two types, i.e., Gaussian (λ = 1) and lognormal (λ = 0). In a series of complex analytical steps, the roughly transformed values are then transferred to a histogram with optimized bin width, from which a cost-based final model is obtained under various assumptions about the most likely distribution in each bin as well as in the presumably non-pathological fraction. For a more detailed description of the algorithm see Ref. [
9].
Finally, we applied two direct methods to the values of the healthy control group to compare our method with the established CLSI/IFCC guideline. The “gold standard” procedure determines the 2.5th and 97.5th percentiles in healthy individuals without making any assumptions regarding the underlying distribution [
2]. A simplified “20-person approach” rejects the specified reference interval if more than two out of twenty reference values fall outside its limits [
2].
4. Discussion
The indirect reflimR method presented here is suitable for a quick and easy-to-interpret verification of specified reference intervals. It works well with mixed datasets containing up to 25% patients with confirmed disease. Taking the livertests dataset included in the package as example, reflimR accepts less than 50% of the literature-derived limits: 34% are rejected and 22% are classified as worth reviewing.
The color system presented here makes it easier to quickly assess the agreement of the reflimR results with reference limits from handbooks or assay package inserts. The traffic light colors are intuitive but not subjective, as they are determined by the specifications of the permissible uncertainty [
16,
17] and cannot be influenced by the user. In contrast to statistical confidence intervals [
2], the permissible uncertainty is independent of the number of observations. The confidence intervals provided by reflimR are a valuable reproducibility measure in cases with low numbers of values but become extremely narrow when several thousand values are analyzed.
If we take a closer look at the red and yellow fields in
Table 3, it is noticeable that more limits are affected in women than in men. Among them are ALB and BIL, where the target values make no difference between both genders, whereas our data indicate a significant difference in the medians (
p < 0.001). A brief literature search shows that men do indeed have higher albumin and bilirubin concentrations than women [
21,
22], a fact that is rarely considered by assay manufacturers and clinical laboratories.
The results of our method are similar to those of the direct CLSI/IFCC method [
2] as well as the more complex refineR method [
9]. No notable differences are observed when our method is applied to the healthy controls only, i.e., omitting the patient values. In this latter case, reflimR and refineR may even outperform the so-called gold standard in specific situations, i.e., where the healthy control group contains individuals with slightly pathological results (see the green density curve in
Figure 2 and the green boxes for GGT in
Figure 6). Such borderline values are reliably eliminated by the three-stage procedure used here, whereas they are fully reflected in the results when calculating quantiles alone without any model assumptions.
With older methods [
6,
15,
23], an IFCC working group found that the application of indirect methods to mixed populations resulted in some bias as compared to carefully selected healthy reference individuals [
24]. This limitation may also apply to reflimR and refineR (see for example CREA for men in
Figure 6), but does not seem to be too serious if these methods are just used for verification of already existing reference intervals rather than for establishing them de novo.
Compared to refineR, the much higher speed is an outstanding advantage of our method.
Figure 7 shows that the computation time of refineR decreases as a function of the number of observations. This counterintuitive behavior can probably be explained by the complex statistical algorithm, which leads to faster convergence for larger sample sizes [
9]. Nevertheless, reflimR is several thousand times faster and therefore qualifies for the rapid verification of reference limits. The long calculation times of refineR may not play a role in individual analyses but can quickly become a problem if the algorithm has to be run repeatedly.
This is particularly the case when confidence intervals are calculated with simulation or bootstrap techniques [
2,
25]. The conf_int95 function of reflimR is based on 100,000 Monte Carlo simulations for each sample size from 200, 400, 600 … to 2000 (see conf_int95 in the package documentation). While this experiment with a total of one million simulations takes about two hours, the corresponding duration with refineR would be roughly a year on the same computer. Very long computation times may also be a reason why refineR does not output any confidence intervals in the default setting. The documentation only contains very rough calculation examples with 30 bootstraps that already take several minutes. For a standard lognormal distribution with 10,000 values, refineR returns plausible reference limits of 0.13 (CI95 0.10 to 0.14) and 6.94 (CI95 4.50 to 7.21) after about three minutes. The same simulation performed with reflimR takes 50 milliseconds and yields almost identical reference limits of 0.14 (CI95 0.10 to 0.19) and 6.88 (CI95 6.05 to 7.70). The reflimR algorithm is so fast because here the confidence intervals are calculated with closed formulas that are based on the 100,000 simulations mentioned above. Such formulas do not exist for refineR.
Several publications have dealt with the minimum sample size required for the different methods. Due to the specifications in the guideline [
2], it has become common practice to consider a sample size of 120 values as a minimum [
9,
25,
26]. A critical IFCC document published in 2010 states that this number is far from optimal and at least 400 healthy individuals are desirable [
27]. The reflimR algorithm issues a warning if there are less than 200 presumably inconspicuous values remaining after truncation of the original data. For a very clean sample without pathological outliers, reflimR even works with only 40 values, from which the 39 quantiles of the Q-Q plot can be calculated (see step 3 visualized in
Figure 5). This number is considered the absolute minimum for the reflimR method if performed in a direct mode with healthy subjects only.
In contrast, refineR warns if there are less than 1000 values in the total dataset. The latter figure has been confirmed by Anker et al. [
10], who found that reflimR is unable to estimate plausible lambda values at sample sizes below 1000. Our results, summarized in
Figure 6, suggest, however, that there are no notable differences between reflimR and refineR even for sample sizes below 1000.
Finally, and most importantly, our method is clearly superior to the simplified guideline approach, which uses just twenty values from healthy controls and counts how many of them fall outside the specified reference interval [
2,
4,
26]. The poor reproducibility of the results (see
Table 3) shows that twenty individuals are not enough for a representative sample of the healthy population and, in addition, the simplified guideline method is inherently flawed because it cannot recognize reference intervals that are too wide [
4]. In our study, this severe limitation applies to ALB and BIL in women and AST in men, as well as CREA in both genders (see empty boxes in
Figure 6). On the other hand, the guideline method tends to reject reference intervals erroneously if the seemingly healthy population includes sick individuals with slightly pathological values (see GGT for men).
Limitations of the Study and Outlook
As a rule, the results of reflimR do not differ significantly from those of refineR and it makes no difference whether reflimR is applied to data from healthy individuals or to mixed populations (see
Figure 6). Slight deviations from this rule, such as for GGT or CREA for men, are probably due to an accumulation of borderline pathological values, which are difficult to separate from the values of healthy individuals using our algorithm (
Figure 2). Such borderline cases are identified by visual inspection of the quantile–quantile plot (
Figure 5). Deviations from linearity indicate the need for analyses using methods like refineR as part of quality assessment. However, objective criteria for non-linearity are still lacking [
7,
10].
Other questions that can be addressed with real-world data include handling small amounts of data in specialty testing [
10], determining the lower reference limits below the detection level of an assay [
8], or integrating reflimR and refineR into laboratory information systems.
Noteworthily, it was demonstrated that reflimR in its indirect version produces results comparable to more sophisticated direct methods and the more time-consuming refineR method. Previous studies have shown similar concordance for the precursor methods of reflimR [
10,
11]. Independent multi-center studies with real laboratory data are needed to validate the performance of reflimR under all conceivable routine conditions and to define when other methods such as refineR need to be used as a control. This also applies to the question of whether the high rate of rejections of predefined reference intervals (see
Table 3) by reflimR is also confirmed by other methods.
The general applicability of indirect methods for testing reference intervals is still a matter of debate [
3,
24,
27], but as direct methods also have their limitations (see
Figure 6 and
Table 3), clear criteria for the use of healthy reference subjects versus mixed populations need to be defined. We hope that our simple, intuitive, and fast method will pave the way for comprehensive investigations and foster collaboration between scientists in laboratory medicine and statistics.