1. Introduction
The goal of forest inventory is to accurately characterize forest attributes and do so with reasonable precision. As such, estimates of forest attributes are typically presented in terms of a confidence interval
where
is the attribute mean,
is the standard error of the mean, and
c is a constant determined by the desired level of confidence. For any given level of confidence, the width of the confidence interval is determined by the error, or variability, in the sample. In general, there are three sources of error for forest inventories: sampling error, modeling error, and measurement error [
1]. Emphasis usually focuses on sampling error, with some attention given to modeling error. Measurement error typically receives little attention because it is commonly assumed that observations are made without any error or with an error that is small and inconsequential when compared to other sources of error [
2]. In reality, observational errors are unavoidable and may be quite severe. Their presence can produce biased and imprecise estimates and mask true relationships [
3,
4].
Common sources of measurement error in a forest inventory include uncalibrated or faulty equipment, negligent record keeping, and lax or improper field techniques. A less cited source of measurement error occurs when data are coarsened by rounding or other approximation. Number preference (NP), i.e., the human tendency to gravitate toward or away from specific numbers [
5], is a type of data coarsening. NP occurs in a variety of circumstances, ranging from baseball players modifying their at-bat strategy to end the season with a batting average over 0.300 [
6] to diners leaving gratuities in whole-dollar amounts or in amounts that make the total bill a whole-dollar amount [
7]. Other examples of NP have been observed in marathon run times [
8], economic and financial environments [
9,
10], human dimension surveys and behavioral studies [
6,
11,
12,
13], and wildlife sampling [
14]. The phenomenon occurs not only when individuals self-report metrics such as income [
15] but also when instrumentation is used for direct measurements, e.g., blood pressures [
16], carcinoma sizes [
17], and fish lengths [
18].
NP, sometimes referred to as digit preference [
19], leads to an excessive grouping together, or heaps, of observations at specific values. For example, consider a survey asking respondents how many days they spent vacationing last year, with responses showing an unusually high frequency at 14 days. Though some respondents likely spent exactly two weeks vacationing, others may have spent a day or two more (or less) on holiday and simply rounded to 14 when completing the survey. Datasets such as this may be inaccurate and biased, though not necessarily so, and can lead to erroneous conclusions because means, variances, and percentiles are all affected [
15]. Therefore, examining data for NP is recommended for the data processing and quality control steps of forest inventory to ensure accurate characterization. To that end, the objective of this work was to evaluate the extent of NP in data collected by the U.S. national forest inventory and identify which factors, if any, influence the propensity for NP. In this study, the numbers of preference are those ending with the digit zero or five (NP
0,5) and those that are a multiple of four (NP
4).
2. Materials and Methods
The data used in this study were collected by the Forest Inventory and Analysis (FIA) program of the U.S. Department of Agriculture, Forest Service (Forest Service). FIA inventory plots are located across the U.S. quasi-systematically with a baseline sampling intensity of 1 plot per 2428 ha [
20]. Some states, national forests, and other areas are sampled at intensities two or three times that of the baseline. Each plot consists of four 7.32 m fixed-radius subplots on which trees ≥ 12.7 cm in diameter are measured. Observations on trees < 12.7 cm in diameter are made on a 2.07 m fixed-radius microplot within each subplot. The cluster of subplots is arranged with one central subplot, and three other subplots located 36.58 m from the central subplot at azimuths of 0°, 120°, and 240°. Each plot is monumented, georeferenced, and measured on an ongoing basis once every 5–10 years.
When plots are partially forested or straddle heterogeneous forest conditions, they are subdivided by a procedure known as condition mapping [
20]. Multiple conditions are classified on the basis of reserved status, owner group, forest type, stand size class, regeneration status, and tree density [
21]. Several ancillary attributes are used to further describe the condition classes but are not used to delineate new classes. Any number of condition classes may be recorded for each plot.
NP0,5 was evaluated for three tree-level attributes: rotten/missing cull volume, diameter, and actual height. NP0,5 and NP4 were evaluated for the microplot metric seedling tree count. Rotten/missing cull volume (cull) is the estimated percentage of tree volume that is rotten or missing. Cull is visually estimated and recorded to the nearest 1%. The diameter of timberland tree species is recorded at breast height (d.b.h.), typically 1.37 m above the ground line on the uphill side of the tree. For woodland (mostly multi-stemmed) species, diameter is recorded at the stem root collar (d.r.c.) or groundline, whichever is higher. Diameter is measured instrumentally unless circumstances warrant otherwise and recorded to the nearest 0.254 cm. Actual height (height) is the tree length from ground level to the highest remaining portion of the tree still present and attached to the bole. Height is measured instrumentally unless circumstances warrant otherwise and recorded to the nearest 30.48 cm. The seedling count is the number of live trees with a diameter < 2.54 cm present on the microplot. To qualify for counting, conifer (softwood) seedlings must be at least 15.24 cm tall, and hardwood seedlings must be at least 30.48 cm tall. Seedlings are tallied by species and condition class up to a count of five and estimated beyond that.
Tree-, condition-, and subplot-level data for the most recent available inventory year of all states except Hawaii were included in the analysis (
Figure 1). Data were collected with protocols outlined in FIA field guide versions 7–9 [
21]. Only plots of the baseline sampling frame were included. The number of tree and seedling count observations available for analysis varied by region and ranged from <1000 to >235,000 (
Table 1). All data are available to the public through the FIA online database [
22] and were downloaded during the first week of July and the second week of August 2022.
To test for NP
0,5, each cull, diameter, height, and seedling count observation was assigned its end digit (ED
i). For example, the end digits of numbers 7, 19, and 23.6 were 7, 9, and 6, respectively. End-digit assignments for diameter and height were based on the U.S. customary units of measure employed by FIA (inches and feet, respectively). The proportion of observations ending in zero or five (
) was estimated for each attribute with a logistic regression model. Cull values = 0%, seedling counts < 11, and seedling counts observed in condition classes with a stocking value < 10, i.e., non-stocked conditions, were not included. Confidence intervals for
(α = 0.01) were computed with a Wald-type interval on the log odds scale and transformed to the probability scale. Analyses were completed with R [
23] packages survey [
24] and srvyr [
25]. Tree and seedling count observations were treated as being clustered on plots by designating plot identification number as the primary sampling unit, i.e., cluster variable, in the survey design specification. Estimations of
were made for each attribute by region: Interior West (IW), Northern, Pacific Northwest (PNW), and Southern (
Figure 1). In the absence of NP, the digits
i = 0, 1, …, 9 were expected to occur with equal frequency (
= 0.1) at the end of a number. Therefore, the null hypothesis for NP
0,5 was
The null hypothesis was rejected if the 99% confidence interval for did not include 0.2.
The test for NP
4 was limited to seedling count and warranted by a recommended tally shortcut: when seedlings are distributed evenly on a microplot, inventory crew members may estimate the total count by multiplying the number of seedlings on one-quarter of the microplot by four [
21]. Therefore, all seedling counts > 10 were categorized as either a multiple of four (M
4) or not a multiple of four. Procedures used to estimate
were repeated to estimate the proportion of M
4 seedling counts (
). Seedling counts made on microplots with more than one condition class were excluded. The proportion of M
4 numbers from 11 to 999 (the maximum seedling count allowed) is approximately 0.25. Thus, the null hypothesis for NP
4 was
The null hypothesis was rejected if the 99% confidence interval for did not include 0.25.
Multivariate logistic regression [
26] was used to identify factors associated with ED
0,5 and M
4. Four tree-level attributes were included as potential predictors of cull ED
0,5: species group (hardwood, softwood), species type (timberland, woodland), tree status (live, standing dead), and treetop status (intact, broken/missing). Five tree-level attributes were included as potential predictors of diameter ED
0,5: diameter point (at breast height, above breast height, below breast height, root collar), method (measured, estimated, different location), species group, stem size (sapling, tree), and tree status. Five tree-level attributes were included as potential predictors of height ED
0,5: method (measured, estimated), species group, species type, stem size, and tree status. A detailed description of these factors is provided in
Table S1. Seven condition-level attributes and one subplot-level attribute were included as potential influencers of seedling count ED
0,5 and M
4: stand size (small, medium, large), stand origin (natural, artificial), disturbance (undisturbed, disturbed), treatment (untreated, treated), depth of water or snow on the subplot (<3 cm, 3–30 cm, >30 cm), owner group (Forest Service, other federal, state/local government, private), physiography (mesic, hydric, xeric), and slope (0%–155%). A detailed description of these factors is provided in
Table S2. Some factors were omitted in some regional regressions due to inadequate sample sizes.
For the ED
0,5 regression, an end digit of zero or five was considered a success (S = 1), and any other end digit a failure (S = 0). For the M
4 regression, multiples of four were considered successes (S = 1), and other values were considered failures (S = 0). The probability that S = 1 was modeled for each attribute by region in the linear form as
where parameter
βi refers to the effect of attribute
xi on the log odds that S = 1, controlling for all other attributes. Dichotomous (0/1) dummy variables were used to represent the categorical attributes. Parameters were estimated with a logit link function under a quasibinomial distribution with R [
23] packages survey [
24] and srvyr [
25]. Tree and seedling count observations were treated as being clustered on plots by designating plot identification number as the primary sampling unit, i.e., cluster variable, in the survey design specification.