Next Article in Journal
Equilibrium Strategies for Overtaking-Free Queueing Networks under Partial Information
Previous Article in Journal
Magnetotelluric Forward Modeling Using a Non-Uniform Grid Finite Difference Method
Previous Article in Special Issue
Sequential Ignorability and Dismissible Treatment Components to Identify Mediation Effects
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Projection-Uniform Subsampling Methods for Big Data

1
Key Laboratory of Mathematics and Information Networks (Beijing University of Posts and Telecommunications), Ministry of Education, Beijing 100876, China
2
School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
*
Author to whom correspondence should be addressed.
Mathematics 2024, 12(19), 2985; https://doi.org/10.3390/math12192985 (registering DOI)
Submission received: 28 August 2024 / Revised: 20 September 2024 / Accepted: 24 September 2024 / Published: 25 September 2024
(This article belongs to the Special Issue Advances in Statistical AI and Casual Inference)

Abstract

:
The idea of experimental design has been widely used in subsampling algorithms to extract a small portion of big data that carries useful information for statistical modeling. Most existing subsampling algorithms of this kind are model-based and designed to achieve the corresponding optimality criteria for the model. However, data generating models are frequently unknown or complicated. Model-free subsampling algorithms are needed for obtaining samples that are robust under model misspecification and complication. This paper introduces two novel algorithms, called the Projection-Uniform Subsampling algorithm and its extension. Both algorithms aim to extract a subset of samples from big data that are space-filling in low-dimensional projections. We show that subdata obtained from our algorithms perform superiorly under the uniform projection criterion and centered L 2 -discrepancy. Comparisons among our algorithms, model-based and model-free methods are conducted through two simulation studies and two real-world case studies. We demonstrate the robustness of our proposed algorithms in building statistical models in scenarios involving model misspecification and complication.

1. Introduction

The rapid advancements in computer technology have dramatically accelerated data generation speed, leading to an explosion in data volumes. Despite advancements in computer storage and processing power, they continue to struggle with processing and analyzing the increasing amount of data [1]. Subsampling is a principal way to extract a small amount of samples from large datasets for downstream modeling [2]. Drovandi et al. [3] suggested that combining big data subsampling problems with experimental design simplifies the challenge to extracting informative design points. This idea has been studied extensively.
The majority of existing research combining experimental design with subsampling has concentrated on model-based subsampling methods, which heavily rely on model assumptions [4]. Wang et al. [5] proposed the information-based optimal subdata selection (IBOSS) method. IBOSS selects the most informative data points under the D-optimality criterion in the context of linear regression. Wang et al. [6] proposed the orthogonal subsampling algorithm for selecting subdata for linear models. We refer to Ma et al. [7], Ma et al. [8], Derezinski and Warmuth [9], Derezinski et al. [10] and Ren and Zhao [11] for other subsampling methods based on linear models. Wang et al. [12] proposed a subsampling approach to efficiently approximate the maximum likelihood estimate in logistic regression. Yu et al. [13] derived a nonuniform subsampling method using the Poisson probabilities for quasi-likelihood estimation. For generalized linear models and nonlinear models, we refer to Ai et al. [14] and Yao and Wang [15] for more discussion. The subdata selected from the above approaches perform better when the base model is correctly specified. Yet, their performance becomes unpredictable when the model is incorrect or unknown [4]. In practice, the generating model of the data is frequently unknown. Therefore, it is important to develop subsampling algorithms that can effectively handle model complication or model misspecification [1].
Space-filling designs have been extensively used in both computer experiments and big data subsampling for their robust performance in building statistical surrogate models, especially with complicated or misspecified base models. A space-filling design is any design that uniformly spreads its points in the design space. There are various types of space-filling designs studied from different perspectives, such as distance, uniformity, and orthogonality. Maximin distance designs [16,17] aim to maximize the smallest distance between any two design points. Uniform designs [18] aim to distribute design points uniformly across the design space by minimizing the discrepancy of the point set. Latin hypercube designs [19], derived from Latin hypercube sampling, have been widely used in computer experiments. A Latin hypercube design guarantees maximum stratification in any one dimension of the design region. Tang [20] and He and Tang [21] introduced orthogonal array-based and strong orthogonal array-based Latin hypercube designs that achieve finer stratification in low-dimensional design space. The space-filling idea has been practiced in subsampling algorithms. Zhang et al. [1] introduced a model-free subsampling method called the data-driven subsampling algorithm. The data-driven subsampling algorithm takes samples based on uniform designs and evaluates the subdata using a generalized empirical F-discrepancy. Shi and Tang [22] proposed the Model-Robust subsampling method for big data that selects samples based on the maximum distance criterion. For data in high-dimensional space, however, both discrepancy and distance criteria perform poorly in selecting subdata with low-dimensional space-filling properties [23].
In this paper, we introduce the Projection-Uniform subsampling (PUS) algorithm and its extension for sampling from big data with high dimensions, ensuring robustness against model misspecification and complication. For high-dimensional data, it is commonly believed that only a few of the factors are significant according to the effect sparsity principle [24]. Samples with uniform low-dimensional projections are preferred. We develop two heuristic algorithms focusing on selecting subdata that fill any two-dimensional projection in a stratified manner. The subdata selected are justified to be space-filling under the uniform projection criterion [25] and the centered L 2 -discrepancy. We compare the efficiency of our algorithms in building statistical surrogate models to IBOSS and Model-Robust (MR) methods. Both simulation and real-life examples show that our algorithms outperform the two algorithms when the model is misspecified or complicated.
The remaining sections of this paper are structured as follows: Section 2 provides notation and background information. In Section 3, we introduce two subsampling algorithms and give some examples on the speed of the algorithms. In Section 4, we demonstrate the effectiveness and applicability of our algorithms through simulations and real-world examples from various perspectives. We conclude the paper in Section 5. Proofs of the theorems are included in the Appendix A.

2. Notation and Background

Let D = x 1 , , x n represent a set of n points constituting a design on the unit hypercube C m = [ 0 , 1 ] m , where x i = ( x i 1 , , x i m ) T , i = 1 , , n . The centered L 2 -discrepancy (CD) [26] can be computed using the following formula:
C D ( D ) = ( 13 12 ) m 2 n i = 1 n j = 1 m ( 1 + 1 2 x i j 0.5 1 2 x i j 0.5 2 ) + 1 n 2 i , k = 1 n j = 1 m ( 1 + 1 2 x i j 0.5 + 1 2 x k j 0.5 1 2 x i j x k j ) .
The centered L 2 -discrepancy of design D reflects how uniform the design points in D are spread out in the unit hypercube C m . It measures the deviation between the empirical density function of D and the theoretical uniform density function when D is projected to all possible dimensions. A design is called a uniform design if it has the minimum centered L 2 -discrepancy among all comparable designs. In the theory of the quasi-Monte method, the overall mean model has an error satisfying the famous Koksma–Hlawka inequality. Sample points with low discrepancy are preferred. The space-filling nature of the points with low discrepancy ensures good performance for developing approximate statistical models.
Zhou et al. [27] and Sun et al. [25] pointed out that the centered L 2 -discrepancy selects designs that have inferior performance in low-dimensional projection space. To overcome this shortcoming, Sun et al. [25] proposed a uniform projection criterion based on the centered L 2 -discrepancy defined in (1). The uniform projection criterion selects designs that minimize ϕ ( D ) , where
ϕ ( D ) = 2 m ( m 1 ) u = 2 C D ( D u ) ,
u is a subset of 1 , 2 , , m , u is the cardinality of u and D u is the projection of D onto the dimensions indexed in u. When a design’s ϕ ( D ) achieves the minimum, it qualifies as a uniform projection design. The uniform projection criterion focuses on minimizing the centered L 2 -discrepancy of all two-dimensional projection designs of D . It is justified that uniform projection designs are space-filling not only in two dimensions, but in all dimensions with regard to distance, uniformity, and orthogonality. For instance, the maximin L 1 -equidistant designs are uniform projection designs.
Orthogonality has been widely used for the measure of uniformity. An orthogonal array of size n, m factors, and strength t, denoted by OA ( n , m , s 1 × × s m , t ) , is an n × m matrix whose entries of the jth column are taken from 0 , 1 , , s j 1 and every possible level combination appears in the same frequency in all its n × t subarray. If s 1 = = s m = s , we say the array is symmetric and denote it by OA ( n , m , s , t ) . An orthogonal array is called balanced if it has strength that is no less than one.
He and Tang [21] introduced a novel array known as the strong orthogonal array. A strong orthogonal array of size n, m factors, s t levels, and strength t, denoted by SOA ( n , m , s t , t ) , is an n × m matrix with entries from 0 , 1 , , s t 1 such that any subarray of g columns for any 1 g t can be collapsed into an OA ( n , g , s u 1 × × s u g , g ) for any positive integers { u 1 , , u g } satisfying u 1 + + u g = t . Collapsing s t levels into s u j levels is performed by a / s t u j for a = 0 , 1 , s t 1 , where x denotes the largest integer not exceeding x. Strong orthogonal arrays of strength t achieve stratification no matter in what way the design space is divided into s t equal volume grids from projection. In response to the challenge that strong orthogonal arrays require a large amount of experimental runs to attain effective strength, He et al. [28] proposed a strong orthogonal array of strength 2+. An n × m matrix with entries from 0 , 1 , , s 2 1 is called a strong orthogonal array of strength 2 + , denoted by SOA ( n , m , s 2 , 2 + ) , if any subarray of two columns can be collapsed into an OA ( n , 2 , s 2 × s , 2 ) and an OA ( n , 2 , s × s 2 , 2 ) . Strong orthogonal arrays of strength 2+ are optimal or nearly optimal under the uniform projection criterion.

3. Projection-Uniform Subsampling Methods

Suppose that a dataset X = { x 1 , , x N } consists of N points, where for i = 1 , , N , x i = ( x i 1 , x i 2 , , x i m ) T . We can arrange all points in X into an N × m matrix ( x 1 , , x N ) T . Without loss of generality, we do not distinguish X between the dataset and matrix. The projection of X onto the u dimensions, where u { 1 , , m } , is the subarray of X consisting of the columns indexed by u. We aim to extract subdata consisting of n points from X that carry information for building approximate statistical models efficiently and maintain robustness under model misspecification or complication. We denote the subdata X = ( x 1 , , x n ) T , where x i , i = 1 , , n represents the selected points from X.
Inspired by the superior properties of the uniform projection criterion, we propose a heuristic algorithm called the Projection-Uniform subsampling (PUS) algorithm for sampling subdata X with small ϕ ( X ) defined in (2). Our primary focus is on selecting subdata that are space-filling in any two-dimensional projection. To achieve this, the algorithm selects data points that are closest to optimal combinatorial designs based on the centered L 2 -discrepancy in any two-dimensional projection. An orthogonal array of strength two displays simplicity and great grid partitioning characteristics in two-dimensional projection space.
Theorem 1.
OA ( 4 , 2 , 2 , 2 ) is the uniform design among 2-level factorial designs.
OA ( 4 , 2 , 2 , 2 ) is one of the most basic orthogonal arrays in the literature. We scale OA ( 4 , 2 , 2 , 2 ) onto the unit plane [ 0 , 1 ] 2 by k ( 2 k + 1 ) / 4 , k = 0 , 1 to fit in the sampling process. The PUS algorithm sequentially selects design points that have the smallest L 1 -distance to the points of OA ( 4 , 2 , 2 , 2 ) in every two-dimensional projection. The steps of PUS are listed below.
Algorithm 1 selects subdata X whose two-dimensional projection consists of points that are closest to OA ( 4 , 2 , 2 , 2 ) in L 1 -distance. If the full data X do not have sharply isolated points and the collection in Step 2 and 3 of Algorithm 1 are full, the algorithm has a time complexity of O ( N m 2 ) . The complexity is solely dependent on the dimension and size of the full data. Otherwise, in the worst case scenario, the algorithm has a time complexity of O ( N n m 2 ) . In such instances, we suggest moving extreme points into the subdata or deleting them based on any prior knowledge available.
Algorithm 1 PUS
  • Step 1 [Initiation]: Scale the full dataset X onto [ 0 , 1 ] m . Divide the unit plane [ 0 , 1 ] 2 into four grids by partitioning each axis into two parts at 0.5, denoting them by G 1 , , G 4 . Denote the 4 points of OA(4,2,2,2) that are the center points of G 1 , , G 4 by o 1 , , o 4 . Let X and X t e m p be two empty sets. Let M = n 4 m 2 . Denote the two-dimensional projection of X by X i P , i = 1 , , m 2 .
  • Step 2 [PUS sampling part 1]: For i = 1 , , m 2 , j = 1 , , 4 , select M points in X i P that are located in G j and have the least L 1 -distances from o j (select randomly on tie and stop if such points are unavailable). Collect the selected points in X and remove them from X.
  • Step 3 [PUS sampling part 2]: For i = 1 , , m 2 , j = 1 , , 4 , select one point in X i P that is located in G j and has the smallest L 1 -distance from o j (select randomly on tie and stop if such a point is unavailable). Collect the selected points in X t e m p and remove them from X.
  • Step 4 [Criterion to stop]: If the total number of points in X t e m p and X is less than n, move the points in X t e m p to X and repeat Step 3–4. Otherwise, randomly move points in X t e m p to X until there are n points in X .
The following example shows the subdata selected by PUS and other comparative algorithms. We compare PUS with the MR [22] and IBOSS [5] methods.
Example 1.
Suppose N = 10 3 and m = 5 . We create an N × m full dataset X by randomly generating x i from a multivariate normal distribution with mean vector μ = ( μ 1 , , μ m ) , where μ k = 2 for k = 1 , , m and covariance matrix Σ, where Σ i i = 3 , Σ i k = 0 for i k , i , k = 1 , , m . Subdata X of size n = 50 are taken by PUS, IBOSS and MR, respectively. We show the two-dimensional projection plots of X (dot) and X (circled) in the first and the second dimensions in Figure 1 for each method. All three methods show their unique patterns. PUS selects points that cluster around the OA ( 4 , 2 , 2 , 2 ) in any two dimensions. IBOSS selects extreme points to gain the most information for linear models. MR selects points that are far from each other.
When M in Step 1 of PUS is larger than two, the points selected in each iteration of Step 2 could be close to each other, potentially damaging the space-filling properties of the samples. We propose an Extended PUS algorithm (EPUS) to address this concern. Sun and Tang [29] stated that a strong orthogonal array of strength 2+ is nearly optimal according to the uniform projection criterion. Any two-column subarray of SOA ( n , m , 4 , 2 + ) can be collapsed to an OA ( n , m , 2 × 4 , 2 ) and an OA ( n , m , 4 × 2 , 2 ) . A strong orthogonal array of strength 2+ achieves stratification on 2 × 4 and 4 × 2 grids in any two-dimensional projection. We present the following Theorem 2 to help with the extended algorithm:
Theorem 2.
OA ( 8 , 2 , 2 × 4 , 2 ) and OA ( 16 , 2 , 4 , 2 ) are the uniform designs among OA ( 8 , 2 , 2 × 4 , 1 ) and 4-level factorial designs, respectively.
OA ( 8 , 2 , 2 × 4 , 2 ) is a basic asymmetric orthogonal array that achieves stratification on 2 × 4 grids. It has the minimum centered L 2 -discrepancy among balanced designs with the same levels. OA ( 16 , 2 , 4 , 2 ) has the minimum centered L 2 -discrepancy among 4-level factorial designs. Both designs have more runs than OA ( 4 , 2 , 2 , 2 ) . The Extended PUS selects points that are closest to the above two designs based on the value of M. We list the steps of Extended PUS in the following (Algorithm 2):
Algorithm 2 Extended PUS
  • Step 1 [Initiation]: Scale the full dataset X onto [ 0 , 1 ] m . Let X and X t e m p be two empty sets. Let M = n 4 m 2 . Denote the two-dimensional projection of X by X i P , i = 1 , , m 2 . If M 1 , let e = 4 and G i , o i , i = 1 , , e be the regions and points in (a) of Figure 2. If 1 < M 2 , let e = 8 and G i , o i , i = 1 , , e be the regions and points in (b) or (c) of Figure 2 at random in each iteration of Step 2 and 3. If M > 2 , let e = 16 and G i , o i , i = 1 , , e be the regions and points in (d) of Figure 2.
  • Step 2 [Extended PUS sampling part 1]: For i = 1 , , m 2 , j = 1 , , e , select 4 M e points in X i P that are located in G j and have the least L 1 -distances from o j (select randomly on tie and stop if such points are unavailable). Collect the selected points in X and remove them from X.
  • Step 3 [Extended PUS sampling part 2]: For i = 1 , , m 2 , j = 1 , , e , select one point in X i P that is located in G j and has the smallest L 1 -distance from o j (select randomly on tie and stop if such point is unavailable). Collect the selected points in X t e m p and remove them from X.
  • Step 4 [Criterion to stop]: If the total number of points in X t e m p and X is less than n, move the points in X t e m p to X and repeat Step 3–4. Otherwise, randomly move points in X t e m p to X until there are n points in X .
The Extended PUS selects subdata X whose two-dimensional projection consists of points that are closest to either OA ( 4 , 2 , 2 , 2 ) , OA ( 8 , 2 , 2 × 4 , 2 ) or OA ( 16 , 2 , 4 , 2 ) depending on the subdata size. The aforementioned orthogonal arrays are carefully chosen uniform designs that are simple for implementation and robust in practice. Note that we randomize the column order of the projected designs in the case when OA ( 8 , 2 , 2 × 4 , 2 ) is applied by the assumption that all factors are equally important. The time complexity of the Extended PUS increases to a maximum of 4 times that of the PUS algorithm. Compared to PUS, the new algorithm sacrifices a tolerable amount of computational complexity to ensure the uniform spread of points in the subdata.
Example 2.
Continuation of Example 1. We conduct the same experiment with the Extended PUS. We show the two-dimensional projection plots of X (dot) and X (circled) in the first and the second dimensions in Figure 3 for PUS and Extended PUS. Points selected by the Extended PUS no longer cluster around specific points and appear to be more space-filling than PUS. The R codes for implementing the PUS and Extended PUS algorithms have been made publicly available on the GitHub repository at https://github.com/yetian28/pus, posted on 15 September 2024.
We present a comparison study to illustrate that our algorithms are good at selecting subdata with small ϕ ( X ) and centered L 2 -discrepancy. The full dataset X is generated in the way described in Example 1 using the following four settings of parameters: (a) N = 10 4 , m = 5 , n = 50 ; (b) N = 10 4 , m = 5 , n = 100 ; (c) N = 10 5 , m = 10 , n = 100 ; (d) N = 10 5 , m = 10 , n = 300 . For each generated X, we select subdata by the PUS, Extended PUS and MR for 100 times repeatedly in cases (a), (b), and (c) and 10 times for case (d) due to time complexity constraints. IBOSS is a deterministic algorithm. We select subdata by IBOSS in the four cases for just one time. Figure 4 displays the boxplots of ϕ ( X ) and CD ( X ) by the four algorithms in the four cases.
As clearly shown in Figure 4, subdata selected by the PUS and Extended PUS have smaller ϕ ( X ) and CD ( X ) compared to the other two methods in all cases. Points selected by our algorithms demonstrate uniformity under the uniform projection criterion and centered L 2 -discrepancy. IBOSS has the worst criteria values in all cases. In cases (b) and (d), when the subdata size is relatively large, PUS outperforms the Extended PUS in both criteria.
Time efficiency stands as a pivotal factor when evaluating subsampling algorithms. In our investigation, we conducted an experiment to compare the speed of our proposed algorithms with that of IBOSS and MR. The full dataset X = { x 1 , , x N } is generated by x i U [ 1 , 1 ] m , i = 1 , , N . We measure the runtime for sampling in the following scenarios: (i) fix m = 5 , n = 100 and change the size of the full dataset by N = 10 3 , N = 10 6 ; (ii) fix N = 10 5 , m = 5 and change the size of the subdata by n = 50 , n = 500 ; (iii) fix N = 10 5 , n = 100 and change the dimension of the data points by m = 5 , m = 10 . The runtime used by each algorithm programmed with R and Rcpp on a server with an Intel(R) Core(TM) i7-12700 processor running on the Windows operating system is listed in Table 1.
In general, the speeds of our algorithms are acceptable and sensitive to the size of the full data. IBOSS requires the minimal time in all scenarios. Both PUS and Extended PUS share similar speeds and are significantly faster than MR. The speeds of PUS and Extended PUS are nearly unaffected by the size of the subdata shown in the runtime in our experiment. In contrast, the speed of MR slows down dramatically as the size of the subdata grows. In particular, MR needs around 4.5 h to select 500 samples from a dataset with 10 5 points and 5 dimensions, while our algorithms need less than 4 s to complete the same task.

4. Numerical Studies

We use two simulation studies and two real-life examples to demonstrate that the PUS and Extended PUS algorithms are capable of selecting space-filling subdata with superior performance in building statistical surrogate models.

4.1. Simulation Study 1

Consider a full dataset X generated in the way described in Example 1 with N = 10 4 , m = 5 . We create Y = ( y 1 , , y N ) from the following two models (same as the models in Section 3.2 of [22]):
Model (A): y i = β 0 + j = 1 m β j x i j + ε i
Model (B): y i = 30 sin ( x i 1 ) + x i 2 3 + 2 x i 3 2 + x i 4 x i 5 + ε i where β j = 1 , j = 1 , , m , and ε i , i = 1 , , N independently follow N ( 0 , σ 2 ) with σ = 3 .
We obtain subdata X of size n = 50 by the PUS, Extended PUS, IBOSS and MR methods, respectively. Assuming that there is no prior knowledge of the models generating Y, which would be true in a lot of cases, we employ a linear model to predict y i with subdata X . The performance of the prediction is evaluated by the mean squared error (MSE) described in [22] to ensure a fair comparison.
The theoretical mean squared error E { E { y ^ ( x ) μ ( x ) } | x } can be approximated by MSE = ( 1 / N ) i = 1 N E ( y ^ ( x i ) μ ( x i ) ) 2 , where μ ( x ) = E ( y | x ) and y ^ ( x ) is the predicted value. The MSE can be decomposed into two parts: MSE = B + V , where
B = ( 1 / N ) i = 1 N Bias ( y ^ ( x i ) , μ ( x i ) ) 2 , V = ( 1 / N ) i = 1 N Var ( y ^ ( x i ) ) .
The response y i for each x i is generated for S = 100 times. Let y i ^ ( s ) , s = 1 , , S denote the prediction of y i in the s-th repetition. The estimates of B and V are calculated as follows:
B ^ = ( 1 / N ) i = 1 N s = 1 S y i ^ ( s ) / S μ ( x i ) 2
and
V ^ = ( 1 / N ) i = 1 N s = 1 S ( y i ^ ( s ) s = 1 S y i ^ ( s ) / S ) 2 S .
We randomly generate the full dataset X for 100 times and calculate the estimates of the mean squared error as MSE = B ^ + V ^ by the definitions in (3) and (4) for each subdata X selected by the four algorithms. Figure 5 and Figure 6 show the distribution of MSE for model (A) and model (B), respectively.
For the linear model (A), we fit the correct model for prediction. IBOSS is designed for linear regression models. The subdata selected by IBOSS achieve the least mean squared errors from the distribution. If the generating model is confirmed to be linear, the IBOSS method is a good choice in terms of accuracy and efficiency. On the other hand, for the nonlinear model (B), we use the wrong model for prediction. PUS outperforms the other three methods. The subdata selected by PUS achieve the least mean squared errors from the distribution. Extended PUS and MR have similar performance while IBOSS is not suitable for predicting nonlinear models. The space-filling samples selected by our algorithms demonstrate robustness under model misspecification. The MR method is also designed to be robust. However, the time complexity of MR remains a major drawback.

4.2. Simulation Study 2

Santner et al. [30] pointed out that Gaussian process models are well suited for space-filling designs. A Gaussian process over a space Z is a stochastic process characterized by a mean function μ : Z R and a kernel function k : Z × Z R such that
y ( x ) = μ ( x ) + ϵ ( x ) , μ ( x ) = E [ y ( x ) ] , k ( x , x ) = E [ ( y ( x ) μ ( x ) ) ( y ( x ) μ ( x ) ) ] ,
with x , x Z . Prediction is made by the posterior mean of the predictive distribution of y ( x ) given the data. It is suggested that a constant mean function μ is good enough for a Gaussian process model [31].
We fit Gaussian process models with a squared exponential kernel to make predictions. The full dataset X and its corresponding Y used in this study are the same as those used in the previous simulation study. Let the size of the subdata range from n = 60 to 420. The subdata X are obtained by the PUS, Extended PUS, IBOSS and MR in different sizes, respectively. Gaussian process models are fitted with each X . To evaluate the performance of the fitted models, we calculate the mean squared prediction error (MSPE) by
MSPE = ( 1 / N ) i = 1 N ( y ^ i ( x i ) y i ) 2 .
The mean squared prediction errors defined in (5) are computed from Gaussian process models fitted with subdata of varying sizes to elucidate the performance trends of the subsampling algorithms. Figure 7 and Figure 8 display the outcomes of this study. Note that the presented mean squared prediction errors for the PUS, Extended PUS and MR are the average results of 100 repeated sampling processes.
For the linear model (A), the subdata selected by IBOSS produce more accurate predictions than other algorithms when n < 200 . When n > 200 , PUS achieves the best mean squared prediction errors. MR shows competitive performance across most sizes of the subdata. For the nonlinear model (B), PUS outperforms the other three methods across most subdata sizes (except for n = 60 ). Extended PUS and MR have comparable performance. However, when the subdata size exceeds 200, the sampling process of MR slows down significantly and harms sampling efficiency. IBOSS produces the worst mean squared prediction errors in this case.
We fit different surrogate models to the same dataset and obtain similar results on the performances of the four subsampling algorithms. IBOSS is the best choice among others for sampling from data generated by linear models. For data originating from nonlinear sources, our algorithms demonstrate both accuracy and efficiency in selecting space-filling samples for predictions.

4.3. Real-Life Case Study 1

We examine the proposed subsampling algorithms under a real-life model. Specifically, we consider a practical drilling scenario where water flows from an upper aquifer through a borehole to a lower aquifer separated by an impermeable rock layer. This illustrative model has been used broadly in the domain of computer experiments and subsampling methodologies (see Zhang et al. [1], Tian and Xu [32]). The response variable Y, which represents the borehole flow rate in m3/yr, is determined by a complex nonlinear function detailed below:
Y = 5 T u ( H u H l ) l n ( r / r w ) ( 1.5 + 2 L T u l n ( r / r w ) r w 2 K w + T u T l ) ,
where the 8 input variables and their typical ranges are described in the following. r w [ 0.05 , 0.15 ] denotes the borehole radius in meters (m); r [100, 50,000] denotes the radius of influence in meters (m); T u [63,070, 115,600] indicates the transmissivity of the upper aquifer in square meters per year (m2/yr); T l [ 63.1 , 116 ] indicates the transmissivity of the lower aquifer in square meters per year (m2/yr); H u [ 990 , 1110 ] represents the potentiometric head of the upper aquifer in meters (m); H l [ 700 , 820 ] represents the potentiometric head of the lower aquifer in meters (m); L [ 1120 , 1680 ] defines the length of the borehole in meters (m) and K w [9855, 12,045] refers to the hydraulic conductivity of the borehole in meters per year (m/yr). The variable r w follows a normal distribution of N ( 0.10 , 0 , 016181 2 ) and r follows a lognormal distribution of Lognormal ( 7.71 , 1 . 0056 2 ) . The distributions of the remaining variables are modeled as continuous uniform distributions constrained within their respective ranges.
To evaluate the effectiveness of the various subsampling methods in building statistical surrogate models on large-scale datasets, consider a full dataset size of 10 6 . Let the size of subdata range from n = 100 to 400. We calculate the mean squared prediction errors defined in (5) in approximating the logarithm of borehole function by the Gaussian process model with the samples selected by PUS, Extended PUS, IBOSS and MR. Figure 9 presents the mean squared prediction errors we computed. Each presented value for the PUS and Extended PUS is the average outcome of 100 repeated executions to eliminate randomness. IBOSS and MR are executed one time because the IBOSS method is deterministic, while the MR method is too expensive to run.
From Figure 9, Extended PUS outperforms the other three methods across most subdata sizes. MR has the second best performance with a notable increase in computational time. Borehole function has a limited number of significant factors with excessive interaction effects. Samples that are space-filling in low-dimensional projections should champion this simulation. Extended PUS demonstrates its performance over other methods in selecting samples for factor screening and interaction effects detection. PUS performs badly and encounters instability issues in this case study. This could be attributed to the following reasons: The borehole function is relatively easy for factor screening. As a result, PUS wastes resources on insignificant factors and thus is unable to capture interaction effects efficiently.

4.4. Real-Life Case Study 2

In the subsequent analysis, we evaluate the effectiveness of our subsampling algorithms on a dataset retrieved from the UCI machine learning repository [33]. The dataset focuses on the Physicochemical Properties of Protein Tertiary Structure derived from the Critical Assessment of Structure Prediction cycles 5–9. The main objective is to model the root mean squared deviation (RMSD), an indicator in protein structure prediction algorithms. Based on the domain expertise presented by Iraji and Ameri [34], nine normalized factors are set as regressors, including total surface area, non-polar exposed area, fractional area of exposed non-polar residue, fractional area of the exposed non-polar part of residue, molecular mass-weighted exposed area, average deviation from the standard exposed area of residue, Euclidean distance, secondary structure penalty, and spatial distribution constraints [35]. The dataset consists of N = 45,730 points of factors and corresponding RMSD without missing data. The subdata of sizes ranging from n = 60 to 560 are obtained by the PUS, Extended PUS, IBOSS and MR, respectively. We fit the Gaussian process models for predicting RMSD using subdata selected by the four algorithms. The mean squared prediction errors of the fitted models defined in (5) are presented in Figure 10, which are the average results of 10 repeated sampling processes for PUS, Extended PUS and MR.
From Figure 10, PUS performs best, in general, under various subdata sizes. The subdata selected by MR have the least mean squared prediction error for n = 60 . When n 300 , PUS and Extended PUS outperform MR and IBOSS. Extended PUS performs unstably when n is between 150 and 300 as the subdata are selected to be close to OA ( 8 , 2 , 2 × 4 , 2 ) in two-dimensional projections. The reason for the unstable performance may lie in the unbalanced representation of different factors. The subdata selected by IBOSS produce the largest mean squared prediction errors. It is interesting to note that the mean squared prediction errors obtained from the IBOSS method increase as the subdata sizes increase. This implies that poor information is conveyed by data points located on the edges. In some real-life situations, edge data may have a greater chance to contain a large measurement error, mistakes and abnormal problems. IBOSS is not suitable for such situations.
The RMSD prediction experiment confirms that space-filling subdata are robust in building statistical surrogate models for unknown or complex generating models. Both PUS and Extended PUS show good performance in this study. MR selects space-filling subdata with a huge computation burden and thus is not a good choice in real applications.

5. Conclusions

In this paper, we introduce the PUS and Extended PUS algorithms aiming at efficiently extracting informative subdata from large datasets in building statistically surrogate models. The proposed algorithms select space-filling data points that are closest to the uniform designs in two-dimensional projections. The subdata obtained from our algorithms perform well under the uniform projection criterion and the centered L 2 -discrepancy. Most existing subsampling algorithms are model-based methods that require the data generating model to be specified. Our proposed algorithms are model-free and suit situations when there is no clue about the data generating process or the generating model is complicated. Compared to another model-free subsampling method, the MR algorithm, our approaches show great advantages in computational complexity. Through two simulation studies and two real-life case studies, we demonstrate the ability of our approaches in developing surrogate models and making predictions over IBOSS and MR in scenarios involving model misspecification and complication. In conclusion, our proposed PUS and Extended PUS algorithms have the following advantages: (i) obtaining space-filling samples, especially in the low-dimensional domain; (ii) robust under model misspecification and complication; (iii) fast to compute; and (iv) perform well in developing surrogate models and making predictions.
There are still many ways to improve our algorithms. Currently, the uniform designs used in our algorithms are full factorial designs with 2 p levels, where p is a positive integer. Other full factorial designs, like OA ( 9 , 2 , 3 , 2 ) , offer different options which are worth exploring further. Extreme outliers can slow down the sampling speed and skew the selection process. Our study solely focuses on discrepancy-related criteria without exploring other alternative criteria. The computational efficiency of our algorithms remains burdensome when handling datasets with a substantial volume of data points. Future research will focus on refining the following aspects: (i) exploring flexible choices of space-filling criteria and designs used in the algorithms; (ii) proposing a data preparation protocol prior to applying the subsampling algorithms; and (iii) developing strategies to enhance the computational efficiency of our algorithms.

Author Contributions

Conceptualization, Y.T.; methodology, Y.T. and Y.S.; software, Y.S.; validation, Y.T., W.L. and Y.S.; formal analysis, Y.T. and Y.S.; writing—original draft preparation, Y.S.; writing—review and editing, Y.T. and W.L.; visualization, Y.T. and Y.S.; funding acquisition, Y.T. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fundamental Research Funds for the Central Universities.

Data Availability Statement

The data presented in this study are openly available in The UCI Machine Learning Repository [35].

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

In the appendix, we provide the proofs of the theorems. We first introduce some necessary notation. Suppose a point set G s , q = { ( i 1 , , i s ) } , where i j { 0 , , q 1 } . G s , q has q s points in total. Let P be a collection of n points in G s , q and let P ( n ; q s ) be the set of all P . P could also be written as a matrix whose rows are the points in P . Denote n ( ( i 1 , . , , , i s ) ; P ) be the occurrence of ( i 1 , . , , , i s ) in P . Let y ( P ) be the q s -vector with elements n ( ( i 1 , . , , , i s ) ; P ) arranged lexicographically in ( i 1 , . , , , i s ) . The following result from [36] is important for the proofs of the theorems:
Lemma A1
([36] (Theorem 1)). Let P P ( n ; q s ) be a set of n points or an n × s matrix.
(1) 
When q = 2 or q is odd, P minimizes C D ( P ) over P ( n ; q s ) if and only if y ( P ) = ( n / q s ) 1 .
(2) 
When q is even (but not 2), s = 1 , 2 and P is an orthogonal array of strength one,
P minimizes C D ( P ) over P ( n ; q s ) if, and only if, y ( P ) = ( n / q s ) 1 .
Proof of Theorem 1.
OA ( 4 , 2 , 2 , 2 ) is a point set P in P ( 4 ; 2 2 ) with y ( P ) = ( 1 , 1 , 1 , 1 ) . We complete the proof by a straightforward application of (1) in Lemma A1.
OA ( 4 , 2 , 2 , 2 ) = 0 1 0 1 0 0 1 1 T
Proof of Theorem 2.
There are 43 different types of OA ( 8 , 2 , 2 × 4 , 1 ) regardless of row orders. We calculate the centered L 2 -discrepancy for them and find 5 possible distinguished values. The representative designs and their centered L 2 -discrepancy are shown in the following:
D 1 = 0 0 0 1 0 2 0 3 1 0 1 1 1 2 1 3 , D 2 = 0 0 0 0 0 1 0 1 1 2 1 2 1 3 1 3 , D 3 = 0 0 0 1 0 1 0 2 1 0 1 2 1 3 1 3 , D 4 = 0 0 0 1 0 3 0 3 1 0 1 1 1 2 1 2 , D 5 = 0 0 0 0 0 3 0 3 1 2 1 2 1 1 1 1
with CD( D 1 ) = 0.1139187, CD( D 2 ) = 0.1197781, CD( D 3 ) = 0.1148953, CD( D 4 ) = 0.1158719 and CD( D 5 ) = 0.117825. From the numerical results, D 1 = OA ( 8 , 2 , 2 × 4 , 2 ) has the smallest centered L 2 -discrepancy and is the uniform design among OA ( 8 , 2 , 2 × 4 , 1 ) .
OA ( 16 , 2 , 4 , 2 ) is a point set P in P ( 16 ; 4 2 ) with y ( P ) = ( 1 , , 1 ) . We complete the proof by a straightforward application of (2) in Lemma A1.
OA ( 16 , 2 , 4 , 2 ) = 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 1 0 3 2 2 3 0 1 3 2 1 0 T

References

  1. Zhang, M.; Zhou, Y.; Zhou, Z.; Zhang, A. Model-Free Subsampling Method Based on Uniform Designs. IEEE Trans. Knowl. Data Eng. 2024, 36, 1210–1220. [Google Scholar] [CrossRef]
  2. Mahendran, A.; Thompson, H.; McGree, J.M. A model robust subsampling approach for Generalised Linear Models in big data settings. Stat. Pap. 2023, 64, 1137–1157. [Google Scholar] [CrossRef]
  3. Drovandi, C.C.; Holmes, C.; McGree, J.M.; Mengersen, K.; Richardson, S.; Ryan, E.G. Principles of experimental design for big data analysis. Stat. Sci. A Rev. J. Inst. Math. Stat. 2017, 32, 385. [Google Scholar] [CrossRef] [PubMed]
  4. Yi, S.Y.; Zhou, Y.D. Model-free global likelihood subsampling for massive data. Stat. Comput. 2023, 33, 9. [Google Scholar] [CrossRef]
  5. Wang, H.; Yang, M.; Stufken, J. Information-based optimal subdata selection for big data linear regression. J. Am. Stat. Assoc. 2019, 114, 393–405. [Google Scholar] [CrossRef]
  6. Wang, L.; Elmstedt, J.; Wong, W.K.; Xu, H. Orthogonal subsampling for big data linear regression. Ann. Appl. Stat. 2021, 15, 1273–1290. [Google Scholar] [CrossRef]
  7. Ma, P.; Mahoney, M.; Yu, B. A statistical perspective on algorithmic leveraging. In Proceedings of the International Conference on Machine Learning. PMLR, Beijing, China, 21–26 June 2014; pp. 91–99. [Google Scholar]
  8. Ma, P.; Chen, Y.; Zhang, X.; Xing, X.; Ma, J.; Mahoney, M.W. Asymptotic analysis of sampling estimators for randomized numerical linear algebra algorithms. J. Mach. Learn. Res. 2022, 23, 7970–8014. [Google Scholar]
  9. Derezinski, M.; Warmuth, M.K. Unbiased estimates for linear regression via volume sampling. Adv. Neural Inf. Process. Syst. 2017, 30, 1748. [Google Scholar]
  10. Derezinski, M.; Warmuth, M.K.; Hsu, D.J. Leveraged volume sampling for linear regression. Adv. Neural Inf. Process. Syst. 2018, 31, 1249. [Google Scholar]
  11. Ren, M.; Zhao, S.L. Subdata selection based on orthogonal array for big data. Commun. Stat.-Theory Methods 2023, 52, 5483–5501. [Google Scholar] [CrossRef]
  12. Wang, H.; Zhu, R.; Ma, P. Optimal subsampling for large sample logistic regression. J. Am. Stat. Assoc. 2018, 113, 829–844. [Google Scholar] [CrossRef] [PubMed]
  13. Yu, J.; Wang, H.; Ai, M.; Zhang, H. Optimal distributed subsampling for maximum quasi-likelihood estimators with massive data. J. Am. Stat. Assoc. 2022, 117, 265–276. [Google Scholar] [CrossRef]
  14. Ai, M.; Yu, J.; Zhang, H.; Wang, H. Optimal subsampling algorithms for big data regressions. Stat. Sin. 2021, 31, 749–772. [Google Scholar] [CrossRef]
  15. Yao, Y.; Wang, H. Optimal subsampling for softmax regression. Stat. Pap. 2019, 60, 585–599. [Google Scholar] [CrossRef]
  16. Johnson, M.E.; Moore, L.M.; Ylvisaker, D. Minimax and maximin distance designs. J. Stat. Plan. Inference 1990, 26, 131–148. [Google Scholar] [CrossRef]
  17. Xiao, Q.; Xu, H. Construction of maximin distance Latin squares and related Latin hypercube designs. Biometrika 2017, 104, 455–464. [Google Scholar] [CrossRef]
  18. Fang, K.; Liu, M.Q.; Qin, H.; Zhou, Y.D. Theory and Application of Uniform Experimental Designs; Springer: Berlin/Heidelberg, Germany, 2018; Volume 221. [Google Scholar]
  19. McKay, M.D.; Beckman, R.J.; Conover, W.J. Comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Technometrics 1979, 21, 239–245. [Google Scholar] [CrossRef]
  20. Tang, B. Orthogonal array-based Latin hypercubes. J. Am. Stat. Assoc. 1993, 88, 1392–1397. [Google Scholar] [CrossRef]
  21. He, Y.; Tang, B. Strong orthogonal arrays and associated Latin hypercubes for computer experiments. Biometrika 2013, 100, 254–260. [Google Scholar] [CrossRef]
  22. Shi, C.; Tang, B. Model-robust subdata selection for big data. J. Stat. Theory Pract. 2021, 15, 82. [Google Scholar] [CrossRef]
  23. Joseph, V.R.; Gul, E.; Ba, S. Maximum projection designs for computer experiments. Biometrika 2015, 102, 371–380. [Google Scholar] [CrossRef]
  24. Box, G.E.; Meyer, R.D. An Analysis for Unreplicated Fractional Factorials. Technometrics 1986, 28, 11–18. [Google Scholar] [CrossRef]
  25. Sun, F.; Wang, Y.; Xu, H. Uniform projection designs. Ann. Stat. 2019, 47, 641–661. [Google Scholar] [CrossRef]
  26. Hickernell, F.J. Lattice rules: How well do they measure up? In Random and Quasi-Random Point Sets; Springer: Berlin/Heidelberg, Germany, 1998; pp. 109–166. [Google Scholar]
  27. Zhou, Y.D.; Fang, K.T.; Ning, J.H. Mixture discrepancy for quasi-random point sets. J. Complex. 2013, 29, 283–301. [Google Scholar] [CrossRef]
  28. He, Y.; Cheng, C.S.; Tang, B. Strong orthogonal arrays of strength two plus. Ann. Stat. 2018, 46, 457–468. [Google Scholar] [CrossRef]
  29. Sun, C.Y.; Tang, B. Uniform projection designs and strong orthogonal arrays. J. Am. Stat. Assoc. 2023, 118, 417–423. [Google Scholar] [CrossRef]
  30. Santner, T.J.; Williams, B.J.; Notz, W.I.; Williams, B.J. The Design and Analysis of Computer Experiments; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  31. Sack, J.; Welch, W.; Mitchell, T.; Wynn, H. Design and analysis of computer experiments (with discussion). Stat. Sci. 1989, 4, 409–435. [Google Scholar]
  32. Tian, Y.; Xu, H. A minimum aberration-type criterion for selecting space-filling designs. Biometrika 2022, 109, 489–501. [Google Scholar] [CrossRef]
  33. The UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/about (accessed on 22 August 2024).
  34. Iraji, M.S.; Ameri, H. RMSD protein tertiary structure prediction with soft computing. IJMSC-Int. J. Math. Sci. Comput. (IJMSC) 2016, 2, 24–33. [Google Scholar] [CrossRef]
  35. Rana, P. Physicochemical Properties of Protein Tertiary Structure. UCI Machine Learning Repository. 2013. Available online: https://archive.ics.uci.edu/dataset/265/physicochemical+properties+of+protein+tertiary+structure (accessed on 22 August 2024).
  36. Ma, C.X.; Fang, K.T.; Lin, D.K. A note on uniformity and orthogonality. J. Stat. Plan. Inference 2003, 113, 323–334. [Google Scholar] [CrossRef]
Figure 1. Subdata selected by PUS, IBOSS and MR.
Figure 1. Subdata selected by PUS, IBOSS and MR.
Mathematics 12 02985 g001
Figure 2. OA ( 4 , 2 , 2 , 2 ) (a), OA ( 8 , 2 , 2 × 4 , 2 ) (b), OA ( 8 , 2 , 4 × 2 , 2 ) (c) and OA ( 16 , 2 , 4 , 2 ) (d) on the unit plane.
Figure 2. OA ( 4 , 2 , 2 , 2 ) (a), OA ( 8 , 2 , 2 × 4 , 2 ) (b), OA ( 8 , 2 , 4 × 2 , 2 ) (c) and OA ( 16 , 2 , 4 , 2 ) (d) on the unit plane.
Mathematics 12 02985 g002
Figure 3. Subdata selected by PUS and Extended PUS.
Figure 3. Subdata selected by PUS and Extended PUS.
Mathematics 12 02985 g003
Figure 4. ϕ ( X ) and CD ( X ) obtained under the (ad) settings of parameters.
Figure 4. ϕ ( X ) and CD ( X ) obtained under the (ad) settings of parameters.
Mathematics 12 02985 g004
Figure 5. Distribution of MSE for predicting responses in Model (A).
Figure 5. Distribution of MSE for predicting responses in Model (A).
Mathematics 12 02985 g005
Figure 6. Distribution of MSE for predicting responses in Model (B).
Figure 6. Distribution of MSE for predicting responses in Model (B).
Mathematics 12 02985 g006
Figure 7. The mean squared prediction errors for predicting responses in Model (A).
Figure 7. The mean squared prediction errors for predicting responses in Model (A).
Mathematics 12 02985 g007
Figure 8. The mean squared prediction errors for predicting responses in Model (B).
Figure 8. The mean squared prediction errors for predicting responses in Model (B).
Mathematics 12 02985 g008
Figure 9. The mean squared prediction errors for the borehole experiments.
Figure 9. The mean squared prediction errors for the borehole experiments.
Mathematics 12 02985 g009
Figure 10. The mean squared prediction errors for predicting RMSD in the protein tertiary structure data.
Figure 10. The mean squared prediction errors for predicting RMSD in the protein tertiary structure data.
Mathematics 12 02985 g010
Table 1. Comparison of runtime for the different methods.
Table 1. Comparison of runtime for the different methods.
Methods m = 5 , n = 100 N = 10 5 , m = 5 N = 10 5 , n = 100
N = 10 3 N = 10 6 n = 50 n = 500 m = 5 m = 10
PUS0.0786 s20.1868 s2.1963 s2.2283 s1.9579 s4.2223 s
EPUS0.2230 s21.3274 s1.8430 s3.8022 s1.8061 s8.48189 s
IBOSS0.0035 s0.0931 s0.0091 s0.0474 s0.0091 s0.0173 s
MR1.7674 s20.9574 min25.3371 s4.5806 h2.5855 min4.7106 min
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Y.; Liu, W.; Tian, Y. Projection-Uniform Subsampling Methods for Big Data. Mathematics 2024, 12, 2985. https://doi.org/10.3390/math12192985

AMA Style

Sun Y, Liu W, Tian Y. Projection-Uniform Subsampling Methods for Big Data. Mathematics. 2024; 12(19):2985. https://doi.org/10.3390/math12192985

Chicago/Turabian Style

Sun, Yuxin, Wenjun Liu, and Ye Tian. 2024. "Projection-Uniform Subsampling Methods for Big Data" Mathematics 12, no. 19: 2985. https://doi.org/10.3390/math12192985

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop