1. Introduction
Population data are indispensable for various sustainable development applications, including disaster assessment, urban planning, and public health management [
1,
2,
3,
4,
5,
6]. While census data serve as the primary source of population data, their coarse resolution limits the revelation of spatial heterogeneity within census units, hindering their application in research related to global social and environmental issues [
7]. To address this limitation, several large-scale gridded population datasets have been produced, such as GPWv4, HRSL, LandScan, and WorldPop [
8].
The datasets above are generated through the top-down population estimation methods, where census data are disaggregated into unified grid cells based on population distribution weighting layers [
9]. Over the past three decades, various modeling approaches have been developed to calculate the weighting layers, including areal weighting [
10,
11], negative exponential [
12,
13], kernel density [
14,
15,
16], and dasymetric mapping models [
17]. With the rapid development of AI technology, intelligent dasymetric mapping has gradually become the dominant approach in Gridded Population Mapping (GPM) studies. This approach leverages algorithms to model the unknown prior relationships between auxiliary variables and the population to obtain the weighting layer [
18,
19,
20,
21]. A notable example is the Random Forest (RF) model used to generate the WorldPop product [
17,
22,
23]. Additionally, Multiple Linear Regression (MLR) is commonly used in some GPM studies [
5,
19]. While both MLR and RF have shown relatively good performance in urban-scale GPM studies, they may not produce accurate results in large-scale study areas (e.g., China). This is primarily due to the significant regional variations in population distribution patterns across such expansive areas. Specifically, the relationship between population and auxiliary (explanatory) variables is spatially heterogeneous (non-stationary) and multi-scale. Using a single global model (e.g., MLR or RF) can lead to interregional heterogeneity being masked by ‘average’ estimates for the study area as a whole, potentially leading to inaccurate predictions of population distribution in localized regions.
New parameter estimation methods offer promising solutions. For instance, MLR assumes the relationship between the dependent and explanatory variables is spatially stationary [
24]. To address this limitation, Geographically Weighted Regression (GWR) was developed by Fotheringham in 1996. GWR improves MLR by employing non-parametric local weighted regression for curve fitting and smoothing applications [
25]. Unlike MLR, GWR considers the non-stationarity of the spatial relationship between the dependent variable and explanatory variables, making it more effective for analyzing factors related to spatial locations [
4,
5,
6,
19]. However, both MLR and GWR are limited in revealing spatial scale differences in the relationships between explanatory variables and the dependent variable. Specifically, the influence of different explanatory variables may be similar within a specific range but differ significantly beyond that range. To address this issue, Fotheringham proposed Multiscale Geographically Weighted Regression (MGWR) in 2017 [
26]. Yu et al. [
27] further supplemented and improved the statistical inference of MGWR, making this method more widely applicable to research. Compared to GWR, MGWR assigns specific bandwidths to each explanatory variable, allowing for the establishment of spatial relationship models closer to reality.
Auxiliary datasets, such as those on land use/land cover, topography, roads, and rivers, are often used in large-scale GPM studies. However, these datasets primarily reflect the potential of human settlements rather than directly indicating whether a specific location is inhabited [
8]. Although mobile phone location data can provide real-time insights into population distribution [
28,
29], its limited accessibility poses challenges for large-scale GPM applications. Compared to data such as land use/cover and topography, human settlements directly indicate the site is inhabited, enabling a more accurate and detailed depiction of the population distribution range. In addition, several openly available global or near-global human settlement datasets have been developed, including Microsoft and Google building footprints, HRSL, the World Settlement Footprint (WSF), and the Global Human Settlement Layer (GHSL). The accuracy of gridded population datasets can be improved by using relatively complete human settlement data as ancillary data. Studies have demonstrated that using these datasets can enhance the internal quantitative and qualitative accuracy of population distribution models by 10% to 15% (depending on different indicators) [
21,
30,
31,
32].
Although human settlement data have been applied in GPM, the datasets used in existing large-scale studies usually lack vertical (height or number of floors) and type (residential/non-residential) information [
21,
33,
34]. Currently, mainstream large-scale gridded population products, including HRSL, LandScan, and WorldPop, rely on 2D and non-functional human settlement auxiliary layers [
8]. Thomson et al. reported severe underestimation (averaging over 80%) in slum areas of Kenya and Nigeria due to the absence of detailed information about human settlements, such as usage and height, in the products above (and others) [
35]. Multiple studies have shown that using building data with vertical information and categorization can significantly improve the accuracy of gridded population outputs [
36,
37,
38,
39,
40]. This improvement is mainly attributed to considering the vertical distribution across building floors and the exclusion of non-residential buildings. In the past, the lack of openly available large-scale 3D residential/non-residential building datasets has strongly limited their application in continental or global-scale GPM. The emergence of the new GHSL data package offers a potential solution to overcome the challenges above, offering high-resolution global human settlement information (hereinafter referred to as GHSL-3D Building): building footprint, building type (residential/non-residential), and building height [
41].
Considering the above discussion, we proposed a new large-scale GPM method to generate a map of nighttime population distribution in mainland China (excluding Taiwan, Hong Kong, Macau, and some surrounding islands due to data limitations). This map corresponds to the concept of the resident population. We also assessed the accuracy of this method across provinces and municipalities with varying population densities and levels of economic development. This GPM method utilized 3D residential building data (from the newly released GHSL data package 2023), POI, nighttime light data, and land use/cover data within the MGWR model. The contributions of this paper are as follows:
(1) Three-dimensional residential building data were used in GPM for the entire mainland China, considering the effect of building height on the population distribution during the model training and imposing strict limits on the range of population distribution.
(2) Population distribution across mainland China was modeled based on MGWR, considering the nonstationarity and multiscale nature of the spatial relationship between population and auxiliary variables. This approach addresses regional differences in population distribution patterns.
To the best of our knowledge, this is the first time the MGWR model has been applied in the context of GPM and the first instance of employing 3D residential building data for national-level GPM in China. Previous studies have shown that WorldPop has a general accuracy advantage over other gridded population data products [
8]. Due to the improvements in the model and auxiliary data, the method presented in this paper is expected to yield results with higher accuracy than the WorldPop dataset, providing a crucial reference for generating next-generation global demographic maps.
This paper is organized as follows:
Section 2 describes the sources of research data and the preprocessing steps.
Section 3 details the methodology. The results and discussion are presented in
Section 4.
Section 5 concludes the paper and outlines directions for future research.
2. Data and Preprocessing
Table 1 presents this paper’s primary data for modeling and accuracy evaluation. The following describes the sources and preprocessing process for these data.
2.1. Population Data
In our research, the resident population data in the study area, excluding Taiwan, Hong Kong, Macau, and some surrounding islands, were obtained from the National Bureau of Statistics 2018 national sample survey resident population data. These data were collected based on the third-level administrative units, encompassing districts and counties, resulting in 2850 units.
The fourth-level resident population (i.e., the resident population at the level of the fourth administrative units) data for Shanghai, Jiangsu, Jiangxi, and Gansu provinces used for the accuracy test were mainly from the 2018 China Statistical Yearbook (Township).
2.2. Administrative Boundary Data
The data concerning the third-level administrative boundaries of China were acquired from the official website of the National Catalogue Service For Geographic Information (China) (
https://www.webmap.cn/main.do?method=index) (accessed on 22 July 2023). Additionally, the administrative boundary data for Beijing, Shanghai, Jiangsu, Jiangxi, and Gansu at the fourth level were sourced from the National Platform for Common Geospatial Information Services (China) (https:/
www.tianditu.gov.cn/, accessed on 12 September 2024). We linked the population data with the administrative boundary data based on the administrative code and name of the respective administrative units. The distribution of the resident population across the third-level administrative units in the study area is illustrated in Figure 3b. The third-level administrative boundary data were used as an input layer for “mask” and “processing extent” in the geoprocessing tool (ArcGIS Pro software 3.0) to ensure that the boundaries of the various types of data were consistent.
2.3. GHSL-3D Building Data
The global building dataset for 2018 was acquired from the official website of the Global Human Settlement Layer (GHSL) (
https://ghsl.jrc.ec.europa.eu/download.php, accessed on 12 September 2024). These datasets are categorized into three types: total building footprint data, non-residential building footprint data, and building height data. The two building footprint datasets have a spatial resolution of 10 m, where each pixel value represents a building area ranging from 0 to 100. The building height data have a spatial resolution of 100 m. In this dataset, each pixel represents the mean net height of all buildings at that location. With reference to studies [
42,
43,
44,
45,
46,
47] related to building height data, we assessed the accuracy of the GHSL building height dataset in
Section S1 of the
Supplementary Materials. The results show that the dataset has relatively good accuracy.
Initially, these datasets were in Lambert projection. However, to suit our study area, we transformed the datasets into the Albers projection using the projection and mask extraction tools in ArcGIS Pro 3.0. The resampling method employed was the nearest neighbor. All the capture cells were set to the building height data for this paper.
2.4. Nighttime Light Data
Nighttime light data is widely used in large-scale GPM [
18,
48]. We obtained the 2018 global VIIRS nightlight data (VNL V2.1 annual version) from the Earth Observation Group (EOG) website of the Colorado School of Mines (
https://eogdata.mines.edu/products/vnl/, accessed on 12 September 2024). This dataset has been carefully processed to exclude the influences of cloud cover and background light. The original spatial resolution of the data is 15 arc seconds, approximately equivalent to 500 m at the equator. To suit our study area, we utilized the projection and mask extraction tools available in ArcGIS Pro 3.0. Through this process, we obtained the nighttime light data at a spatial resolution of 100 m. Referring to Gaughan et al. [
22], the nearest neighbor method was used to resample nighttime light data to avoid changing pixel values.
2.5. Land Use/Cover Data
The 2018 land use/cover raster data were obtained from the official website of the Chinese Academy of Sciences Resource and Environmental Science Data Center (
https://www.resdc.cn/, accessed on 12 September 2024). The dataset primarily relies on Landsat satellite remote sensing imagery, which was manually interpreted. It follows a two-level classification system: Level 1 includes six land classes, namely, cultivated land, forest land, grassland, water area, built-up land, and unused land; Level 2 consists of 25 land classes based on the Level 1 classification system. The original spatial resolution of the data is 30 m, and it is projected onto the Albers projection using the Krasovsky ellipsoid. We converted the data into the Albers projection based on the WGS-84 ellipsoid to suit our study area. This conversion resulted in a spatial resolution of 100 m for the raster data. To minimize accuracy loss during resampling, we utilized the majority resampling method. Furthermore, we reclassified the processed land use/cover data by merging all land classes except for urban land, rural residential land (referred to as rural land), and industrial and mining land (referred to as industrial land) into a single class named ‘remaining land’.
2.6. POI Data
POI data can represent various human activities in their location and neighborhood (e.g., companies, restaurants, and financial services) that correlate with population density to varying degrees [
18,
49]. Therefore, POI data is are often used in GPM studies [
8,
50]. The POI data used in this study was were collected in 2017 and obtained from Amap (
https://ditu.amap.com/, accessed on 12 September 2024), a leading provider of digital maps, navigation, and location-based services in China. The raw text data was were carefully cleaned and transformed into vector points using latitude and longitude information. After this processing, the data was were projected for further analysis.
In our research, we utilized 13 types of POI data, including shopping services, government organizations and social groups, health care services, lifestyle services, car maintenance, catering services, sports and leisure services, financial and insurance services, companies and enterprises, car services, education and cultural services, car sales, and motorcycle services. These 13 types collectively amounted to 38,154,240 records, forming the basis for our analyses and investigations.
5. Conclusions
In this study, we applied the MGWR model to disaggregate population data by integrating 3D residential building, nighttime light, POI, and land use/cover data, creating a 100 m gridded population map for mainland China. As far as we know, this is the first time the MGWR model has been used in the context of GPM and the first instance of employing 3D residential building data for national-level GPM in China. The resulting gridded population map exhibits higher accuracy than the existing WorldPop dataset. This improvement can be attributed to utilizing 3D residential building data and the MGWR model. Unlike land use/cover data, residential building data can more accurately reflect the extent of population distribution and show a stronger correlation with the population; its height information can reflect the vertical distribution of the population within the building and is an excellent auxiliary variable. In addition, for large-scale countries or regions like China, the MGWR model, which takes into account the nonstationarity and multiscale nature of the spatial relationship between population and variables, is very suitable for use in GPM in such study areas because of the relatively significant differences in the population distribution patterns among regions.
This study can be a significant reference for developing the next-generation global gridded population product datasets. As GHSL-3D Building and nighttime light data are globally available, and alternatives to land use/cover and POI data, such as ESA/CCI and OpenStreetMap data, exist, the approach presented here can be applied globally. Regarding global population input data, GPWv4 can be a viable substitute for census data, as population grid products like GHS-POP have employed GPWv4 as input data [
41]. In contrast to the RF model used by WorldPop, the MGWR model allows for uniform modeling of all input units globally, eliminating the need for zonal modeling by country or region to control the accuracy of population predictions. As a result, the modeling method in this study significantly reduces the complexity and time required for global population modeling.
However, there is still much room for improvement in this study. The MGWR model used in this study can reflect the spatially localized relationship between population and explanatory variables, but it fails to reveal their nonlinear relationship. The relationship between population and influencing factors is usually nonlinear. In addition, to minimize the problem of collinearity among variables, we combined the kernel density layers of the 13 categories of POI into a single one, resulting in the loss of a large amount of semantic (category) information in the POI data. To address these shortcomings, we plan to combine the local regression idea with nonlinear machine learning algorithms (e.g., RF) to build a new model in our following research. Similar to GWR, for each location point, only some nearby observations are used to build a local model [
52,
53]. This model can express the spatially nonstationary and nonlinear relationship between the population and the variables and is less sensitive to the problem of covariance between the variables. This is expected to produce results with higher accuracy.