1. Background
Currently, the concept of “health inequalities” refers to the impact that factors, such as wealth; education; employment; racial or ethnic group; exposure to environmental factors, including air pollution or weather variables; urban or rural residences; and/or the social conditions of an individual’s workplace or dwelling, have on the distribution of health and disease among the population. The study of the characteristics of the population and the geographical area of residence is the methodological support that allows for intervention points focused on the prevention and the disappearance of existing health inequalities to be identified.
Initially, socioeconomic inequalities were identified with health inequality [
1]. Health inequality can be defined as an inequity in the spread of a disease. In other words, health inequality is the systematic and potentially avoidable differences in one or more health aspects across socially, economically, demographically, or geographically defined populations or population groups. Two conditions must be met for a difference in health to be considered as an inequality: (1) it must be considered socially unjust and (2) potentially avoidable (i.e., there are instruments available that could be used to avoid it) [
1].
There is evidence that inequalities in health exist. While the Ladonde [
2] and Black [
3] Reports pointed this out, it was the Acheson Report [
1] that firmly concluded that inequalities in health have a socioeconomic explanation. To date, twenty years later, most of these relationships have been demonstrated, and not an insignificant proportion is caused by environmental problems [
1]. These factors are generally, but not exclusively, linked to gender, social and economic conditions [
1,
4,
5].
In general, the living environment, and thus environmental conditions, can contribute to socioeconomic inequalities in health, either independently or, more likely, jointly [
1,
5]. The first is differential exposure: the most economically disadvantaged groups has a greater exposure to environmental problems, including, air pollution. The second is differential susceptibility to exposure (i.e., the main adverse health effects) resulting from environmental problems, which occur among the most economically disadvantaged people due to their greater vulnerability.
When we think about a longitudinal study to observe how health inequalities, individuals’ health, income, or another specific characteristic evolve over time, our thoughts very quickly turn to creating a cohort. This is immediately followed by considerations of the high cost and logistical difficulties of managing a cohort in terms of obtaining users, processing the sample, managing the information, and even handling and looking after the sample.
There are many cohorts in which the number of individuals easily surpasses 100,000 marks, including the Framingham Heart Study [
6] the Current Management of Secondary Hyperparathyroidism: A Multicenter Observational Study (COSMOS) [
7], and the NutriNet-Santé Study [
8]. When the sample is large, the governance of the user and their data become extremely costly. The sample is acquired in the traditional way, via a letter explaining to the individual concerned that they have been selected to take part in a project and what it consists of involves some costs that are sufficiently high as to consider alternatives to the cohort [
9,
10,
11,
12]. Another point of consideration is that the cost of increasing, improving, or simply demonstrating the significance for a group or subgroup that was not initially contemplated can be so high that many researchers decide not to incorporate any more individuals into the cohort beyond a theoretical framework. Financial constraints and a lack of logistical resources are factors that generally mean that traditional cohorts have limits.
This is where digital considerations come into play. An
electronic-cohort or an
e-cohort is a traditional but digitally managed cohort [
13]. This management can be entirely digital via user interactions with websites, platforms, apps, or by post [
9]. It can also be of a hybrid nature, depending on the type of information needed to be previously collected and the level of difficulty of obtaining the information automatically. Some traditional cohorts, some of them
novel cohorts with a high number of individuals, are starting to test the transformation of traditional cohorts into electronic cohorts, seeking their improvement. These improvements basically focus on optimizing the cost/efficiency of the project and obtaining and managing data.
The marginal cost of the sample in an e-cohort is practically zero [
11], although some costs inherent to longitudinal studies and linked to maintaining and managing the sample remain. They are, nonetheless, significantly lower than the cost of traditional acquisition. This cost reduction not only signifies monetary savings, but also logistical ease in terms of the human factor. Currently, the e-cohorts that have published results focus on using a webapp as the working platform, sometimes including external elements, such as smartwatches [
14] or diaries that must be kept up [
9], with the user being able to choose different format. These external elements end up not being used by the individuals, causing sample mortality and making this a weakness of e-cohorts that needs to be addressed [
10,
11,
14] to be able to obtain data without the user having to directly intervene with the app or the mobile phone.
The e-cohort also reduces the costs linked to data collection, minimizing the logistical costs of obtaining, cleaning, homogenizing, processing, and automating all the information concerning the sample. In a cohort, the time spent purging everyone’s information quickly adds up to many hours, while digitally doing so allows for “interviewing” the sample, thus eliminating the time spent on this task. We must also consider that the information is obtained in this way just once or twice a year, especially if the sample is large. This lack of information about the user during certain periods causes a data lag, generating an information gap that the traditional cohort cannot resolve. The e-cohort enables different and several surveys to be carried out at no extra economic cost, although consideration must be given to ensure that the sample is not saturated with activity.
In e-cohorts, the data can be obtained in different ways, which, for the sake of simplification, can be separated into two groups: the first where the user interacts, and the second where the user is “passive”. In the first, the user interacts directly with the website, app, or mobile device, and consciously responds to the information requested, such as answering a survey or a question about their perceived state of health. Although users’ fatigue thresholds have not yet been established, the e-cohort is an attractive option, thanks to the possibility of asking more users more questions at a lower cost. In addition, all the answers enter a digital process where they are easily automated, further reducing the cost and increasing the efficiency of the process. The same logic can be applied to the use of external elements, for example, a smartwatch that can supply minute-by-minute information about the evolution of an individual’s heart rate. The results obtained using these tools are unbiased compared to the data obtained using traditional tools, and they also provide information that is consistent over time.
It has been demonstrated that the most effective way to gather users for a sample is by offering a monetary incentive [
9,
12,
13], which the user receives once they have responded to the questions.
There has been a case in which the sample was opened up by applying citizen science. In these cases, the e-cohorts have to buy their sample with a census, or via a similar means, to validate whether the sample obtained is representative of the study population [
11,
13]. The sample must be validated by separating the different demographic characteristics. In various cases, it has been observed that there are groups that do not tend to take part in these experiences, so additional efforts are required to sample these groups correctly. Conversely, young women with a higher educational level tend to participate most in this type of initiative, leading to their oversampling [
14]. This can cause biases, which must be controlled when performing the inferences. It has also been shown that a population with little or no digital skills find responding to the questions problematic. Despite this limitation, very few individuals emerge to complicate the sampling of specific groups [
11].
One common limitation of the cohorts that is not resolved by the e-cohort emerges when seeking a way to use a sample to represent a set of territories. If we want to significantly represent the population of Catalonia, it is sufficient that it is random throughout the territory. Meanwhile, if we want to work with a specific axis, such as age, it is sufficient to make a small adjustment and increase the size of the sample.
The Public Health Observatory of Girona Province (Dipsalut) is designing an e-cohort to carry out a longitudinal study to simultaneously examine the health of the population and its socioeconomic situation. The province of Girona is defined as a semi-rural territory [
15], with 221 municipalities and a population of approximately 770,000 people. Less than 10% of the municipalities have more than 10,000 inhabitants, substantially limiting statistical significance and causing us to encounter the limitations of the statistical secret.
This e-cohort must not only allow us to obtain a significant representation of all the municipalities in the territory, but it must also optimize the resources and the sample. A municipality codified as LAU level 2 by Eurostat is the smallest existing territorial division at the national level in Spain, where there is a decision-making power over local policies. The present paper explains the process of carrying out clustering in the province of Girona. The clustering must allow similar municipalities to be clustered for the purpose of constructing a representative sample of the different territories. This sample must enable the generation of a set of indicators that present the inequalities that exist in the territories [
16]. Furthermore, its design must revolve around the five major axes of inequality: sex, age, social class, migratory process, and territory. This sample was controlled and had to be regulated, so working with an open sample was not a consideration.
This paper explains the process used to cluster the municipalities into 6 groups according to their similarities, and how 14 clustering algorithms were tested to find the ones were the most effective and representative of the province. Finally, statistical modeling was used to observe if there were significant differences between the clusters to draw the final conclusions.
3. Results
3.1. Area and Period of Study
A process of clustering small areas of Catalonia using a set of 54 variables was carried out. A prior task was performed to select the variables that were most relevant to the different areas, as explained in the following sections.
The study period was initially 2010 to 2018. However, given the small dimensions of both the territory and population, the data are bound by the obligations of the statistical secret, presenting limitations regarding accessing the available information. Consequently, the study period was changed to 2015–2017, when the data are more consistent and relatively unproblematic regarding lost values. All the municipalities were therefore represented by a high level of consistency.
In this study, we considered 221 of the 948 municipalities belonging to the region of Catalonia. The number of inhabitants varied between 83 and 99,013 (average inhabitants: 3412, standard deviation: 9081.349 inhabitants, median inhabitants: 746, Q1 298 inhabitants, and Q3 2290 inhabitants). The population density varied between 1 and 4493 inhabitants per km2 (average: 45 inhabitants/km2, standard deviation: 464.216 inhabitants/km2, median inhabitants: 45/km2, Q1 20 inhabitants/km2, and Q3 130 population/km2).
3.2. Variable Selection
To eliminate the redundant variables and excessive noise, we carried out a variable selection process, spike and slab, according to the population [
52]. The models were based on the relationship with respect to the number of inhabitants in a municipality. The mean squared error of the predictions was used as a method comparison criterion [
70]. The spike and slab method presents the smallest mean squared error (MSE) (see
Table 1).
The dimensions of the final dataset are defined in 54 variables for 221 municipalities over 3 years, thus obtaining a final sample of 35,802 cases.
3.3. Clustering
The number of clusters obtained from the supervised methods was six (
Figure 2). This number was validated based on the application of the Elbow method in a task carried out prior to the process of clustering. The number of optimized clusters does not change in any of the three data sets.
The results of the clustering process are presented in
Table 2 (external and internal validation of clustering),
Table 3 (number of observations for each cluster and data set), and
Figure 3 and
Figure 4 (results of clustering).
The diversity of the municipalities in Girona presents a well-recognized heterogeneity. The capital has a little over 100,000 inhabitants (103,369 inhabitants), while there are less than 50,000 (47,235 inhabitants) in the next largest municipality. There is also important diversity in a geographical sense, with a set of municipalities located in mountainous areas and others located on the Mediterranean coast. This heterogeneity across the entire area generates some obvious socioeconomic and health differences. The density-based clustering algorithms do not work this heterogeneity optimally. Many municipalities, including the capital of the province, are detected as outliers. This type of algorithm does not allow all the municipalities to be classified, and so they were ruled out. However, the rest of the models classified all the municipalities (see
Figure 3).
An external and internal validation study was carried out to choose between the rest of the algorithms. A graphic validation was later designed using a cloud of points and the mapping of the clusters. The clustering produced by the hierarchical k-means method was consequently chosen.
As shown in
Table 2, the internal validation values [
71] of the algorithms, k-means, hierarchical k-means, PAM, and CLARANS, present the optimum values in the original database. In the nominal and smoothed data set, we observe how the PAM algorithm obtains some internal validation results that are inferior to the rest of the previously mentioned algorithms.
The external validation shows how PAM is the algorithm presenting a difference between inferior clusters in all the data sets. However, the intra-cluster difference varies depending on the data set. The three algorithms that present the relation of the most optimum intra-between cluster differences can be highlighted: k-means, PAM, and hierarchical k-means. The entropy value [
71] that shows the best clustering is presented in the fuzzy, DIANA, and AGNES algorithms for the different data sets. The CH index [
72] shows how the k-means, PAM, CLARANS, and hierarchical k-means algorithms are the ones that present the best construction of the clusters.
Table 3, which shows the distribution of the clusters, helps with the conceptualization of the dimensions of the clusters. It can be observed how the different clustering has a main cluster in the original data set, which has a greater number of cases than the rest. This main cluster varies from 186 to 627 in the different algorithms. There are two types of clustering: those in which the main cluster captures most cases, and those in which the cases are distributed more homogeneously between the clusters. In most of the groupings, there is a second cluster with a weight greater than 20% for all the observations. The groupings in which the main cluster retains at least 50% of the sample are CLARA, CLARANS, hierarchical k-means, fuzzy, BICO, EA, DIANA, and AGNES. Meanwhile, k-means, PAM, and BIRCH are the algorithms that distribute the individuals in the most balanced way. The nominal and smoother data sets present a more uniform distribution of the clusters in the municipalities.
Once the validations of the clusters and their dimensions have been analyzed, a graphic representation of them must be produced. This representation must allow the algorithms that generate a visually intuitive clustering to be detected to facilitate choosing the final clustering (
Figure 3).
Figure 4 shows how the k-means, PAM, and hierarchical k-means algorithms are the dimensions that generate a more visually intuitive clustering for the different data sets. The representations based on the nominal data set show how the distribution is reduced. In the smoothed data set, the cases are smoothed in a more obvious manner.
The graphic representation using the clouds of points does not allow a pattern that is significantly better than the rest to be detected. Therefore,
Figure 5 shows the groupings of the k-means, PAM, and hierarchical k-means algorithms on the study map (province of Girona).
3.4. Mapping of the Clustering
The maps illustrate how the clustering carried out using the original data set enables us to detect that the k-means and hierarchical k-means algorithms differentiate between the set of coastal municipalities and some county capitals together. They also cluster the set of inland municipalities that link Barcelona and France. They do not detect a differentiation between the mountain municipalities, although they do differentiate between a subregion of them. A small cluster for some of the municipalities with a high population is generated. Regarding PAM, the mountain and coastal municipalities are clearly differentiated. Some county capitals are also added to these last clusterings. A set of municipalities very close to Barcelona and the municipalities nearest the French border can be identified, as can the inland municipalities dispersed in a first and second ring around the county capitals. In all three clusterings, Girona is grouped independently.
The clusters generated by the k-means, PAM, and hierarchical k-means algorithms, based on the nominal and smoothed data sets, are very similar. The k-means and hierarchical k-means algorithms detect the first grouping of the municipalities located in the mountainous areas. K-means detects a subset of these municipalities since they belong to the inland municipalities. Both algorithms also detect a set of municipalities that belong to the coast, together with some county capitals. The municipalities nearest the French border and those closest to Barcelona are detected. Meanwhile, PAM detects a pattern among the municipalities next to France (
Figure 6 and
Figure 7).
3.5. Descriptive Study of the Clustering
Table 4 shows the variability of the clusterings. Notably, the k-means and hierarchical k-means algorithms are the data sets with the least variability in all three data sets, indicating that these clusterings do not undergo changes and are stable over time.
The algorithm chosen is hierarchical k-means, because it presents the optimum and secure properties to generate a sample that endures over the years. Six clusters can be detected in this algorithm. The first cluster contains the municipalities near the French border (Empordà), and the second contains the municipalities located in mountainous areas. The third group focuses on the inland municipalities of the territory. The fourth group is made up of the coastal municipalities and some provinces in the county. The fifth group detects the territory’s important municipalities, be it economically or in terms of population. The sixth and last group separates the capital from the rest of the municipalities.
The results of the descriptive study of the clustering are shown in
Table 5 (descriptive analysis by conglomerates, robust values). As can be observed, the size of the population is very different among the six groups. There is an obvious contrast between the high number of people that live in the capital (98,255) and the median population of the municipalities located in the other county capitals (37,042) and close to the coast (10,709), with lower population numbers than the rest of the cluster. The population density is also higher in these groups, and especially in the capital (2512). It can be observed how the native population figures are quite similar for all the clusters, except the capital, where this figure is higher (40.22). Meanwhile, the ratios of immigrants in the inland municipalities (0.082) and the mountainous areas (0.061) are lower than in the rest of the clusters, with the highest ratios in the coastal municipalities (0.217) and the other county capitals (0.225).
The internal and external flow of movements is greatest in the capitals of the county (28) and in the capital of the province (555). The migratory balance is also higher in the capital than in the rest of the clusters. The different weights in the distribution of jobs in the sectors in each cluster can also be observed. The mountain and border clusters (7.23 and 6.61, respectively) have the highest percentage of the population employed in agriculture. Meanwhile, the inland municipalities (20.98) have a higher percentage of the population employed in the industrial sector. The weight of the construction sector is similar in all the clusters, except for the capital, which has a lower percentage (4.67). The services sector predominates in all the clusters, with the greatest weight (81.94) in the capital. The unemployment rate increases in line with the weight of the population of each cluster. Likewise, the clusters with the highest population densities are where the Gini index is highest.
Inequality is greatest in the capital (36), followed by the coastal municipalities (34.60) and the main county capitals (34.10).
Income from salaries is highest in the capital (10,277) and lower in the coastal municipalities (7393) and the county capitals (7218). Income derived from unemployment benefits is lower in the coastal municipalities (2488) and the county capitals (2221).
The cost of renting housing is similar among the clusters. However, the cadastral value is not, with the highest values in the capital (4,005,166) and the lowest around the French border (22,797.5).
On observing the breakdown of the population balance, it can be observed how this balance is lower in the mountainous areas (1) and the border areas (2), than in the capital (704) and some other municipalities (193). A similar dynamic appears in relation to the natural growth of the population and the dependency index. The border and mountain municipalities have the same negative natural growth rate (−1) and the highest dependency indexes (60.60 and 56.19). Conversely, there is a higher natural growth rate in the capital and county capitals and a lower dependency index (50 and 48.04). The number of traffic accident victims is similar in all the clusters, except in the capital (76). However, more phone calls are registered in the coastal municipalities (5) and the other county capitals (5) than in the rest of the clusters.
Geographically, it can be observed how the highest municipalities are found in the mountain municipalities (953.5).
3.6. Inference
The clusters represent the variability of the territory, which, as we have shown, is very varied, and therefore these different realities are so different that they do not follow a normal distribution. The Kruskal–Wallis [
73] and Mann–Whitney tests [
74] show that there are significant differences among the clusters. To observe these differences from the clusters, we assume that we do not have the presence of multiculturalism or outliers. A multinominal logistic regression was performed to observe these differences [
75]. The odds ratios of the regression are presented in
Table 6.
4. Discussion
The execution of the algorithms and data sets show how the validation improves when working with more stable data, such as the nominal values smoothed by the z score. This stability is translated into less variability in the construction of the clusters in the three periods. The variability improves when working with the smoothed data set. This is a relevant point when considering the design of a longitudinal study to find the individuals that are representative of the same type of municipality.
Of the clustering presented, three of the maps can be identified as the most representative of the territory. The first was the map created with the original data set using the PAM algorithm, which managed to determine six clusters: the French border, the mountainous area, the outskirts of Barcelona, the coast, the area inland, and the capital of the province. The other two were the maps generated with the nominal and smoothed data sets, using the hierarchical k-means algorithm, which showed five clusters: the French border, the mountainous area, the coast, the area inland, and the capital of the province, in addition to a sub-cluster of the main county capitals. The more solid algorithm was chosen at the expense of the loss of the cluster adjoining Barcelona.
The multinominal logistic regression shows that there are differences among the clusters and the capital. There are no significant differences demographically between the municipalities grouped as county capitals and the capital of the province. The clusters of the mountainous areas, the French border, and the coast have the probability of having lower population balances and lower population densities than the capital. Consequently, the probability of having a higher global dependency index than the capital is higher.
Economically, there are less differences between the clusters. Any differences stem from salaries with respect to the capital, giving the coastal areas a lower probability. However, they have a higher probability of obtaining a gross average income and a pension than the capital. On the coast, both the gross average income and income from salaries have a higher probability of being the same as those of the capital. However, pensions have a lower probability.
The probability of having a rental housing offer equal to that of the capital is less in the mountain, border, and coastal clusters. However, the probability of owning property is higher with respect to the capital. Nonetheless, there is less probability that they are valued the same as the capital.
The job market presents significant differences, except for the municipalities in the county capitals. In the rest of the clusters, there is a greater probability of being unemployed than in the capital. Notably, the probability of having an immigrant unemployment rate equal to that of the capital is lower on the coast and in the mountains. The probability of having workers who are employed in the agricultural sector with respect to the probability of the same in the capital is greater in the coastal municipalities, and lower in the other municipalities. The probability of having workers employed in the services sector works inversely.
The probability of having sports facilities and libraries with respect to the capital is higher in the coastal municipalities. We find the inverse in the border municipalities, which have a negative probability. There are no significant differences in the municipalities of the county capitals.
In terms of interpreting the health variables, there are no significant differences with respect to the municipalities of the country capitals. For the rest of the clusters, the probability of having an aging index, similar to that of the capital, is negative. The probability of having the same death rate as the capital is higher in the border municipalities and on the coast, and negative in the others. The recovery index also has a negative probability. In the inland municipalities, the probability of having a mean age equal to that of the capital is lower than in the rest of the clusters. Traffic incidences and victims are more probable in the mountain and coastal municipalities than in those of the capital.
Clear differences were observed between the clusters and the capital. However, few significant differences were observed in the subgroup of the municipalities in the county capitals.
In conclusion, working with microdata is complicated, in terms of both making comparisons and modeling and clustering, especially if they are socioeconomic data. The difficulties of working with indicators, indexes, and rates complicate the data mining process and, later, the reading of the results. A smoothing or standardization process is necessary to work effectively. It must be considered that using percentages with such small data sets mean that these can drastically change from year to year. These possible irregularities accentuate the variations and generate an elevated volatility. This volatility affects the clustering and models, making their classification difficult. These factors end up translating into a high variability of the observations in the groups. However, this way of working can end up impeding the detection of new emerging clusters.
The functions based on density do not work optimally with variables that have such different realties as these.
Figure 3 shows how they do not manage to classify all the municipalities. It should be tested whether re-clustering the outliers results in being able to classify all the municipalities, even though this means generating a final clustering superior to the k-number of the chosen clusters. The hierarchical k-means and k-means algorithms generate a cluster that does not present large significant differences with respect to the capital, so we can therefore work with five clusters rather than six. This helps us to design the simplest sample with the possibility of generating the most segregations. Another point for further study is whether the subgroup detected by PAM presents significant differences to the other groups to maintain the six clusters. A priority when designing a clustering to be able to extract a set of individuals to carry out a longitudinal study using digital tools is that these groupings endure for as long as possible.
Another point to bear in mind is that the number of years studied should always be higher than the number of clusters we want to create. This way, we can know in which cluster the municipalities are classified, most times, to be able to find a cluster–territory relationship and a trend. This was not possible in this study due to the lack of data.
5. Conclusions
This article aims to help researchers and other decision-making institutions facilitate a comparison of the structuring and grouping of small areas, especially in those cases where the differences between them are so large. It also endeavors to show an optimal way of transforming and working on datasets to facilitate the resulting groupings. Two of the main limitations in grouping such diverse and small populations is, on the one hand, the lack of data and, on the other, the lack of experiences that endured over time, where we can observe their evolution.
If we want to analyze the impacts of spatial variables such, as NDVI or the pollutants PM2.5, PM10, NO2 or CO2, it is advisable to generate data at a lower level than the municipality, as municipalities, while not the smallest administrative division, are the smallest division that has political decision-making power. This would allow us to segment a cohort from census tracts or districts in the future and reduce the potential ecological fallacies that cohort data may generate. In addition, it would capture the inequality that can be observed between the rich and poor areas in cities better. The lack of experience working in small areas, along with the nature of most indicators, makes these processes difficult.
Currently, it is essential to start generating data at the scale of small areas, even smaller than those of a municipality, because otherwise we will not always be masking the inequalities through averages and aggregate values of population subsets, in which wealth has blurred the levels of poverty. On the other hand, the microdata permits the creation and adaptation of new indicators that allow the inequalities and the phenomena that occur in the territorial field to be captured more efficiently.
Data protection policies, although necessary, often prevent the study of the reality of territories. They also make it difficult to study individuals in a particular way. These mechanisms end up making it difficult to observe inequalities as well as study the sensitivity that each individual has, with respect to their social conditions and how these affect them.
To facilitate the best clustering process, it would be useful to carry out trend studies and predictive modeling to observe the subsequent years and to be able to forecast where each municipality will be classified, to help create a clustering that endures over time.