1. Introduction
Geographically explicit models have played an instrumental role in advancing our understanding of spatially explicit, dynamic socio-economic systems. By incorporating spatial data and human decision-making processes, these models allow researchers to capture the intricate ways in which individuals, communities, and larger populations interact within defined geographic contexts [
1]. The selection of a grocery retail store by customers living in an urban environment is one such complex socio-economic problem that has a significant spatial aspect and one in which customers and retailers interact dynamically. Numerous studies have demonstrated how geographic proximity, transportation networks, and spatial accessibility profoundly shape consumer choices [
2,
3,
4]. For instance, ref. [
5] found that store convenience and travel costs, rather than other factors, are key determinants of store selection. However, most of these studies primarily focus on store attributes, and the interaction between customers and stores is often overlooked. While spatial elements such as proximity to stores can directly influence a customer’s behaviour, it is also driven by marketing strategies of stores, which are informed by knowledge of their customer base, their profiles, needs, and habits [
6]. Because of the complexities involved, only geographically explicit modelling enables a proper assessment of the diverse factors that drive consumer decisions while also providing a platform for simulating potential policy interventions or market changes [
7].
Among geographically explicit modelling strategies, agent-based models (ABMs) have emerged as a widely adopted approach due to their capacity to simulate complex systems by capturing the behaviours and interactions of autonomous decision-making agents within a spatial context [
1,
8]. Within an ABM, each agent operates according to a set of predefined rules, yet the collective behaviours of these agents can give rise to emergent patterns that may be difficult to predict using traditional modelling approaches [
9]. When combined with Geographical Information Systems (GIS), an ABM becomes particularly powerful [
10]. GIS data supply the spatial and environmental context, while an ABM manages the cognitive and behavioural rules of the agents, creating simulations that more closely represent real-world processes.
However, a key requirement for accurate and reliable ABM simulations lies in constructing heterogeneous agent profiles [
11]. Heterogeneity reflects the diversity among individuals in terms of demographic attributes, socioeconomic status, behavioural tendencies, and preferences [
12]. In the context of food retail choice, for instance, individuals may differ in household size, income, dietary preferences, vehicle ownership, and daily commuting patterns—all of which can greatly influence where and when they choose to shop for groceries. By incorporating such variation, model builders can produce simulations that better approximate real-world decision-making processes.
Despite the clear importance of agent heterogeneity, acquiring detailed, individual-level data to build such profiles can be challenging. Privacy regulations, for example, often limit the extent to which personally identifiable information can be shared or used for research purposes [
13]. In addition, the cost and logistics of large-scale surveys may be prohibitive, and secondary data sources may not always offer the level of granular details required to construct agent profiles. These obstacles have led researchers to seek creative solutions for data gathering and synthesis. When real-world data are sparse or incomplete, synthetic approaches—where populations are generated or augmented based on statistical methods—can fill some of the gaps [
14,
15,
16].
Synthetic population data are generated through techniques that approximate the demographic composition of actual populations. Synthetic populations are constructed based on census information or large-scale surveys, and they have the same statistical characteristics as real-world groups, including household size, age distribution, and income brackets [
14]. These methods have been widely applied in ABM research to represent agent diversity without obtaining personally identifiable information [
17]. Consequently, synthetic population datasets serve as an invaluable resource for modellers who wish to represent a range of demographic attributes while respecting data privacy constraints.
Secondary data sources, such as loyalty card program data, can be combined with synthetic population data to help augment information about customer choice. Since the 1990s, loyalty card programs offered by retailers have become popular in the retail sector. These programs reward registered customers by providing incentives, e.g., special pricing or cash-back bonuses, and allow the retailer to track purchasing patterns. These programs allow retailers to track cardholder interaction with their stores and often capture valuable information on consumer attributes and behaviours, such as customer residential addresses, types of products purchased, frequency of store visits, and total amount spent per visit [
18]. While these data are typically anonymized and only provide limited information on the social demographics to protect individual identities, they can still offer robust insights into the purchasing patterns of diverse customer segments.
Therefore, there is strong potential for the integration of these two types of data, loyalty card program data and synthetic population, to develop richly detailed and heterogeneous agent profiles. Loyalty card program data provide anonymized information about consumers’ behaviour, revealing purchasing habits and frequency. In contrast, synthetic population data addresses the need for demographic and socioeconomic attributes to construct a broader customer landscape. By bringing these complementary datasets together based on residential postal codes, researchers can overcome many of the limitations posed by each data source. Rather than relying solely on synthetic demographic profiles that may or may not accurately reflect actual consumer behaviour or exclusively on loyalty card data that may lack detailed demographic attributes, a combined dataset helps produce a holistic representation of consumers in a specific urban environment.
However, raw data integration alone may not be sufficient to understand and exploit the underlying heterogeneity within a consumer base, which can yield a large database without identifying any patterns. Clustering methods come into play as a powerful analytical tool for grouping individuals who share similar demographic, socioeconomic, and behavioural attributes [
19], and further help build the computational representation of this typology as specific “agent types” in an ABM [
20]. This clustering process not only assists modellers in identifying meaningful segments within large, complex datasets but also aids stakeholders, such as retailers, urban planners, and policymakers, in grasping the diversity of consumer types across geographic areas. Through segmentation, decision-makers can develop tailored strategies by informing marketing strategies, store location planning, or targeted policy interventions.
The combined approach outlined above overcomes two principal limitations of most existing approaches to grocery customer clustering. First, they frequently rely on a single data source—either behavioural data (e.g., loyalty card data) or socioeconomic data [
21]. In studies that emphasize socioeconomic data, census information is commonly employed rather than a synthetic population, thereby restricting customer variability and reducing agent heterogeneity in agent-based models. Consequently, it becomes difficult to capture a comprehensive understanding of customer behaviour. The second limitation is the lack of quantitative and geographical perspectives of customer clusters. For instance, although ref. [
22] created customer segmentation for grocery purchases that incorporates clear geographical insights, the primary focus on model building precluded the inclusion of quantitative representations of clusters and socioeconomic variables. This can limit the utility and interpretability of segmentation for broader analytical or practical applications.
Building on these observations, this paper integrates loyalty card program data with synthetic population data and applies clustering methods to enhance our understanding of grocery customers. By bridging the gap between single-source studies and those lacking quantitative geographic insights, this work seeks to capture both behavioural and socioeconomic dimensions of customer segments within a spatially explicit context. We argue that such an integrated approach not only enriches ABM simulations by representing heterogeneous consumer profiles more accurately but also provides guidance for retailers, urban planners, and policymakers aiming to make data-driven decisions. In the remainder of this paper, we outline our methodology and present a quantitative view of customer segments.
3. Results
Following segmentation using the k-means algorithm, the 295,631 customers in the loyalty card dataset are clustered into seven groups. As shown in
Figure 4, Clusters 2 and 4 have the largest number of total individuals among all the groups, and they together contribute more than half of the total customers recorded in the loyalty card dataset. On the other hand, Clusters 3, 5 and 6 have the lowest numbers at 10,000 individuals, and Clusters 0 and 1 range between 20,000 and 40,000 individuals.
Table 1 illustrates the descriptive statistics of all numerical variables for generating the typology. In the following part, we will describe these clusters from the perspectives of each attribute.
3.1. Total Income
Total individual gross income is a variable obtained from the synthetic population, and according to ref. [
34], income is the most frequently mentioned variable influencing customer choice in the marketing literature. Therefore, we believe including this attribute is imperative to carrying out proper customer segmentation in terms of shopping choices.
Table 1 and
Figure 5 illustrate the total income of each segment of customers. The average income value of most clusters is approximately CAD 40,000 to CAD 50,000, except for Cluster 3. The mean value of Cluster 3 is the highest at CAD 110,898.73, while also having significant variability, which is around 2 times that of the other clusters.
3.2. Frequency of Store Visits
Based on the transaction records in the loyalty card dataset, the frequency of customer shopping in a total period “F” can be calculated as the number of times a customer visits any store of the chain during the total activation period of loyalty cards, which is calculated through the following equation:
The variable “F” thus captures the frequency of visits, e.g., an F value of “7” indicating one visit weekly. In the clustering process, this behavioural attribute is also included in the K-means method, and we can examine this variable by clusters.
Table 1 and
Figure 6 provide a detailed comparison of this variable “F” (representing the average number of days between store visits) across the seven identified customer segments, revealing notable differences in shopping frequency. Cluster 0 stands out as the most frequent shopper segment, with a median interval of only 7.39 days between visits, which means they visit stores every week. By contrast, Cluster 6 has a markedly higher median interval of 1034 days, making its members the least frequent group in the analysis, and it also has the fewest number of customers. This thus identifies this cluster as infrequent visitors who shopped less than 5 times over their entire activation period. Positioned between these two extremes, Clusters 1, 2, 3, and 4 exhibit moderate intervals (ranging from 12.77 to 16.00 days), indicating middle ground in terms of shopping frequency (i.e., approximately once every 2 weeks on average). They also make up the largest group of individuals. Cluster 5, at 34.47 days per visit (i.e., approximately once per month), is substantially less frequent than Clusters 0 through 4 but still far below the considerable gap shown by Cluster 6.
3.3. Distance to Their Most Frequently Visited, as Well as Nearest Store
To examine whether distance influences customers’ store choices, we analyzed the distances to both the most frequently visited store and the nearest store within each customer segment. Distances were obtained from the loyalty card dataset, given the well-established impact of geographical proximity on consumer decision-making. Within the integrated dataset, we have each store’s exact address converted into latitude/longitude, while customer location is approximated by postal code. We compute the straight-line distance between these coordinates.
Table 1 provides the statistics of the distances between customers and their most frequently visited store. Overall, Clusters 0–4 exhibit relatively small mean distances, ranging from approximately 1.20 to 1.74 km, suggesting that customers in these segments tend to frequent a store that is geographically close to them. Cluster 2 has the smallest mean distance (1.20 km), indicating that these customers strongly prefer nearby stores.
By contrast, Cluster 5 stands out with a much larger mean distance (14.27 km) and the highest standard deviation (11.09), implying that some customers within this segment routinely travel farther to shop. Cluster 6 has a moderately larger mean distance than Clusters 0–4 (3.62 km), and its higher standard deviation (4.88) also suggests more variation in store choices based on location.
Figure 7 depicts the box plots for these distances by cluster, highlighting the pronounced difference in distance for Cluster 5 in comparison to the other segments.
Based on the customer and store location, the average distance to the nearest store that each cluster’s members have visited can also be summarized and compared with the most frequently visited store. In general, Clusters 0–4 again show smaller mean distances, all below or around 1 km. Notably, Cluster 2 has the smallest mean distance (0.66 km), suggesting that these customers tend to have at least one store located very close to their residential postal codes.
Cluster 5 still registers the greatest distance (3.18 km) to the nearest store, though notably smaller than its distance to its most frequently visited store. Cluster 6 also has a higher mean distance compared to most other clusters (1.20 km), with a relatively large standard deviation of 1.02. These patterns suggest a greater willingness among individuals in Clusters 5 and 6 to travel farther, even to the nearest store, compared to the other segments.
Figure 8 illustrates these distances via box plots, reflecting the comparatively wide distribution of distances for Cluster 5 and the moderate distances for Cluster 6.
Overall, these findings suggest that while most customer segments prefer stores located close to their primary residence or routine travel paths, certain segments (notably Clusters 5 and 6) are more likely to travel longer distances—whether to their most frequently visited store or to other stores in the chain nearest to them—indicating distinct shopping preferences and behaviours within these groups.
3.4. Average Spending per Transaction
By examining the average spending per transaction, we can identify key differences in how much customers typically spend during each visit, revealing the sales value of each cluster to the retailer. This analysis helps determine which groups are more inclined to spend more and those that exhibit more conservative purchasing habits, allowing for targeted strategies and individual simulation that align with each segment’s distinct financial behaviours.
The spending patterns vary notably across the clusters, with Cluster 1 standing out due to its significantly higher mean spending (CAD 115.71) compared to other clusters, whose average spending ranges between CAD 31.72 and CAD 39.75. Cluster 1 also shows a relatively high standard deviation (CAD 47.83), indicating greater variability in spending compared to other segments.
Figure 9 displays box plots of the average spending per transaction by cluster, further highlighting the disparities among the clusters. Most clusters exhibit compact distributions with lower median spending values and moderate variability, suggesting consistent spending behaviours within these segments. In contrast, Cluster 1 shows a higher median and broader spread, reinforcing the data that this segment spends considerably more per transaction than others. The spread observed in Cluster 1 suggests that customers in this segment are more likely to engage in higher spending transactions.
3.5. The Number of Unique Stores Visited
Examining the number of unique stores each customer visits provides additional valuable insights into their shopping preferences. A higher count may indicate more exploratory or variety-seeking tendencies, whereas a lower count suggests a narrower focus on specific stores or a very low frequency of shopping. In this section, the number of unique stores visited are compared using descriptive statistics and a box plot visualization.
As shown in
Table 1 and
Figure 10, Cluster 0 stands out with the highest mean number of unique stores visited (9.93) and a relatively wide range (std = 3.09). By contrast, Cluster 6 exhibits the lowest mean (1.11) and the narrowest spread (std = 0.31), suggesting that members of this segment tend to visit very few, if any, additional stores apart from their “usual” store. Based on the results of shopping frequency mentioned earlier, we also know that Cluster 6 consists of extremely low-frequency shoppers. This helps explain the narrow choices of store visits as this cluster is made up of very infrequent shoppers.
In between these extremes, Clusters 1, 2, 3, 4, and 5 have average values ranging from 3.40 to 4.50, indicating a moderate number of different stores visited. These patterns are also evident in the box plots, where each cluster’s median, quartiles, and any outliers visually reflect the variability observed in the table. Overall, the data suggest that Cluster 0 customers explore visiting many different stores, whereas Cluster 6 shoppers appear the most limited in their store visits.
3.6. The Percentage of Shopping Done in the Evening
To further examine customer behaviours, we analyzed the proportion of shopping that occurs during the evening. Cluster 2 demonstrates the highest mean percentage of evening shopping at 58.77%, with a relatively low standard deviation (16.66), indicating that most customers in this segment consistently prefer to shop in the evening. Cluster 0 follows with a mean of 32.39%, and Cluster 3 has a comparable mean of 30.76%, though both clusters display wider variability in their distributions, as reflected by their higher standard deviations.
In contrast, Cluster 4 exhibits the lowest mean percentage of evening shopping at 12.64%, suggesting that these customers are least likely to shop in the evening. Cluster 1 also shows a relatively low mean (20.49%) but is more spread out, as indicated by a standard deviation of 19.67. Clusters 5 and 6 occupy the mid-range, with mean values of 25.67% and 27.76%, respectively. The particularly high standard deviation (41.94) for Cluster 6 suggests that while some customers do a considerable portion of their shopping in the evening, others in this cluster are not so consistent.
The box plots in
Figure 11 illustrate these differences clearly. Cluster 2 has the highest interquartile range and median, confirming its strong evening-shopping tendency, whereas Cluster 4 remains consistently low in this behaviour. Clusters 0, 3, 5, and 6 fall between these two extremes but exhibit varying degrees of spread, indicating diverse evening-shopping patterns within each group.
3.7. The Percentage of Shopping on the Weekend
The proportion of shopping conducted on weekends also provides unique insights into customers’ shopping patterns and preferences, as shown in
Figure 12. By examining the mean, median, and spread of weekend shopping percentages across different customer segments, we can identify segments that are more inclined to make purchases on weekends, as well as those that appear less influenced by weekend shopping opportunities.
In general, Cluster 1 exhibits the highest average proportion of weekend shopping (mean = 43.23%), suggesting that these customers are particularly likely to carry out their shopping on weekends. The median value of 41.23% reinforces the tendency toward weekend purchases within this group, although their standard deviation (25.62) indicates a moderate spread in behaviour. By contrast, Cluster 4 shows the lowest mean percentage (25.09%), implying a preference for weekday shopping. Its median is 24.10%, again underscoring the relatively limited weekend engagement among these consumers. Clusters 0, 2, 3, and 5 display mean weekend shopping proportions ranging from roughly 30% to 33%, indicating moderate weekend shopping activity. Cluster 6 is notable for having a large standard deviation (43.67) and a median of 0.00, suggesting the presence of customers who rarely shop on weekends alongside others with very high weekend shopping percentages. This wide variability points to the heterogeneous nature of shopping behaviour within Cluster 6.
3.8. The Most Frequently Purchased Items
Examining the most frequently purchased items can inform the design of micro-level actions for both customers and retailers (e.g., targeted purchases and promotional strategies). However, due to the large amount of missing data, only 40% of the customers could be linked to specific items. We also attempted clustering using only the subset of customers with item-level data, but the results were broadly similar. Consequently, items purchased were not included in the final clustering.
Table 2 presents the top 10 most frequently purchased items.
Overall, the rankings remain consistent across most clusters, with the same 10 items—beverage, chips, juice, cheese, butter, mayonnaise, eggs, yogurt, milk, and sugar—appearing most often. Two exceptions are Cluster 1 and Cluster 6. As mentioned above, Cluster 6 mostly comprises one-time shoppers, leading to a different ranking; milk and sugar are replaced by bread and cereals. In Cluster 1, cereals replace sugar, while the remaining items appear in similar positions as in other clusters.
3.9. The Spatial Distribution of Customer Clusters
Understanding how customer segments are dispersed across a city can offer valuable insight into localized consumer behaviour, potentially guiding marketing strategies, resource allocation, and service planning and further aiding ABM modelling.
Figure 13 illustrates the spatial distribution of the seven identified customer clusters in Montreal, which is similar to a presentation of geodemographics [
35]. Each colour-coded region represents the dominant cluster in that area, allowing for a clear visualization of where each segment is most prevalent. The basic spatial unit is the dissemination area (DA), which is the smallest standard geographical unit in the Canadian census available to the public.
Figure 14 illustrates the store density across Montreal in the unit of store per square km, and areas with the highest store density are predominantly found within the downtown core. By contrast, many peripheral suburbs exhibit far fewer retail establishments. This can also explain Cluster 5, whose members live far from the downtown, have higher than average shopping distances, and shop less frequently; whereas Clusters 0 to 4, whose members mainly live near the downtown area, show the opposite pattern.
As shown in
Figure 13, Clusters 4 and 5 dominate much of the total region, although they occupy distinct areas. Cluster 4, the most populous cluster, is concentrated in central and eastern Montreal, and in the southwestern suburbs. By contrast, Cluster 5 primarily appears in suburban areas, especially those located far from retail stores, as shown in
Figure 14, though it comprises fewer individuals overall. Meanwhile, Clusters 0, 1, and 2 cluster around the downtown core, where they are largely surrounded by Cluster 4. In addition, Clusters 3 and 6 are the least populous, and thus, only a small number of DAs are dominated by these segments.
To further understand the spatial contiguity of these clusters, we also compute the overall join count ratio of this landscape with the rgeoda library [
36]. We rely on the joint-count ratio since some widely used metrics of clustering, such as Moran’s I, are not suitable for the categorical variable with multiple classes [
37]. The overall join count ratio is 0.6793, which means 67.93% of total connections among adjacent objects belong to the same cluster [
38], and this indicates a moderate to strong tendency of spatial clustering.
4. Conclusions and Discussion
In this study, seven distinct customer segments were identified using both demographic and behavioural attributes. Each cluster exhibits unique characteristics that can inform targeted marketing strategies, store location planning, and personalized promotions.
Table 3 summarizes each segment and provides a descriptive label that captures its principal shopping profile. For example, Cluster 4’s label of “Weekday Day Shoppers” represents the average shopping behaviour of the majority of customers. They usually shop at a nearby store, and prefer to shop on weekdays and in the daytime. This clearly distinguishes them from other clusters. For example, Cluster 0 represents the most “Frequent Store Explorers”; they visit the retail chain’s stores on a weekly basis and frequent multiple locations, suggesting variety-seeking and convenience-oriented habits. Cluster 1, “High-Spending Weekend Shoppers”, have a lower shopping frequency but the highest spending per trip, and they strongly favour weekend shopping. Individuals belonging to Cluster 2 usually shop in the evening and near their residences. While people in Cluster 3 have the highest income, their behaviour is quite similar to Cluster 0, yet they have a lower shopping frequency and travel further to get to stores. Cluster 5 represents those who travel far and are infrequent customers, and they also spend moderately. This may result from their shopping in stores under different brands, and this also possibly explains the behaviours of Cluster 6. Cluster 6 appears to be a group of outliers, as its members only shop at the chain’s stores intermittently over a three-year period, suggesting they may rely on other shopping options, have relocated outside the study area, or took the membership for one-off access to promotional pricing.
Segmentation results like this not only enhance the understanding of the local customer landscape but can also support targeted marketing strategies and inform store location planning. Retailers could leverage the findings to refine store placement in areas with dense clusters of high-value or frequent shoppers. Further, they might tailor promotions by linking them to known cluster preferences in some neighbourhoods, as indicated by the mapping of clusters in
Figure 13. For instance, store managers in the downtown core area can hold weekend and after-work promotions and deals for Cluster 1 and Cluster 2, since they are concentrated in that region.
Policy makers and urban planners can also utilize this study to understand how demographics and distances interact with shopping frequency, which may guide further infrastructure development or store location decisions. For example, although it is possible that Cluster 5, who live in suburban areas have other shopping alternatives in their surroundings, managers of the retail store whose loyalty card data we are using can plan to establish new stores in that region to decrease travel distance, and possibly cause them to shift to Cluster 0–4, and thereby further increasing the engagement of these consumers. Furthermore, public-private partnerships could be formed to support community-focused retail initiatives or mobile markets, ensuring that populations with limited access to transportation are not left out of convenient shopping options. Additionally, local governments can subsidize public transportation or develop new bus routes to reduce travel barriers in regions without access to grocery stores. By actively influencing the spatial distribution of grocery services and improving connectivity, policy makers can help align consumer behaviour with healthier, more sustainable purchasing patterns.
Agent modellers also benefit from these insights. They can draw on these findings to capture both the quantitative behaviour of consumers and the heterogeneity represented by the clusters. Additionally,
Figure 15 illustrates a proposed ABM framework based on the clustering results. The elements highlighted in red represent the attributes and behaviours that vary by ABM typology. In the initialization stage, the spatial distribution of these clusters can enhance the realism of more precise and robust simulations intended to reflect actual consumer landscapes. Specifically, during model execution, each customer agent in this model is assigned a cluster label to represent the heterogeneity among them. Consequently, agents from different clusters exhibit diverse behaviours, including decisions on shopping time, frequency, distance, and spending. Through these behaviours, customer agents select stores in the environment, and store agents can also adjust pricing strategies to attract customers. As a result, this proposed ABM can serve as a tool to observe dynamics between customers, stores, and the environment, an aspect often overlooked in previous data-focused studies. With this tool, we can answer research questions such as “How do customers make decisions on when and where to shop for groceries within the urban space?”
However, several limitations and uncertainties should be noted. First, the reliance on loyalty card program data means that only registered members could be analyzed, thereby excluding those who opt out of such programs. Further, there are also uncertainties within the loyalty card dataset [
39], such as the outliers in Cluster 6 (some of whom may have joined for one-time access to promotional pricing). Future research should consider collaborating with multiple retailers or incorporating additional datasets to capture a more representative customer behaviour landscape. Second, while the synthetic population provides valuable demographic attributes, its matching process—based on nearest-neighbour approach—introduces potential spatial uncertainty, especially in densely populated areas. To minimize the impacts of this, we only incorporate income, an attribute that has been shown to have distinct spatial distribution across Montreal and is important to the action of purchase. Third, raw data quality also limits our choice of clustering attributes, for example, missing records of items purchased. In future work, incorporating other sources of data could help to generate greater insights at the micro-level. In addition, since we used the pre-COVID data in this study, the changes brought by COVID pandemics on shopping behaviours are not considered. In future studies, if updated data become available, incorporating this can better capture the post-pandemic landscape.
In summary, this study demonstrates that merging loyalty card program transactions with a synthetic population can yield quantitative and spatially explicit insights into the multifaceted nature of grocery-shopping behaviours. Distinguishing customers into seven segments reveals substantial diversity in the dimension of spatial distribution, shopping behaviours, and demographics. These insights, in turn, can guide data-driven interventions—ranging from marketing campaigns to store location optimizations and urban planning initiatives—to better address the needs of Montreal’s diverse consumer base. Ultimately, this approach provides the basis for the ABM approach, which can enhance our capacity to plan more effectively for the dynamic interplay between customers, retailers, and the broader spatial environment in which these actors operate.