Bayesian Modeling of Travel Times on the Example of Food Delivery: Part 1—Spatial Data Analysis and Processing

Gibas, Justyna; Pomykacz, Jan; Baranowski, Jerzy

doi:10.3390/electronics13173387

Open AccessArticle

Bayesian Modeling of Travel Times on the Example of Food Delivery: Part 1—Spatial Data Analysis and Processing

by

Justyna Gibas

,

Jan Pomykacz

and

Jerzy Baranowski

^*

Department of Automatic Control & Robotics, AGH University of Krakow, 30-059 Krakow, Poland

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(17), 3387; https://doi.org/10.3390/electronics13173387

Submission received: 20 June 2024 / Revised: 16 August 2024 / Accepted: 17 August 2024 / Published: 26 August 2024

(This article belongs to the Special Issue Advances in Intelligent Data Analysis and Its Applications, 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Online food delivery services are rapidly growing in popularity, making customer satisfaction critical for company success in a competitive market. Accurate delivery time predictions are key to ensuring high customer satisfaction. While various methods for travel time estimation exist, effective data analysis and processing are often overlooked. This paper addresses this gap by leveraging spatial data analysis and preprocessing techniques to enhance the data quality used in Bayesian models for predicting food delivery times. We utilized the OSRM API to generate routes that accurately reflect real-world conditions. Next, we visualized these routes using various techniques to identify and examine suspicious results. Our analysis of route distribution identified two groups of outliers, leading us to establish an appropriate boundary for maximum route distance to be used in future Bayesian modeling. A total 3% of the data were classified as outliers, and 15% of the samples contained invalid data. The spatial analysis revealed that these outliers were primarily deliveries to the outskirts or beyond the city limits. Spatial analysis shows that the Indian OFD market has similar trends to the Chinese and English markets and is concentrated in densely populated areas. By refining the data quality through these methods, we aim to improve the accuracy of delivery time predictions, ultimately enhancing customer satisfaction.

Keywords:

food delivery services; travel time estimation; spatial analysis; data preprocessing; Bayesian modeling

1. Introduction

The e-commerce sector is in a constant state of growth and evolution, particularly within its subdomain of online food delivery (OFD) [1,2]. Recent market forecasts indicate a steady rise in revenue for companies offering such services. With numerous players in the market, ensuring customer satisfaction is paramount for a company’s survival. Customers increasingly demand user-friendly applications that simplify the ordering process with just a few taps, while also providing features such as delivery time estimates and communication channels with couriers [3]. However, estimating delivery times accurately without real-time data presents a significant challenge. While companies can track courier positions via GPS, they lack access to real-time information on factors such as traffic, accidents, and roadworks.

Before any model, algorithm, or computational technique can be applied, the initial step involves finding and preparing data. Effective data analysis and processing play a pivotal role in determining the performance and outcomes of models, given that many techniques rely on clean and comprehensive data. Nonetheless, real-world data seldom meet these ideal conditions and are often challenging to interpret. Even though Bayesian models, discussed further in Part 2 of this article [4], are less affected by noisy data and outliers, they still require data processing and reasonable assumptions.

Visualization of data is equally significant. It offers valuable insights, particularly when combined with expertise in the relevant field. For instance, plotting geospatial data on maps can reveal outliers, patterns, or anomalies. Proper visualization is also essential to draw appropriate conclusions and evaluate model outputs [5].

This article leverages spatial data analysis and standard data preprocessing techniques to enhance the data quality utilized in Bayesian models aimed at predicting food delivery times. By examining the spatial distribution of online food deliveries, we identify and rectify faulty data points. It also gives us insight into the Indian food delivery market and customer behavior, which, to the best of our knowledge, has not yet been explored. Drawing from existing literature, we identify the most crucial features for our models. Additionally, we investigate the relationships between variables in our dataset to uncover any patterns that could impact the behavior of our models.

The contributions of this paper can be summarized as follows: (1) By leveraging the OSRM API, we effectively transformed raw data into meaningful predictors, which are essential for the implementation of Bayesian models. (2) Standardizing the data proved to be a crucial step for numerical stability in the models without compromising their interpretability. (3) Visualization provided valuable insights into outliers, enabling us to justifiably exclude them from the training data. (4) Route analysis provides insight into the Indian food delivery market and the ability to compare with other countries (mainly England and China).

The remainder of this paper is organized as follows: In the next section, the relevant studies in the extant literature are reviewed and discussed. Section 3 provide overview of the used data and describe data preprocessing and visualization methods in detail. In Section 4, we perform in-depth analysis of the processing results and spatial analysis of routes. Finally, in Section 5, we make a conclusion and describe possible future work.

2. Literature Review

2.1. Distance and Travel Time Estimation

Typically, OFD platforms acquire customers’ GPS coordinates when they open the app to place an order [6]. Then, their location is used for restaurant recommendations and to specify the delivery destination. Subsequently, customer GPS coordinates can be used to predict travel time with an origin–destination based approach. Although it allows for estimating time without knowing the exact route and distance, a significant issue in this particular method is that it requires an exact departure time, which is usually unknown. Moreover, most of the research using origin–destination (OD)-based methods have focused on other applications such as taxi trip duration [7].

Our problem demands that we provide customers with accurate waiting times without relying on knowledge of the precise preparation time [8] or the courier’s departure. Moreover, variables used in Bayesian models need to be obtainable from the probabilistic distributions. To reduce complexity, we transformed raw timestamps into meal preparation times, aligning the predictor variable’s units with the output variable (minutes). Meal preparation times likely follow well-defined distributions (e.g., normal, gamma, uniform), whereas event timestamps may exhibit complex distributions with peaks corresponding to breakfast, lunch, or dinner order times. We believe that this decision increases the explainability of the models. As for the geographic coordinates, we could try representing them as probability distributions. However, a challenge would likely arise when interpreting the outcomes of our models, since coordinates alone convey very little information about possible routes or the time required to travel them. Instead, we could choose to transform the coordinates into approximate shortest route distances. This approach would provide a clear dependency: the longer the route, the longer it takes to travel. Similarly to transforming timestamps into meal preparation time, we believe that transforming geographical coordinates into route distances increases the explainability and interpretability of our models.

Joshi et al. articulated a compelling argument advocating for the consideration of road distance over geometric distances like the Haversine distance for accurate delivery time estimation. They contend that relying solely on geometric distances can result in oversimplified representations, potentially leading to unrealistic data inputs into predictive models. Their approach does not entail route generation but rather involves mapping GPS coordinates onto a road network map of the analyzed area. They mapped the city network to the graph and assumed that the weight of each edge is he average travel time across all delivery vehicles in the corresponding road. Their study was focused on effective batching of orders and assignment of orders to vehicles [9]. Ji et al. adopted a method for estimating travel time that utilizes GPS trajectories aligned with specific road network segments. They used GPS trajectories of food carriers and removed stay points. Then, they mapped GPS trajectories to road segments and obtained the travel time of carriers on each road segment (which was included in their dataset). Their research, however, is tailored towards optimizing the grouping of tasks related to efficient delivery operations [10]. Ulmer et al. assumed that all drivers follow a mobile navigation device to determine the best paths. They approximated road distance by multiplying the Euclidean distance by a factor of 1.4. It is demonstrated that this approach closely approximates the relationship between Euclidean and street distances. They chose such a method as they were attempting to dynamically control a fleet of drivers and had so many potential paths that they exceeded the limitations of commercial mapping services [11]

Alternatively, some researchers have turned to popular routing tools like Google Maps API, Baidu Maps API, and the TOM-TOM API [12,13,14,15,16,17]. Yet, the utilization of these services incurs exorbitant expenses for OFD companies, often surpassing a staggering one million dollars annually. It should be noted that these costs are particularly high for large companies, where the number of orders is the primary factor influencing the price of the service. Companies attempt to reduce these costs by caching a set of routes or utilizing historical delivery data [18]. As a cost-effective solution, exploration into open-source routing services like OSRM (Open-Source Routing Machine) has been initiated [19].

2.2. Spatial Analysis

Spatial analysis related to food outlets and OFD platforms has also grown in popularity, as researchers are more interested in factors that influence customer behavior. Most of the spatial approaches are concentrated on the distribution of food outlets, food accessibility, and its impact on diet and health. Most frequently used methodologies in spatial analysis are statistics and GIS [20].

Another objective of spatial analysis in the food delivery context is to explore the factors influencing the utilization of these services, particularly in relation to built infrastructure. Typically, densely populated areas are examined [15,21] because online food outlets tend to be more prevalent in urban regions [22]. Recent studies have examined the distribution of different types of food outlet (e.g., fine dining, fast food), which can affect the usage of OFD platforms [23]. Regardless of the research area, the periphery has the most limited access to food outlets and OFD platforms cannot improve this [15,22,23,24]. Even in studies that are more focused on food delivery optimization, spatial analysis of the restaurants and customers’ distribution play crucial roles [25].

Spatial analysis of deliveries or journeys is not as common and typically focuses on exploring the connections between distinct regions [10,26]. In the realm of region segmentation, researchers often employ two primary methods: the grid-based approach and the road network-based approach. The grid-based method is particularly useful for visualizing smaller areas, such as those in Wang et al.’s article [26]. The road network-based approach is claimed to be more informative; yet, in densely populated urban areas, this approach may yield excessively small regions, necessitating a merging step prior to further analysis [10].

3. Materials and Methods

3.1. Data

In this study, we employed the Food Delivery Dataset, which is presently accessible on Kaggle [27]. Initially, the dataset was made available by HackerEarth for their machine learning competition. It encompasses more than 45,000 deliveries spanning 21 cities across India. The data span a three-month period, encompassing February through April of 2023. The dataset includes the location of restaurants and delivery destinations as well as other nonspatial information such as weather, traffic conditions, and time and date the order was placed. Table 1 provides a comprehensive list of all variables under consideration, along with their respective meanings.

3.2. Routes Generation

While the exact trajectory of the courier remains uncertain and unpredictable, approximating the route is essential for our purposes. This will not only ensure an appropriate level of reality reproduction, but will also allow to detect incorrect data. Routing engines are not able to generate routes between geographical coordinates that cannot be connected by a road network, thus implicitly detecting improper origin–destination pairs. To achieve this, we have opted to leverage the OSRM API [28], an open-source routing engine. By constructing tailored queries containing GPS coordinates for both the starting and ending points of the route, we can utilize the API to generate optimal route suggestions [29]. Furthermore, customization options such as transportation mode (e.g., car) and route type (e.g., shortest route) allow for further refinement of our queries. After receiving the routing request, the OSRM API passes it on to the OSRM routing engine, which employs the OpenStreetMap (OSM) [30] data to produce the optimal route according to the specified parameters. OpenStreetMap is an open geographic database that is updated and maintained by a community. It can be treated as an alternative to Google Maps. The routing engine considers multiple factors, including road types, speed limits, and turn restrictions, to guarantee an accurate and efficient route. The output of the routing request contains information about the route, including the route geometry, distance, and estimated travel time. If there is an issue processing the request, the API will provide an error code (e.g., “InvalidUrl”) indicating the reason behind the failure. In our case, there were three types of possible errors: “NoSegment”, “TooBig”, and “TooManyRequests”. The first error indicates that one of the supplied input coordinates could not snap to the street segment. The second scenario applies if the request size violates one of the service size restrictions. The last expected error appears when the server is overloaded.

3.3. Preprocessing

Preparing data as needed by the model is an essential part of delivery time estimation. Figure 1 shows the preprocessing steps, which are described in detail in the following.

Data cleaning is initial part of preprocessing. It ensures that missing values will not lead to poor results and wrong conclusions. In our approach, we decided to remove rows with missing data and rows where restaurant or destination GPS coordinates are outside the geographical boundaries of India. Approximately 15% of the samples were removed in this process. In the case of the removed geographical coordinates, they were logically incorrect (points in countries other than India, points in the ocean), which in later steps would lead to a huge number of errors when using the OSRM API.

Routes generation is performed using OSRM API, which was described in Section 3.2. The created OSRM client uses asynchronous HTTP to find the shortest route between restaurant and delivery destination. If response is successful, distance and route geometry are saved; otherwise, the unsuccessful request and its reason are logged for future analysis. Out of the suspected error types, we only encountered “TooManyRequests”. It was most likely caused by too many people trying to use the server at once. To overcome this issue, we logged those failed routes and fetched them again when the service was more responsive. As for the other two types of errors, it would be difficult to correct them; so, the data related to such problematic routes would be excluded from modeling. Distance obtained from routing API is integrated with rest of dataset as an additional column.

Data transformation techniques are used to convert data into a sustainable format. Date and time are especially difficult to analyze due to various factors. To overcome this, meal preparation time is calculated based on Time_Order and Time_Order_picked. One variable represents time of placing the order and another time of picking up by courier. Their difference will determine the meal preparation time. This will not correspond to an exact time, as the prepared dish may be waiting for the courier to arrive. This will enable the model to be utilized in cases where the restaurant furnishes an estimated preparation duration for the order, or where the statistics of preparation time can be derived from historical data [8].

Feature selection is one of the ways to cope with dimensionality. The goal is to remove irrelevant and redundant features, which may include accidental correlations in models and reduce their generalization abilities. Feature selection also decreases the risk of over-fitting and reduces the search space, making the learning process faster and less memory consuming [31].

Feature scaling is a critical step in constructing effective models as it helps mitigate bias stemming from variations in the ranges and magnitudes of data. Among the most widely used techniques are normalization and standardization. We have chosen to implement standardization for selected features with a continuous distribution. This decision ensures that operations and results of the model will be more straightforward to interpret.

Mapping categorical data is essential to use them as input for a model. It is crucial to note that, unlike in machine learning models, for Bayesian models created using the Stan library [32], data must be mapped starting from 1 upwards (as vectors are indexed from 1).

3.4. Visualization

As mentioned previously, visualization is also an important part of preprocessing. Plotting the data can reveal outliers or anomalies that cannot be easily identified in other ways. In this research, we will perform visual analysis of the generated routes followed by analysis of distributions of input data of the models.

The main focus of the route analysis is spatial visualization. We will explore various visualization methods, including heatmaps, interactive maps, and road network graph maps.

For the route analysis, each of the 21 cities is treated separately, allowing for a more detailed examination. The allocation of routes to each city was accomplished through clustering. We utilized the KMeans algorithm provided by scikit-learn [33], with clustering based on the locations of restaurants and delivery destinations.

To implement interactive maps, we used Folium [34], which is a wrapper for the Leaflet.js library. Folium allows for the creation of interactive Leaflet maps and supports a wide range of overlay formats, such as images, videos, and GeoJSON, enabling the embedding of multiple layers. For our maps, we used OSM as the base layer and added routes based on the geometry returned by the OSRM API. Additionally, routes from each city were plotted on separate layers for clarity. Each city has been added as a separate layer that can be activated on the map. The main advantage of this approach is the ability to analyze small areas of the city without compromising image quality. Moreover, routes are embedded on the OSM, which gives the opportunity to check infrastructure near the starting and ending points of the route. This may also reveal anomalies, e.g., if the staring point of the route is not near the restaurant.

Heatmaps are particularly useful for analyzing large datasets or densely located points. Our goal is to depict road usage with an intensity map similar to Navarro’s approach [35]. Our approach was to divide each city into a grid of squares of selected length. Each grid would be represented by a pixel on an image. If there was a route point in such a grid, then grid value was incremented by one, thus changing color of the pixel and effectively creating a heatmap. However, the raw route geometry data are unevenly distributed, as illustrated in Figure 2. Points are clustered near intersections and turns, whereas on straight sections, the routes are sparsely spaced. To achieve a meaningful scale, we need to interpolate points on these straight sections. We implement linear interpolation for each route segment where the distance between consecutive points exceeds a selected threshold (e.g., 5 m). Linear interpolation is a popular method; however, it has high error when the distance between interpolated points is too large. This is particularly noticeable in the curved road segment [36].

Other widely used methods include cubic interpolation, neighbor-based interpolation (also known as distance-based interpolation), and spherical linear interpolation. Cubic interpolation uses the current value and gradient vector to estimate intermediate points. This method has very low error rates but requires significantly more computational time, often up to ten times longer [37]. Neighbor-based interpolation determines the interpolated value by considering the surrounding points, calculating a weighted average of the nearby observations or using the nearest observation values. Spherical linear interpolation is particularly useful over long distances where the curvature of the Earth must be taken into account. In our case, we try to interpolate data on straight road segments, and the distance between two consecutive coordinates is relatively small, usually smaller than 100 m. We find that linear interpolation is sufficiently precise for this task.

Another method we considered was visualizing routes on road network graphs. This required generating a graph representing the road network. We utilized the OSMnx library [38], which provides tools to model, analyze, and visualize street networks and other geospatial features from OSM. The generated graph, along with the routes obtained from the OSRM API, needed to be converted into a common format and then plotted as a high-resolution image. In contrast to interactive maps, differences in road segments’ intensity will be more visible. Additionally, there will be no unnecessary elements like different icon types.

In order to minimize computation time and ensure the appropriate level of map detail, we recommend generating a graph only for the city area where the analyzed routes occur. In our case, the maximum and minimum values of longitude and latitude that occurred on routes in a given city were selected as the limit values. No information about any route will be lost, but four points of all routes are located exactly on the border of the image. This will not influence analysis of routes distribution.

3.5. Models Overview

For better understanding of the importance of preprocessing steps and conclusions, a short overview of the proposed Bayesian models is in order. Bayesian inference is a method of statistical inference, in which we fit a predefined probability model to a set of data and evaluate outcomes with regards to observed parameters of the model and unobserved quantities, like predictions for new data points [39].

Both models are generalized linear models. We have defined linear predictor as

η = X β

, where X denotes vector of features and

β

is vector of coefficients. Both vectors are size N × 1 [39]. We then used a logarithmic link function to transform the linear predictor’s domain to positive real numbers. It is necessary step, as both models are defined by inverse gamma function. This particular distribution effectively models skewness of the data and provides strictly positive continuous outputs. An in-depth description of models and explanation of Bayesian inference is presented in Part 2 of this article [4].

4. Results

4.1. Preprocessing Results

Preprocessing was conducted as outlined in Section 3.3. During data transformation, we discovered instances where deliveries, from the moment the courier picked up the order to its delivery, were recorded with zero or negative times. We detected them by comparing the calculated meal preparation time and total delivery time. Consequently, additional data cleaning was necessary, as data used in modeling need to be obtainable from the distributions used. Samples with such travel times were excluded from modeling. Negative travel times can destabilize models and make it impossible to obtain reasonable results. A possible cause for these anomalies is that not all of the data used in calculations were initially recorded and missing values were added manually to the system.

To estimate delivery times, companies and researchers commonly utilize a variety of features. These include spatial features (such as the location of the restaurant and destination, and the city road map), cooking time features, order features (like the number of items ordered and the date and time), and courier features (such as workload) [8,9,40,41,42]. Our first model utilizes the distance between the restaurant and delivery destination, meal preparation time, and traffic density. These features are used, respectively, in 42%, 9%, and 12% of the researches [43]. The second model is extended by courier features: the number of simultaneous deliveries and courier rating, which are also crucial in predicting food delivery times [8]. In our dataset, traffic density refers to road traffic intensity; however, the exact measurement methodology is not provided. A detailed analysis of the input data will be presented in Section 4.1.2.

One of the most critical steps in our preprocessing flow was obtaining real world distances. This allowed us to construct a well-interpreted model. Utilization of standardization assured the numerical stability of our models, which would otherwise be difficult to achieve. Last but not least, careful consideration of the features used in modeling was equally important. Our second model, which took into account two additional variables, required over two hours more to perform sampling.

4.1.1. Routes Generation

We used OSRM API to generate approximate delivery routes based on demo server. The parameters and their values in our requests are as follows: service—route, version—v1, profile—driving, overview—false, geometries—geojson, and steps—true. GeoJSON is a standardized format for encoding geographic data structures. The steps parameter is used to return information related to each part of the route. All approximate distances were successfully generated, with no incorrect routes identified between the restaurant and delivery destination. The histogram of the obtained distances is shown in the accompanying Figure 3, and their statistics are presented in Table 2. The histogram and statistics reveal that the distances in our dataset are significantly larger than those in other studies. In research conducted by Wang and He, the 95th percentile for distance is 6 km [15], whereas in our data, it is approximately 28.5 km. The shape of the distance distribution in their study is similar to ours, except for two peaks in our distribution. However, they analyzed customer behavior only in Shenzhen, which is a metropolis with a unique population composition and many subcenters. It has a very different dynamic compared to the Indian cities included in this study.

The number of deliveries significantly drops when the distance exceeds 25 km. However, there are routes with distances that are considerably longer. The histogram indicates two distinct groups: one around 65 km and another around 120 km. Both of these groups fall above the 99th percentile, classifying them as outliers. To understand the reasons behind these outliers, we conducted a thorough investigation and compared their routes with other routing engines—particularly Google Maps, which employs different mapping techniques. We use websites dedicated to individual services (https://www.google.pl/maps, https://www.openstreetmap.org (accessed on 16 August 2024)) and check each pair of geographical coordinates that are in the group of outliers.

First, we checked if both engines returned routes with similar distances and paths. This way, we excluded incorrect mappings from the OSRM engine or erroneous point mapping in OSM. Next, we analyzed the surroundings of the starting and ending points. We verified whether there was actually a restaurant near the restaurant locations and whether the delivery location was in the vicinity of buildings. We identified unrealistic delivery locations; we defined these as locations where delivery is highly improbable, such as points on highways and bridges or points far from any buildings. In the future, the analysis of the surroundings could be automated using OSM, which provides the ability to check if selected infrastructure objects (e.g., shops and restaurants) are within a chosen radius of a point.

All routes with road distances around 65 km are concentrated in the Dehradun area. Dehradun is situated in a valley at the foot of the Himalayas, resulting in a limited number of roads leading to the surrounding areas. A comparison of the selected route determined by the OSRM API and Google Maps is shown in Figure 4. Both engines visually identified the same route; however, Google Maps indicates that the route is approximately 2 km longer. This discrepancy may stem from differences in point positioning on various maps and the distinct distance calculation algorithms used. While the exact route lengths also vary among engines utilizing only OSM data, these differences are negligible over shorter distances.

The second problematic group of routes is found in the city of Agra. Analyzing this case is particularly challenging because online routing engines display varying results depending on the day of access. Additionally, Google Maps often selects different destinations for the same location—sometimes directing to an expressway and other times to a parallel street. The OSM data used by OSRM indicate that the given location is situated in one of the lanes on the nearby Yamuna Expressway. This suggests that the routes allowing direct U-turns on this type of road are incorrectly designated. A comparison of different routes for the same locations is illustrated in Figure 5.

Discrepancies between routes generated using OSRM API and Google Maps may be caused by differences in road network representation. Both services use graphs; however, one is a commercial tool and the other is open source. Therefore, there may be slight differences in the positions of features on these maps. There are also limitations to location accuracy. We have noticed that entered coordinates change their values once routing engines start processing. The differences are negligible but may affect the results obtained. Additionally, maps may add some noise to location data due to localization privacy policies.

The discrepancies in the routing engine do not impact the relevance of this paper. Although using different routing engines might produce varying routes, resulting in different distributions in our models, the models proposed in Part 2 of this article focus on the applicability of Bayesian modeling for delivery time prediction. Consequently, the variation in the distribution of one predictor would not affect the study’s findings.

To ensure our models are trained on meaningful data, we decided to filter out the outliers. We set a 30 km upper limit for deliveries, which encompasses over 96% of the data available after preprocessing. Orders that take up to 45 min are usually accepted [25], although too long a delivery time significantly reduces the freshness and quality of the meal. Therefore, we assume that deliveries over 30 km cannot meet this requirement. Standardization was performed after applying this distance filter to prevent excessive data spread caused by biases in the mean and standard deviation.

4.1.2. Input Data Analysis

Following the described preprocessing and outlier filtering, we ended up with nearly 35,000 data samples. The data prepared for Bayesian models are presented in Figure 6.

The basic distance statistics remained relatively stable after filtration. The average distance decreased to 13.33 km, while the standard deviation reduced to 7.17 km. This outcome was expected, given that the filtered values constituted only about 3% of the dataset eligible for model utilization.

The distribution of order preparation time presents an intriguing puzzle. Initially perceived as a continuous variable, the values are distributed almost evenly across 5, 10, and 15 min intervals. This phenomenon could stem from the provision of approximate order and pick-up times rather than precise values. Additionally, calculated meal preparation time may not reflect to exact cooking time. On the other hand, time required to prepare the food is related to the restaurant type. Expected times for fast, fast casual, and gourmet restaurants are 5, 10, and 15 min [11]. This will allow the type of restaurant to be indirectly taken into account. However, in case of higher cooking times, it may lead to a significant increase in the expected delivery time. Nevertheless, there are no anomalies in the dataset because the meal preparation time is assumed to be between 5 and 15 min [25].

The prevalence of high ratings among couriers has led to an inflated average of 4.6 in the standardization process. Furthermore, the remarkably small standard deviation of 0.32 translates to substantially reduced standardized values for couriers with lower ratings. Notably, no courier has a rating below 2.5.

The categorical variables “traffic density” and “number of deliveries” appear to align with expectations. Couriers typically handle no more than two deliveries concurrently. Interestingly, deliveries occur with equal frequency during rush hours and periods of very low car traffic. However, deliveries in moderate traffic conditions are relatively less common.

4.2. Spatial Analysis

The aim of the spatial analysis was to analyze the frequency of use of road segments. Orders are not evenly distributed among all 21 cities. The analysis included all those located in India, and a route for them could be determined (including data considered as outliers). Jaipur is the clear leader in terms of the number of orders, with over 3400 deliveries located there. Eight cities have very similar values at the level of 3150 deliveries. These include the largest cities in India such as Bangalore and Mumbai. Cities in which the previously calculated outliers occurred have a significantly lower number of deliveries in their area. A chart showing the number of orders for each city is shown in Figure 7. Discrepancies in orders among different cities are expected, as previous studies conducted in China and England have shown that densely populated areas have higher OFD platform usage [15,22]. Limited representations of smaller cities may lead to inaccurate delivery time estimation in these cities. On the other hand, it may improve the generalizability of the model.

Interactive maps created using Folium present completed routes on a road map with OSM. The intensity of a given road fragment depends on how often it occurs on routes. Additionally, for readability and easier analysis, the routes from each city have been added as a separate layer that can be activated. Sample visualization results are shown in Figure 8. The main advantage of this approach is the ability to analyze small areas of the city without compromising image quality. It help us to identify outliers that were related to unreal delivery locations, e.g., deliveries to places without nearby buildings.

It can be noticed that the roads in city centers are definitely the most frequently traveled. The size of the marker is not related to the results because it is automatically adjusted to the scale of the map. The closer the area, the thinner the drawn routes. Routes that have been traveled once or twice are barely visible. Such an example is noticeable at the left edge of the Jaipur image.

The visualization result using a city graph for similar cities is shown in Figure 9. Similarly to the previous maps, the intensity of a road fragment depends on its frequency in the routes. In general, in large-scale images generated using OSMnx, differences in saturation levels are not so easily noticeable. Due to the dense network of streets in the centers where deliveries are concentrated, all routes appear to be seldom traveled. However, after a small zoom, they are much more visible than on maps generated using Folium.

The very visible difference in the intensity of bridges on the Mumbai map made using different methods interesting, so we decided to analyze it more closely. As it turned out, there was another road running in the immediate vicinity of the bridge, which also had a route that—at a sufficiently large distance—was displayed directly on top of itself, disturbing the results. Moreover, we were dealing with a route that made no sense because it ended in the middle of a bridge (Google maps even projected the point onto the waters of the bay) and it was filtered as an outlier (it was about 40 km long).

Despite utilizing linear interpolation to increase the number of points on routes, we did not achieve satisfactory results with our heatmap visualizations. We were unable to determine a suitable grid length. When we selected a grid length of 25 m, the resulting images had a resolution that was too high, appearing uniform to the human eye and requiring significant zoom to see the intensity of corresponding road segment. Conversely, a grid length of 1 km allowed for the perception of routes across the entire city but was excessively broad. This broadness caused nearby routes to blend together and thus fail to represent the intensity of the distinct routes’ segments.

5. Discussion

This research underscores the importance of data preprocessing and spatial analysis in the context of online food delivery services. By integrating routing information from the OSRM API, we have demonstrated a method for identifying and eliminating outliers in delivery data, thus enhancing the accuracy of subsequent predictive models.

Based on the route distribution in Figure 3 and direct analysis of the two groups of outliers in Figure 4 and Figure 5, we classify 3% of the data as outliers. This also indicates that outliers are predominantly deliveries to more distant locations, whereas deliveries within city limits exhibit fewer outliers. Outliers located in the city boundaries most often refer to unreal destination points (e.g., points on highways or bridges). One of them is located on the map of Mumbai (bridge) shown in Figure 8. Additionally, our spatial analysis shows that the Indian OFD market has similar trends to the Chinese [15] and English [22,24] markets. The distribution of orders among cities presented in Figure 7 confirms that use of this type of platform is much more popular in densely populated areas.

Our approach has some limitations. First, the study area is restricted to India. Consequently, the results may not be generalizable on an international scale. To confirm the generalizability of our findings, we would need to gather and test similar datasets from various global regions. Consistent performance across these diverse datasets would validate our models’ robustness. Second, we mainly used one map provider, OSM, and one routing engine, OSRM; therefore, the results may differ from those obtained using other tools. Third, the data used cover a period of three months; however, the number of deliveries in cities is relatively low. In this dataset, the average number of deliveries in Jaipur is under 40 per day. This does not accurately reflect the actual workload of OFD companies, which is likely to be much larger, and some methods may not perform as expected in the case of large amounts of data. The most noticeable problem may be the low responsiveness of Folium’s interactive maps, which makes analysis much more difficult. In the case of Bayesian models, a large amount of data will increase the model fitting time and require more computational power.

Additionally, the use of the OSRM API involves several limitations, especially when considering the usage of a shared server. The number of requests per minute is limited and common to all users. Being an open-source project, OSRM does not offer any quality guarantees and, in some regions, the data may be sparse or outdated. These issues can affect all open-source routing engines. Moreover, there are certain geographical areas where access to external maps or GPS services is restricted. Those limitations could impose difficulties in using OSRM in different parts of the world.

In the second part of this article, we will delve into the application of Bayesian models to the preprocessed dataset, examining their efficacy in predicting delivery times and exploring potential improvements to the modeling approach. Our dataset does not accurately reflect the actual workload of OFD companies; therefore, it is highly recommended for future research to evaluate used methods on a larger dataset. The analysis of the Indian OFD market is based solely on deliveries; further research may take into account other elements such as social and cultural factors. Additionally, future studies could consider refinement of the preprocessing steps, such as advanced handling of missing values and more sophisticated outlier detection methods.

Author Contributions

Conceptualization, J.G., J.P. and J.B.; methodology, J.G., J.P. and J.B.; software, J.G. and J.P.; validation, J.G. and J.B.; formal analysis, J.G.; investigation, J.G.; resources, J.B.; data curation, J.G. and J.P.; writing—original draft preparation, J.G.; writing—review and editing, J.G. and J.B.; visualization, J.G. and J.P.; supervision, J.B.; project administration, J.B.; funding acquisition, J.B. All authors have read and agreed to the published version of the manuscript.

Funding

The third author’s work was partially realized in the scope of a project titled “Process Fault Prediction and Detection”. The project was financed by The National Science Centre on the base of decision no. UMO-2021/41/B/ST7/03851. Part of work was funded by AGH’s Research University Excellence Initiative under project “DUDU—Diagnostyka Uszkodzeń i Degradacji Urządzeń”.

Data Availability Statement

All code prepared as part of this project is available in the following repository: https://github.com/JohnnyBeet/Food-delivery-time-prediction/tree/preprocessing (accessed on 16 August 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

OFD	Online food delivery
GPS	Global Positioning System
OD	Origin–destination
OSRM	Open-Source Routing Machine
API	Application Programming Interface
OSM	OpenStreetMap

References

Statista. Online Food Delivery—Worldwide. 2024. Available online: https://www.statista.com/outlook/emo/online-food-delivery/worldwide (accessed on 4 May 2024).
IMARC Group. India Online Food Delivery Market Report. 2023. Available online: https://www.imarcgroup.com/india-online-food-delivery-market (accessed on 4 May 2024).
Alalwan, A.A. Mobile food ordering apps: An empirical study of the factors affecting customer e-satisfaction and continued intention to reuse. Int. J. Inf. Manag. 2020, 50, 28–44. [Google Scholar] [CrossRef]
Pomykacz, J.; Gibas, J.; Baranowski, J. Bayesian modelling of travel times on the example of food delivery: Part 2—Model creation and handling uncertainty. Preprints 2024, 2024061443. [Google Scholar]
Unwin, A. Why is data visualization important? what is important in data visualization? Harv. Data Sci. Rev. 2020, 2, 1. [Google Scholar]
Li, B.; Chen, L.; Xiong, D.; Chen, S.; He, R.; Sun, Z.; Lim, S.; Jiang, H. Simultaneous detection of multiple areas-of-interest using geospatial data from an online food delivery platform (industrial paper). In Proceedings of the 30th International Conference on Advances in Geographic Information Systems, Seattle, WA, USA, 1–4 November 2022; pp. 1–10. [Google Scholar]
de Araujo, A.C.; Etemad, A. End-to-end prediction of parcel delivery time with deep learning for smart-city applications. IEEE Internet Things J. 2021, 8, 17043–17056. [Google Scholar] [CrossRef]
Zhu, L.; Yu, W.; Zhou, K.; Wang, X.; Feng, W.; Wang, P.; Chen, N.; Lee, P. Order fulfillment cycle time estimation for on-demand food delivery. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 2571–2580. [Google Scholar]
Joshi, M.; Singh, A.; Ranu, S.; Bagchi, A.; Karia, P.; Kala, P. Batching and matching for food delivery in dynamic road networks. In Proceedings of the 2021 IEEE 37th international conference on data engineering (ICDE), Chania, Greece, 19–22 April 2021; pp. 2099–2104. [Google Scholar]
Ji, S.; Zheng, Y.; Wang, Z.; Li, T. Alleviating users’ pain of waiting: Effective task grouping for online-to-offline food delivery services. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 773–783. [Google Scholar]
Ulmer, M.W.; Thomas, B.W.; Campbell, A.M.; Woyak, N. The restaurant meal delivery problem: Dynamic pickup and delivery with deadlines and random ready times. Transp. Sci. 2021, 55, 75–100. [Google Scholar] [CrossRef]
Garus, A.; Christidis, P.; Mourtzouchou, A.; Duboz, L.; Ciuffo, B. Unravelling the last-mile conundrum: A comparative study of autonomous delivery robots, delivery bicycles, and light commercial vehicles in 14 varied European landscapes. Sustain. Cities Soc. 2024, 108, 105490. [Google Scholar] [CrossRef]
Malhotra, I.; Chandra, P.; Majumdar, S.K. Route Optimization Application using Server-Client Architecture and Google APIs. In Proceedings of the 2019 6th International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, India, 13–15 March 2019; pp. 210–214. [Google Scholar]
Paithane, P.; Wagh, S.J.; Kakarwal, S. Optimization of route distance using k-NN algorithm for on-demand food delivery. Syst. Res. Inf. Technol. 2023, 85–101. [Google Scholar] [CrossRef]
Wang, Z.; He, S.Y. Impacts of food accessibility and built environment on on-demand food delivery usage. Transp. Res. Part D Transp. Environ. 2021, 100, 103017. [Google Scholar] [CrossRef]
Abahussein, S.; Ye, D.; Zhu, C.; Cheng, Z.; Siddique, U.; Shen, S. Multi-Agent Reinforcement Learning for Online Food Delivery with Location Privacy Preservation. Information 2023, 14, 597. [Google Scholar] [CrossRef]
Muñoz-Villamizar, A.; Solano-Charris, E.L.; Reyes-Rubiano, L.; Faulin, J. Measuring Disruptions in Last-Mile Delivery Operations. Logistics 2021, 5, 17. [Google Scholar] [CrossRef]
Yu, X.; Li, X.Y.; Zhao, J.; Shen, G.; Freris, N.M.; Zhang, L. Antigone: Accurate navigation path caching in dynamic road networks leveraging route apis. In Proceedings of the IEEE INFOCOM 2022-IEEE Conference on Computer Communications, Online, 2–5 May 2022; pp. 1599–1608. [Google Scholar]
Fu, J.; Bhatti, H.J.; Eek, M. Optimization of Freight Charging Infrastructure Placement Using Multiday Travel Data. In Proceedings of the 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Bizkaia, Spain, 24–28 September 2023; pp. 1576–1582. [Google Scholar]
Vonthron, S.; Perrin, C.; Soulard, C.T. Foodscape: A scoping review and a research agenda for food security-related studies. PLoS ONE 2020, 15, e0233218. [Google Scholar] [CrossRef] [PubMed]
Hu, X.; Zhang, G.; Shi, Y.; Yu, P. How Information and Communications Technology Affects the Micro-Location Choices of Stores on On-Demand Food Delivery Platforms: Evidence from Xinjiekou’s Central Business District in Nanjing. ISPRS Int. J. Geo-Inf. 2024, 13, 44. [Google Scholar] [CrossRef]
Keeble, M.; Adams, J.; Bishop, T.R.; Burgoine, T. Socioeconomic inequalities in food outlet access through an online food delivery service in England: A cross-sectional descriptive analysis. Appl. Geogr. 2021, 133, 102498. [Google Scholar] [CrossRef]
Maulidi, C.; Dwicaksono, A.; Aritenang, A.F.; Winarso, H. Food service spatial pattern after the emergence of online retail. J. Infrastruct. Policy Dev. 2024, 8, 3005. [Google Scholar] [CrossRef]
Janatabadi, F.; Newing, A.; Ermagun, A. Social and spatial inequalities of contemporary food deserts: A compound of store and online access to food in the United Kingdom. Appl. Geogr. 2024, 163, 103184. [Google Scholar] [CrossRef]
Jahanshahi, H.; Bozanta, A.; Cevik, M.; Kavuk, E.M.; Tosun, A.; Sonuc, S.B.; Kosucu, B.; Başar, A. A deep reinforcement learning approach for the meal delivery problem. Knowl.-Based Syst. 2022, 243, 108489. [Google Scholar] [CrossRef]
Wang, L.; Fu, H.; Wu, S.; Liu, Q.; Tan, X.; Huang, F.; Zhang, M.; Wu, W. CAMLO: Cross-Attentive Multi-View Network for Long-Term Origin-Destination Flow Prediction. In Proceedings of the 2024 SIAM International Conference on Data Mining (SDM), Atlanta, GA, USA, 21–25 October 2024; pp. 454–462. [Google Scholar]
Food Delivery Dataset. 2023. Available online: https://www.kaggle.com/datasets/gauravmalik26/food-delivery-dataset (accessed on 13 May 2024).
Open Source Routing Machine. 2024. Available online: https://project-osrm.org/ (accessed on 13 May 2024).
Open Source Routing Machine API. 2024. Available online: https://project-osrm.org/docs/v5.5.1/api/#route-service (accessed on 13 May 2024).
OpenStreetMap. 2024. Available online: https://www.openstreetmap.org/ (accessed on 13 May 2024).
García, S.; Ramírez-Gallego, S.; Luengo, J.; Benítez, J.M.; Herrera, F. Big data preprocessing: Methods and prospects. Big Data Anal. 2016, 1, 1–22. [Google Scholar] [CrossRef]
Stan. 2024. Available online: https://mc-stan.org/ (accessed on 14 May 2024).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Folium. 2024. Available online: https://python-visualization.github.io/folium/latest/ (accessed on 14 May 2024).
Navarro, D. How to Visualise a Billion Rows of Data in R with Apache Arrow. 2022. Available online: https://blog.djnavarro.net/posts/2022-08-23_visualising-a-billion-rows (accessed on 15 May 2024).
Hoteit, S.; Secci, S.; Sobolevsky, S.; Ratti, C.; Pujolle, G. Estimating human trajectories and hotspots through mobile phone data. Comput. Netw. 2014, 64, 296–307. [Google Scholar] [CrossRef]
Carfora, M.F. Interpolation on spherical geodesic grids: A comparative study. J. Comput. Appl. Math. 2007, 210, 99–105. [Google Scholar] [CrossRef]
Boeing, G. Modeling and Analyzing Urban Networks and Amenities with OSMnx. 2024. Working Paper. Available online: https://geoffboeing.com/publications/osmnx-paper/ (accessed on 5 June 2024).
Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis, 3rd ed.; Chapman and Hall/CRC: New York, NY, USA, 2013. [Google Scholar]
The Swiggy Delivery Challenge. 2018. Available online: https://bytes.swiggy.com/the-swiggy-delivery-challenge-part-one-6a2abb4f82f6 (accessed on 18 May 2024).
DeepETA: How Uber Predicts Arrival Times Using Deep Learning. 2022. Available online: https://www.uber.com/en-PL/blog/deepeta-how-uber-predicts-arrival-times/ (accessed on 18 May 2024).
Predicting Time to Cook, Arrive, and Deliver at Uber Eats. 2019. Available online: https://www.infoq.com/articles/uber-eats-time-predictions/ (accessed on 18 May 2024).
Abdi, A.; Amrit, C. A review of travel and arrival-time prediction methods on road networks: Classification, challenges and opportunities. PeerJ Comput. Sci. 2021, 7, e689. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Preprocessing steps include the following: deletion of incomplete or out-of-India samples, generating routes via OSRM API, converting the date and time into a sustainable format, selecting predictors with the greatest information value, and mapping data to format appropriate for Bayesian model.

Figure 2. Raw distribution of sample route geometry. GPS coordinates are clustered near intersections and turns, whereas on straight sections, they are sparsely spaced. Considering raw distribution of route geometry data for some visualization method interpolation is necessary.

Figure 3. Histogram of total distance of generated routes; each bin represents 1 km interval. Distribution of route distances is skewed; however, in Bayesian modeling, such distributions are not problematic. Models can handle various data distributions including joint distributions. Most of the deliveries have a distance under 25 km. Two peaks in the distribution correspond to distances of 2–3 km and 4–5 km. Previous research shows that most orders are carried out within a distance of 2–3 km [15]; however, it analyzes only one city. Our data include deliveries from cities that vary in size and number of inhabitants; therefore, the first peak may correspond to high-density cities and the second to smaller ones. Moreover, two outlying groups can be identified (around 65 km and 120 km).

Figure 4. Comparison of routes determined using different routing engines. (a) Route obtained using OSRM on OSM. (b) Route obtained using Google Maps. Both services generate approximately the same route (difference in distance compared to the length of the entire route is negligible).

Figure 5. Comparison of routes results. (a) Route obtained using OSRM on OSM, which involves direct U-turn on Yamuna Expressway (obtained 18 May 2024). (b) Route obtained using OSRM on OSM (obtained 16 April 2024). (c) Route obtained using Google Maps that maps the location on road near Yamuna Expressway (obtained 18 May 2024). The lack of repeatability in the obtained routes may result from the lack of a direct point corresponding to the coordinate from the dataset.

Figure 6. Histograms of data used in inference. (a) Standardized distance, bins defined as <−1.5;2> with steps of 0.1. (b) Standardized meal preparation time, 20 bins equally spaced, automatically defined by program. (c) Categories of road traffic, from highest to lowest. (d) Distinct deliveries count. (e) Standardized delivery person rating, bins defined as <−7;2> with steps of 0.5.

Figure 7. Number of orders in each city. Larger Indian cities have a higher number of orders while smaller cities there have even 6 times fewer orders. This confirms that ordering food online is a typical urban phenomenon.

Figure 8. Route visualizations using Folium for (a) Mumbai, (b) Jaipur, and (c) Bangalore. Sections of roads that were heavily trafficked with deliveries have a more intense color, while sections that have been traveled once or twice are much less visible.

Figure 9. Route visualizations of routes on a city street graph made in OSMnx for (a) Mumbai, (b) Jaipur, and (c) Bangalore. Sections of roads that were heavily trafficked with deliveries have a more intense color, while sections that were traveled once or twice are much less visible.

Table 1. Data available in the dataset and their description.

Variable name	Meaning
Restaurant_latitude	Latitude of the restaurant
Restaurant_longitude	Longitude of the restaurant
Delivery_location_latitude	Latitude of the delivery destination
Delivery_location_longitude	Longitude of the delivery destination
Road_traffic_density	Road traffic intensity (Low, Medium, High or Jam)
Weatherconditions	Current weather conditions (e.g., Sunny, Stormy, Fog )
multiple_deliveries	Quantity of simultaneous deliveries (number 0–4)
Delivery_person_Ratings	Average rating of the courier
Delivery_person_Age	Age of the courier
Order_Date	Date of placing the order
Time_Orderd	Time of placing the order
Time_Order_picked	Time of picking up by courier
Time_taken	Delivery time in minutes

Table 2. Statistics of routes generated using the OSRM API.

Total number of routes	41522
Average route distance	13.99 km
Standard deviation of route distance	8.42 km
Minimum route distance	1.49 km
95th percentile	28.53 km
99th percentile	36.42 km
Maximum route distance	121.89 km
Distance intervals with the highest number of routes	2–3 km (2331)
Distance intervals with the highest number of routes	4–5 km (2153)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gibas, J.; Pomykacz, J.; Baranowski, J. Bayesian Modeling of Travel Times on the Example of Food Delivery: Part 1—Spatial Data Analysis and Processing. Electronics 2024, 13, 3387. https://doi.org/10.3390/electronics13173387

AMA Style

Gibas J, Pomykacz J, Baranowski J. Bayesian Modeling of Travel Times on the Example of Food Delivery: Part 1—Spatial Data Analysis and Processing. Electronics. 2024; 13(17):3387. https://doi.org/10.3390/electronics13173387

Chicago/Turabian Style

Gibas, Justyna, Jan Pomykacz, and Jerzy Baranowski. 2024. "Bayesian Modeling of Travel Times on the Example of Food Delivery: Part 1—Spatial Data Analysis and Processing" Electronics 13, no. 17: 3387. https://doi.org/10.3390/electronics13173387

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bayesian Modeling of Travel Times on the Example of Food Delivery: Part 1—Spatial Data Analysis and Processing

Abstract

1. Introduction

2. Literature Review

2.1. Distance and Travel Time Estimation

2.2. Spatial Analysis

3. Materials and Methods

3.1. Data

3.2. Routes Generation

3.3. Preprocessing

3.4. Visualization

3.5. Models Overview

4. Results

4.1. Preprocessing Results

4.1.1. Routes Generation

4.1.2. Input Data Analysis

4.2. Spatial Analysis

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI