Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining

Topaloglu, Bilal; Oztaysi, Basar; Dogan, Onur

doi:10.3390/jtaer19040138

Open AccessArticle

Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining

by

Bilal Topaloglu

^1,*,†

,

Basar Oztaysi

^1,†

and

Onur Dogan

^2,3,†

¹

Department of Industrial Engineering, Istanbul Technical University, Istanbul 34367, Turkey

²

Department of Management Information Systems, Izmir Bakircay University, Izmir 35665, Turkey

³

Department of Mathematics, University of Padua, 35121 Padua, Italy

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

J. Theor. Appl. Electron. Commer. Res. 2024, 19(4), 2851-2879; https://doi.org/10.3390/jtaer19040138

Submission received: 9 September 2024 / Revised: 13 October 2024 / Accepted: 16 October 2024 / Published: 17 October 2024

(This article belongs to the Topic Online User Behavior in the Context of Big Data)

Download

Browse Figures

Versions Notes

Abstract

Understanding customer journeys is key to e-commerce success. Many studies have been conducted to obtain journey maps of e-commerce visitors. To our knowledge, a complete, end-to-end and structured map of e-commerce journeys is still missing. In this research, we proposed a four-step methodology to extract and understand e-commerce visitor journeys using process mining. In order to obtain more structured process diagrams, we used techniques such as activity type enrichment, start and end node identification, and Levenshtein distance-based clustering in this methodology. For the evaluation of the resulting diagrams, we developed a model utilizing expert knowledge. As a result of this empirical study, we identified the most significant factors for process structuredness and their relationships. Using a real-life big dataset which has over 20 million rows, we defined activity-, behavior-, and process-level e-commerce visitor journeys. Exploitation and exploration were the most common journeys, and it was revealed that journeys with exploration behavior had significantly lower conversion rates. At the process level, we mapped the backbones of eight journeys and tested their qualities with the empirical structuredness measure. By using cart statuses at the beginning and end of these journeys, we obtained a high-level end-to-end e-commerce journey that can be used to improve recommendation performance. Additionally, we proposed new metrics to evaluate online user journeys and to benchmark e-commerce journey design success.

Keywords:

e-commerce; online user behavior; customer journey; process mining; big data

1. Introduction

The conversion rate (CR) is one of the most important key performance indicators of the retail industry. For the traditional brick-and-mortar retail channel, the conversion rates are generally between 20 and 40% and are rarely below 15% [1]. According to one report, the global average conversion rate for e-commerce is 1.79% [2]. Compared with the traditional channel, e-commerce conversion rates are significantly lower, which suggests that the main problem that needs to be solved, even for the most successful e-commerce companies, is how to increase the conversion rates.

In addition to the many advantages that e-commerce has, it also accommodates the means to solve its problems. For example, in e-commerce, huge amount of visitor data can be collected and analyzed to find solutions. Unlike traditional channels, e-commerce provides quick, easy, and unobtrusive records on visitors’ activities, which are known as clickstream data [3]. To obtain the maximum value using clickstream data, the traditional channel can be a good model for following the approaches and strategies used. Understanding consumer behavior is vital in the retail industry [4], and various types of qualitative and quantitative methods are utilized for this purpose [5]. A thorough analysis of the studies performed with e-commerce logs revealed that most of the studies focused on either prediction or product recommendation to website visitors [6,7,8], and exploratory e-commerce studies are relatively fewer in quantity. As a result, despite many valuable studies in e-commerce research, a complete and easy-to-understand map of customer journeys is still missing. We propose that to make accurate predictions or recommendations, first, there needs to be a holistic understanding of e-commerce visitor journeys.

Traditionally, customer journey mapping is performed manually by analysts and consultants using inputs from subject matter experts and is supported by customer surveys. Due to the limited number of inputs, this approach may not represent the majority of real journeys. In the age of data, data mining and process mining introduce new approaches and techniques for understanding online customer journeys [9,10]. By utilizing big data, it is possible to analyze online customer journeys in an advanced way. Our criticism of the e-commerce process mining literature is that all of the existing studies assumed that e-commerce consumer behavior was a structured process. In contrast to this assumption, customer processes are generally regarded as unstructured [11]. In a recent study on the challenges and opportunities for customer journeys and process mining, researchers pointed out the complexity of customer journeys compared to business processes [12]. As a consequence, process discovery with clickstream data results in diagrams such as the example in Figure 1, which are not easy to interpret. For example, the starting and ending nodes of this diagram are not clear, and each activity can initiate the process and be followed by any of the other activities.

We believe that process mining results for e-commerce need to be structured to have a better understanding of e-commerce visitor journeys. Structured processes generally have a clear start and end, defined activities with explicit activity names in object + verb form, a defined flow, and they are less complex. With this research, we aim to focus on structuring e-commerce visitor journeys for the first time in the literature. Many methods have been developed to cope with unstructured processes such as log filtering, clustering and abstracting. To evaluate the success of the proposed methods, researchers have generally used visual proofs, metrics developed for manually created process diagrams or quality criteria specific to process mining [11]. However, visual proofs are not always supported with objective measures and process complexity metrics are not suitable for process mining. In addition, process mining quality criteria are not defined for the most commonly used notation in practice, namely, the Directly Follows Graph (DFG) [13], which consists of only nodes and arcs. Furthermore, when big data comprise the input for process discovery, DFG notation remains the sole alternative due to performance-related concerns. Therefore, there is a need for a practical measure specifically suitable for this notation, and it needs to be based on empirical evidence. To address the abovementioned challenges and gaps in existing research, we examine the following research questions in this paper:

RQ1: What is the most important process structuredness factor, and what are the other significant factors?
RQ2: What are the most frequent e-commerce visitor journeys that can be observed using clickstream data?
RQ3: What is the end-to-end e-commerce journey?

In this study, a model was developed that assisted in objectively evaluating the resulting diagrams to measure process structuredness [11]. This measure acted as a litmus paper from a customer-centric approach. First, a visual dataset in DFG notation was created that included several attributes. These diagrams were classified as either structured or not structured by a group of experts who can be regarded as the customers of the resulting e-commerce visitor journey diagrams. After the evaluation phase, using logistics regression [14], a model was obtained that was capable of classifying process diagrams from the structuredness perspective. Following the empirical study, the focus was to obtain structured e-commerce visitor journeys. Consumer purchasing journeys and corporate procurement processes are at the edges of the continuum of purchasing process structuredness. Thus, corporate procurement processes were taken as a benchmark for structuring online visitor paths. A comparison of the process structuredness continuum is available in Table 1.

In line with the research motivation, a four-step methodology was developed to understand e-commerce visitor journeys. Activity type enrichment, start and end node identification, and Levenshtein distance-based clustering were included in this methodology. In line with the research objectives, to obtain a holistic understanding of e-commerce behaviors, the complete dataset was used without removing any events unlike the existing studies. As a result, activity-, behavior-, and process-level journeys were defined. The structuredness of the process-level journeys was evaluated with the empirical structuredness measure. In this way, a complete and easy-to-understand map of e-commerce visitor journeys was obtained.

The remainder of the paper is organized as follows: the backgrounds of process mining and logistic regression are described in Section 2, together with previous studies on web usage mining, process mining applications in e-commerce, and process structuredness; Section 3 explains our methodology for obtaining structured e-commerce visitor journeys; the setting used for the evaluation of our approach is given in detail in Section 4; and the results are provided in Section 5.

2. Background, Related Work, and Contributions

2.1. Background of Web Usage Mining

Analyzing consumer behavior on the Internet has great potential for e-commerce companies [15]. Web mining is a technique that has been used for developing solutions to increase low conversion rates in e-commerce [16]. It is a branch of data mining that uses web logs [17]. It has three subcategories: web content mining, web structure mining, and web usage mining [18]. This study is related to web usage mining, which involves the application of statistical and data mining methods to understand users’ behavior through web log data [19].

Electronic recordings of visitors’ actions on the Internet are defined as clickstream data [3]. These data consist of the path a visitor follows in the web, and in analyzing these paths, the choices that the visitors made can be found within and across different websites [3]. A web session is ordered page views or actions that a visitor performs before logging out or automatic time out [20]. Depending on the website settings, the visits made in the morning and in the evening can be counted as two sessions. Raw clickstream data need to be converted into sessions for use in web mining [20].

Electronic commerce is abbreviated as e-commerce and mainly utilizes the Internet to provide goods and services to customers [21]. In broad terms, e-commerce covers all commercial activities related to building and maintaining a website with product and price information, making promotions, and campaigns for selling goods and services and then delivering them to customers. The conversion rate is a key performance indicator in e-commerce that is calculated by dividing the number of visitors purchasing a product by the total number of visitors of a website [22]. Customer journey mapping is a tool that visually depicts the events that customers experience with an organization during their purchase process [23]. An online customer journey can be defined as the path including all the touch-points an individual customer is following in all the online channels, leading to a potential purchase decision [24].

2.2. Related Work on Web Usage Mining

Researchers are increasingly using clickstream data, which were originally collected for website performance analysis. In one of the most prominent studies, Moe introduced four shopping typologies in the quadrant of purchasing horizons and search behavior, and the typologies were directed buying, hedonic browsing, search/deliberation, and knowledge building [15]. In 2004, the browsing behavior of customers at multiple web sites was modeled to understand the complete customer journey [25]. Later, in 2012, studies on customer typologies were summarized in a review in which a new set of typologies was distilled and tested using clickstream data. Only one of the eight articles analyzed used web logs, and validations were performed with surveys for all other studies [26]. In the same year, visualized behavioral patterns of customers were obtained utilizing web logs [27]. Influenced by Moe’s typologies, in 2016, Schellong et al. used clickstream data as a source to discover online shopper types, and in a quadrant of consumer involvement and search behavior, they proposed Buying, Searching, Browsing, and Bouncing customer typologies [28]. Another study in that year investigated the navigational path differences in mobile and PC devices using a footstep graph [29]. In that year, Park’s research focused on modeling customer behavior using logs from multiple websites [30]. The following year, researchers predicted session outcomes with the help of multiple action motifs [31]. After modeling three, eight, and fourteen action purchase models, the repeating motifs in each model were compared, and the results were used for outcome predictions [31]. Sequential Event Pattern Mining (SEPM) was also used to identify e-commerce visitor behaviors. In a 2020 study, SEPM was applied with the session duration, and discovered types included classic, observer, comparing, and research behaviors [32]. The differences between hedonic and utilitarian purchases were the subject of an article in 2020, and Li et al. analyzed and listed the usage of various channels using clickstream data [33]. The latest research on web usage mining continues to focus on path visualization. A study from 2021 visualized browsing paths in which a hybrid of Hierarchical Recurrent Neural Networks (HRNNs) and Hopfield Neural Networks (HNNs), associated with a visualization graph, was used [34]. In the same year, an empirical study in 2022 considered online shopping cart use and online shopping cart abandonment to model customer behavior in an online setting. As an important managerial implication, the findings in this research revealed that product page browsing was negatively correlated with cart use and positively correlated with cart abandonment [35].

2.3. Definitions Related to Process Mining and Techniques Used

A process consists of activities that are organized to achieve an outcome, whereas a business process is a group of activities that creates value for customers when performed together [36]. Process mining is an emerging discipline that utilizes a set of techniques to discover, control, and enhance processes through the systematic use of event data [13].

Process mining is a set of data-driven techniques for discovering, conformance checking, and enhancing processes [37]. It utilizes the event logs that are generated when a system is running, and in an algorithmic and objective way, process analysis can be performed using this technique [11]. Process mining has been used in a wide range of domains, such as healthcare, manufacturing, and financial services [38,39,40].

Process mining requires event data that are recorded during the execution of corporate IT systems or the devices that we use [11]. In general, events are observable phenomena related to sequential or parallel steps performed [11]. This definition has even widened with the advancement of technologies such as the Worldwide Web and the Internet of Things (IoT). Therefore, events can be initiated by any means, and in addition to business processes, other types of processes can be improved with recorded event data [9].

Events grouped under the same process instance are called a case. Process mining requires at least three attributes in an event: the related case of the event, a timestamp indicating the time or the sequence of the events, and the activity or the action that generated the event [41]. In addition, an event can have other attributes, such as cost, organization, and the user who performed it. When events are collected together, they form event logs, and each event occurs only once in a specific collection [11]. A finite sequence of events is called a trace, where each event is listed once. Trace is a mandatory attribute for cases [11].

Table 2 depicts an example for an e-commerce event log. Each row in Table 2 is an event with the attributes Case ID, Activity, Timestamp, Price, and Product. The first four rows belong to the same case because they have the same Case ID. The collection of such events makes an entire event log. In real-life examples, an event log can include thousands or even millions of events.

The main input for process mining algorithms is an event log, and a process model is created using event logs. Some important definitions of process mining are given below:

Definition 1.

(Process model). Let U_act be the universe of activity names, U_M be the universe of process models, and trace

σ = < t 1, t 2, \dots, t n > \in U act *

be a sequence of activities. A process model

M \in U M

defines a set of traces

l a n g (M) \subseteq U act *

[13].

Definition 2.

(Transition system). A transition system is a triplet consisted of states, actions, and transitions such that TS = (S, A, T) [11].

S: Set of states;

A: Set of actions or activities;

T: Set of transitions where

T \subseteq S \times A \times S

.

Definition 3.

(Petri net). A Petri net is a triplet

N = (P, T, F)

with finite set of places (P) and finite set of transitions (T) such that

P \cap T = \emptyset

, and with a flow relation

F \subseteq (P \times T) \cup (T \times P)

through a set of directed arcs. A pair (N, M) is a marked Petri net where marking M is a multi-set over P [11].

Definition 4.

(Directly Follows Graph). A DFG is a pair

G = (A, F)

where A is a set of activities and F is a multiset of arcs such that

A \subseteq U act *

and

F \in B ((A \times A) \cup (▹ \times A) \cup (A \times □) \cup (▹ \times □)) .

In this notation, ▹ represents the start node and □ represents the end node where

({▹, □} \cap U act * = \emptyset)

, and U_G is the set of all Directly Follows Graphs such that

U G \subseteq U M

[13].

Logistic regression is considered a specialized form of linear regression for predicting and explaining a categorical variable instead of a metric response [14]. In binary classification problems, the outcome is always one of two categories. This requires calculating the probability of being in one of these categories. When the probability is greater than 0.5, the observation is assigned to one of the categories; otherwise, it is assigned to the other category. Linear regression may result in probability results below zero or greater than one [42]. The solution is to use the logistic function which gives results in (0, 1) [43].

Levenshtein distance is a string measure used to calculate the difference between two strings by counting the minimum edit operations, such as insert, delete, and, change needed to convert one word to another [44].

2.4. Process Mining Applications in E-Commerce

Process mining has great potential for discovering visitor journeys in e-commerce, and in a recent study, the processual pattern was suggested as the primary form of representation for explaining digital traces [45]. Compared with the existing techniques, process mining is more path-oriented and has different notations for visualization and analysis. As a result, it provides alternative ways to understand visitor journeys in e-commerce. The first research on this subject was published in 2013. In this study, after preprocessing the data to make them suitable for process mining, researchers obtained several process models with buyer-ends and exits without making purchases [9]. Four years later, the focus of another article was to model the workflows of bargain shoppers and surgical shoppers, and using clickstream data, differences and deviations of these typologies were observed [46]. In the same year, Quality of Service (QoS) was also the focus of an e-commerce study. A model was proposed for modeling a customer behavior graph that was sensitive to the QoS provided to customers during their navigation to formulate QoS-aware offers [47]. Another paper in that year used an LTL-based model-checking approach to analyze customer behavior with declarative modeling [48]. Following these, Terragni and Hassani published two clickstream studies in the process mining discipline. Their dataset was not from e-commerce but from advertising, which we considered to be a closer area to e-commerce. In both of the papers, recommendations for product pages were made for the user to visit [49,50]. In 2018, researchers used online ticket sales data to create a process model to make predictions and recommendations [51]. In 2019, process mining was used to improve the usability of a website by taking into account dynamic aspects of user activity [52]. After 2020, two articles were published in which web logs were analyzed via process mining. In one of them, an approach for designing and deploying a customer journey management system was developed and tested with the Google Merchandise Store dataset [53]. The other one was related to a data preprocessing method (CPA-PM) for event logs generated by cloud-based information systems with an emphasis on clickstream data [54]. A summary of these studies is given in Appendix A.

2.5. Evaluating Process Structuredness

Unstructured, spaghetti-like, complex, and hard-to-understand are the adjectives frequently used to describe process models discovered via process mining [55,56,57]. A process not being simple is classified as complex [58], and a process model with structure, as in second column of Table 1, helps its audience to comprehend it. Having close meanings, these adjectives are mostly used interchangeably. Van der Aalst defined a continuum of processes ranging from “structured” to “unstructured” processes [59], and he mentioned that unstructured processes had great potential for improvement [60]. In this section, we list the most prominent studies from the literature on measuring process structuredness before and after process mining.

According to a survey conducted in 2007, the number of arcs was found to be the most significant factor influencing process complexity [61] among the other metrics investigated in this study, such as Density, Coefficient of Network Connectivity (CNC), and Average Connector Degree (ACD). In a complementary study of that research, empirical evidence was found to support that process understandability was correlated with Density, CNC, and ACD [62]. According to the Seven Process Modeling Guidelines (7PMG), the number of elements, the number of routing paths per element, and the model layout were listed as factors affecting process model complexity [63].

Structuring process mining outcomes has been an important research topic since the inception of this discipline. In the Process Mining Manifesto, obtaining understandable process diagrams was listed as a challenge [37]. In a more recent article, it was shown that creating understandable process models is still an improvement opportunity for process mining [64].

Process mining quality criteria, namely, fitness, precision, generalization, and simplicity, are commonly accepted and widely used metrics for evaluating process mining outcomes [11]. In practice, Petri Nets with a fitness greater than 0.8 were regarded as structured [11]. During the development of the discipline, several other metrics were suggested for measuring process structuredness. In 2009, Minimum Description Length was used as a model quality measure [65]. In 2012, BPMN process models were investigated, and several complexity measures were tested to understand the costs and benefits of structuredness [66]. One year later, researchers proposed thresholds for model quality [67]. In that study, a model size less than 37 or a model with, at most 31, activities was considered non-complex. In a recent review article, more than forty process modeling guidelines were distilled from nearly 800 studies, and 30% of the selected studies did not provide empirical evidence related to the proposed guidelines [68]. In 2022, event log complexity and existing measures for process complexity were analyzed using a total of 32 event logs for BPMN models, and new measures were defined based on graph entropy [58]. Recently, the relationship between process performance and process complexity was studied, and a very high positive correlation was observed [69], which highlights the need to design easier e-commerce visitor journeys.

2.6. Main Contributions

The main objective of this research was to enhance our understanding of e-commerce visitor behaviors, which are considered to be a significant prerequisite for improving online customer journeys, making more accurate predictions and recommending more relevant products. The contributions of this study can be summarized as follows:

We developed an empirical process structuredness measure using expert knowledge.
We proposed a methodology for structuring the outcomes of e-commerce visitor journeys and tested it with real-life data.
By treating the cases with an account balance approach that is applied similarly in accounting, we calculated the cart status at the beginning and end of a journey and used this information for grouping the sessions.
We identified three levels of e-commerce visitor journeys and explained them.
By using cart statuses at the beginning and end of these journeys, we obtained a high-level end-to-end e-commerce journey.
We proposed new metrics to evaluate online user journeys and to benchmark e-commerce journey design success success.

3. Materials and Methods

Figure 2 schematically depicts the proposed methodology. The novel approach developed to measure the structuredness of processes is explained in Appendix B. Details of the proposed methodology are given below:

Step 1: Filtering Activity Level Journeys

To make the journeys at all levels more structured, before filtering the most basic e-commerce journeys, a preprocessing step was applied by setting a defined endpoint per the 7PMG. We assumed “Purchase” as the last activity for all the cases ending with a purchase similar to Lin et al. [31]. Below is a sample with multiple “Purchase” (shown as B) activities:

DDDDDAAAACBBBDDDDDAACCDDDAABBDD

Applying this preprocessing step gives these three cases:

DDDDDAAAACBBB
DDDDDAACCDDDAABB
DD

Another assumption in this preprocessing step was to keep consecutive purchases in the same case to prevent one-activity cases. For non-purchase sessions, this step did not modify the web logs.

After the preprocessing step, cases with only one or two events were filtered out and saved in another file for further analysis. Due to the nature of e-commerce, clickstream data contain sessions with very few events. Web crawlers collecting competitor data and clicks on ads resulting in unintended access to the websites are some of the reasons for short cases [9,49]. In addition, some visitors may leave websites due to poor journey design. In this research, we regarded activity level journeys as an improvement opportunity and analyzed them separately. Filtering out activity level journeys is also an important step for obtaining better results at other journey levels. Otherwise, depending on their rate in the dataset, behaviors or process diagrams may be misleading when interpreting e-commerce visitor journeys.

Step 2: Event Log Enrichment and Repair

Guiding Principle 1 of Process Mining Manifesto states that [37]:

“Event data should be treated as first class citizens”.

According to our experience, data from corporate systems such as ERP and CRM are generally more suitable for process mining. Web logs contain less-structured data. To align with the 7PMG, we applied enrichment and repair operations to the event log.

(a) Activity Type Enrichment: Activity types in the data of corporate systems mostly tend to correspond to a defined process step. However, as web logs are usually not designed from a process perspective, redefinition of activity types is needed. For example, a visitor may visit a specific product category or check multiple categories. In the log file, both events will probably have “View” as the activity type. From a customer journey perspective, changing a category can provide important information for understanding customers. Therefore, the activity types in the dataset were examined and redefined. Details are given in Appendix C.

After this step, four new activity types were created, and in total, there were eight activity types. In the process mining literature, modifications of event logs were studied under abstraction and log repair topics [70,71]. To our knowledge, linking events with previous events and checking the gross total of matching events using an accounting-like approach is novel.

(b) Adding Start and End Nodes in Line With Cart Status: In the e-commerce sales funnel, adding a product to the cart is an important action. To our knowledge, previous studies on e-commerce that used process mining techniques did not consider cart status when analyzing customer journeys. In our methodology, we attempted to identify under which conditions an online session was starting and ending. For example, a visitor may revisit a website and remove previously added products from the cart. When such a journey is analyzed without cart status, it will be misleading for the analyst. This step was also essential from the process modeling perspective according to the 7PMG.

Since cart status information was not readily available in the web logs used in this research, the algorithm inserted start nodes basically by searching for at least one event in a case with “Remove-Previously-Carted” or “Purchase-Previously-Carted” activity types. If one of these activities existed, then a starting node named “Start-with-Stocked-Cart” was inserted. Otherwise, the “Start-with-Unknown-Cart” node was added, meaning that the visitor neither had a product in the cart from previous sessions nor was interested in the product(s) remaining in the cart from previous sessions.

End nodes were inserted using an accounting-like approach. If the sum of the products added to the cart was greater than the total number of remove and purchase activities for a specific product, then a node named “Exit-with-Stocked-Cart” was inserted for non-purchase cases, and for the cases with at least one purchase, the event node name was “Order-Given-Stocked-Cart”. For such cases, we assumed that the visitor was stocking some products in the cart for buying or deciding in the future. If the number of products added to the cart in a session was equal to the total number of remove and purchase activities for a specific product, either the “Exit-with-Unknown-Cart” or “Order-Given-Unknown-Cart” node was added depending on whether the case was a purchase or non-purchase type. The same remark provided for the start nodes applies here. In such a session, the visitor might not be interested in the products added to the cart in previous sessions, or the cart might be empty.

We can refer to Table 3 for an explanation of this step. In the “New Activity Type” column, there are three activities related to previous sessions. Using this information, we understand that there were products in the cart at the beginning of the session. Therefore, the starting node was named “Start-with-Stocked-Cart”. When we check the products added to the cart, removed from the cart and purchased in this session, we can identify that three products were added to the cart, one of which was removed and one of which was purchased. One product remained in the cart at the end of the session. With the help of this calculation, the end node was set to “Order-Given-Stocked-Cart”.

Step 3: Filtering Behavior Level Journeys

In the first step of the methodology, the threshold for activity level journeys was defined to be a maximum of two events in the same session. On the other hand, while developing the process structuredness measure, a process was assumed to have at least three steps or three different activities. Therefore, we defined every journey in between as the behavior level. Figure 3 depicts three levels of e-commerce visitor journeys.

Step 4: Process Level Journey Classification

Filtering out activity-level and behavior-level journeys gave the most complex and valuable e-commerce visitor journeys: process-level journeys. To explain these journeys, the following algorithmic steps were applied:

(a) Decreasing Number of Variants (Optional): This step can be applied depending on the log complexity and the need for each activity type for understanding customer behavior. The main objective for renaming purchase activities was to determine the cart status. A side effect of this manipulation was the increase in the number of variants, which made the analysis more difficult. As an optional step in the methodology, some activity types were reverted back to the original type. With this step, it became easier to analyze some e-commerce consumer behaviors.

(b) Grouping Journeys in Line with Cart Status: At this stage, the dataset contained cases with at least 3 different activity types. A further grouping was made at this point by using the artificial nodes added for the cart status. Each case had one of the two start node types and one of the four end node types. As depicted in Table 4, eight categories were identified.

(c) Grouping by Variant Frequency: This step was required because of the complexity of e-commerce logs. Compared with those in corporate system logs, the number of variants in web logs is usually too high. As running the algorithm with a high number of variants took considerable time, this step was required to shorten the execution time. After all the variants in the remaining dataset were identified and grouped by variant frequency, each group was divided into two subgroups using the trace variant frequency of the cases. If a case had a variant frequency greater than one, it was placed in the high-frequency set. Otherwise, it was put in the low-frequency set.

(d) Clustering of High-Frequency Variants: Using Levenshtein distance-based clustering [72], cases with high variant frequency were split into at least two clusters. Depending on the evaluation performed using the logistic regression model, the number of clusters increased. The objective of the algorithm was to minimize the maximum distance within a cluster. As average distance works in favor of the cluster with the highest number of cases assigned, it was not preferred. Initially the most frequent variants are assigned as the seeds. Then, for each case, the algorithm calculates the Levenshtein distance with all elements of each cluster. The maximum distance to each cluster is found, and the case is assigned to the cluster with the minimum distance if the distance is smaller than a predefined threshold parameter.

(e) Assigning Low-Frequency Variants: Process variants generally show a high level of internal heterogeneity [73]. E-commerce web logs usually contain a large number of trace variants. Leaving low-frequency variants out means ignoring significant details. Therefore, low-frequency variants were assigned to the nearest cluster. In machine learning terms, the process variants were finally grouped as a cluster of process trace [73]. The proposed algorithm calculates the distance between each unassigned case and the assigned cases and assigns it to the cluster with the case with the minimum distance. If a case had equal minimum distance to more than one cluster, then the cluster with the highest number of cases with minimum distance was selected. When there was more than one cluster satisfying this condition, the case was assigned to the first cluster.

An illustration for the grouping and clustering processes of the variants is given in Figure 4. In the upper part, grouping is achieved using variant frequencies according to the defined threshold parameter. The larger circles at the bottom of the illustration depict the clusters that consist of high-frequency variants. At the bottom right, Variant8 is assigned to the blue cluster, as Variant2 is the nearest to it according to the calculated Levenshtein distance. Similarly, Variant9 and Variant10 are placed in the red cluster as a result of a comparison conducted using Levenshtein distance.

With this approach, it was possible to assign all remaining cases to a cluster, which decreased the in-cluster similarity. To prevent this, a distance limit was used as a threshold, which was set by trial and error. The guiding principle for setting the threshold limit was to have 80% of the cases resulting from the preprocessing phase assigned to a cluster. Next, a process diagram was created for each cluster and evaluated with the proposed structuredness measure.

4. Evaluation and Results

The proposed methodology was tested using web logs from an online cosmetics shop [74]. The dataset contained clickstreams of five months between October 2019 and February 2020, and it had more than 20 million rows. Originally, the dataset was composed of five separate files. First, using KNIME, these files were combined and ordered by session ID so that the run time of the algorithms could be improved. After that, the proposed four-step methodology was applied with a set of algorithms as follows:

Step 1—Filtering Activity Level Journeys: An algorithm was developed to process the dataset in line with the methodology described in the Section 3. “Purchase” was considered to be the last activity in the customer journey. Each purchase in the dataset was shown as a separate event in the dataset; because of this, consecutive purchases in the same case were kept together to prevent one-activity cases. Cases without any “Purchase” activities were not changed. Following this, cases with just one or two events were removed from the dataset and saved in another file for further analysis as the activity-level journeys.
Step 2—Event Log Enrichment and Repair: After the initial dataset examination, it was decided that activity types in the dataset did not reflect the intentions of the visitors. For data enrichment purposes, activity types in Appendix C were created with an algorithm. Since cart status information was not readily available in the web logs used in this research, cart statuses were calculated and start and end nodes in line with cart status were added to each case as explained in Table 3.
Step 3—Filtering Behavior Level Journeys: In line with Figure 3, we created an algorithm that filtered out behavior-level journeys regardless of the session length but at a maximum of two activity types. Then, each combination of activities was saved in separate files for analyzing these behavior-level journeys.
Step 4—Process Level Journey Classification: In order to decrease the number of variants, first of all, “Purchase-Previously-Carted” activities were converted back to “Purchase”. Then, a case grouping was made with start and end nodes, as in Table 4, and eight groups were identified. Following that, each group was processed with the algorithm illustrated in Figure 4. Structuredness for the processes of each resulting cluster were tested using the empirical measure developed in this study.

In this study, preprocessing and case groupings were performed using Python 3.10 in Google Colab Pro, which had 51 GB of RAM. During the tests, the amount of RAM used never exceeded 10 GB. The analysis unit was the sessions. In addition, “session-id”, “event-time” and “event-type” columns were also used for event log grouping and process discovery. “event-type” had “view”, “cart”, “remove-from-cart”, and “purchase” activity types, which were similar to the other web logs openly available [75].

In this research, we managed to understand and explain more than 95% of the cases in the combined dataset. The quality of the process diagrams obtained in this research were objectively measured and verified with the model developed using expert knowledge. The results of this study were limited to the dataset used. Analyzing the web logs with the proposed methodology resulted in the identification of eight behavior-level and eight process-level journeys. In the middle of Figure 5, all journey types are given in a funnel, and at the end of the funnel is the ultimate objective, making the visitors “Purchase”. Charts in the figure show the overall shares of journey levels with buyer-end-included and -excluded versions and the share of purchase for each journey level.

4.1. Activity Level E-Commerce Visitor Journeys

Approximately three-quarters of the analyzed dataset contained sessions with activity-level journeys. The vast majority of these were “View” activity type cases, 83.51% of which involved a single event, and 9.49% of which involved two events. This fraction of the dataset held only 0.03% of “Purchase” activities. A total of 5.12% of the dataset was “Add-to-Cart”, and 0.085% was “Remove-from-Cart”. These kinds of journeys did not require visual inspection.

4.2. Behavior Level E-Commerce Visitor Journeys

In this group, there were 23 distinct activity type combinations in the dataset analyzed. As part of the analysis, each combination was visualized via process mining and statistically examined. As a result, eight behavior-level journeys were discovered in the dataset. Exploitation, Exploration, and Purchase behaviors were already available in the literature [15,76]. Other behaviors were defined as a result of the analysis. A summary of these behaviors is given in Table 5, and explanations are provided below.

Exploitation: Visiting product pages only in a specific category. In Table 5, the joint occurrence of each behavior with other behaviors is also given. At the behavior level, this kind of journey was observed in 59.92% of cases without taking any other action, and in 40.08% of the cases, other behaviors were detected.
Exploration: Visiting product pages in at least two categories.
Selection: Adding products to the cart without visiting related product pages.
Handpicking: After visiting a product page, adding that product to the cart and sometimes viewing another product in the same category and adding it to the cart.
Elimination: Removing products from cart which were added in previous sessions or in the current session after visiting a product page. This behavior was always observed with other behaviors.
Cancellation: Removing products from cart that were added in previous sessions or in the current session. This behavior is identical to the invalidation of purchase requests in corporate procurement processes.
Replenishment: Removing products from the cart that were added in previous sessions before or after adding new products to the cart from a list. This behavior was always observed with other behaviors.
Purchase: Purchasing products that were added to the cart in the previous sessions or in the current session.

4.3. Process Level E-Commerce Visitor Journeys

At the process level, there were eight e-commerce visitor journey categories according to the state of the cart at the beginning and end of the sessions, which are available in Appendix D. Since classification with cart status yielded unstructured processes, the algorithm based on the Levenshtein distance was used to discover comprehensible processes. For each process-level journey, clustering was applied, and the resulting process variants were evaluated with the proposed structuredness measure. Basically, after obtaining clusters, process diagrams were obtained and inputs for the structuredness measure calculated and the structuredness value were found with Equation (A2). With this approach, the main flows of each journey were discovered. Process mining was also utilized to obtain statistical information. Statistics and structuredness values for process-level journeys are given in Table 6.

By using the cart statuses at the beginning and end of the journeys, the end-to-end e-commerce high-level journey in Figure 6 was mapped. While designing this diagram, journeys with unknown cart statuses were assumed not to have a start or end connection with other journeys. For example, P2 starts with an unknown cart and it was connected to “Unknown Cart” node. At the end of P2, remaining items were detected in the cart. Using this information P2 was linked with P1, P3, P5, and P7, which had stocked carts at the process initiation. With this approach, a complete picture of e-commerce journeys was obtained from a cart status perspective and it provided an advanced understanding compared with other approaches which superpose multiple diagrams.

Each process level journey discovered with process mining was considered a subprocess of this high-level journey. The percentages inside the steps depict the overall shares of the journey categories. Dashed lines were used for following the flow easily. Diagrams for process-level structured journeys 1–6 are given in Figure 7. In line with 7PMG, activity labels were converted to verb–object format before process discovery. P7 consisted of Cancellation and Exploration behaviors. On the other hand, P8 was composed of Cancellation and Exploitation behaviors. Therefore, process diagrams were not necessary for P7 and P8.

5. Discussion

In this research, a more advanced understanding of e-commerce visitor journeys was obtained with the proposed methodology compared with existing research given in Appendix A. The main difficulty of the analysis was the structure of the dataset. As the start and end status of the cart was not available, we developed an alternative approach to identify the cart status. In addition, activity types were not designed from a process discovery perspective. We also suggested a solution for this problem. We believe that better results could be achieved with datasets that are more suitable for our approach. This implies the importance of data for process mining, as stated in the Process Mining Manifesto [37].

The execution time of the proposed methodology was approximately two hours. With a dataset of 20 million rows, the algorithm performance was satisfactory. Clusters of each process-level journey were visually analyzed, and algorithm parameters were manipulated to improve the structuredness of the resulting diagrams. With this approach, at the end of the analysis, over 95.42% or the cases were in a cluster that could be interpreted, which was in line with the initial data representation target.

At this point, we can discuss the results of this study in terms of research questions:

RQ1: In the existing studies in the literature, the number of arcs in a diagram was considered to be an important factor for the process structuredness [61]. In this study, it was revealed that process experts do not consider self-loops of the nodes, and Arcs per Node Excluding Self Loops is the most significant factor for process structuredness, as shown in Table A3. Moreover, Number of Nodes is a factor influencing process structuredness; however, it is not as significant as Arcs per Node Excluding Self Loops.
RQ2: By applying the four-step methodology, we identified that activity level e-commerce visitor journeys were the most common journey type and that they carried importance for e-commerce journey design. After these, at the behavior level, Exploitation and Exploration were the most common journeys, and it was revealed that journeys with Exploration behavior had significantly lower CR. We think that for prediction and recommendation in e-commerce, researchers and e-practitioners can benefit from analyzing and studying on these behaviors. Lastly, at the process level, P5 and P6 were the most common journeys which needed to be focused on for higher CR.
RQ3: To our knowledge, for the first time in the e-commerce literature, we mapped an end-to-end process of an e-commerce journey, which is given in Figure 6. This journey map is the top-level process, and most common lower-level processes are shown in Figure 7. In line with the research objective, these processes are structured, and they provided many insights for the dataset analyzed. The implications of these are discussed in the following subsections.

5.1. Implications for E-Commerce Visitor Journeys

The Cosmetics Shop dataset analyzed in this research included more than 4.5 million cases. When the activity level was split, the other two journey levels consisted of 1.17 million cases. Excluding almost three-quarters of the data meant losing most of the information in the dataset. Therefore, the complete dataset was analyzed in this research, unlike other studies performed with process mining.

If similar rates of activity-level journeys were observed in the brick-and-mortar shops, we would see that the majority of the customers would enter and exit the store, which might require immediate action from the managers. Hence, instead of ignoring them like in existing studies [9,49,50], we considered activity-level journeys to be an improvement opportunity. By using complementary analysis methods, e-commerce practitioners can benefit from the results:

To enhance e-commerce journey design and thus convert activity-level journeys to more advanced journeys;
To obtain a KPI to measure competition by filtering web crawler sessions;
To evaluate advertisement efficiency by reporting access points.

In addition, an important implication arising from activity-level journeys was that any predictions or recommendations made during the first two events of a session would fail most of the time. Therefore, running models after the third event can result in higher accuracy rates in the reviewed process mining studies in e-commerce [49,50,51] and improve recommendation performance from the e-commerce practitioners’ perspective.

The journey-level conversion rates given in Figure 5 were in line with our initial expectations. At the activity level, purchasing was seen only in 0.02% of the sessions. At the behavior level, the conversion rate was 0.73%, and at the process level, it was 2.72%. Buyer-end was observed to be four times greater than that at the behavior level. This finding verified that when engagement in the visitor journey increased, the probability of purchasing also increased.

At the behavior level, Exploitation, Exploration, and Purchase were the behaviors that were identified in previous studies [15,76]. To our knowledge, Selection, Handpicking, Elimination, Cancellation, and Replenishment behaviors were identified for the first time in clickstream data analysis.

Interestingly, more than half of the sessions consisted of only product views. A total of 38.82% of these cases included Exploitation and 28.21% of them were Exploration-based. In addition, Exploitation jointly occurred with other behaviors in 40.08% of the cases. This statistic was 16.69% for Exploration. It can be concluded that Exploitation was a preferable behavior from an e-commerce point of view because it had a greater probability of evolving into a process-level behavior which had a four-times-higher conversion rate in the dataset. Additionally, product recommendations from the same category would be a better strategy for e-commerce practitioners for higher conversion rates.

Process-level journeys from P1 to P4 were the buyer-end part of the dataset. Among these, P4 was the only process in which all products in the cart were purchased. In addition, it was the quickest in terms of the median duration. P4 was the journey where the visitors were the most dedicated and the buying intention was greater than that of the other three journeys. From an e-commerce conversion perspective, it can be beneficial to focus on and understand P4, and e-commerce experience managers can use this information to redesign e-commerce visitor journeys to increase the ratio of P4. Another insight was that the number of product pages visited per category was lower for buyer-end journeys than for others, which is in line with the existing literature [35].

P5 and P6 contained the majority of the cases at the process level. Since both of these journeys ended with a stocked cart, we named them Stocking journeys. P6 mainly exhibited Exploration behavior, and as previously mentioned, this behavior was a sign of not making a purchase. In this journey, visitors first explored products, then added products to the cart mostly without visiting the product page; they rarely removed a product added to the cart in that session, and they exited the website with a stocked cart. P5 was considered the continuation of P6. In the main flow of P5, visitors first removed products from the cart that were added in the previous sessions either directly or after Exploitation. Following that, they explored other products, added a product after viewing its page, added products to the cart mostly without visiting the product page, sometimes removed a product added to the cart in that session, and exited the website with stocked cart. From an e-commerce management perspective, P5 and P6 were the processes to focus on in order to increase CR as they represented nearly 70% of process level behaviors.

Among process level journeys, P7 and P8 were Cancellation-based processes in which the visitors had no intention to buy. These journeys can be considered the counterpart of cancelling a procurement request [77] in corporate processes. These processes took considerably less time than did the other six processes. At the behavior level, Cancellation was 9.41%, whereas at the process level, it was 7.37%, which supported the importance of visitor engagement in e-commerce.

Finally, in contrast with the highest average product page visit per session, the net cart change per session metric was the lowest for P7, and it was zero for P8, meaning that no products remained in the cart at the end of the session. Finally, we can discuss the end-to-end high-level e-commerce visitor journey in Figure 6. The importance of the process level journeys arose from the fact that it had the highest purchase rate among the three journey levels. In one respect, they were the golden part of the clickstream data. In our opinion, the share of P4 is a metric indicating the success of journey design for the e-commerce practitioners. When the share of P4 is high, website visitors follow the happy path to satisfy their needs in the quickest way, as P4 had the lowest median duration among buyer-end journeys. On the other hand, P8 is the unhappy path as the journey ends without any items in the cart. In this dataset, journeys other than P4 and P8 spanned across sessions, and their total share was 95.32%. Obviously, this was the result of both the journey design and the variety of competitors and channels available.

5.2. Implications for Structuredness Measure

The objective of this study was to discover comprehensible e-commerce visitor journeys using process mining. In line with this objective, a structuredness measure was developed using inputs from five process analysis experts. In the existing researches number of arcs, Density, CNC, ACD, and model layout were listed among the significant factors for process understandability [61,62,63]. According to the logistic regression model, the proposed CNCX metric was identified to be the most significant factor for detecting structuredness of DFG process diagrams. Size was the other factor impacting the structuredness. Superority of the proposed structuredness measure was to be able to distinguish between process models numerically. In one of the previous studies, 31 nodes were identified as the limit for process understandability [67]. The structuredness measure proposed in this research showed that a process diagram with 31 nodes was structured with 52 arcs and unstructured with 53 arcs. In this research, there were at most 10 nodes in the process diagrams. When 31 nodes criterion was used, without any clustering resulting process diagram could be considered not complex. By using the structuredness measure, we obtained process diagrams which were acceptable by the process experts in this study. Structuredness was found to be a combination of several factors and was not related to just one measure. On the other hand, the results indicated that some structuredness measures, such as Density and Total Number of Elements, were not needed because CNCX and Size performed better. Additionally, when the number of self-loops is known, CNCX should be preferred over CNC.

Alongside the the abovementioned implications, the logistic regression model obtained in this study allows for more fine-granular insights. Similar to process mining quality criteria [11], proposed structuredness measure gives results in the range (0, 1). On top of that, it is also possible to compare process diagrams using odds ratio. For example, both P2 and P4 in Figure 7 have eight nodes and P2 has one more arc, and as a result, the respective CNCX values are 1.625 and 1.500. The structuredness values are 14.62% and 12.25%, as given in Table 6. Since logistic regression results are not directly comparable, we can compare odds ratios, which are 0.171 and 0.139, respectively. We can say that for a one unit increase in the number of arcs, the odds of being unstructured increases by a factor of 1.226. If the CNX value increases by one unit (adding eight arcs), the odds of being unstructured increases by a factor of 3.400.

Taking these into account, we can say that the proposed structuredness model has a high-precision and comparable measurement capability.

5.3. Limitations and Validity

The structuredness measure obtained in the empirical study with process experts was limited to the DFG notation. In addition, since the selection of process experts was made on a voluntary basis, the experts who participated in the research may not represent the whole population.

There was also room for improvement in the model performance. The logistic regression model obtained in this study could classify 97% of the cases accurately. The remaining 3% was due to factors not included in the model, and naturally, there were evaluation errors that caused incorrect classifications which resulted from having a limited number of attributes for developing the structuredness measure, and better accuracy values can be achieved with more attributes. Moreover, having at most 43 nodes in the visual dataset can also be considered a limitation of this research.

On the other hand, the findings of the e-commerce visitor journey analysis were valid for the dataset used, and different datasets may yield different results. It is possible to find new enriched activity types with different datasets, and this can be a threat to the validity of the research.

5.4. Future Research Directions

In the future, we are planning to support our findings using different datasets. At this point, an improvement is needed to enhance the methodology to obtain a higher data representation rate. In addition, in the analysis, we used limited attributes from the dataset. By utilizing attributes such as duration, brand, and price, our understanding of e-commerce visitor journeys can be further enhanced. Finally, we plan to utilize customer behaviors in this research to make online product recommendations.

Author Contributions

Conceptualization, B.T., B.O., and O.D.; methodology, B.T.; software, B.T.; validation, B.O. and O.D.; formal analysis, B.T.; investigation, B.O. and O.D.; data curation, B.T.; writing—original draft preparation, B.T.; writing—review and editing, B.O. and O.D.; visualization, B.T.; supervision, B.O. and O.D.; project administration, B.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data supporting the conclusions of this article can be made available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CR	Conversion Rate
DFG	Directly Follows Graph
RQ	Research Question
SEPM	Sequential Event Pattern Mining
HRNNs	Hierarchical Recurrent Neural Networks
HNNs	Hopfield Neural Networks
QoS	Quality of Service
LTL	Linear Temporal Logic
CPA-PM	Cloud Pattern API-Process Mining
CNC	Coefficient of Network Connectivity
CNCX	Coefficient of Network Connectivity Excluding Self Loops
ACD	Average Connector Degree
7PMG	Seven Process Modeling Guidelines
BPMN	Business Process Modeling Notation
IoT	Internet of Things
ERP	Enterprise Resource Planning
CRM	Customer Relationship Management
KPI	Key Performance Indicator
AUC	Area Under Curve
MAP	Mean Average Precision

Appendix A

A summary of the literature for process mining usage in e-commerce is given in Table A1 and Table A2.

Table A1. Overview of process mining research in e-commerce.

#	Reference	Main Focus	Contribution	Findings
1	Poggi et al. [9]	Process discovery	Knowledge-based Miner algorithm which was also capable of using prior knowledge; first e-commerce research with process mining; and transforming URLs to activities	General customer behavior with different algorithms
2	Padidem and Nalini [46]	Process discovery	Four shopper types assumed, expected behaviors guessed and then two of the types were discovered as processes; expected shopper types verified and behavior of shopper types analyzed
3	Ghavamipoor et al. [47]	Quality of Service (QoS) sensitive customer behavior model graph discovery	Proposed a Customer Behavior Model Graph that was sensitive to the QoS provided to customers during their navigation to formulate QoS-aware offers to them	Customer Behavior Model Graphs for buyer and visitor customer types were analyzed
4	Hernández et al. [48]	Analysis of customers’ purchasing behavior	LTL-based model checking approach to analyze customer behavior with declarative modeling was developed	Purchasing rates were found to be differing for different categories
5	Terragni and Hassani [49]	Analyzing customer journeys to make recommendations	Used process mining on the web logs to explore the customer journey; predicted their activities and recommended actions that maximize particular KPIs	Attribute (mobile and non-mobile) based customer journeys discovered and analyzed
6	Terragni and Hassani [50]	Analyzing customer journeys to make recommendations	Data-driven customer journey mapping and recommendations using the mapped journey	General customer journey map obtained
7	Goossens et al. [51]	Order aware recommendations	Used a process model to do predictions and recommendations within the customer journey; explicitly used the order of events during predictions and recommendations and optimized recommendations for any chosen KPI	No information
8	Filipowska et al. [52]	Usability	Proposed a model for improving usability of the website taking into account dynamic aspects of user’s activity on the portal	No information
9	Nguyen et al. [53]	Customer Journey Management	Proposed an approach for designing and deploying a customer journey management system	No information
10	El-Gharib and Amyot [54] (2022)	Data preprocessing	A data preprocessing method (CPA-PM) for event logs generated by cloud-based information systems, with an emphasis on clickstream data	No information

Table A2. Details of process mining research in e-commerce.

#	Industry	# of Events	Tool	Notation	Algorithm	Prediction	Recommendation	Metrics
1	Online Travel and Booking	4 million+	Business Process Insights (BPI)	DFG	Knowledge-based Miner, Heuristic Miner, Fuzzy miner	No	No	None
2	No information	No information	No information	Petri Nets	No information	No	No	None
3	Supermarket	200,000	ProM	Dependency Frequency Diagram	Heuristic Miner	Transition probabilities calculated	No	Average Absolute Error
4	Gift Shop	8,607,625	Model Checker	Declarative Model	No information	Behavioral patterns occurrence	No	No information
5	Advertising	10 million	Disco	DFG	Fuzzy Miner	Next activity prediction	Product pages recommended to the visitors	AUC (Area Under Curve)
6	Advertising	10 million	Disco	DFG	Fuzzy Miner	No	Product pages recommended to the visitors	MAP@5 and AUC
7	Online Ticket Sales	141,510	Disco	DFG	Fuzzy Miner	Next activity prediction	Recommendations were made for new sales to customers	F1 Score
8	No information	76,975	ProM	No information	No information	No	No	No information
9	Software	18,077	PM4Py	Petri Nets	Fuzzy Miner, Alpha Algorithm, Heuristic Miner, Inductive Miner	No	No	Fitness and Simplicity
10	Software	1,602,438 and 2,144,210	Disco and ProM	DFG	Fuzzy Miner	No	No	None

Appendix B

In this paper, we developed a measure for assessing the structuredness of process models in DFG notation. The proposed measure was based on the evaluations of DFGs by five process analysis experts and obtained with these steps:

Step 1: Visual Dataset Creation

To create a DFG, the main input is a transition matrix that is built using event logs. Disregarding the repeating number of loops between the nodes in a control-flow diagram and including self-loops, there can be at most

n^{2}

distinct process diagrams with n nodes. For a 10-step process, the number of possible process diagrams is one hundred. From a practical point of view, a diagram with fewer than three nodes was not considered a process, and creating process diagrams using more than forty-three nodes caused algorithm performance problems. Thus, the range for nodes in the dataset was specified as [3, 43].

Starting from three nodes, the process diagram generation algorithm randomly assigned the number of arcs and created the transition matrices randomly. Then, process diagrams were created with this information, and the numbers of nodes, arcs, and self-loops were saved in a spreadsheet. For simplification purposes, standardized activity names such as Activity1 and Activity2 were used, the thickness was the same for all arcs, artificial start/end nodes were not included, and the diagram direction was from left to right, which was the reading direction for all the experts in this study. At the end of this step, 3000 process diagrams were randomly created using Python 3.10 and Graphwiz [78].

Step 2: Preparation and Scale Design of the Evaluation Criteria

The evaluation criteria were distilled from the articles in the literature review section. The adjectives structured/unstructured, simple/complex, spaghetti-like, and easy/hard to understand were identified as the guiding criteria together with the following questions:

Can you follow the flow in the diagram and read it as a process?
How much effort is needed to turn the process diagram into a more structured one?
As a process analysis expert, is this diagram acceptable from a customer centric perspective?

The outcome of the evaluation process was to have the diagrams in the visual dataset classified as either “Structured” or “Unstructured”. Initially, a pilot classification was made with a 0–1 (Structured/Unstructured) scale. Since it turned out to be a challenging task to classify some of the process diagrams with this scale, a Structured–Semi-structured–Unstructured continuum [11] was used as the idea from which to create the scale below:

Structured: Without any hesitation, the process diagram can be tagged as structured.
More or Less Structured: There are a few parts complicating the process, and by easily lining these parts, the process can be turned into a structured process.
Rather Structured: It is possible to partially understand the flow; however, it is difficult to determine whether it is structured. If it is possible to make it structured with less effort according to the reviewer, then this tag is assigned.
Close to Unstructured: It is possible to partially understand the flow; however, it is difficult to determine whether it is structured. If the required effort is expected to be high by the reviewer, then this tag is assigned.
Unstructured: There is no block structure in the process; it is very difficult to follow the flow, and visually, it is spaghetti-like without any hesitation. From a practical point of view, this type of output is not acceptable by a customer.
Extremely Unstructured: It is impossible to follow or understand the flow. This type of output cannot be produced manually.

In this scale, the counterpart options are “Structured”/“Unstructured” and “Rather Structured”/“Close to Unstructured”. The reason for having the “More or Less Structured” option was to help the reviewer make a smooth transition from “Structured” to “Rather Structured”. Leaving the “Extremely Unstructured” option out, the suggested scale can also be used for evaluating manually created process diagrams.

Step 3: Two-step Dataset Evaluation

A target list of reviewers was prepared, twenty-three process experts were invited to the evaluation phase, and five of them attended dataset evaluation sessions. On average, the evaluators had 18.8 years of professional process analysis experience using various notations in process projects with customer involvement. The minimum experience was 10 years, and the maximum was 25 years.

First of all, evaluation criteria were provided to all reviewers, and the proposed scale was explained. After examining the evaluation criteria, the complete dataset was rated by one of the process experts with the proposed scale. Pauses or breaks were allowed during the review process. The reviewer had the options to review the evaluation criteria again, skip any of the process diagrams to evaluate later on and change any of the ratings when needed. At the end of the first evaluation step, a rating was obtained for all of the process diagrams.

Following the complete evaluation of the synthetic process diagrams, the ratings were converted to a 0–1 scale. The first three options in the scale were considered “Structured”, and the remaining options were converted to “Unstructured” tags. Then, a classification model was built using logistic regression [14]. Evaluations on structuredness of process diagrams in DFG notation were used as the dependent variable for this classification problem together with the following input variables, which were either obtained during dataset creation or calculated using other variables:

Number of Nodes (Size);
Number of Arcs;
Total Number of Elements (including all nodes and arcs);
Number of Self Loops (Arcs starting and ending at the same node);
Percentage of All Possible Journeys (Density);
Arcs per Node (CNC: Coefficient of Network Connectivity);
Arcs per Node Excluding Self Loops (CNCX: Coefficient of Network Connectivity Excluding Self Loops).

Using probabilities calculated with logistic regression for each dependent variable, a subset of the visual dataset was created. First, the complete dataset was sorted in ascending order, and for each decile, twenty images were randomly selected. Thus, a sample of two hundred process diagrams was obtained. After that, the process diagrams in the sample dataset were rated by the other reviewers with the same guidelines. When the ratings were completed, they were converted to a 0-1 (Structured/Unstructured) scale with the same approach as in the first evaluation. At this step, there were five evaluations for each diagram in the sample. Final classifications were calculated by taking the maximum of the evaluations. For example, when four evaluators rated a diagram as “Structured” and one of the evaluators rated the same diagram as “Unstructured”, the diagram was tagged as “Unstructured” because max[0, 0, 0, 0, 1] = 1.

Step 4: Classification Model Training and Evaluation

In the model training phase, K-fold Cross Validation was applied [42]. The results were evaluated with the confusion matrix and classification matrix [79], and the significance of the input variables was tested. Since expert evaluations resulted in imbalanced groups, weighting was used to improve the results. The model was created first by adding one variable at a time, known as Forward Stepwise Selection [42]. The weighted averages of the classification metrics were calculated until the average started decreasing. Wald’s test was applied to determine the significant variables. The results are given in Table A3.

Table A3. Results of logistic regression.

Input Variable	Wald’s Test	p-Value	Coefficient
(Intercept)	320.453	<0.001	−4.993
CNCX	264.367	<0.001	1.632
Size	36.838	<0.001	0.072
Number of Self Loops	0.67	0.682	Insignificant

As a result, a model was developed that can be used as a practical measure to assess the structuredness of a process in DFG notation. Equation (A1) can be used to assess the structuredness of a process diagram in DFG notation:

y = - 4.993 + 1.632 C N C X + 0.072 S i z e .

(A1)

By plugging Equation (A1) into a logistic regression equation, Equation (A2) was obtained. As mentioned earlier, this equation returns values in the range (0, 1), and 0.5 is the breakpoint for making the classifications with this binary measure.

p (X) = \frac{1}{1 + e^{- (- 4.993 + 1.632 C N C X + 0.072 S i z e)}}

(A2)

The logistic regression model obtained in this study could classify 97% of the cases accurately. The remaining 3% were due to factors not included in the model, and naturally, there were evaluation errors that caused incorrect classifications.

As an example, a diagram with 14 nodes and 29 arcs, one of which is a self-loop, has CNCX = 2, and Size = 14 has p(X) = 0.327. A probability value of 0.5 is the crossover point in logistic regression, which means that the diagram in this example is structured. On the other hand, consider a diagram with 9 nodes and 29 arcs, 1 of which is a self-loop with CNCX = 3.11 and Size = 9, has p(X) = 0.675, meaning that it is unstructured because it is greater than 0.5. Increasing the number of arcs by one for the latter example increases p(X) to 0.713.

In addition to the hypothetical examples provided above, Figure 1 and diagrams from a study on spaghetti-like processes were also evaluated with the proposed structuredness measure [80]. For Figure 1, the structuredness value was 0.995, which means that it was not understandable to the process analysis experts who participated in this study. In the second page of the sample process mining study mentioned above, process diagrams for the complete log and clusters obtained using trace clustering are given. Visual examination revealed that the spaghetti-ness of the resulting process diagrams decreased. With the help of the proposed structuredness measure, process mining researchers and practitioners can decide whether the results are at an acceptable structuredness level. For demonstration purposes, the structuredness of these process diagrams was calculated. The number of nodes, arcs, and self-loops for each diagram were counted and plugged in Equations (A1) and (A2). Structuredness for the diagram obtained with the complete log was 0.999, and for the diagrams of clusters 1, 2, and 3, the values were 0.866, 0.991, and 0.987, respectively. It can be concluded that there is improvement in terms of structuredness; however, there is still room for enhancement, as all of the values are above 0.5, meaning that the results of trace clustering are unstructured processes.

Appendix C

For data enrichment purposes, the following activity types were identified:

Changing Category: All “view” activities after an event with a different category than the previous event were converted to “Change-Category-and-View” activity. In this way, visitors moving to another category were identified.
Adding to Cart: As the “cart” activity type did not cover the actual visitor experience, adding a product to the cart after viewing that specific product page was named “View-and-Cart”. If the visitor added a product without viewing its product page, then this activity was considered “Cart-from-list”, meaning that after searching, the visitor directly added that product from a search list or used a similar functionality.
Removing Carted Products: After examining the dataset, it was observed that in some cases, the user was removing a product that was not added to the cart in that session. These activities were renamed “Remove-Previously-Carted”. If a product was added to the cart and removed in the same session, no modifications were needed.
Purchasing: Similar to removing from cart, it was observed that some products were purchased that were not added to cart in a specific session. These activities were renamed “Purchase-Previously-Carted”. If a product was added to the cart and purchased in the same session, no modifications were needed.

An example for clarifying the algorithm behind detecting the events related to the previous sessions is given in Table 3. In this table, the events of Case “bcda45” are listed in chronological order. The first event is removing product “12345” from the cart. As this is the first event, we conclude that this product was added to the cart in the previous sessions. Therefore, the activity type was changed to “Remove-Previously-Carted”. Then, three products were added to the cart without any “view” activities. Because of this, the activities in the next three events were converted to “Cart-from-List”. Then, product “12347” was removed from the cart, which was added to the cart in the active session. Thus, no changes were needed. In the sixth event, product “12348” was removed from the cart. As no products were added to the cart with this ID before this session, the activity type was set to “Remove-Previously-Carted”. The case ended with two purchase events. The first purchase was for product “12346”, which was added to the cart in the active session, so no changes were needed. However, the second purchase was related to product ”12349”, which was not added to the cart in the active session. The last change involved setting the activity type as “Purchase-Previously-Carted” for the last event.

Appendix D

The compositions of the journeys are given in Table A4. In the column header of the table, P stands for process, and as an example, P1 is the process level journey category starting with “Unknown Cart” and ending with an “Order” and “Stocked Cart”. With this approach, each journey had a specific start and end. Table A4 also depicts the percentages of the original and enriched activities for each process.

Table A4. Details and percentages for process level journey categories*.

Node	P1	P2	P3	P4	P5	P6	P7	P8
START	✓	✓	✓	✓	✓	✓	✓	✓
Start with Unknown Cart	X	✓	X	✓	X	✓	X	✓
Start with Stocked Cart	✓	X	✓	X	✓	X	✓	X
VIEW	84%	81%	78%	83%	91%	99%	87%	98%
View	71%	63%	67%	68%	77%	89%	83%	86%
Change Category and View	73%	60%	60%	57%	75%	85%	78%	51%
CART	✓	✓	92%	✓	✓	✓	47%	✓
View and Cart	68%	75%	54%	72%	56%	69%	17%	51%
Cart from List	96%	87%	76%	78%	90%	82%	32%	63%
REMOVE	91%	64%	82%	51%	✓	25%	✓	✓
Remove from Cart	77%	64%	53%	51%	63%	25%	43%	✓
Remove from Previous Cart	82%	X	74%	X	✓	X	✓	X
PURCHASE	✓	✓	✓	✓	X	X	X	X
Purchase	99%	✓	88%	✓	X	X	X	X
Purchase Previously Carted	69%	X	82%	X	X	X	X	X
END	✓	✓	✓	✓	✓	✓	✓	✓
Exit with Unknown Cart	X	X	X	X	X	X	✓	✓
Exit with Stocked Cart	X	X	X	X	✓	✓	X	X
Order given Stocked Cart	✓	✓	X	X	X	X	X	X
Order given Unknown Cart	X	X	✓	✓	X	X	X	X

* X: Not seen in the process, ✓: Always seen in the process.

References

Available online: https://www.retaildogma.com/conversion-rate/ (accessed on 30 August 2024).
Available online: https://www.oberlo.com/statistics/average-ecommerce-conversion-rate (accessed on 31 August 2024).
Bucklin, R.E.; Sismeiro, C. Click here for Internet insight: Advances in clickstream data analysis in marketing. J. Interact. Mark. 2009, 23, 35–48. [Google Scholar] [CrossRef]
Alawadh, M.; Barnawi, A. A Consumer Behavior Analysis Framework toward Improving Market Performance Indicators: Saudi’s Retail Sector as a Case Study. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 152–171. [Google Scholar] [CrossRef]
Schiffman, L.; O’Cass, A.; Paladino, A.; Carlson, J. Consumer Behaviour, 6th ed.; Pearson Higher Education: Melbourne, VIC, Australia, 2013; pp. 1–71. [Google Scholar]
Zhang, X.; Guo, F.; Chen, T.; Pan, L.; Beliakov, G.; Wu, J. A Brief Survey of Machine Learning and Deep Learning Techniques for E-Commerce Research. J. Theor. Appl. Electron. Commer. Res. 2023, 18, 2188–2216. [Google Scholar] [CrossRef]
Stalidis, G.; Karaveli, I.; Diamantaras, K.; Delianidi, M.; Christantonis, K.; Tektonidis, D.; Katsalis, A.; Salampasis, M. Recommendation Systems for e-Shopping: Review of Techniques for Retail and Sustainable Marketing. Sustainability 2023, 15, 16151. [Google Scholar] [CrossRef]
Al-Hasan, T.M.; Sayed, A.N.; Bensaali, F.; Himeur, Y.; Varlamis, I.; Dimitrakopoulos, G. From Traditional Recommender Systems to GPT-Based Chatbots: A Survey of Recent Developments and Future Directions. Big Data Cogn. Comput. 2024, 8, 36. [Google Scholar] [CrossRef]
Poggi, N.; Muthusamy, V.; Carrera, D.; Khalaf, R. Business process mining from e-commerce web logs. In Proceedings of the Business Process Management: 11th International Conference, BPM 2013, Beijing, China, 26–30 August 2013; pp. 65–80. [Google Scholar]
Kakalejčík, L.; Bucko, J.; Vejačka, M. Differences in buyer journey between high-and low-value customers of e-commerce business. J. Theor. Appl. Electron. Commer. Res. 2019, 14, 47–58. [Google Scholar] [CrossRef]
Van der Aalst, W.M.P. Process Mining: Data Science in Action, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2016; pp. 125–153, 387–420. [Google Scholar]
Halvorsrud, R.; Mannhardt, F.; Prillard, O.; Boletsis, C. Customer journeys and process mining–challenges and opportunities. In Proceedings of the ITM Web of Conferences, International Conference on Exploring Service Science (IESS 2.4), Brno, Czech Republic, 8–9 February 2024. [Google Scholar]
Van der Aalst, W.M.P.; Carmona, J. Process Mining Handbook, 1st ed.; Springer Nature: Berlin/Heidelberg, Germany, 2022; pp. 37–76, 125–153. [Google Scholar]
Hair, J.F.; William, C.B.; Barry, J.; Rolph, E.A. Multivariate Data Analysis, 7th ed.; Pearson: Edinburgh Gate, UK, 2009; pp. 313–341. [Google Scholar]
Moe, W.W. Buying, searching, or browsing: Differentiating between online shoppers using in-store navigational clickstream. J. Consum. Psychol. 2003, 13, 29–39. [Google Scholar] [CrossRef]
Kumar, B.; Roy, S.; Sinha, A.; Iwendi, C.; Strážovská, L. E-Commerce Website Usability Analysis Using the Association Rule Mining and Machine Learning Algorithm. Mathematics 2022, 11, 25. [Google Scholar] [CrossRef]
Kumar, A.; Singh, R.K. Web mining overview, techniques, tools and applications: A survey. Int. Res. J. Eng. Technol. 2016, 3, 1543–1547. [Google Scholar]
Mughal, M.J.H. Data mining: Web data mining techniques, tools and algorithms: An overview. Int. J. Adv. Comput. Sci. Appl. 2018, 9, 208–215. [Google Scholar] [CrossRef]
Eirinaki, M.; Vazirgiannis, M. Web mining for web personalization. ACM Trans. Internet Technol. 2003, 3, 1–27. [Google Scholar] [CrossRef]
Hu, X.; Cercone, N. A data warehouse/online analytic processing framework for web usage mining and business intelligence reporting. Int. J. Intell. Syst. 2004, 19, 585–606. [Google Scholar] [CrossRef]
Rosário, A.; Raimundo, R. Consumer Marketing Strategy and E-Commerce in the Last Decade: A Literature Review. J. Theor. Appl. Electron. Commer. Res. 2021, 16, 3003–3024. [Google Scholar] [CrossRef]
McDowell, W.C.; Wilson, R.C.; Kile, C.O., Jr. An examination of retail website design and conversion rate. J. Bus. Res. 2016, 69, 4837–4842. [Google Scholar] [CrossRef]
Tueanrat, Y.; Papagiannidis, S.; Alamanos, E. Going on a journey: A review of the customer journey literature. J. Bus. Res. 2021, 125, 336–353. [Google Scholar] [CrossRef]
Anderl, E.; Becker, I.; Wangenheim, F.V.; Schumann, J.H. Mapping the customer journey: A graph-based framework for online attribution modeling. SSRN 2014. [Google Scholar] [CrossRef]
Park, Y.H.; Fader, P.S. Modeling browsing behavior at multiple websites. Mark. Sci. 2004, 23, 280–303. [Google Scholar] [CrossRef]
Liu, F.; Wang, R.; Zhang, P.; Zuo, M. A Typology of Online Window Shopping Consumers. In Proceedings of the PACIS, Ho Chi Minh City, Vietnam, 11–15 July 2012; p. 128. [Google Scholar]
Wei, J.; Shen, Z.; Sundaresan, N.; Ma, K.L. Visual cluster exploration of web clickstream data. In Proceedings of the 2012 IEEE Conference on Visual Analytics Science and Technology (VAST), Seattle, WA, USA, 14–19 October 2012; pp. 3–12. [Google Scholar]
Schellong, D.; Kemper, J.; Brettel, M. Clickstream Data as a Source to Uncover Consumer Shopping Types in a Large-Scale Online Setting. In Proceedings of the Twenty-Fourth European Conference on Information Systems (ECIS) AISeL (2016), Istanbul, Turkiye, 15 June 2016. [Google Scholar]
Raphaeli, O.; Goldstein, A.; Fink, L. Analyzing online consumer behavior in mobile and PC devices: A novel web usage mining approach. Electron. Commer. Res. Appl. 2017, 26, 1–12. [Google Scholar] [CrossRef]
Park, C.H. Online purchase paths and conversion dynamics across multiple websites. J. Retail. 2017, 93, 253–265. [Google Scholar] [CrossRef]
Lin, W.; Milic-Frayling, N.; Zhou, K.; Ch’ng, E. Predicting outcomes of active sessions using multi-action motifs. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, Thessaloniki, Greece, 14–17 October 2019; pp. 9–17. [Google Scholar]
Kanaan, M.; Cazabet, R.; Kheddouci, H. Temporal pattern mining for e-commerce dataset. Trans.-Large-Scale-Data-Knowl.-Centered Syst. 2020, XLVI, 67–90. [Google Scholar]
Li, J.; Abbasi, A.; Cheema, A.; Abraham, L.B. Path to purpose? How online customer journeys differ for hedonic versus utilitarian purchases. J. Mark. 2020, 84, 127–146. [Google Scholar] [CrossRef]
Ho, H.F. A novel approach for exploring channel dependence of consumers’ latent shopping intent and the related behaviors by visualizing browsing patterns. Data Technol. Appl. 2021, 55, 715–733. [Google Scholar] [CrossRef]
Kukar-Kinney, M.; Scheinbaum, A.C.; Orimoloye, L.O.; Carlson, J.R.; He, H. A model of online shopping cart abandonment: Evidence from e-tail clickstream data. J. Acad. Market. Sci. 2022, 50, 961–980. [Google Scholar] [CrossRef]
Hammer, M. The Agenda: What Every Business Must Do to Dominate the Decade, 1st ed.; Crown Business: New York, NY, USA, 2003; pp. 1–68. [Google Scholar]
IEEE Task Force on Process Mining. Process Mining Manifesto. In Proceedings of the BPM 2011 International Workshops, Clermont-Ferrand, France, 29 August 2011; pp. 169–194. [Google Scholar]
Garcia, C.d.S.; Meincheim, A.; Junior, E.R.F.; Dallagassa, M.R.; Sato, D.M.V.; Carvalho, D.R.; Santos, E.A.P.; Scalabrin, E.E. Process mining techniques and applications–A systematic mapping study. Expert Syst. Appl. 2019, 133, 260–295. [Google Scholar] [CrossRef]
Lorenz, R.; Senoner, J.; Sihn, W.; Netland, T. Using process mining to improve productivity in make-to-stock manufacturing. Int. J. Prod. Res. 2021, 59, 4869–4880. [Google Scholar] [CrossRef]
Pereira Detro, S.; Santos, E.A.P.; Panetto, H.; Loures, E.D.; Lezoche, M.; Cabral Moro Barra, C. Applying process mining and semantic reasoning for process model customisation in healthcare. Enterp. Inf. Syst. 2019, 14, 983–1009. [Google Scholar] [CrossRef]
Teinemaa, I.; Dumas, M.; Rosa, M.L.; Maggi, F.M. Outcome-oriented predictive process monitoring: Review and benchmark. ACM Trans. Knowl. Discov. Data 2019, 13, 1–57. [Google Scholar] [CrossRef]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning, 1st ed.; Springer: New York, NY, USA, 2013; pp. 130–210. [Google Scholar]
Sharma, S. Applied Multivariate Techniques, 1st ed.; John Wiley & Sons, Inc.: Upper Saddle River, NJ, USA, 1995; pp. 317–342. [Google Scholar]
Wilbik, A.; Kaymak, U. Linguistic Summarization of Processes–a research agenda. In Proceedings of the 2015 Conference of the International Fuzzy Systems Association and the European Society for Fuzzy Logic and Technology (IFSA-EUSFLAT-15), Gijón, Spain, 30 June–3 July 2015; pp. 1636–1643. [Google Scholar]
Grisold, T.; Kremser, W.; Mendling, J.; Recker, J.; Vom Brocke, J.; Wurm, B. Generating impactful situated explanations through digital trace data. J. Inf. Technol. 2024, 39, 2–18. [Google Scholar] [CrossRef]
Padidem, D.K.; Nalini, D.C. Process Mining Approach to Discover Shopping Behavior Process Model In Ecommerce Web Sites Using Click Stream Data. Int. J. Civ. Eng. Technol. 2017, 8, 948–955. [Google Scholar]
Ghavamipoor, H.; Hashemi Golpayegani, S.A.; Shahpasand, M. A QoS-sensitive model for e-commerce customer behavior. J. Res. Interact. Mark. 2017, 11, 380–397. [Google Scholar] [CrossRef]
Hernández, S.; Álvarez, P.; Fabra, J.; Ezpeleta, J. Using Model Checking to Identify Customers Purchasing Behaviour in an E-Commerce. In Proceedings of the ATAED@ Petri Nets/ACSD, Zaragoza, Spain, 25–30 June 2017; pp. 158–164. [Google Scholar]
Terragni, A.; Hassani, M. Analyzing customer journey with process mining: From discovery to recommendations. In Proceedings of the 2018 IEEE 6th International Conference on Future Internet of Things and Cloud (FiCloud), Barcelona, Spain, 6–8 August 2018; pp. 224–229. [Google Scholar]
Terragni, A.; Hassani, M. Optimizing customer journey using process mining and sequence-aware recommendation. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, Limassol, Cyprus, 8–12 April 2019; pp. 57–65. [Google Scholar]
Goossens, J.; Demewez, T.; Hassani, M. Effective steering of customer journey via order-aware recommendation. In Proceedings of the 2018 IEEE International Conference on Data Mining Workshops (ICDMW), Singapore, 17–20 November 2018; pp. 828–837. [Google Scholar]
Filipowska, A.; Kałużny, P.; Skrzypek, M. Improving user experience in e-commerce by application of process mining techniques. Zesz. Nauk. Politech. CzęStochowskiej ZarząDzanie 2019, 33, 30–40. [Google Scholar] [CrossRef]
Nguyen Chan, N.; Nguyen Vo, D.L.; Pham-Nguyen, C.; Le Dinh, T.; Dam, N.A.K.; Pham Thi, T.T.; Vu Thi, M.H. Design and deployment of a customer journey management system: The CJMA approach. In Proceedings of the 5th International Conference on Future Networks & Distributed Systems, Dubai, United Arab Emirates, 15–16 December 2021; pp. 8–16. [Google Scholar]
El-Gharib, N.M.; Amyot, D. Data preprocessing method and API for mining processes from cloud-based application event logs. Algorithms 2022, 15, 180. [Google Scholar] [CrossRef]
Augusto, A.; Conforti, R.; Dumas, M.; La Rosa, M.; Bruno, G. Automated discovery of structured process models from event logs: The discover-and-structure approach. Data Knowl. Eng. 2018, 117, 373–392. [Google Scholar] [CrossRef]
Diamantini, C.; Genga, L.; Potena, D. Behavioral process mining for unstructured processes. J. Intell. Inf. Syst. 2016, 47, 5–32. [Google Scholar] [CrossRef]
Veiga, G.M.; Ferreira, D.R. Understanding spaghetti models with sequence clustering for ProM. In Proceedings of the Business Process Management Workshops: BPM 2009 International Workshops, Ulm, Germany, 7 September 2009; pp. 92–103. [Google Scholar]
Augusto, A.; Mendling, J.; Vidgof, M.; Wurm, B. The connection between process complexity of event sequences and models discovered by process mining. Inf. Sci. 2022, 598, 196–215. [Google Scholar] [CrossRef]
Van der Aalst, W.M.P. Process mining: Overview and opportunities. ACM Trans. Manag. Inf. Syst. 2012, 3, 7.1–7.17. [Google Scholar] [CrossRef]
Van der Aalst, W.M.P. Process mining: Discovering and improving Spaghetti and Lasagna processes. In Proceedings of the 2011 IEEE Symposium on Computational Intelligence and Data Mining (CIDM), Paris, France, 11–15 April 2011; pp. 1–7. [Google Scholar]
Mendling, J.; Reijers, H.A.; Cardoso, J. What makes process models understandable? In Proceedings of the Business Process Management: 5th International Conference, BPM 2007, Brisbane, Australia, 24–28 September 2007; pp. 48–63. [Google Scholar]
Reijers, H.A.; Mendling, J. A study into the factors that influence the understandability of business process models. IEEE Trans. Syst. Man. Cybern. Part A Syst. Hum. 2010, 41, 449–462. [Google Scholar] [CrossRef]
Mendling, J.; Reijers, H.A.; Van der Aalst, W.M.P. Seven process modeling guidelines (7PMG). Inf. Softw. Technol. 2010, 52, 127–136. [Google Scholar] [CrossRef]
Martin, N.; Fischer, D.A.; Kerpedzhiev, G.D.; Goel, K.; Leemans, S.J.; Röglinger, M.; Aalst, W.M.P.v.; Dumas, M.; Rosa, M.L.; Wynn, M.T. Opportunities and challenges for process mining in organizations: Results of a Delphi study. Bus. Inf. Syst. Eng. 2021, 63, 511–527. [Google Scholar] [CrossRef]
Calders, T.; Günther, C.W.; Pechenizkiy, M.; Rozinat, A. Using minimum description length for process mining. In Proceedings of the 2009 ACM Symposium on Applied Computing, Honolulu, HI, USA, 12 March 2009; pp. 1451–1455. [Google Scholar]
Dumas, M.; La Rosa, M.; Mendling, J.; Mäesalu, R.; Reijers, H.A.; Semenenko, N. Understanding business process models: The costs and benefits of structuredness. In Proceedings of the Advanced Information Systems Engineering: 24th International Conference, CAiSE 2012, Gdansk, Poland, 25–29 June 2012; pp. 31–46. [Google Scholar]
Sánchez-González, L.; Ruiz, F.; García, F.; Piattini, M. Improving quality of business process models. In Proceedings of the Evaluation of Novel Approaches to Software Engineering: 6th International Conference, ENASE 2011, Beijing, China, 8–11 June 2011; pp. 130–144. [Google Scholar]
Avila, D.T.; dos Santos, R.I.; Mendling, J.; Thom, L.H. A systematic literature review of process modeling guidelines and their empirical support. Bus. Process Manag. J. 2020, 27, 1–23. [Google Scholar] [CrossRef]
Vidgof, M.; Wurm, B.; Mendling, J. The Impact of Process Complexity on Process Performance: A Study Using Event Log Data. Lect. Notes Comput. Sci. 2023, 14159, 413–429. [Google Scholar]
Marin-Castro, H.M.; Tello-Leal, E. Event log preprocessing for process mining: A review. Appl. Sci. 2021, 11, 10556. [Google Scholar] [CrossRef]
Van Zelst, S.J.; Mannhardt, F.; de Leoni, M.; Koschmider, A. Event abstraction in process mining: Literature review and taxonomy. Granul. Comput. 2020, 6, 719–736. [Google Scholar] [CrossRef]
Bose, R.J.C.; Van der Aalst, W.M.P. Context aware trace clustering: Towards improving process mining results. In Proceedings of the 2009 SIAM International Conference on Data Mining, Denver, CO, USA, 6–8 July 2009; pp. 401–412. [Google Scholar]
Fang, H.; Liu, W.; Wang, W.; Zhang, S. Discovery of process variants based on trace context tree. Connect. Sci. 2023, 35, 2190499. [Google Scholar] [CrossRef]
Kaggle. Ecommerce Events History in Cosmetics Shop. Kaggle Dataset, Provided by the REES46 Marketing Platform (Kechinov, M.). Available online: https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop (accessed on 15 December 2022).
Liu, B.; Zhang, H.; Kong, L.; Niu, D. Factorizing historical user actions for next-day purchase prediction. ACM Trans. Web 2021, 16, 1–26. [Google Scholar] [CrossRef]
McInerney, J.; Lacker, B.; Hansen, S.; Higley, K.; Bouchard, H.; Gruson, A.; Mehrotra, R. Explore, exploit, and explain: Personalizing explainable recommendations with bandits. In Proceedings of the 12th ACM Conference on Recommender Systems, Vancouver, BC, Canada, 31 August 2018; pp. 31–39. [Google Scholar]
Diba, K.; Remy, S.; Pufahl, L. Compliance and performance analysis of procurement processes using process mining. In Proceedings of the International Conference on Process Mining, Aachen, Germany, 24–26 June 2019. [Google Scholar]
Ferreira, D.R. A primer on Process Mining: Practical Skills with Python and Graphviz, 2nd ed.; Springer International Publishing: Cham, Switzerland, 2017; pp. 1–93. [Google Scholar]
Luque, A.; Carrasco, A.; Martín, A.; de Las Heras, A. The impact of class imbalance in classification performance metrics based on the binary confusion matrix. Pattern Recognit. 2019, 91, 216–231. [Google Scholar] [CrossRef]
Bose, R.J.C.; Van der Aalst, W.M.P. Trace clustering based on conserved patterns: Towards achieving better process models. In Proceedings of the Business Process Management Workshops: BPM 2009 International Workshops, Ulm, Germany, 7 September 2009; pp. 170–181. [Google Scholar]

Figure 1. E-commerce visitor journey discovered using process mining.

Figure 2. Proposed methodology.

Figure 3. Three levels of e-commerce visitor journeys.

Figure 4. Grouping and clustering variants using frequency and Levenshtein distance.

Figure 5. E-commerce visitor journeys and their shares.

Figure 6. End-to-end high level e-commerce visitor journey.

Figure 7. Process-level structured e-commerce visitor journeys.

Table 1. Comparison of purchasing process structuredness continuum.

Consumer Purchase Behavior	Corporate Procurement Process
Unclear start and end	Clear start and end
Undefined process activities	Defined process activities
Flows in random order	Defined flow order
No rules	Defined rules
High number of process variants	Low number of process variants
Hard to understand process diagrams	Understandable process diagrams
Events are logged if possible	Events are mostly logged

Table 2. Sample event log.

Case ID	Activity	Timestamp	Price	Product
abcd45	Add to Chart	2024-08-08 13:05:01:034
abcd45	Add to Chart	2024-08-08 13:05:03:055	150
abcd45	Purchase	2024-08-08 13:05:04:077	210
abcd45	View	2024-08-08 13:05:05:066		Lipstick
bcda71	Add to Chart	2024-08-08 13:05:06:041	500
bcda71	Remove from Chart	2024-08-08 13:05:10:064	350
bcda71	Purchase	2024-08-08 13:05:11:094

Table 3. Example for detecting events related to previous sessions and cart status.

Case ID	Order	Activity Type	Product ID	New Activity Type
bcda45	1	Remove-from-Cart	12345	Remove-Previously-Carted
bcda45	2	Cart	12346	Cart-from-List
bcda45	3	Cart	12347	Cart-from-List
bcda45	4	Cart	12347	Cart-from-List
bcda45	5	Remove-from-Cart	12347	Remove-from-Cart
bcda45	6	Remove-from-Cart	12348	Remove-Previously-Carted
bcda45	7	Purchase	12346	Purchase
bcda45	8	Purchase	12349	Purchase-Previously-Carted

Table 4. Categorization with start and end dodes.

Group	Start Node	End Node
1	Start-with-Stocked-Cart	Order-Given-Stocked-Cart
2	Start-with-Unknown-Cart	Order-Given-Stocked-Cart
3	Start-with-Stocked-Cart	Order-Given-Unknown-Cart
4	Start-with-Unknown-Cart	Order-Given-Unknown-Cart
5	Start-with-Stocked-Cart	Exit-with-Stocked-Cart
6	Start-with-Unknown-Cart	Exit-with-Stocked-Cart
7	Start-with-Stocked-Cart	Exit-with-Unknown-Cart
8	Start-with-Unknown-Cart	Exit-with-Unknown-Cart

Table 5. Percentages for behavior level journeys.

Behavior	Share %	Median Duration (mins)	Joint Occurrence %
Exploitation	38.82	3.42	4.08
Exploration	28.21	6.45	16.69
Selection	24.46	4.04	67.88
Handpicking	2.01	1.79	51.60
Elimination	3.22	3.00	100.00
Cancellation	9.41	1.78	66.93
Replenishment	0.77	1.50	100.00
Purchase	5.27	2.35	53.47

Table 6. Session statistics for the discovered process-level journeys *.

Metric	P1	P2	P3	P4	P5	P6	P7	P8
Explained Cases %	35.68	60.31	69.94	73.10	64.63	81.18	57.48	43.87
Median Duration (minutes)	16.80	28.00	17.60	12.40	13.40	10.70	7.30	4.20
Structuredness %	12.05	14.62	10.26	12.25	8.49	10.36	26.07	20.28
Product Pages Visited	1.82	1.92	2.48	2.02	3.55	3.89	6.39	2.20
Product Categories Visited	1.86	1.80	2.00	1.64	2.54	2.62	3.80	1.00
Product Pages Visited per Category	0.98	1.06	1.24	1.23	1.40	1.49	1.68	2.20
Net Products Added to Cart in The Session	7.72	8.38	2.40	3.65	5.22	5.13	0.00	0.00
Net Cart Change	4.73	8.38	−0.43	3.65	−0.31	5.13	−7.47	0.00
Purchases from The Session	[0.00, 6.72]	4.47	2.40	3.65	0.00	0.00	0.00	0.00
Purchases from Previous Session(s)	[2.68, 9.40]	0.00	5.82	0.00	0.00	0.00	0.00	0.00
Purchase per Session	9.40	4.47	8.22	3.65	0.00	0.00	0.00	0.00
Products Stocked for Upcoming Session(s)	[1.00, 7.72]	3.92	0.00	0.00	5.22	5.13	0.00	0.00

* Start and end nodes characterize process level journeys and can be named accordingly. To avoid long process names, we used codes (P1, P2 etc.) for simplification and saving space. Numbers in the codes do not show precedence relation.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Topaloglu, B.; Oztaysi, B.; Dogan, O. Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining. J. Theor. Appl. Electron. Commer. Res. 2024, 19, 2851-2879. https://doi.org/10.3390/jtaer19040138

AMA Style

Topaloglu B, Oztaysi B, Dogan O. Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining. Journal of Theoretical and Applied Electronic Commerce Research. 2024; 19(4):2851-2879. https://doi.org/10.3390/jtaer19040138

Chicago/Turabian Style

Topaloglu, Bilal, Basar Oztaysi, and Onur Dogan. 2024. "Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining" Journal of Theoretical and Applied Electronic Commerce Research 19, no. 4: 2851-2879. https://doi.org/10.3390/jtaer19040138

APA Style

Topaloglu, B., Oztaysi, B., & Dogan, O. (2024). Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining. Journal of Theoretical and Applied Electronic Commerce Research, 19(4), 2851-2879. https://doi.org/10.3390/jtaer19040138

Article Menu

Cart-State-Aware Discovery of E-Commerce Visitor Journeys with Process Mining

Abstract

1. Introduction

2. Background, Related Work, and Contributions

2.1. Background of Web Usage Mining

2.2. Related Work on Web Usage Mining

2.3. Definitions Related to Process Mining and Techniques Used

2.4. Process Mining Applications in E-Commerce

2.5. Evaluating Process Structuredness

2.6. Main Contributions

3. Materials and Methods

4. Evaluation and Results

4.1. Activity Level E-Commerce Visitor Journeys

4.2. Behavior Level E-Commerce Visitor Journeys

4.3. Process Level E-Commerce Visitor Journeys

5. Discussion

5.1. Implications for E-Commerce Visitor Journeys

5.2. Implications for Structuredness Measure

5.3. Limitations and Validity

5.4. Future Research Directions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A

Appendix B

Appendix C

Appendix D

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI