1. Introduction
As is the case with nearly all fields of study that fall under the area of the social sciences, much of the body of knowledge in the field of consumer studies is based on statistical results from conventional data methodological approaches, with regression procedures dominating the way researchers attempt to describe variable relationships and explain phenomena. Traditional regression techniques are designed to identify the marginal effects of specified and pre-selected factors based on theory and the existing literature. Conventional analytical techniques have been refined over the past half-century to increase explanatory power; however, even with advancements, conventional approaches remain limited in their explanatory power. Factors that might be possibly related to an outcome of interest, but have not been reported in the literature or thought to be theoretically relevant, are generally excluded from subsequent analyses. This means that the amount of explained variance across a wide number and variety of consumer studies outcomes is inevitably limited.
Big data analytical techniques, which tend to be atheoretical, have increasingly gained traction across the social sciences to acquire a deeper understanding of human attitudes and behaviors. Machine learning (ML)—a type of artificial intelligence application—is both a field of study and an umbrella term that describes algorithms that are built in such a way that hidden layers of information can be identified through a learning process based on training data and computational proofs. ML approaches are intended to supplement the role of researchers by showing that variables that might have once been discarded in previous studies or not included at all in an empirical analysis can add insight into describing and explaining outcomes.
The purpose of this study is to illustrate the use of ML from a consumer studies perspective to improve data descriptions when compared to a conventional regression approach. The outcome of interest in this study is a household’s degree of financial preparedness as indicated by the presence of an emergency fund (i.e., a measure based on household liquidity). As will be discussed later in this paper, numerous researchers have examined factors associated with holding an emergency fund, explaining the components of emergency savings, and predicting which households are most likely to meet liquidity ratio guidelines. A unique feature of much of the existing literature is that regardless of the research purpose, analysts tend to use similar variables when describing and predicting household emergency funds. These variables have come to represent the basis of many consumer-focused financial recommendations. A cursory review of this literature suggests, however, that other variables or relationships among variables is needed to gain a more comprehensive understanding of consumer financial preparedness to improve prediction rates.
When asked, financial service professionals, financial counselors, and financial educators tend to agree that managing household emergency funds involves the ongoing management of interacting variables. This is one reason why ecological systemic theory is prominently mentioned as a key explanatory model when emergency fund analyses are conducted at the household level [
1,
2]. As previously mentioned, much of the existing research has primarily sought to understand emergency funds within the confines of economic or financial theories using a delimited number of factors such as financial status or sociodemographic variables (e.g., [
3,
4]). While such studies have contributed positively to the literature by reinforcing existing theories and research findings, they may overlook the potential relevance of variables highly pertinent to how households manage emergency funds in practice. Methodologically, this signifies the need for an approach centered on pattern recognition and classification, as opposed to the identification of linear relationships upon which conventional studies have been based (e.g., [
3,
4,
5]). Consequently, the combination of ecological systemic theory, pattern recognition, and classification underscores the necessity to consider complex system science models [
6,
7]. Furthermore, in the context of the social sciences and economics, where complex system science models are gaining acceptance, there is a need for research in personal finance utilizing ML techniques [
6,
8].
This study adds to the existing literature in several important ways. First, it employs ML in the context of a consumer studies topic. While some prior attempts within the field have been made (e.g., [
9,
10,
11,
12,
13,
14,
15]), these efforts have been limited in their ability to compare various ML methods comprehensively. Another limitation is that some prior studies have relied on macro, rather than micro or household, data, which produce outcomes that are disconnected from a household’s actual financial management activities. Consequently, this study is one of the few initial attempts to explain emergency fund management by integrating various ML techniques at the household level.
Second, previous studies have been limited to the assessment of a few central variables, including financial factors and sociodemographic factors, when studying emergency funds (e.g., [
3,
4]); this study is more expansive. Specifically, the analyses conducted in this study relied on a diverse set of variables that align with the research objectives. For instance, in addition to financial and sociodemographic factors, this study introduces a broad array of variables, including financial education, psychological factors, COVID-19-related factors, distance to financial service providers, and types of loans. This approach aligns well with the strengths of ML, which are designed to enhance predictive capabilities by combining numerous variables when classifying and describing relationships [
16]. This study carries the potential to discover meaningful variables that have been previously unnoticed in existing research by supplementing ML predictions with additional variables potentially related to the management of emergency funds at the household level.
Third, as mentioned earlier, previous studies have typically assumed that variable relationships are linear, even when this assumption may not be practically relevant. Rather than rely on a linear assumption, this study is premised on pattern recognition and classification, distinct from models based on linear assumptions. Specifically, this study utilizes six ML algorithms as complex systems science models. While the six ML methods in this study have been widely used in empirical studies, their application in comparison to traditional linear assumption-based analytical methods is limited, particularly in relation to personal finance and consumer studies topics.
In summary, this paper contributes to the methodological literature in consumer studies by showing that when prediction is the main purpose of analysis (i.e., for use when making policy, creating education interventions, and advice giving), conventional analytical techniques may not always be the best solution. ML incorporating a larger set of variables that accounts for interactions between and among factors can offer a more robust and powerful way to increase predictive validity. In this regard, the research questions associated with this study are (a) What is the optimal ML algorithm to predict the presence of an emergency fund? (b) How do ML predictions perform when compared to a conventional logistic regression analysis? and (c) What are the most important factors associated with holding an emergency fund when viewed with an ML algorithm lens?
This study consists of sub-sections to arrive at the answer to these questions and deliver contributory points.
Section 2 includes a background discussion about emergency funds and the methodological background of ML.
Section 3 introduces the empirical model based on the background and methodological review.
Section 4 describes the data and measurements utilized in the ML and logistic models.
Section 5 illustrates the findings from each ML and the logistic model.
Section 6 discusses the results. This paper concludes by describing this study’s implications in
Section 7.
3. Empirical Model Flow
3.1. Research Purpose and Analysis Structure
The overarching purpose of this study was to determine which modeling technique offers the best prediction rate when describing the presence of an emergency fund. As noted above, this study employed and compared various ML algorithms. A four-step analytical process was used, and the steps are described below.
Step 1: Find the best parameters across the various ML algorithms
Multiple sub-algorithms exist within nearly all ML algorithms (Naïve Baynes is an exception). For instance, in terms of kNN, the Euclidean method and the Manhattan method can be used to measure distance. For Gradient Boosting, four sub-algorithms are widely used: categorical, Extreme, Extreme with random forest, and scikit-learn. In the case of SVM, the kernel can be assumed to be linear, polynomial, RBF, or sigmoid. Three sub-algorithms exist for SGD (i.e., elastic, lasso, and ridge). At this step of the analytical process, each sub-algorithm was tested. For the conventional analysis (i.e., logistic regression), three types of feature selection were utilized (i.e., all variables, forward stepwise selection, and backward stepwise selection).
In addition to sub-algorithms, each ML algorithm can be affected by internal settings (i.e., parameter settings). Based on the parameter setting, the same algorithm may exhibit different degrees of performance robustness [
80]. To account for this possibility, this study tested different parameters for each algorithm. For
kNN, normally, the number of neighbors can affect classification performance. Therefore, different numbers of neighbors (i.e., from 1 to 100) were employed and compared to find the best tuning for the
kNN algorithm. Regarding Gradient Boosting, the learning rate may affect the algorithm’s performance. As such, various learning rate settings (i.e., 0.10, 0.15, 0.20, 0.25, and 0.30) were employed and compared to find the best application. For SVM, cost values are known to affect classification performance. To account for this, different cost values (i.e., 0.10, 1.00, 5.00, 10.00, 50.00, and 100.00) were employed and compared. It is also known that in terms of SGD, the learning rate may affect the algorithm’s performance. To deal with this possibility, various learning rate settings (i.e., 0.001, 0.005, 0.010, 0.050, and 1.000) were employed and compared. For NN, the number of neurons can change the algorithm’s performance. Therefore, different settings of neurons (i.e., 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90, 95, and 100) were utilized and compared to find the best performance outcome. As shown in
Figure 1 (Part A and Part B and Line a), the first step in the analysis involved selecting the best performing sub-algorithms and the best tuning for each algorithm.
Step 2: Find the best ML prediction algorithm among the various ML algorithms
It is important to note that assuming that one specific ML algorithm will ever show a dominant performance across predictions and classifications is unrealistic. Rather, by the topical issue type and the predictive dataset’s nature, diverse ML algorithms can be expected to show better/worse prediction and classification performance [
27]. Given the binary feature of the dependent variable in this study, various classification algorithms were selected, as explained above. As shown in part A with line b in
Figure 1, the second step in the analytical process involved finding the optimal ML algorithm from the selected six ML algorithms. The best prediction performance was selected as the most appropriate for use within the dataset.
Step 3: Check whether ML accuracies are higher than those offered by a conventional analysis
Even if a selected ML algorithm shows excellent performance across tested ML algorithms, the prediction function may actually offer a lower level of prediction when compared to a conventional analytical technique like logistic regression. Therefore, the third step involves comparing the prediction performance of the selected ML algorithm and the conventional analysis (see parts A and B with line b,
Figure 1).
Step 4: Determine which factors are associated with holding an emergency fund
Assuming the selected ML algorithm performs better than the conventional analysis, the influencing rank of input factors can be found by evaluating algorithm outcomes. The influencing rank can be viewed similarly to the significant variable list from a regression model, or the rank can differ. By checking the similarity or differences between the rank of influencing factors (ML algorithm) and the significant factors (logistic regression), it is possible to establish variable importance and possible linkages across variables that can then be examined at a later date. This step in the analytical process is crucial because some variables that emerge from an ML algorithm may not be significant in a traditional sense. Therefore, as shown in
Figure 1 (line c for both parts A and B), the final step involves checking the variable list generated from the ML algorithms and the logistic analysis.
3.2. Analytic ML and the Conventional Analysis Process
Each ML algorithm test was conducted by dividing the sample into a training dataset and a test dataset. As shown in
Figure 2, using the training dataset, each ML algorithm was used to identify the best prediction model. Data were split into training and testing datasets using a 50:50 random split ratio. As noted by Joseph [
81], the split ratio varies by study and typically ranges from 80:20 division, 70:30, 60:40, and 50:50. The literature shows a conspicuous absence of definitive guidelines delineating the optimal or preferred data split ratio for a given dataset. As such, based on the comparatively small size of the dataset used in this study, the research team concluded that a 50:50 ratio was appropriate (see also [
82,
83]). Moreover, this ratio split allowed for robust validation of the data (i.e., k-fold validation). After a model was identified, the test dataset was utilized to validate the results from the test. If the model still showed a robust prediction outcome, the model was defined as optimal. The Python with Orange 3 visualization tool was used for all the tests. The conventional analysis utilized a similar procedure. A logistic regression model was estimated utilizing the training dataset. Results were validated using the test dataset. Stata 17.0 was used to estimate the models.
3.3. The Accuracy Estimation Method
To measure prediction accuracy, a receiver operator characteristics curve (ROC curve) and the area under the ROC curve (AUC) methodological approaches were utilized. An ROC curve is produced using two inputs: a true positive (TP) rate and a false positive (FP) rate [
84]. The TP rate is calculated as the ratio between positive (i.e., correct) classifications and total positives. The FP rate is calculated using the ratio between negative (i.e., incorrect) classifications and total negatives. This indicates a more precise estimate when the TP rate is close to 1.00. The approach is also more precise when the FP rate is close to zero. An ROC curve shows the TP rate on the vertical axis and the FP rate on the horizontal axis. When an ROC curve shows a convex shape upward to the left, the accuracy is considered to be more precise. Additionally, the area under the curve is called the AUC, which indicates the power of the ROC (i.e., measured as 0.00 to 1.00) [
44]. If the ROC curve has a vertical axis with a TP rate (i.e., zero to 1.00) and a horizontal axis with a FP rate (i.e., zero to 1.00), the area can be calculated from zero (zero times zero) to 1.00 (one times one).
3.4. The Factor Ranking Method
In Step 4, the rank of variables, in terms of prediction, is represented numerically (i.e., RReliefF). Whereas predictors in a logistic analysis can be evaluated using significance/insignificance estimates and marginal effects (i.e., coefficients), identifying high-ranking predictors using ML algorithms is more complex. For example, in the case of NN, all input variables connect to the outcome variable through neurons. Multiple weights are connected between a particular input variable and the outcome variable. There is not a specific number. As such, the evaluation of ML algorithms tends to focus on the complex combinations of input factors and the effects of variables on an outcome variable instead of the unique association between an input variable and the outcome variable.
For this study, variable ranks were identified using RReliefF. RReliefF is an advanced version of Relief [
85] and ReliefF [
86], which are generally accepted attribute estimators. Relief is the baseline of RReliefF. Robnik-Šikonja and Kononenko [
87] introduced RReliefF, which was developed from Relief. The diff function, as shown below, can be used to better understand the baseline of RReliefF. The diff function is used to measure the distance among instances, which can be used to identify the nearest neighbors [
87]. Equation (18) is used for categorical attributes, and Equation (19) is for continuous attributes:
These equations are used when investigating a dataset that comprises multiple examples, denoted as
I1,
I2, ...,
In, situated within an instance space. Each example is characterized by a set of attributes, represented as
Ai, where attributes are associated with each example. By using the diff function, the weight (
W) of attribute
A can be estimated as Relief by following Equation (20) [
86]:
Based on the fundamental Relief framework, regressional ReliefF was introduced using Equation (21):
Compared to other attribute estimators (e.g., the root mean of squared error and mean absolute error), the RReliefF estimator uses a factor measured by considering interactions with other factors. RReliefF measures a factor’s estimator contextually. A higher RReliefF number for a specific variable indicates that the factor is expected to predict the outcome with better (optimized) performance. Therefore, in this study, RReliefF was used to check the factors’ ranking.
4. Data and Measurement
4.1. Data
Data were collected in 2021 using an online survey distributed in the United States. A survey agency invited 5900 consumer households to participate in this study; 1000 respondents answered all the questions; however, 13 respondents provided inaccurate information (e.g., reporting two years old for their age), which resulted in a useable sample of 987. Descriptive information for the sample is shown in
Appendix A Table A1.
4.2. Measurement
The outcome variable was whether a respondent held an emergency fund or not. The variable was coded dichotomously (Have = 1; Not have = 0) based on an answer to the following question, “Have you set aside emergency or rainy day funds that would cover your expenses for three months, in case of sickness, job loss, economic downturn, or other emergencies?”.
The input variables (i.e., predictors) were split into the following five categories in alignment with [
88] and [
89]: (a) financial statements and resources, (b) financial literacy and education, (c) psychological factors, (d) demographic factors, and (e) COVID-associated factors (used to account for the period of data collection).
The following binary-coded variables comprised the financial statements and resources category: (a) have auto loan or not; (b) have student loan or not; (c) have farm loan or not; (d) have equity loan or not; (e) have mortgage loan or not; (f) own house or not; (g) have saving account or not; (h) have checking account or not; (i) own term life insurance or not; (j) own whole life insurance or not; (k) ever use payday loan; and (l) have health insurance or not. In addition, a categorical variable was included to account for the possibility of receiving financial advice for making financial decisions (i.e., 1 = have; 2 = do not know; 3 = no). Finally, a respondent’s physical distance from their closest financial professional was asked and coded as follows: 1 = less than 5 miles; 2 = 5 to 10 miles; 3 = 10 to 20 miles; 4 = 20 to 50 miles; 5 = over 50 miles; and 6 = n/a or do not know.
Three variables comprised the financial literacy and education category: (a) had financial courses in high school (1 = Yes; 0 = otherwise); (b) had financial courses in college (1 = Yes; 0 = otherwise); and (c) objective financial literacy. The objective financial literacy variable was based on answers to three true/false questions [
90], resulting in scores that could range from 0 (no correct answers) to 3 (all correct answers).
The psychological factors category was comprised of the following variables: (a) financial risk tolerance; (b) financial satisfaction; (c) financial stress; (d) financial self-efficacy; (e) locus of control; (f) life satisfaction; (g) the Rosenberg self-esteem scale; and (h) job insecurity. Financial risk tolerance was assessed using the Grable and Lytton’s risk-taking propensity scale [
91]. Scores ranged from 13 to 42. Financial satisfaction was measured using seven items on a five-point scale (min = 7; max = 35) (see [
92]). Financial stress was measured using 24 items on a five-point scale (min = 24; max = 120) (see [
88]). Financial self-efficacy was measured using six items, also on a five-point scale (min = 6; max = 30) (see [
93]). Locus of control was measured using seven items on a five-point scale (min = 7; max = 35) (see [
94]). Higher scores were representative of an external locus of control. Life satisfaction was measured using seven items on a seven-point scale (min = 5; max = 35) (see [
95]). Self-esteem was measured with Rosenberg’s 10-item scale that was assessed using a four-point scale (see [
96]). Finally, job insecurity was measured using seven items on a five-point scale (min = 7; max = 35) (see [
97]).
Demographic factors included (a) a variable representing the region of the country where a respondent lived, (b) work status, (c) agricultural working status, (d) education level, (e) marital status, (f) gender, (g) age, (h) whether a respondent lived in an urban area, (i) ethnicity, (j) income level, (k) number of children in a respondent’s household, and (l) perceived health status. The region represented a respondent’s state of residence. Work status was coded categorically as 1 = Full-Time; 2 = Part-Time; 3 = Self-Employed; 4 = Homemaker; 5 = Full-Time Student; and 6 = Not Working. Agriculture working status was coded as a categorical variable (1 = farm; 2 = ranch; 3 = agri-business; and 4 = not working in agriculture). Education level was coded categorically as 1 = high school or lower; 2 = some college; 3 = college; and 4 = postgraduate. Gender was coded as female or otherwise. Marital status was coded as a binary variable (i.e., single or otherwise). Age was measured in years. Living in an urban area was coded categorically as follows: 1 = urbanized area of 50,000 or more people; 2 = suburban area, near urbanized area with at least 2500 and less than 50,000 people; and 3 = rural area, all population, housing, and territory not included within any urban areas). Ethnicity was coded as a categorical variable, where 1 = White or Caucasian; 2 = Hispanic or Latino/a; 3 = Black or African American; 4 = Asian; 5 = Pacific Islander/Native American or Alaskan Native; and 6 = Other. Income level was coded categorically as 1 = Less than USD 15,000; 2 = USD 15,000 to USD 25,000; 3 = USD 25,000 to USD 35,000; 4 = USD 35,000 to USD 50,000; 5 = USD 50,000 to USD 75,000; 6 = USD 75,000 to USD 100,000; 7 = USD 100,000 to USD 150,000; and 8 = Over USD 150,000. The number of children living in a respondent’s household was measured as a reported number. Finally, the perceived health status of a respondent was measured as a categorical variable (i.e., 1 = Excellent; 2 = Good; 3 = Fair; and 4 = Poor).
Finally, COVID factors were measured with items that asked how a respondent was affected by the COVID-19 virus and pandemic, how long a respondent expected the COVID-19 pandemic to last, and the receipt and timing of a stimulus check. The following items were used to evaluate perceptions of the COVID-19 pandemic: (a) how a respondent’s financial situation was affected by COVID-19; (b) how a respondent’s health condition was affected by COVID-19; (c) how a respondent’s general well-being was affected by COVID-19; and (d) how a respondent’s work–life balance was affected by COVID-19. Answers were coded as 1 = almost no impact to 4 = serious impact. Perceptions about the duration of the pandemic were assessed by asking if (a) my financial situation will get better, get worse, or stay the same in three months; (b) my financial situation will get better, get worse, or stay the same in six months; or (c) my financial situation will get better, get worse, or stay same in one year. Answers were coded as 1 = get better; 2 = get worse; or 3 = stay the same. The timing of receiving a stimulus check was measured nominally as 1 = get stimulus check in April; 2 = get stimulus check in May; 3 = get stimulus check in June; 4 = get stimulus check in July; 5 = get stimulus check after July; 6 = do not know; 7 = do not want to answer; 8 = had not received stimulus check yet; and 9 = not eligible for a stimulus check.
6. Discussion
ML and big data analytical techniques have, over the past decade, garnered increasing attention among researchers, educators, and policy makers as a way to obtain deeper insight into social science phenomena. This study adds to the growing consumer studies methodological literature by illustrating how ML techniques can be applied to assessing household consumer attitudes and behaviors and how ML methods can improve prediction rates.
The outcome variable in this study was whether a household held an emergency fund, which was used to indicate a household’s degree of financial preparedness. The existing financial ratio literature is relatively consistent in reporting that those who hold emergency savings share a common demographic profile [
3,
4]. They tend to have high income, are more educated, and have greater wealth. It is important to note, however, that nearly all profiles reported in the literature were constructed using traditional methodologies, primarily regression techniques. At the outset of this paper, it was hypothesized that while existing profiles may remain valid, other variables might also be influential in describing who does and does not hold emergency savings. Traditional regression modeling techniques do not account for hidden layers between and among variables. While it is possible to create moderation and mediation models, to do so with large data is nearly impossible when the constraints associated with regression modeling are applied. This study’s methodological approach dealt with this issue by showing that when prediction or profiling is the main purpose of a study, ML algorithms can provide a more nuanced insight into consumer behavior compared to more commonly used statistical analysis techniques [
7,
16].
This study compared and tested several ML algorithms to determine which offers the most robust prediction rate. The ML algorithm outputs were compared to estimates derived from logistic regression models. Several takeaways emerged from these analyses. First, those using ML techniques must know that parameter tuning is not optional. Incorrect parameter tuning results in lowered prediction and classification rates. Those who adopt ML algorithms in consumer studies should consider this point and compare tuning performance when conceptualizing studies. Second, sub-algorithms should be considered. Using an incorrect sub-algorithm will almost always lower prediction and classification validity. Third, when evaluating ML algorithm outputs, it is important to remember that ML algorithms do not show marginal effects. Instead, ML algorithms provide a ranked ordering of predictors. As such, the interpretation of an ML analysis should not be considered deterministic. Instead, the interpretation of an ML output needs to be conceptualized as more in line with an explorative introduction.
In this study, Gradient Boosting,
kNN, and SVM were found to provide the most robust degrees of prediction and classification. Gradient Boosting offered the best prediction rate, which aligns with what others have reported in the literature (e.g., [
9,
10,
15,
44]). Gradient boosting is an ensemble modeling technique that integrates classification and regression methods [
42,
43]. The ensemble of classification and regression estimation works well when optimizing prediction accuracy [
31] and minimizing error levels [
44]. What is particularly interesting in this study is that income and wealth—factors generally considered the most descriptive of financial preparedness—were not highly ranked in the Gradient Boosting algorithm, nor with
kNN or SVM. This insight differs from what is generally shown using regression techniques [
3]. However, educational factors and the existence of financial obligations were more important. It appears that a consumer must possess the financial literacy to anticipate the need for emergency savings, formulate a plan to build an emergency fund, and implement the plan. The consumer must also have an objective reason to hold emergency fund assets. The existence of loans is one reason a consumer may opt to hold assets in an emergency fund. Likewise, a consumer needs to hold an attitudinal disposition that values one’s future self or the well-being of household members. The consistently high ranking of life insurance in the ML algorithms suggests that the ability to plan for the future is an important characteristic among those holding emergency fund assets. The region variable in the
kNN model is worthy of future research. The variable represents the state where a respondent resided at the time of the survey. It appears that some consumers are more likely than others to take financial preparedness steps. Specifically, those living in rural areas who also hold existing debt, are predicted to be more likely to hold an emergency fund.
This study represents a noteworthy advancement in consumer studies literature, particularly in the domains of personal finance and financial planning. This paper illustrates the value of ML techniques when predicting behavior. While numerous researchers have utilized ML methodologies with social science datasets (e.g., [
9,
10,
11,
12,
13,
14,
15]), these efforts have sometimes suffered from limitations, such as their inability to comprehensively compare diverse ML methods or their focus on non-household factors. This means that the practical relevance of findings about household financial management has notable limitations. This paper is one of the few studies to comprehensively analyze the nuances associated with holding an emergency fund at the household level.
Another significant contribution of this paper is the expanded scope of variables that were used to predict holding an emergency fund. Rather than rely on a limited set of preexisting variables as described in the literature (i.e., primarily financial factors and sociodemographic attributes) (e.g., [
3,
4]), this study introduced a broader range of variables, including financial education, psychological aspects, COVID-19-related factors, distance to financial service providers, and holding various types of loans. This approach aligns well with ML’s capacity to leverage multiple variables [
16], potentially unveiling overlooked variables that could significantly contribute to understanding the dynamics of emergency fund management.
Moreover, this study departs from the prevailing practice of assuming linear relationships between and among variables. The ML technique uses a pattern recognition and classification approach, making it possible to move beyond traditional linear assumptions. To achieve this, six distinct ML algorithms were employed as complex systems science models. The application of these algorithms allowed for a comprehensive investigation of the potential contributions of ML to the field of consumer studies. Notably, each ML algorithm underwent meticulous parameter tuning and calibration, extending beyond algorithmic utilization to demonstrate the application of ML techniques to address complex questions. The comprehensive approach in this study underscores the commitment to advancing the understanding of emergency fund management dynamics and enhancing the practical applicability of ML in consumer studies.
In summary, the results from this study advance the methodological body of literature for those working in the consumer studies field. This study shows that ML algorithms can be used to improve predictions and classifications of consumer attitudes and behaviors. Future research should align the results from this study with existing models and profiles of those who hold emergency savings. Information from such studies can be used by financial educators, consumer advocates, and policy makers when helping households achieve greater levels of financial preparedness.
7. Conclusions
This study is noteworthy in making significant theoretical, practical, and methodological contributions to consumer studies. The theoretical contribution lies in its application of ML techniques to the study of household financial decision making. Unlike traditional linear models, this study used a pattern recognition and classification methodology, shedding light on the intricate complexities underlying emergency fund management. The findings from this study challenge conventional beliefs by highlighting the importance of financial literacy, financial obligations, and a positive attitude towards future financial well-being as key factors in predicting who is more likely to hold emergency savings, with income and wealth taking a secondary role.
On a practical level, findings from this study underscore the critical importance of parameter tuning and sub-algorithm selection when employing ML techniques in consumer studies. This paper offers valuable insights into the use of ML algorithms when predicting and classifying consumer attitudes and behaviors, which can have direct applications for financial service providers, financial educators, consumer advocates, and policy makers. Moreover, this study expands the spectrum of variables considered, incorporating financial education, psychological factors, COVID-19-related variables, and others, thereby enhancing the predictive capacity of models to understand the dynamics of emergency fund management.
Even in the context of these significant contributions, limitations need to be acknowledged. ML techniques, while improving prediction rates, do not readily provide straightforward marginal effects. Thus, some researchers use ML algorithms as a starting point in identifying key variables for use in secondary models. While this study evaluated six robust ML algorithms, including Gradient Boosting, kNN, and SVM, further research is needed to determine when one particular approach should be used to address a specific research question. Further advanced ML algorithms, such as Generative Adversarial Network, Recurrent Neural Network, or Convolutional Neural Network, should be evaluated in future studies. In the context of this study, additional research is needed to decipher regional variations in holding an emergency fund. Future studies should also aim to integrate the findings with existing models and profiles of emergency savings holders. Doing so will contribute to a better understanding of the financial preparedness of households. In addition, the current ML algorithms are all well-known algorithms. Even in the context of these limitations and opportunities for future work, this study advances the consumer studies methodological landscape by showcasing how ML techniques can enrich the field’s comprehension of consumer attitudes and behaviors, particularly within the context of holding an emergency fund.