*4.1. Research Context*

For several decades, scholars used business games to investigate the accuracy of managerial decisions and their application through an analysis of the performances made by the participants of their experiments (Lainema and Makkonen 2003; Faria et al. 2009; Kim et al. 2013; Henriksen and Børgesen 2016; Korchinskaya et al. 2020). In this section, the simulation game 'General Management Business Game (GMBG)' (provided by Artémat) is briefly explained, which was used during the lectures of a general managemen<sup>t</sup> course in one large Italian University, for training students in the managemen<sup>t</sup> of firms as well as for doing this research. In this simulation game, students take the part of a General Manager of a small-sized manufacturing firm specialized in the production of mobile devices (e.g., Tablet.). Players aim to win contracts to manufacture and sell a large selection of articles, while tuning their businesses through the benchmarks of operations governance. This developing organization requires participants to take care of both the human and productive resources tailoring the optimization capacity, and because of this they should make several managemen<sup>t</sup> decisions that touch all the aspects of the organization. This is proved within a module of the business game in which participants are asked to run all the managerial functions of the firm with the aim of achieving the highest possible performance for the firm—measured in terms of net worth, number of contracts signed, and gratification of clients. Students cross paths through "real-job" alternatives allowing them to work on managemen<sup>t</sup> decision-making processes by collecting and analyzing data. The main performance indicator used to derive how the simulated firm was conducted is the net worth: the value of all the non-financial and financial assets owned by the firm minus the value of all its outstanding liabilities.

### *4.2. Experimental Design, Procedure and Measurements of Variables*

To test the three developed hypotheses and answer the research paper's aim, 120 graduate students (57 male, 63 female, Average Age = 21.2 years, Standard Deviation (SD) = 1.3 years) following an optional managemen<sup>t</sup> course in one large Italian University were involved in this research; participants, that can be considered as 'Millennials'; participants were rewarded for the experiment with University credits. The laboratory experiment, then followed by quantitative analysis, is considered as the most suitable research design in research fields that can be considered at a stage of development far from the nascent one (Edmondson and McManus 2007). These intermediate or mature fields of research—like the role of personal traits in decision making processes (see the review by Cristofaro 2017a)—are challenged by "focused questions and/or hypothesis relating existing constructs" (p. 1160).

Sampled students, as gleaned from informal conversation, had no or limited work experience; however, we did not control work experience in an empirical way. The selection of participants followed the convenience sampling approach (Given 2008), which is non-probability sampling that consists of selecting a sample from a part of the population that is close at hand. Despite the fact that a stream of scholars believe that students' samples are not suitable for behavioral research aiming to provide working implications for practitioners (e.g., Gallander Wintre et al. 2001), another equally important stream of research (e.g., Lucas 2003; Thomas 2011) believes that these samples are appropriate in cases of research emphasis on basic psychological processes. Concerning the latter, according to Berkowitz and Donnerstein (1982): "the meaning the subjects assign to the situation they are in and the behavior they are carrying out plays a greater part in determining the generalizability of an experiment's outcome than does the sample's demographic representativeness" (p. 249). So, here it is strongly believed that due to the aim of this research (thus, finding connections among CSE, overconfidence and decision-making performance) and the settings of the laboratory experiment, at least the internal validity of the research is guaranteed. This follows similar contributions on managemen<sup>t</sup> decisions (see Cristofaro 2016) that consider students' samples and investigate psychological variables against decision-making performance.

At the beginning of the course, each participant was free to form a group with three other members; a total of 30 groups were composed. The experiment took place when participants were asked to run the final module of the simulation game in which they had all the managerial functions under control with the aim of grasping the most elevated conceivable execution for the firm. On the day of the experiment (when students were asked to play this final module of the simulation game), the leading researcher explained the interest in studying the relationship between their personality traits, some behavioral decision variables and their connections with performance in the simulation game.

Before running the final module of the simulation game, each participant was first invited to answer the 12-item CSE Scale (CSES) (Appendix A) and the seven-item Cognitive Reflection Test (CRT) (Appendix B) to measure their CSE degree and tendency to adopt intuitive or reflective thinking (explained later). Then they were asked to perform the last module of the simulation game; this occurred over 3 hours. During this time period, participants were not aware of their performance (in terms of net worth; tracked by the leading researcher). At the end of the simulation game, each individual, within groups, was asked to estimate the final net worth of their firm and to note it on a paper. Variables at the center of the developed hypotheses to be tested were measured as follows.

*Core Self-Evaluations:* The CSE score of each group was calculated according to the average of the CSE score of its members. The CSE score at the individual level was derived by asking participants to complete the 12-item Core Self-Evaluations Scale (CSE by (Judge et al. 2003) in Appendix A; test-retest accuracy was 0.81 over a one-month span) which, rather than measuring the four characteristics comprised within the CSE independently and weighting the scores, provides an explicit and integrative estimation of a person's core self-perception. Respondents were asked to rank their predisposition on the 12 items according to a five-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree) (see also Joo et al. 2012). It is worth noticing that values assigned to reverse grade questions were subtracted; from that, the maximum and minimum value that can be reached by completing the CSE are, respectively, +24 and −24, with a neutral point at 0. Thanks to the STATA function called 'egen'1, it was possible to derive three main clusters according to the average CSE value of groups: (i) low CSE groups (whose averages range from −24 to −9), (ii) average CSE groups (whose averages range from −8 to +8), and (iii) high CSE groups (whose averages range from +9 to +24). Thanks to that, the initial 30 groups were reallocated in these three clusters, equating to 10 groups each.

*Reflective and intuitive thinking:* To assess the cognitive predisposition of respondents between reflective and intuitive thinking, the seven-item CRT by Toplak et al. (2014) was adopted (Appendix B),

<sup>1</sup> computed as: egen CSE = cut(CSE), at(−9,8,24).

which extends the three-item CRT formulated by Frederick (2005). The seven questions are constructed such that they have an instinctive but erroneous reply that arises rapidly and an appropriate reply that is simple to catch when it is clarified. Subsequently, the test is presumed to estimate a person's tendency to engage in intuitive or reflective thinking (Patel et al. 2019). Each correct answer was counted as 1, while an incorrect answer was 0; so, the maximum and minimum values that can be reached by completing the CRT are, respectively, 7 and 0. After having calculated the sum of correct and incorrect answers for each individual, this value was summed with the ones of the other members of the group and a final average was derived. Among the participants who did not yield the proper reply, the instinctive one was usually the most given answer.

*Overconfidence* (overestimation): Moore and Healy (2008), as already introduced, defined that overconfidence usually has been measured in terms of overestimation and this is the definition of overconfidence that has been operationalized for this study. In particular, during the simulation game participants were not made aware of their current net worth; at the end of their performance, each individual, within groups, was asked to make an estimation of their final net worth. After collecting each individual's estimation of their net worth, this value was summed with those of the other members of the group and a final average was derived. To calculate the overestimation of each group, the final average of the estimation of the net worth of the group was compared with the actual net worth and a subtraction was made to find the overall overestimation of their results. This procedure is in line with the one adopted by Hoppe and Kusterer (2011).

*Performance*. To measure the groups' performance—the outcome of managemen<sup>t</sup> decisions that have been collectively made by groups—during the simulation game, the net worth variable (as suggested by McGraw Hill's instruction material) was taken as the best representative of performance reached within the simulation game. The net worth is the value of all the non-financial and financial assets owned by the firm minus the value of all its outstanding liabilities; this is automatically calculated by the software. This variable has also been used as an indicator of firms' performance by other authors in the past (Penrose 1956; Carlstrom and Fuerst 1997).

*Data analysis*: To test the three developed hypotheses, three one-way Analysis of Variance (ANOVA) have been implemented. This is the most suitable statistical technique that can be used in order to compare means of two or more samples to find significant di fferences, if any (Field 2013). A Tukey post hoc test was conducted after each One-way ANOVA to determine the significant di fferences among groups.
