**2. Related Work**

#### *2.1. Collective Intelligence and Social Learning*

There is a rich literature on how decentralized information processing, learning and decision-making affects the performance of collectives and swarms [32–36]. Here, we focus on how platforms can be designed for people to make predictions with high performance, which is a central question for the Wisdom of the Crowd [22,23,37].

It has been shown that the temporal influence and mutual information dynamics between individuals can have a strong effect on crowd collective performance. On the one hand, prior work has shown that exposure to social information can lead to degraded performance in aggregate guesses [26,37,38]. For example, increasing the strength of social influence has been shown to increase inequality [39]. Selecting the predictions of people who are resistant to social influence has been shown to have improved collective accuracy [27]. The influence of influential peers has been theoretically shown to prevent the group from converging on the true estimate [26], and exposure to the confidence levels of others has been shown to influence people to change their predictions for the worse [40].

On the other hand, social learning has also been shown to lead to groups outperforming their best individuals when they work separately [41] and a collective intelligence factor has been shown to predict team performance better than the maximum intelligence of members of the team [35]. Similarly, human-inspired social communication between agents has been shown to improve collective performance in optimization algorithms [5,42].

Therefore, the role of social learning in collective performance is still being understood. Our contribution to this line of research is that a more complete characterization of performance in terms of not just accuracy but also risk provides avenues for future work towards reconciling the disagreements as to the role of social influence on performance. This is especially important due to the already existing strong social components in many crowd-sourcing platforms and applications [43–48] that could be harnessed more effectively for performance improvement.

#### *2.2. Accuracy-Risk Trade-Off*

Previous work has investigated several avenues to optimize the accuracy of the crowd such as by recalibrating predictions against systematic biases of individuals [26] and selecting participants who are resistant to social influence [27]. Additionally, rewiring the network topology of information-sharing between subjects [25,41], and optimally allocating tasks to individuals [49] has improved collective accuracy. However, these studies focused on accuracy with little regard to risk. There is a rising movement to go beyond accuracy and to fully characterize performance—at the individual and the collective level—in terms of both accuracy and risk. Some call this emerging line of work going beyond the 'bias bias (In the statistics literature, bias is another name for accuracy. This movement suggests that research should go beyond its current focus on just bias and study risk).

At the individual level, there is increasing evidence that people preferentially optimize for risk instead of accuracy in a variety of domains [50]. Cognitively, people have been observed to manifest decision heuristics [51] to be conservative in the face of uncertainty [52,53]. For example, rice farmers have been observed not to adopt significant harvest improvement technology because of the risk of it failing once and causing significant family ruin [54]. Evolutionarily, risk aversion has been shown to emerge when rare events have a large impact on individual fitness [52]. Furthermore, in a meta-study of 105 forecasting papers, 102 of them support prioritizing for lower risk to achieve higher overall performance [55]. At the collective level, there is limited work regarding the characterization of the performance of collectives and swarms in terms of both accuracy and risk although there is a large literature on other related trade-offs such as between speed and accuracy [56–60].

From a system design perspective, crowd-sourcing platform designers should characterize their performance in terms of both accuracy and risk due to theoretical results [8,9] and observations in applications [10,11] that the performance of any prediction system is

subject to a fundamental trade-off between accuracy and risk. This is especially important in our domain of predicting financial asset prices as risk is already known to have negative effects on the efficiency of markets such as through the phenomenon of implied volatility [61].

#### **3. Materials and Methods**

#### *3.1. Experimental Design*

To test our hypothesis that a Pareto frontier exists between risk and accuracy—i.e., that there is a trade-off between risk and accuracy of prediction across several prediction rounds—and that it is mediated by social learning, we need a dataset with the following requirements:


Given the above requirements, we designed the experimental procedure as detailed below: we recruited a total of 2037 participants over seven prediction rounds to predict the future prices of financial assets (the S&P 500, WTI Oil, and gold prices) during seven separate consecutive 3-week rounds over the span of 6 months, resulting in 9268 predictions (i.e., 4634 prediction pairs or sets). We focused on predicting financial prices as doing so is a hard prediction problem [62,63]. Our participants were mid-career financial professionals with years of financial experience. Our participants consented to their data being used in this study and we obtained prior IRB approval. One of our rounds of prediction happened to end the day of the Brexit vote, which means that we have prediction data during a particularly volatile market period [31] as described in Supplementary Section A.5 .

During each round, participants made a prediction of the same asset's closing price for the same final day of the round. We use the round's last day's closing market price as our measure of ground-truth. We carefully instrumented the social and non-social information that our participants were exposed to, and collected their predictions before and after exposure to this information. We also deployed one of our rounds during a high uncertainty period to understand if variance reduction strategies allow the crowd to be resistant to risk.

We did not opt for an A/B testing experimental design [64]—where we would have split participants and shown each group either the social information or the historical price time series—because we wanted participants to naturally choose whichever source of information to use to update their belief. This was an important experimental design choice as we wanted to understand, as close as possible to in-situ how people update their beliefs in the real-world where they are already exposed to both their peers' beliefs and to price history information, such as through financial news. Our design is in contrast to previous work where the experiments were deployed within a carefully controlled laboratory set-up as in prior work [25,37,40].

#### *3.2. Data Collection*

As shown in the screenshot of the user interface in Figure 1, we designed the data collection process as follows: every time a participant makes a prediction of an asset's future price through our platform, the following prediction set comprising *Bpre*, *BH*, *BT* and *Bpost* is collected:


**Figure 1.** An annotated screenshot of how data were collected: the pre-exposure prediction *Bpre* is shown first, followed by the social histogram *BH* and the price history *BT*. Finally, the updated prediction *Bpost* is collected. The ground-truth of the asset's final closing price will be *V* (not shown here, realized at the end of the round).

> Overall, we ensure that the "pre-exposure" prediction is made before any social information and price history is shown. We present a unique histogram for every new prediction (as it is built using past predictions up to this point), as well as a unique price history time series (as it shows the 6-month price data up to the time of prediction). We require all participants to make a post-exposure prediction even if they decide to keep it at the pre-exposure level.
