*3.3. Modeling and Estimation*

Using the data collected in the live experiments, we want to test our hypothesis that a Pareto frontier exists between risk and accuracy and that it is mediated by social learning. In this section, we describe all the modeling and estimation steps required to investigate our hypothesis:


of social vs. non-social learning. This allows us to make aggregate predictions—at the platform level—based on a pre-specified amount of social learning.

• In Section 3.3.3, we detail how the accuracy and risk—at the platform level—of selected subsets are measured, and how they are used to investigate whether a Pareto trade-off exists between accuracy and risk and whether it is mediated by the relative amount of social vs. non-social learning.

#### 3.3.1. Modeling Belief Updates

Using formalism inspired by Bayesian models of cognition [29], we can model the 4634 prediction sets collected over many rounds, at a high level, as a Bayesian update. To use this formalism, we need to select a prior distribution for each individual's belief before exposure to any information and a likelihood (evidence) distribution to model the data participants are exposed to. Additionally, a sampling or approximate method is required to use the prior and evidence to compute the posterior (updated belief after information exposure) distribution. Here, we describe the modeling assumptions and procedure at a high level, and detail more thoroughly our modeling assumptions and present our derivations in Supplementary Section A.3.

Fundamentally, we are interested in how participants predict an asset's future price (ground-truth) *V* based on the information we expose them to. The choice of the prior distribution is straightforward: *Pprior*(*V*) ≈ *P*(*Bpre*), the distribution of belief of an individual before they are exposed to any information. We discuss in our model derivation (Supplementary Section A.3) how, when needed, we approximate the full distribution *P*(*Bpre*) since we obtain only one sample, *Bpre*, for each participant and cannot observe the full distribution *P*(*Bpre*).

After participants input their pre-exposure belief *Bpre*, there are two main likelihood (evidence) distributions participants employ: they are exposed to the assets' price history *BT*, giving us *Plikelihood*(*V*) ≈ *P*(*BT*), or analogously, the social histogram *BH*, giving us *Plikelihood*(*V*) ≈ *P*(*BH*). In the modeling stage here, we assume that participants used these two likelihood distributions separately to update their beliefs, but we relax this assumption in the estimation stage next where we estimate the relative amount of social vs. non-social learning for each prediction. We detail in Supplementary Section A.3 how likelihood distributions are built from the information that participants are exposed to. In Supplementary Section A.2, we formally detail how we transform the price history into a cognitively accurate 'rates histogram' using price momentum. As a summary, because it has been shown that people process time series as a distribution of changes as opposed to a distribution of the quantity itself [65–67], we convert the price history time series into a histogram of daily changes (slopes) in prices which is used for both the simple Gaussian models and the numerical models for price prediction.

Given the prior and likelihood, the *modeled* posterior prediction *Pposterior*(*V*), can, therefore, be approximated as *Pposterior*(*V*) ∝ *P*(*BH*) · *P*(*Bpre*) in the case of exposure to social information, and *Pposterior*(*V*) ∝ *P*(*BT*) · *P*(*Bpre*) when participants are exposed to the past price history. We do not make any other assumptions in terms of what data to use to approximate the likelihood and prior distributions. Given these distributions, the question is then how to compute the posterior (updated) belief of an individual.

Although we focus on Bayesian models in this work, we include one popular model commonly used as a benchmark in the literature, the DeGroot model [68]. In this model, an individual updates their belief as the weighted average belief of their peers where weights can be, for example, trust values of the individual for their peers. Here we set the weights (trust values) equal for all peers, as we have no data to estimate these weights, and therefore assume a uniform prior.

Although the space of possible distributions and posterior computation approaches is very large, we focus here on using two simple, interpretable, and theoretically motivated approaches from prior work [28]. We either use Gaussian (normal) conjugate distributions to approximate priors and likelihoods due to strong evidence of their ubiquity as Bayesian models of cognition [29], or use a full Monte Carlo numerical sampling approach to calculate the posterior from the actual distributions of prices that participants were exposed to. We leave to future work the exploration of richer distributions and approaches to modeling belief update as it is beyond the scope of this study.

#### 3.3.2. Subsetting Predictions Based on Social Learning

Based on how participants update their belief, we would like to select subsets of predictions based on whether they were more likely updated using social or non-social information. This approach of using characteristics of how predictions are updated is standard in the Wisdom of the Crowd literature. For example, prior work has estimated resistance to social influence [27] and influenceability in revising judgments after seeing the opinion of others [69,70], and used them to improve collective performance. No prior work has investigated investigating if the modeling of belief update strategies could be leveraged for improved collective performance.

Using the previously modeled posteriors, we can *estimate* how much of each information source—social information and price history—each participant used to update their belief by comparing the residual errors of models using either only social information or only price history as likelihood. As will be introduced in the Section 4, although we explored many models of belief update, the simple conjugate Gaussian models model best how participants update their belief. This is in line with previous research showing that although simple, they are highly accurate models of mental estimation in a variety of domains [28].

Therefore, for the purposes of selecting subsets of prediction based on their relative amount of social vs. non-social learning, we choose to focus on the GaussianSocial and GaussianPrice. These models assume the likelihood (evidence) data distribution to be built, respectively, from the social information and price history participants are exposed to.

Our approach is illustrated in Figure 2: using the prediction of the models Gaussian Social and GaussianPrice, we calculate a residual *<sup>H</sup>* for when updating belief using social information *BH* and a residual *<sup>T</sup>* when updating from the price history *BT*, as *<sup>H</sup>* <sup>=</sup> <sup>|</sup>GaussianSocial−*Bpost*<sup>|</sup> *Bpost* and *<sup>T</sup>* <sup>=</sup> <sup>|</sup>GaussianPrice−*Bpost*<sup>|</sup> *Bpost* respectively. We define *α* = *<sup>T</sup>* − *H*, and we use it to measure how likely a participant used each source of information to update their prediction. For example, for a prediction set [*Bpre*, *BH*, *BT*, *Bpost*] if *α* > 0 (i.e., *<sup>T</sup>* > *H*), this means that this prediction set is better modeled using the social histogram of peer's belief *BH* instead of the price history *BT*.

Using *α*, which we re-scale to be in the interval [−1, 1] for each round, we can select a subset *Sα<sup>s</sup>* of the prediction sets such that the *α* of these prediction sets lie in the range 0 ≤ *α* < *α<sup>s</sup>* (or *α<sup>s</sup>* < *α* ≤ 0 when *α<sup>s</sup>* < 0). *α<sup>s</sup>* is the one-sided boundary we will vary to measure how much more likely a participant updated their belief from the social information instead of the price history. For example, the higher *α<sup>s</sup>* is, the more likely a prediction set is better modeled using the social histogram of peer's belief *BH* instead of the price history *BT*.

It is important to note that the residuals we use to select subsets are belief update model residuals (between the observed updated belief and the predicted modeled updated belief) which are uncorrelated with the crowd residual (between the crowd's aggregate prediction and the ground-truth).

**Figure 2.** An example belief update: for each prediction set, a participant updates their belief from the pre-exposure prediction *Bpre* to the updated prediction *Bpost* by either learning from the social histogram *BH* and/or the price history *BT*. *<sup>H</sup>* is the residual between the *modeled* updated prediction GaussianSocial and the participant's updated prediction *Bpost*; *<sup>T</sup>* is the residual between GaussianPrice and *Bpost*. *α* is the difference between *<sup>T</sup>* and *H*.

#### 3.3.3. Evaluating Improvement of Subsets

Our hypothesis is that a Pareto frontier exists between risk and accuracy and that this trade-off is mediated by the relative amount of social vs. non-social learning.

To test this hypothesis, we investigate how the accuracy and variance of subsets *Sα<sup>s</sup>* of predictions selected using *α<sup>s</sup>* (a measure of the relative amount of social vs non-social learning) compares to the current standard Wisdom of the Crowd approach whereby all predictions are used.

From the perspective of platform designers who want to be able to select predictions based on required levels of accuracy or risk (e.g., to fit a certain portfolio of risk), it is important to measure improvement of subsets relative to the full collection of predictions. This is because, currently, platform designers only have access to one global measure of risk and accuracy—that of the whole set of predictions (when there is no subset filtering). To demonstrate that selecting subsets of predictions can lead to significant *improvements* in accuracy and risk, we therefore need to calculate these improvements.

We therefore define improvement *<sup>I</sup>Sα<sup>s</sup>* as the absolute difference between the error *eS<sup>α</sup><sup>s</sup>* when using a subset *Sα<sup>s</sup>* compared to the error *eSall* when using the full set of predictions *Sall*, the Wisdom of the Crowd, where *Sall* is defined as the full subset over all predictions using −1 ≤ *α* ≤ 1.

The error *ei*,*Sα<sup>s</sup>* over all predictions *j* ∈ *Sα<sup>s</sup>* for an estimated amount *α<sup>s</sup>* of relative social vs. non-social information during experiment round *<sup>i</sup>* is defined as <sup>|</sup> <sup>∑</sup>*j*∈*Sα<sup>s</sup>* [*Bpost*,*j*]−*Vi*<sup>|</sup> *Vi* . To allow for estimation uncertainty over the improvement in accuracy and risk of subsets, we use 100 bootstraps with replacement. This procedure is formally described in Supplementary Section A.3.4.

We use an analogous approach to estimate the risk of the platform by calculating the standard deviation instead of the mean of the improvements over experiment rounds. This measures the risk for platform designers to estimate, over a basket of prediction rounds, what is the variance of improvements over this basket. This is the same as understanding the variance of error of a statistical prediction model (e.g., machine learning model) such that we can calibrate both the accuracy and variance of the model over a portfolio of predictions.

#### **4. Results**

Here we present our results. In Section 4.1, we detail our supporting result related to how different belief update models perform. Next, in Section 4.2, we present our main result about the trade-off between accuracy and risk in the Wisdom of the Crowd. Lastly, we present the supporting result regarding the effect of social learning during the high uncertainty period before the Brexit vote in Section 4.3.

### *4.1. Belief Update Models*

Although the space of possible prior and likelihood distributions and posterior computation approaches is very large, we focus on using simple, interpretable, and theoretically motivated approaches from prior work [28]. We leave to future work the exploration of richer distributions and approaches to modeling belief update as it is beyond the scope of this study. We detail how model error and confidence intervals are evaluated in Supplementary Section A.3.3.

As can be seen in Figure 3, models that use social information as likelihood for modeling the belief update of participants (GaussianSocial,GaussianSocialModes, Numerical Social) outperform better than models that use the price history (GaussianPrice, Numerical Price). This suggests that our participants more likely use social information instead of the price history to update their belief, in line with previous work showing that participants often prefer using social information [71,72].

**Figure 3.** The y-axis shows the relative residual between *modeled* belief update and *actual* updated belief. Simple approximated models do better at modeling belief update than numerical models, and models using social histograms as likelihood perform better than models using the price history. Error bars represent 95% CI.

Specifically, GaussianSocial, our simple Gaussian model that assumes the data follows a single-mode Gaussian distribution, outperforms GaussianSocialModes, a model that identifies when the social histogram is non-unimodal (using the Hartigan's dip test of unimodality [73]) and uses the largest mode as the mean of the distribution. This suggests that participants assume the data they learn from to be unimodal even when it is non-unimodal, in line with prior work [74,75] showing that this might be due to the fact that using multi-modal data is cognitively costly.

Additionally, GaussianSocial outperforms the more precise numerical model NumericalSocial which makes no parametric assumption on the data distributions and uses a Monte Carlo procedure to estimate the posterior distribution. This suggests that participants employ simple heuristics when learning from their peers, in line with the attribute substitution heuristic of human decision-making [30]. However, when participants are learning from the price history, the dominance of simpler models is not as clear because

the performance of the simple GaussianPrice model is indistinguishable from that of the numerical model (NumericalPrice).

GaussianSocial also outperforms the popular DeGroot model commonly used as a benchmark in the literature [68], where an individual updates their belief as the weighted average belief of their peers. Here we set the weights (trust values) equal for all peers, as we have no data to estimate these weights, and therefore assume a uniform prior. It is interesting to note that GaussianSocial is equivalent to the DeGroot model when a participant's weight on their own prior belief is equal to the total of the weights of all other participants. This agrees with previous work showing that participants put a disproportionately larger weight on their own prior belief [76,77].

Overall, the superiority of GaussianSocial in predicting belief update suggests that participants use a heuristic, unimodal, and simple belief update procedure when updating their beliefs, and that they predominantly update their predictions using social information instead of price history. It is important to note that approximate (non-Monte Carlo) models such as GaussianSocial and GaussianPrice are parameter-less models and did not require any parameter fitting, making their success in modeling belief update quite interesting.

## *4.2. Accuracy-Risk Trade-Off*

Here, we present our main result about the trade-off between accuracy and risk in the Wisdom of the Crowd. Using a Pareto curve, we compare the improvement in prediction accuracy and risk (variance) of each subset *Sα<sup>s</sup>* as defined by *αs*, a measure of the relative amount of social vs non-social learning.

As shown in Figure 4, we observe that with improvements in accuracy of subsets comes increased risk, mediated by the relative amount of social vs. non-social learning *αs*, suggesting a trade-off between accuracy and risk. As formally described earlier in Section 3.3.3, improvement is a measure of the additional accuracy gained from a subset of predictions compared to when using all predictions by the crowd (the de-facto Wisdom of the Crowd) over all prediction rounds. Similarly, risk is a measure of the risk of this subset compared to when using all predictions over all rounds. From a system design perspective, we choose these measures of improvement and risk as they allow us to understand how choices over subsets of participants might affect performance, allowing us to calibrate the crowd as per the platform designer's risk preferences.

**Figure 4.** (**A**): In this Pareto curve, we plot the improvement of each subset vs. the risk (standard deviation) in improvement within this subset. We see a risk-return trade-off: predictions made with price history are more accurate, but with higher risk (standard deviation). Fitted line has *R*<sup>2</sup> of 0.49, and *p*-value < 0.001. Horizontal and vertical error bars represent 95% CI from 100 bootstraps. (**B**,**C**): Instead of plotting risk vs. improvement (as in (**A**), here we plot the same values of improvement ((**B**), *R*<sup>2</sup> = 0.82, *p*-value < 0.001) or risk ((**C**), *R*<sup>2</sup> = 0.50, *p*-value < 0.001) against the relative amount of social vs. non-social learning, *αs*, that generated these values of improvements or risk.

Additionally, since we observe that variance of improvement (risk) decreases with increased social leaning, our result replicates prior findings that exposure to social information decreases the variance of the crowd [37]). Please note that the decrease in risk from social learning is not because participants are simply converging towards the crowd's mean: as detailed in the previous Section 4.1, the social histogram participants are shown is quite often non-unimodal (tested using the Hartigan's dip test of unimodality [73]), which means that participants are intentionally collapsing multiple distribution modes in the observed data.

Such a Pareto trade-off between risk and accuracy is common in financial forecasting [15,16] and statistical prediction [8–11], but has not been typically observed in the literature on the Wisdom of Crowds. This has strong implications for the design of crowdsourced prediction platforms as described in the Discussion Section 5.1.

#### *4.3. Performance under High Uncertainty*

A supporting result of our work is from the investigation of the crowd's performance during a period of high uncertainty using the data from the prediction round that happened during the Brexit vote (see supplementary Section A.5 for details about this round).

Following the same procedure described in the Methods Section 3.3.3, we bin all *α*'s from the prediction sets and investigate the improvements of subsets of predictions compared to the whole crowd. The main difference here is that unlike in all previous results where we took care not to use the last week of data to calculate collective accuracy so that prediction was not too easy, we do so here as the high uncertainty only happened in the last week (as shown in supplementary Figure S1). This last week of data that we use is a *disjoint subset* from the data we previously used.

As can be seen in Figure 5, as *α<sup>s</sup>* decreases (i.e., we select predictions that were more likely updated using the price history instead of the social information, *α<sup>s</sup>* < 0), improvement in accuracy of subsets compared to the Wisdom of the Crowd (all predictions) decays to a great extent.

**Figure 5.** Improvement when selecting predictions based on how much more they were likely made using social information (*α<sup>s</sup>* > 0) vs. price history (*α<sup>s</sup>* < 0). 95% Confidence intervals obtained through 100 bootstraps.

Conversely, as subsets of predictions updated using the social histogram (*α<sup>s</sup>* > 0) are selected, the improvement in their accuracy is stable.

Given that such high market uncertainty only occurred during one round, we do not have enough data to produce a Pareto curve over multiple rounds. Additionally, note that although a smaller number of predictions were made during the last week before Brexit (52 prediction sets compared to 284 during the open period of prediction used earlier), we have sufficient data to afford statistically significant results as shown by the 95% confidence intervals of our findings.

This supporting result suggests that during periods of high uncertainty, social learning leads to higher accuracy in contrast to the result in the previous section where the asset prices were more predictable. This result has implications for platform designers such as the potential of leveraging social learning as a valuable tool that minimizes catastrophic performance during high uncertainty prediction regimes.

#### **5. Discussion**

Our main result (the trade-off seen in Figure 4) supports our hypothesis that a Pareto frontier exists between risk and accuracy—similarly to what has been observed in statistical modeling [8–10] and financial [14–16] forecasting systems. This trade-off is mediated by the relative amount of social vs. non-social learning. Additionally, as supporting results, we observe that simple approximate models outperform more complicated Monte Carlo approaches in modeling the belief update process of participants. This suggests that participants use several heuristics, and that during periods of high uncertainty, social learning leads to higher accuracy.

Here, we discuss the implications of our results for platform designers in Section 5.1, describe the contributions of our work to the literature on heuristics in information processing and decision-making in Section 5.2. We end with a description the limitations of this work in Section 5.3.

#### *5.1. Collective Intelligence System Design Implications*

If we are to deploy crowd-sourced financial prediction and speculation systems at scale, it will be important to fully characterize the performance of these systems. This is especially given the growing importance of decentralized financial prediction and speculation including very recent events during which retail investors self-organized using social media and drove up asset and derivative prices [3,4]. However, crowd-sourced prediction systems and literature so far focus on measuring and optimizing for the accuracy of the predictions with little regard to the risk of these predictions even though measuring both accuracy and risk is standard in machine learning [8–10] and financial [14–16] forecasting applications. More generally, proper modeling and estimation of risk will support more sophisticated and versatile applications of crowd-sourced predictions such as hedging risks over portfolios of prediction tasks.

Additionally, beyond the passive monitoring and reporting of risk, a practical question for designers is how to *tune* the platform to reach a desired value of risk and accuracy. Our result that social learning can mediate the accuracy-risk trade-off provides a practical means to attain performance along this frontier. Specifically, our results suggest that social learning within a crowd-sourcing platform could be more purposefully leveraged to fit the task at hand. For example, platform designers could incentivize social learning between participants to have lower risk. This might be especially needed during highly uncertain times, as our results from the Brexit prediction (Figure 5) prediction showed. Past work has already showed that crowd-sourcing platforms can be incentivized to be more social [43,44].

Beyond platform design considerations, our results also add to the rich study of social learning and its impact on collective intelligence within the Wisdom of the Crowd domain [25,27,37,40,41] by adding the novel perspective that risk is an important dimension of the behavior of crowds to be measured.

More generally, our work brings together two disjoint studies by showing that it is possible to improve collective intelligence by modeling individual belief update. Our results therefore suggest a connection between the field of collective intelligence [78] (of which the Wisdom of the Crowd is one domain) and the field of computational cognitive science [79] (of which Bayesian models of cognition is an area). Until now, the latter literature has mostly focused on individual models of belief update such as through computational models of how people perform sampling [80], what their priors are [81], and how they perform inference [82], sometimes in social situations [83]. Yet, there is little work that looks at the impact of individual belief update on collective performance. On the other hand, there is limited collective intelligence literature regarding leveraging the modeling of individual belief update to improve group performance and past work has instead been focused on using personal characteristics such as resistance to social learning [27].

#### *5.2. Information Processing and Decision-Making Heuristics*

Our results also have implications for the literature on decision heuristics and biases [75,84]. Through the modeling of belief update, we observe that our subjects exhibit the attribute substitution heuristic of human decision-making [30]. This information processing heuristic describes when people attempt to solve a complicated problem by approximating it with a simpler, less accurate model. We observe this heuristic as our participants' updated beliefs are better modeled by the GaussianSocial model (which assumes the data to be unimodal) than by the multi-modal belief update model GaussianSocialModes. This indicates that our participants assume the data to be unimodal even when it is not, in line with previous studies that have shown that people wrongly assume data to be unimodal [74,85,86]. This is hypothesized to be because updating belief using multi-modal data is cognitively costly [87]. Additional evidence of this substitution heuristic is from the fact that simpler, approximate models better predict the updated beliefs of participants than the more complicated Monte Carlo numerical models.

Another decision heuristic that we observe is that participants prefer to use social information rather than the underlying price history of an asset to update their belief as models which use social information (GaussianSocial,GaussianSocialModes, and NumericalSocial) outperform models that use price history (GaussianPrice and NumericalPrice) as shown in Figure 3. This is surprising given that our participants were mid-career finance professionals with strong financial experience who should know that price information is generally better to predict future prices [88,89]). However, such behavior was observed in prior work where even experts performing a familiar task demonstrate sub-optimal decision heuristics [90,91], and often over-rely on social information [71,72].

Generally, such information processing and decision-making heuristics have been seen as irrational and sub-optimal. Our results suggest that within the full specification of both accuracy and risk, perhaps participants are preferentially aiming for lower risk instead of higher accuracy. This preference for social information especially pays off during the high uncertainty period before the Brexit vote. Our results support growing evidence that heuristics and biases are not merely *defects* of human decision-making, but that perhaps they optimize for richer objectives or are optimized for more time- or data-constrained decision-making [92–98]. For example, when individual decision-making is viewed within the lens of more realistic requirements such as limited time [99,100] or attention [101], heuristics and biases have been shown to act as helpful priors that facilitate fast and risk-averse decision-making [102,103].
