*3.3. Summary (1)*

My objective here was not to show that simplicity and likelihood approaches can account for all contrast polarity phenomena (on their own, they certainly can account for some but probably not for all). Instead, my objective was to show that Pinna and Conti applied these approaches incorrectly, even though they had been warned about this. Thereby, they knowingly ignored that these approaches are far more flexible than they assumed them to be. In my view, this is scientifically inappropriate.

### **4. Simplicity and Likelihood Are Not Equivalent**

Pinna and Conti formulated their second claim, about the alleged equivalence of simplicity and likelihood, as follows.

"[...] the visual object that minimizes the description length is the same one that maximizes the likelihood. In other terms, the most likely hypothesis about the perceptual organization is also the outcome with the shortest description of the stimulus pattern." [2] (p. 3 of 32)

This is an extraordinary claim. It therefore requires extraordinary evidence, but Pinna and Conti actually provided no corroboration at all (in their earlier draft, they cited Chater [46]; see Section 4.1. Instead, they seem to have jumped on the bandwagon of an idea that, for the past 25 years, has lingered on in the literature—in spite of refutations. As said, Pinna and Conti had been informed about its falsehood but chose to persist. It is therefore expedient to revisit the alleged equivalence of simplicity and likelihood (see Table 1 for a synopsis of relevant issues and terminologies).

Before going into specific equivalence claims, I must say that, to me, it is hard to even imagine that simplicity and likelihood might be equivalent. Notice that descriptive simplicity is a fairly stable concept. That is, as has been proved in modern information theory (IT) in mathematics, every reasonable descriptive coding language yields about the same complexity ranking for things [43–45]. Probabilities, conversely, come in many shapes and forms. For instance, on the one hand, in technical contexts like communication theory, the to-be-employed probabilities may be (approximately) known—though notice that they may vary with the situation at hand. For known probabilities, one may aim at minimal long-term average code length for large sets of identical and nonidentical messages (i.e., Shannon's [38] optimal coding), and by the same token, at compounds of label codes that yield data compression for large compounds of identical and nonidentical messages (see, e.g., in [47,48]). On the other hand, the Helmholtzian likelihood principle in perception is now and again taken to rely on objective "real" probabilities of things in the world. This would give it an explanatory nature, but by all accounts, it seems impossible to assess such probabilities (see, e.g., in [49,50]). In between are, for instance, Bayesian models in cognitive science. In general, as said, such models employ free-to-choose probabilities for free-to-choose things, where both those things and their probabilities may be chosen subjectively on the basis of experimental data or modeller's intuition. Therefore, all in all, how could one ever claim that fairly stable descriptive complexities are equivalent to every set of probabilities employed or proposed within the probabilistic framework?

Yet, notice that Pinna and Conti are not alone in their equivalence claim. Equivalence also has been claimed by, for instance, Friston [51], Feldman [52,53], and Thornton [54]. They too failed to provide explicit corroboration, which raises the question of where the claim actually comes from. As a matter of fact, for alleged support, they all referred consistently to either Chater [46] or MacKay [55], or to both. These sources are discussed next (for more details, see in [17,22,56,57]).

### *4.1. Chater (1996)*

The main issue in the well-cited article by Chater [46] may be explained by means of Figure 3, starting at the left-hand side. The upper-left quadrant indicates that, for some set of probabilities *p*, one can maximize certainty via Bayes' rule, that is, by combining prior probabilities *p*(*H*) and conditional probabilities *p*(*D*|*H*) for data *D* and hypotheses *H* to obtain posterior probabilities *p*(*H*|*D*). [*Note:* in general, priors account for viewpoint-independent aspects (i.e., how good is hypothesis *H* in itself?), whereas conditionals account for viewpoint-dependent aspects (i.e., how well do data *D* fit hypothesis *H*?).] The lower-left quadrant indicates information measurement in the style of classical IT, that is, by the conversion of probabilities *p* to surprisals *I* = − log *p* (term coined by Tribus [58]; concept developed by Nyquist [36] and Hartley [37]). As said, the surprisal can be used to achieve

optimal coding [38], but as indicated in Figure 3, prior and conditional surprisals can, analogous to Bayes' rule, also be combined to minimize information as quantified in classical IT. The latter constitutes the minimal message length principle (MML) [40], which, considering the foregoing, clearly is a full Bayesian approach that merely has been rewritten in terms of surprisals [59].

**Figure 3.** Surprisals versus precisals. For data *D* and hypotheses *H*, probabilities *p* can be used to maximize Bayesian certainty under these probabilities (top left), and via the surprisal conversion *I* = − log *p*, also to minimize information as quantified in classical information theory (IT) (bottom left). Descriptive complexities *C* can be used to minimize information as quantified in modern IT (bottom right), and via the precisal conversion *p* = <sup>2</sup>−*C*, also to maximize Bayesian certainty under these precisals (top right) (adapted from [57]).

Turning to the right-hand side of Figure 3, the lower-right quadrant indicates that, for some descriptive coding language yielding complexities *C*, one can combine prior and conditional complexities to minimize information as quantified in modern IT. This is the minimum description length principle (MDL) [42], which can be seen as a modern version of Occam's razor [60]. It also reflects the current take on the simplicity principle in perception [16,17]. The upper-right quadrant indicates that complexities *C* can be converted to what are called algorithmic probabilities *p* = <sup>2</sup>−*C*, also called precisals [17]. These are artificial probabilities but, just as holds for other probabilities, prior and conditional precisals can, for instance, be combined to maximize certainty via Bayes' rule. This reflects Solomonoff's [44,45] Leitmotif: because classical IT relies on known probabilities, he wondered if one could devise "universal" probabilities, that is, probabilities that can be used fairly reliably whenever the actual probabilities are unavailable. In modern IT, precisals are proposed to be such universal probabilities and much research goes into their potential reliability. In cognitive science, they can be used, for instance, to predict the likelihood of empirical outcomes according to simplicity (i.e., rather than assuming that the brain itself uses them to arrive at those outcomes).

The surprisal and precisal conversions are convenient in that they allow for sophisticated theoretical comparisons between simplicity and likelihood approaches (see, e.g., in [59,60]). Chater, however, jumped to the conclusion that these conversions imply that simplicity and likelihood are equivalent. Notice that the left-hand and right-hand sides in Figure 3 represent fundamentally different starting points and lines of reasoning. Therefore, equivalence would hold only if, in the lower half, the left-hand probability-based quantification of information and the right-hand content-based quantification of information—or, in the upper half, the related left-hand and right-hand sets of probabilities—are identical. Apart from the fundamental questionability thereof, these were not issues Chater addressed. It is true that the conversions imply that simplicity and likelihood can use the same

minimization and maximization formulas, but Chater fatally overlooked that equivalence depends crucially on what they substitute in those formulas—here, it is clear that they substitute fundamentally different things. Chater's mistake is in fact like claiming that Newton's formula *ma* for force *F* is equivalent to Einstein's formula *mc*<sup>2</sup> for energy *E*—allegedly because both could have used a formula like *mX*, but fatally ignoring that *X* is something fundamentally different in each case. Therefore, all in all, Chater provided no evidence for equivalence of simplicity and likelihood at all.
