*Article* **Partial Exchangeability for Contingency Tables**

**Persi Diaconis**

Department of Mathematics and Statistics, Sequoia Hall, Stanford University, Stanford, CA 94305, USA; diaconis@math.stanford.edu

**Abstract:** A parameter free version of classical models for contingency tables is developed along the lines of de Finetti's notions of partial exchangeability.

**Keywords:** algebraic statistics; contingency tables; de Finetti representation theorem; Markov basis; partial exchangeability

## **1. Introduction**

Consider cross-classified data: *X*1, *X*2, ..., *Xn*, where *Xa* = (*ia*, *ja*), *ia* ∈ [*I*], *ja* ∈ [*J*] (for [*I*] = {1, 2, ..., *I*}). Such data are often presented as an *I* × *J* contingency table *T* = (*tij*) where *tij* is the number of times (*i*, *j*) happens. Suppose that *X*1, ..., *Xn* are exchangeable and extendible. Then, de Finetti's theorem says:

**Theorem 1.** *For exchangeable* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *taking values in* [*I*] × [*J*]

$$P[X\_1 = (i\_1, j\_1), \dots, X\_n = (i\_{n\prime} j\_n)] = \int\_{\Delta\_{I \times \{\cdot\}}} \prod\_{i,j} p\_{ij}^{t\_{ij}} \mu(dp) \dots$$

*where* Δ*I*×*<sup>J</sup>* = {*pij* ≥ 0, ∑*i*,*<sup>j</sup> pij* = 1}*. The representing measure μ is unique.*

A popular model for cross classified data is

$$p\_{ij} = \theta\_i \eta\_j \dots$$

Here is a Bayesian, parameter free, description.

**Theorem 2.** *For exchangeable* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *taking values in* [*I*] × [*J*]*, a necessary and sufficient condition for the mixing measure μ in Theorem 1 to be supported on* Δ*<sup>I</sup>* × Δ*<sup>J</sup> (with* Δ*<sup>I</sup>* = {*p*1, ... , *pI* : *pi* ≥ 0, ∑*<sup>i</sup> pi* = 1}*), so*

$$P[X\_1 = (i\_1, j\_1), \dots, X\_n = (i\_n, j\_n)] = \int\_{\Delta\_l \times \Delta\_{\parallel}} \prod \theta\_i^{t\_{i\*}} \eta\_j^{t\_{\*j}} \mu(d\theta, d\eta)\_{\tau\_i}$$

*is that*

$$P[X\_1 = (i\_1, j\_1), X\_2 = (i\_2, j\_2), X\_3 = (i\_3, j\_3), \dots, X\_n = (i\_{n\prime}, j\_n)] = \\

$$P[X\_1 = (i\_1, j\_2), X\_2 = (i\_2, j\_1), X\_3 = (i\_3, j\_3), \dots, X\_n = (i\_{n\prime}, j\_n)]. \tag{1}$$
$$

*Condition* (1) *is to hold for any n* ≥ 2 *and any* (*ia*, *ja*) 1 ≤ *a* ≤ *n.*

**Proof.** Condition (1) implies for all *n* and *h* ≥ 1 (surpressing *P* a.s. throughout)

$$P[X\_1 = (i\_1, j\_1), X\_2 = (i\_2, j\_2) | X\_n = (i\_n, j\_n), \dots, X\_{n+h} = (i\_{n+h}, j\_{n+h})] = \\

$$P[X\_1 = (i\_1, j\_2), X\_2 = (i\_2, j\_1) | X\_n = (i\_{n\prime}, j\_n), \dots, X\_{n+h} = (i\_{n+h\prime}, j\_{n+h})]. \tag{2}$$
$$

**Citation:** Diaconis, P. Partial Exchangeability for Contingency Tables. *Mathematics* **2022**, *10*, 442. https://doi.org/10.3390/ math10030442

Academic Editors: Emanuele Dolera and Federico Bassetti

Received: 30 December 2021 Accepted: 20 January 2022 Published: 29 January 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Let *<sup>h</sup>* <sup>↑</sup> <sup>∞</sup> and then *<sup>n</sup>* <sup>↑</sup> <sup>∞</sup>. Let <sup>T</sup> be the tail field of {*Xi*}<sup>∞</sup> *<sup>i</sup>*=1. Then, Doob's increasing and decreasing martingale theorems show

$$P[X\_1 = (i\_1, j\_1), X\_2 = (i\_2, j\_2) | \mathcal{T}] = P[X\_1 = (i\_1, j\_2), X\_2 = (i\_2, j\_1) | \mathcal{T}].$$

However, a standard form of de Finetti's theorem says that, given <sup>T</sup> , the {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> are i.i.d. with *P*[*X*<sup>1</sup> = (*i*, *j*)] = *pij*. Thus

$$p\_{i\bar{j}}p\_{i'\bar{j}'} = p\_{i\bar{j}'}p\_{i'\bar{j}} \quad \text{for all } i, i', j, j'. \tag{3}$$

Finally, observe that (3) implies (writing *pi*<sup>∗</sup> := <sup>∑</sup>*<sup>j</sup> pij*, *<sup>p</sup>*∗*<sup>j</sup>* := <sup>∑</sup>*<sup>i</sup> pij*)

$$p\_{i\*}p\_{\*j} = \sum\_{l\mathfrak{l}} p\_{il}p\_{lj} = \sum\_{l\mathfrak{l}} p\_{ij}p\_{hl} = p\_{ij}\dots$$

We remark the following points.

1. If *Xi* = (*Yi*, *Zi*) condition (2) is equivalent to

$$\mathcal{L}((\mathbf{Y}\_1, Z\_1), (\mathbf{Y}\_2, Z\_2), \dots, (\mathbf{Y}\_n, Z\_n)) = \mathcal{L}((\mathbf{Y}\_1, Z\_{\sigma(1)}), \dots, (\mathbf{Y}\_n, Z\_{\sigma(n)}))$$

for all *<sup>n</sup>* and *<sup>σ</sup>* <sup>∈</sup> *Sn* (*Sn* is the symmetric group over 1, 2, ... , *<sup>n</sup>*). Since {(*Yi*, *Zi*)}*<sup>n</sup> i*=1 are exchangeable this is equivalent to saying the law is invariant under *Sn* × *Sn*.


**Theorem 3.** *Let Xi* = (*Yi*, *Zi*) *be exchangeable with Yi* ∈ Y *, Zi* ∈ Z *, complete separable metric spaces,* 1 ≤ *i* < ∞*. Suppose*

$$\begin{aligned} P[X\_1 \in (A\_1, B\_1), X\_2 \in (A\_2, B\_2), \dots, X\_n \in (A\_n, B\_n)] &= \\ P[X\_1 \in (A\_1, B\_2), X\_2 \in (A\_2, B\_1), \dots, X\_n \in (A\_n, B\_n)] \end{aligned}$$

*for all measurable Ai*, *Bi and all n. Then,*

$$P(X\_1 \in (A\_1, B\_1), \dots, X\_n \in (A\_n, B\_n)) = \int\_{\mathcal{P}(\mathcal{Y}) \times \mathcal{P}(\mathcal{X})} \prod\_{1}^{n} \theta(A\_i) \eta(B\_i) \mu(d\theta, d\eta),$$

*with* P(Y ),P(Z ) *the probabilities on the Borel sets of* Y , Z *. The mixing measure μ is unique.*


$$\frac{1}{n}\sum \delta\_{X\_i}(A \times B) \to \mu(\theta(A), \eta(B)).$$

One object of this paper is to develop similar parameter free de Finetti theorems for widely used log-linear models for discrete data. Section 2 begins by relating this to an ongoing conversation with Eugenio Regazzini. Section 3 provides needed background on discrete exponential families and algebraic statistics. Sections 4 and 5 apply those tools to give de Finetti style partially exchangeable theorems for some widely used hierarchical and graphical models for contingency tables. Section 6 shows how these exponential family tools can be used for other Bayesian tasks: building "de Finetti priors" for "almost exchangeability" and running the "exchange" algorithm for doubly intractable Bayesian computation. Some philosophy and open problems are in the final section.

#### **2. Some History**

I was lucky enough to be able to speak at Eugenio Regazzini's 60TH birthday celebration, in Milan, in 2006. My talk began this way:

Hello, my name is Persi and I have a problem. "

For those of you not aware of the many "10 step-programs" (alcoholics anonymous, gamblers anonymous, ...) they all begin this way, with the participants admitting to having a problem. In my case the problem was this:


There is a lot of nice mathematics and hard work in (b) but such tests violate the likelihood principle and lead to poor scientific practice. Hence my problem (I still have it): (a) and (b) are incompatible.

There has been some progress. I now see how some of the tools developed for (b) can be usefully employed for natural tasks suggested by (a). Not so many people care about such inferential questions in these 'big data' days. However, there are also lots of small datasets where the inferential details matter. There are still useful questions for people like Eugenio (and me).

#### **3. Background on Exponential Families and Algebraic Statistics**

The following development is closely based on [5], which should be considered for examples, proofs and more details.

Let X be a finite set. Consider the exponential family:

$$p\_{\theta}(\mathbf{x}) = \frac{1}{Z(\theta)} e^{\theta \cdot T(\mathbf{x})} \quad \theta \in \mathbb{R}^d, \mathbf{x} \in \mathcal{X} \tag{4}$$

Here, *<sup>Z</sup>*(*θ*) is a normalizing constant and *<sup>T</sup>* : <sup>X</sup> <sup>→</sup> <sup>N</sup>*<sup>d</sup>* − {0}. If *<sup>X</sup>*1, *<sup>X</sup>*2, ..., *Xn* are independent and identically distributed from (4), the statistic *t* = *T*(*X*1) + ··· + *T*(*Xn*) is sufficient for *θ*. Let

$$\beta\_t^\circ = \{ (\mathbf{x}\_1, \dots, \mathbf{x}\_n) : T(\mathbf{x}\_1) + \dots + T(\mathbf{x}\_n) = t \}.$$

Under (4), the distribution of *X*1, ..., *Xn* given *t* is uniform on Y*t*. It is usual to write

$$t = \sum\_{i=1}^{n} T(X\_i) = \sum\_{\mathcal{X}} \sigma(\mathbf{x}) T(\mathbf{x}) \quad \text{with } \sigma(\mathbf{x}) = \#\{i : T(X\_i) = T(\mathbf{x})\}.$$

Let

$$\mathcal{F}\_t = \{ f : \mathcal{X} \to \mathbb{N} : \sum f(\mathbf{x}) T(\mathbf{x}) = t \}.$$

**Example 1.** *For contingency tables* X = {(*i*, *j*) : 1 ≤ *i* ≤ *I*, 1 ≤ *j* ≤ *J*}. *The usual model for independence has <sup>T</sup>*(*i*, *<sup>j</sup>*) <sup>∈</sup> <sup>N</sup>*I*+*<sup>J</sup> a vector of length <sup>I</sup>* <sup>+</sup> *<sup>J</sup> with two non zero entries equal 1. The 1's in T*(*i*, *j*) *are in the i th place and position j of the last j places. The sufficient statistic t contains the row and column sums of the contingency table associated to the first n observations. The set* F*<sup>t</sup> is the set of an I* × *J tables with these row and column sums.*

*A Markov chain on this* F*<sup>t</sup> can be based on the following moves: pick i* = *i , j* = *j and change the entries in the current f by adding* ±1 *in pattern*

$$
\begin{array}{ccccc}
j & \stackrel{j}{+} & \stackrel{j'}{-} & \quad or & \stackrel{-}{+} & + \\
\stackrel{j'}{i'} & \stackrel{-}{-} & \stackrel{+}{+} & & & \end{array}
$$

*This does not change the row sums and it does not change the column sums. If told to go negative, just pick new i*, *i* , *j*, *j . This gives a connected, aperiodic Markov chain on* F*<sup>t</sup> with a uniform stationary distribution. See [6].*

Returning to the general case, an analog of <sup>+</sup> <sup>−</sup> <sup>−</sup> <sup>+</sup> moves is given by the following:

**Definition 1** (Markov basis)**.** *A Markov basis is a set of functions f*1, *f*2, ..., *fL from* X *to* Z *such that*

$$\sum\_{\mathcal{X}} f\_l(\mathbf{x}) T(\mathbf{x}) = 0 \quad 1 \le i \le L \tag{5}$$

*and that for any t and f* , *f*  ∈ F*<sup>t</sup> there are* (*t*1, *fi*<sup>1</sup> ), ...,(*tA*, *fiA* ) *with ti* = ±1*, such that*

$$f' = f + \sum\_{j=1}^{A} t\_j f\_{i\_j} \quad \text{and} \quad f + \sum\_{j=1}^{a} t\_j f\_{i\_j} \ge 0, \text{ for } 1 \le a \le A. \tag{6}$$

This allows the construction of a Markov chain on F*t*: from *f* , pick *I* ∈ {1, 2, ..., *L*} and *t* = ±1 at random and consider *f* + *t fI*. If this is positive, move there. If not, stay at *f* . Assumptions (5) and (6) ensure that this Markov chain is symmetric and ergodic with a uniform stationary distribution. Below, I will use a Markov basis to formulate a de Finetti theorem to characterize mixtures of the model (4).

One of the main contributions of [5] is a method of effectively constructing Markov bases using polynomial algebra. For each *x* ∈ X , introduce an indeterminate, also called *x*. Consider the ring of polynomials *k*[X ] in these indeterminates where *k* is a field, e.g., the complex numbers. A function *<sup>g</sup>* : <sup>X</sup> <sup>→</sup> <sup>N</sup> is represented as a monomial <sup>X</sup> *<sup>g</sup>* <sup>=</sup> <sup>∏</sup><sup>X</sup> *<sup>x</sup>g*(*x*). The function *<sup>T</sup>* : <sup>X</sup> <sup>→</sup> <sup>N</sup>*<sup>d</sup>* gives a homomorphism

$$\begin{aligned} \varphi\_T \colon k[\mathcal{X}^\cdot] &\longrightarrow k[t\_1, \dots, t\_d] \\ \mathbf{x} &\longmapsto t\_1^{T\_1(\mathbf{x})} t\_2^{T\_2(\mathbf{x})} \cdot \cdots t\_d^{T\_d(\mathbf{x})} \end{aligned}$$

extended linearly and multiplicatively (*ϕT*(*x* + *y*) = *ϕT*(*x*) + *ϕT*(*y*) and *ϕT*(*x*2) = *ϕT*(*x*)<sup>2</sup> and so on). The basic object of interest is the kernel of *ϕT*:

$$I\_T = \{ p \in k[\mathcal{X}^\circ] : \varphi\_T(p) = 0 \}.$$

This is an ideal in *k*[X ]. A key result of [5] is that a generating set for *IT* is equivalent to a Markov basis. To state this, observe that any *<sup>f</sup>* : <sup>X</sup> <sup>→</sup> <sup>Z</sup> can be written *<sup>f</sup>* <sup>=</sup> *<sup>f</sup>*<sup>+</sup> <sup>−</sup> *<sup>f</sup>*<sup>−</sup> with *f*+(*x*) = max(*f*(*x*), 0) and *f*−(*x*) = max(−*f*(*x*), 0). Observe ∑ *f*(*x*)*T*(*x*) = 0 iff <sup>X</sup> *<sup>f</sup>*<sup>+</sup> <sup>−</sup> <sup>X</sup> *<sup>f</sup>*<sup>−</sup> <sup>∈</sup> *IT*. The key result is

**Theorem 4.** *A collection of functions f*1, *f*2, ..., *fL is a Markov basis if and only if the set*

$$\mathcal{X}^{f\_{i+}} - \mathcal{X}^{f\_{i-}} \quad 1 \le i \le L$$

*generates the ideal IT.*

Now, the Hilbert Basis Theorem shows that ideals in *k*[X ] have finite bases and modern computer algebra packages give an effective way of finding bases.

I do not want (or need) to develop this further. See [5] or the book by Sullivant [7] or Aoki et al. [8]. There is even a Journal of Algebraic Statistics.

I hope that the above gives a flavor for what I mean by "working in (b) is hard honest work". Most of the applications are for standard frequentist tasks. In the following sections, I will give Bayesian applications.

#### **4. Log Linear Model for Contingency Tables**

Log linear models for multiway contingency tables are a healthy part of the modern statistics. The index set is <sup>X</sup> = <sup>∏</sup>*γ*∈<sup>Γ</sup> *<sup>I</sup><sup>γ</sup>* with <sup>Γ</sup> indexing categories and *<sup>I</sup><sup>γ</sup>* the levels of *<sup>γ</sup>*. Let *p*(*x*) be the probability of falling into cell *x* ∈ X . A log linear model can be specified by writing:

$$\log p(\mathbf{x}) = \sum\_{a \subseteq \Gamma} \varphi\_a(\mathbf{x}) \; . $$

The sum ranges over subsets *a* of Γ and *ϕa*(*x*) means a function that only depends on *x* through the coordinates in *a*. Thus, *ϕ*∅(*x*) is a constant and *ϕ*Γ(*x*) is allowed to depend on all coordinates. Specifying *ϕ<sup>a</sup>* = 0 for some class of sets *a* determines a model. Background and extensive references are in [9]. If the *a* with *ϕ<sup>a</sup>* = 0 permitted form a simplicial complex C (so *a* ∈ C and ∅ = *a* ⊆ *a* ⇒ *a* ∈ C) the model is called *hierarchical*. If C consists of the cliques in a graph, the model is called *graphical*. If the graph is chordal (every cycle of length ≥ 4 contains a chord) the graphical model is called *decomposable*.

**Example 2** (3 way contingency tables)**.** *The graphical models for three way tables are:*

The simplest hierarchical model that is not graphical is *No Three Way Interaction Model*. This can be specified by saying 'the odds rate of any pair of variables does not depend on the third'. Thus,

$$\frac{p\_{ijk}p\_{i'j'k}}{p\_{ij'k}p\_{i'jk}}\quad\text{is constant in }k\text{ for fixed }i,i',j,j'.\tag{7}$$

As one motivation, recall that for two variables, the independence model is specified by

$$p\_{i\bar{j}} = \theta\_i \eta\_{\bar{j}}.$$

For three variables, suppose there are parameters *θij*, *ηjk*, *ψik* satisfying:

$$
\eta\_{ijk} = \theta\_{ij}\eta\_{jk}\psi\_{ik} \quad \text{for all } i, j, k. \tag{8}
$$

It is easy to see that (8) entails (7) hence 'no three way interaction'. Cross multiplying (7) entails

$$p\_{\vec{r}\vec{k}}p\_{\vec{r}'\vec{k}}p\_{\vec{r}'\vec{k}'}p\_{\vec{r}'\vec{k}'} = p\_{\vec{i}\vec{k}'}p\_{\vec{r}'\vec{k}'}p\_{\vec{i}'\vec{k}'}p\_{\vec{r}'\vec{k}}.\tag{9}$$

This is the form we will work with for the de Finetti theorems below.

For background, history and examples (and some nice theorems) see ([10], Section 8.2), [11,12], Simpsons 'paradox' [13] is based on understanding the no three way interaction model. Further discussion is in Section 5 below.

## **5. From Markov Bases to de Finetti Theorems**

Suppose <sup>X</sup> is a finite set, *<sup>T</sup>* : <sup>X</sup> <sup>→</sup> <sup>N</sup>*<sup>d</sup>* − {0} is a statistic and { *fi*}*<sup>L</sup> <sup>i</sup>*=<sup>1</sup> is a Markov basis as in Section 3. The following development shows how to translate this into de Finetti theorems for the contingency table examples of Section 4. The first argument abstracts the argument used for Theorem 2 above.

**Lemma 1** (Key Lemma)**.** *Let* <sup>X</sup> *be a finite set and* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *an exchangeable sequence of* X *-valued random variables. Suppose for all n* > *m*

$$P[X\_1 = x\_1, \ldots, X\_m = x\_m, X\_{m+1} = x\_{m+1}, \ldots, X\_n = x\_n] = \\

$$P[X\_1 = y\_1, \ldots, X\_m = y\_m, X\_{m+1} = x\_{m+1}, \ldots, X\_n = x\_n]. \tag{10}$$
$$

*In* (10)*, x*1, ..., *xm*, *y*1, ..., *ym are fixed and xm*+1, ..., *xn are arbitrary. Then, if* T *is the tail field of* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *and p*(*x*) = *P*[*X*<sup>1</sup> = *x*|T ]*,*

$$\prod\_{i=1}^{m} p(\mathbf{x}\_i) = \prod\_{i=1}^{m} p(y\_i). \tag{11}$$

**Proof.** From (10) and exchangeability

$$\begin{aligned} P[X\_1 = \mathbf{x}\_1, \dots, \mathbf{x}\_n = \mathbf{x}\_m, X\_{n+1} = \mathbf{x}\_{n+1}, \dots, X\_{n+h} = \mathbf{x}\_{n+h}] &= \\ P[X\_1 = y\_1, \dots, X\_m = y\_{m\prime}, X\_{n+1} = \mathbf{x}\_{n+1}, \dots, X\_{n+h} = \mathbf{x}\_{n+h}] \end{aligned}$$

so

$$\begin{aligned} P[X\_1 = \mathbf{x}\_1, \dots, X\_m = \mathbf{x}\_m | X\_{n+1} = \mathbf{x}\_{n+1}, \dots, X\_{n+h} = \mathbf{x}\_{n+h}] &= \\ P[X\_1 = \mathbf{y}\_1, \dots, X\_m = \mathbf{y}\_m | X\_{n+1} = \mathbf{x}\_{n+1}, \dots, X\_{n+h} = \mathbf{x}\_{n+h}] &\dots \end{aligned}$$

Let *h* ↑ ∞ and then *n* ↑ ∞, use Doob's upward and then downward martingale convergence theorems to see:

$$P[X\_1 = x\_1, \ldots, X\_m = x\_m | \mathcal{T}] = P[X\_1 = y\_1, \ldots, X\_m = y\_m | \mathcal{T}].$$

Now, de Finetti's theorem implies (11).

**Remark 1.** *The Key Lemma shows that the p*(*x*) *satisfy certain relations. Using choices of* {*xi*}, {*yi*} *derived from a Markov basis will show that p*(*x*) *satisfy the required independence properties. Suppose that* ∑<sup>X</sup> *f*(*x*)*T*(*x*) = 0*,* ∑<sup>X</sup> *f*(*x*) = 0 *and f* ∈ {0, ±1}*. Let S*<sup>+</sup> = {*x* : *f*(*x*) = 1}*, S*<sup>−</sup> = {*y* : *f*(*y*) = −1}*. Say* |*S*+| = |*S*−| = *m. Enumerate S*<sup>+</sup> = {*x*1, ..., *xm*}*, S*<sup>−</sup> = {*y*1, ..., *ym*}*. Assumptions (10) and conclusion (11) will give our theorems.*

**Example 3** (Independence in a two way table)**.** *Let* X = [*I*] × [*J*]*. A minimal basis for the independence model is given by fi*,*j*,*<sup>i</sup>*,*<sup>j</sup> :*

$$
\begin{array}{c|cc}
 & j & j' \\
\hline
 i & + & - \\
 i' & - & + \\
\end{array}
\quad (all\ other\ entries=0).
$$

*The condition of the Key Lemma becomes:*

$$P[X\_1 = (i, j), X\_2 = (i', j'), X\_3 = (i\_3, j\_3), \dots, X\_n = (i\_n, j\_n)] = 0$$

$$P[X\_1 = (i, j'), X\_2 = (i', j), X\_3 = (i\mathfrak{z}, j\mathfrak{z}), \dots, X\_n = (i\_n, j\_n)].$$

*Passing to the limit gives*

$$p\_{i\bar{j}}p\_{\bar{i}'\bar{j}'} = p\_{i\bar{j}'}p\_{\bar{i}'\bar{j}}$$

*and so*

$$p\_{i\*}p\_{\*j} = \sum\_{i'j'} p\_{i\bar{j}'}p\_{i'\bar{j}} = p\_{i\bar{j}}.$$

*This is precisely Theorem 2 of the Introduction.*

**Example 4** (Complete independence in a three way table)**.** *The sufficient statistics are Ti*∗∗, *T*∗*j*∗, *T*∗∗*k. From [5], there are two kinds of moves in a minimal basis. Up to symmetries, these are:*

*Passing to the limit, this entails:*

*pijk pij <sup>k</sup>* = *pij <sup>k</sup> pi jk and pijk pi <sup>j</sup> <sup>k</sup>* = *pij <sup>k</sup> pijk* .

*These may be said as 'the product of any pijk*, *pi jk remains unchanged if the middle coordinates are exchanged'. By symmetry, this remains true if the two first or last coordinates are exchanged. As above, this entails*

$$p\_{i\*\*}p\_{\*j\*}p\_{\*\*k} = p\_{i\vec{\jmath}k}\cdot$$

*These observations can be rephrased into a statement that looks more similar to the classical de Finetti theorem; using symmetry:*

**Theorem 5.** *Let* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *be exchangeable, taking values in* [*I*] × [*J*] × [*K*]*. Then*

$$\begin{aligned} P[X\_1 = (i\_1, j\_1, k\_1), \dots, X\_{\text{fl}} = (i\_{\text{fl}}, j\_{\text{fl}}, k\_{\text{fl}})] &= \\ P[X\_1 = (\sigma(i\_1), \zeta(j\_1), \eta(k\_1)), \dots, X\_{\text{fl}} = (\sigma(i\_{\text{fl}}), \zeta(j\_{\text{fl}}), \eta(k\_{\text{fl}}))] \end{aligned}$$

*for all n,* {(*ia*, *ja*, *ka*)}*<sup>n</sup> <sup>a</sup>*=<sup>1</sup> *and* (*σ*, *ζ*, *η*) ∈ *SI* × *SJ* × *SK is necessary and sufficient for there to exist a unique μ on* Δ*<sup>I</sup>* × Δ*<sup>J</sup>* × Δ*<sup>K</sup> with*

$$P[X\_{\mathfrak{a}} = (i\_{\mathfrak{a}}, j\_{\mathfrak{a}}, k\_{\mathfrak{a}}), \ 1 \le a \le n] = \int\_{\Delta\_I \times \Delta\_f \times \Delta\_K} \prod\_{a=1}^n p\_{i\_a} q\_{j\_a} r\_{k\_a} \mu(dp\_\prime, dq\_\prime, dr).$$

**Example 5** (One variable independent of the other two)**.** *Suppose, without loss, that the graph is*

*1 2 3*

*Identify the pairs* (*j*, *k*) *with* {1, 2, ..., *L*} *with L* = *JK. The problem reduces to Example 4. A minimal basis consists of (again, up to relabeling)*

$$
\begin{array}{c|cc}
 & I & P \\
\hline
 i & + & - \\
 i' & - & + \\
\end{array}
$$

*We may conclude*

**Theorem 6.** *Let* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *be exchangeable, taking values in* [*I*] × [*J*] × [*K*]*. Then*

$$\begin{aligned} P[X\_1 = (i\_1, j\_1, k\_1), \dots, X\_n = (i\_{\mathbb{N}}, j\_{\mathbb{N}}, k\_n)] &= \\ P[X\_1 = (\sigma(i\_1), \zeta(j\_1, k\_1)), \dots, X\_n = (\sigma(i\_n), \zeta(j\_n, k\_n))] \end{aligned}$$

*for all n,* {(*ia*, *ja*, *ka*)}*<sup>n</sup> <sup>a</sup>*=<sup>1</sup> *and* (*σ*, *ζ*) ∈ *SI* × *SJ*×*<sup>K</sup> is necessary and sufficient for there to exist a unique μ on* Δ*<sup>I</sup>* × Δ*JK with*

$$P[X\_a = (i\_a, j\_a, k\_a), \ 1 \le a \le n] = \int\_{\Delta\_I \times \Delta\_{f|K}} \prod\_{a=1}^n p\_a q\_a \mu(dp, dq).$$

**Example 6** (Conditional independence)**.** *Suppose variable i and j are conditionally independent given k.*

*1 3 2*

*Rewrite the parameter condition of section four as*

$$p\_{\ast \ast k} p\_{ijk} = p\_{\ast \ast k} p\_{\ast jk} \quad \text{for all } i, j, k$$

*The sufficient statistics are* {*Ti*∗*k*}*i*,*k*, {*T*∗*jk*}*jk. From [5], a minimal generating set is*

$$\begin{array}{c|cc} \hline j\_k & j'\_k \\ \hline i\_k & + & - \\ i'\_k & - & + \\ \end{array} \qquad \qquad K \times \frac{I(I-1)}{2} \times \frac{I(J-1)}{2} \text{ moves in all.}$$

*From this, the Key Lemma shows (for all i*, *j*, *k)*

$$p\_{i\bar{j}k}p\_{\bar{i}\bar{j}\bar{k}} = p\_{i\bar{j}'k}p\_{\bar{i}\bar{j}k}\dots$$

*This entails:*

$$p\_{i\ast k}p\_{\ast jk} = \sum\_{i',j'} p\_{ij'k}p\_{i'jk} = \sum\_{i'j'} p\_{i\not k}p\_{i'j'k} = p\_{i\not k}p\_{\ast \ast k}\dots$$

*Again, phrasing the condition* (10) *in terms of symmetry.*

**Theorem 7.** *Let* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *be exchangeable, taking values in* [*I*] × [*J*] × [*K*]*. Then,*

$$\begin{aligned} P[X\_1 = (i\_{i'}j\_i, k\_i), \dots, X\_n = (i\_{n'}j\_{n'}k\_n)] &= \\ P[X\_1 = (\sigma^{k\_1}(i\_1), \zeta^{k\_1}(j\_1), k\_1), \dots, X\_n = (\sigma^{k\_n}(i\_n), \zeta^{k\_n}(j\_n), k\_n)] \end{aligned} \tag{12}$$

*for all n,* {(*ia*, *ja*, *ka*)}*<sup>n</sup> <sup>a</sup>*=<sup>1</sup> *and <sup>σ</sup>k*, *<sup>ζ</sup><sup>k</sup>* <sup>∈</sup> *SI* <sup>×</sup> *SJ,* <sup>1</sup> <sup>≤</sup> *<sup>k</sup>* <sup>≤</sup> *K, is necessary and sufficient for there to exist a unique family <sup>μ</sup>* <sup>×</sup> <sup>∏</sup>*<sup>k</sup> <sup>b</sup>*=<sup>1</sup> *<sup>μ</sup>b*,*<sup>r</sup> on* <sup>Δ</sup>*<sup>K</sup>* <sup>×</sup> (Δ*<sup>I</sup>* <sup>×</sup> <sup>Δ</sup>*J*)*<sup>K</sup>*

 $P[X\_d = (i\_{d\_f}, j\_d, k\_d), \; 1 \le a \le n] = \newline 
$$\int\_{\Delta\_K \times (\Delta\_I \times \Delta\_f)^K} \prod\_{a=1}^n r\_{k\_a} p\_{j\_a}^{k\_a} q\_{j\_a}^{k\_a} \prod\_{b=1}^k \mu\_{b,r} (p^{i\_b} q^{i\_b}) \mu(dr). \tag{13}$$
$ 

Both (12) and (13) have a simple interpretation. For (12), {*Xi*}*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> are exchangeable 3-vectors. For any *<sup>k</sup>* and specified sequence of values {(*ia*, *ja*, *<sup>k</sup>*)}*<sup>n</sup> <sup>a</sup>*=<sup>1</sup> the chance of observing these values is unchanged under permuting the (*ia*, *ja*, *<sup>k</sup>*), by permutations *<sup>σ</sup><sup>k</sup>* <sup>∈</sup> *SI*, *<sup>ζ</sup><sup>k</sup>* <sup>∈</sup> *SJ*. Here *σk*, *ζ<sup>k</sup>* are allowed to depend on *k*.

On the right of (13), the mixing measure may be understood as follows. There is a probability *<sup>μ</sup>* on <sup>Δ</sup>*K*. Pick *<sup>r</sup>* = (*r*1, ...,*rk*) <sup>∈</sup> <sup>Δ</sup>*K*. Given this *<sup>r</sup>*, pick (*pk*, *<sup>q</sup>k*) from *<sup>μ</sup>k*,*<sup>r</sup>* on the *<sup>k</sup>th* copy of Δ*<sup>I</sup>* × Δ*J*. These choices are allowed to depend on *r* but are independent, conditional on *r*, 1 ≤ *k* ≤ *K*.

All of this simply says that, conditional on the tail field,

$$P[\mathbf{X\_{4}}=(i,j,k)|\mathcal{T}]=P[\mathbf{X\_{4}}=(i,\*,k)|\mathcal{T})P(\mathbf{X\_{4}}=(\*,j,k)|\mathcal{T}].$$

The first two coordinates are conditionally independent given the third.

**Example 7** (No three way interaction)**.** *The model is described in Section 4. The sufficient statistics are* {*Tij*∗}, {*Ti*∗*k*}, {*T*∗*jk*}. *Minimal Markov bases have proved intractable. See [5] or [8]. For any fixed I*, *J*, *K, the computer can produce a Markov basis but these can have a huge number of terms. See [7,8] and their references for a surprisingly rich development.*

*There is a pleasant surprise. Markov bases are required to connect the associated Markov chain. There is a natural subset, the first moves anyone considers, and and these are enough for a satisfactory de Finetti theorem (!).*

*Described informally, for an I* × *J* × *K array, pick a pair of parallel planes, say the k*, *k planes in the three dimensional array, and consider moves depicted as*


*These moves preserve all line sums (the sufficient statistics). They are not sufficient to connect any two datasets with the same sufficient statistics. Using the prescription in the Key Lemma, suppose:*

$$P[X\_1 = (i, j, k), X\_2 = (i', j', k), X\_3 = (i, j', k'), X\_4 = (i', j, k'),$$

$$\begin{aligned} \mathbf{X\_4} &= (i\_{a/j}, a\_a k\_a) \ \mathbf{S} \le a \le n \end{aligned} = \begin{aligned} \mathbf{X\_4} &= (i', j, k'), \\ \mathbf{X\_4} &= (i', j, k), \mathbf{X\_3} = (i, j, k'), \mathbf{X\_4} = (i', j', k'), \\ \mathbf{X\_4} &= (i\_{a/j}, j\_a, k\_a) \ \mathbf{S} \le a \le n \end{aligned} \tag{14}$$

*Passing to the limit gives*

$$p\_{\vec{r}\vec{k}}p\_{\vec{r}'\vec{k}}p\_{\vec{i}\vec{r}'\vec{k}'}p\_{\vec{r}'\vec{k}'} = p\_{\vec{i}\vec{j}'k}p\_{\vec{r}'\vec{k}}p\_{\vec{i}\vec{k}'}p\_{\vec{i}'\vec{r}'\vec{k}'}.\tag{15}$$

.

*This is exactly the no three way interaction condition. Or, equivalently:*

$$\frac{p\_{ijk}p\_{i'j'k}}{p\_{i\bar{j}'k}p\_{i'jk}} = \frac{p\_{i\bar{j}k'}p\_{i'j'k'}}{p\_{i\bar{j}'k'}p\_{i'jk'}}$$

*The odds ratios are constant on the kth and k th planes (of course, they depend on i*, *j*, *i* , *j ). These considerations imply:*

**Theorem 8.** *Let* {*Xi*}<sup>∞</sup> *<sup>i</sup>*=<sup>1</sup> *be exchangeable, taking values in* [*I*] × [*J*] × [*K*]*. Then, condition* (14) *is necessary and sufficient for the existence of a unique probability μ on* Δ*IJK, supported on the no three way interaction variety* (15) *satisfying*

$$P[X\_a = (i\_{a\prime}j\_a, k\_a), \; 1 \le a \le n] = \int\_{\Delta\_{I\backslash K}} \prod p\_{ijk}^{\eta\_{ijk}} \mu(dp\_{ijk}) \dots$$

We remark on the following points.

1. It follows from theorems in [12] and [11] that, if all *pijk* > 0, condition (15) is equivalent to the unique representation,

$$p\_{ijk} = r\alpha\_{jk}\beta\_{ki}\gamma\_{ij\prime} \tag{16}$$

where *r*, *α*, *β*, *γ* have positive entries and satisfy

$$\sum\_{k} \alpha\_{jk} = \sum\_{i} \beta\_{ki} = \sum\_{j} \gamma\_{ij} = 1 \text{ for all } i, j, k$$

and

$$r\sum\_{i,j,k} a\_{jk}\beta\_{ki}\gamma\_{i\bar{j}} = 1\ .$$

The integral representation in the theorem can be stated in this parametrization. The condition *pijk* > 0 is equivalent to *P*(*X*<sup>1</sup> = (*i*, *j*, *k*)) > 0 on observables.


#### **6. Discussion and Conclusions**

The tools of algebraic statistics have been harnessed above to develop partial exchangeability for standard contingency table models. I have used them for two further Bayesian tasks: approximate exchangeability and the problem of 'doubly intractable priors'. As both are developed in papers, I will be brief.

Approximate exchangeability.Consider *n* men and *m* women along with a binary outcome. If the men are judged exchangeable (for fixed outcomes for the women) and vice versa, and, if both sequences are extendable, de Finetti [1] shows that there is a unique prior on the unit square [0, 1] <sup>2</sup> such that, for any outcomes *<sup>t</sup>*1, ..., *tn*, *<sup>σ</sup>*1, ..., *<sup>σ</sup><sup>m</sup>* in {0, 1}

 $P[X\_1 = t\_1, \dots, X\_n = t\_n, Y\_1 = \sigma\_1, \dots, Y\_m = \sigma\_m] = \

$$\int\_{[0,1]^2} p^S (1-p)^{n-S} \theta^T (1-\theta)^{m-T} \mu(dp, d\theta),$$
$ 

with *S* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *ti*, *<sup>T</sup>* = <sup>∑</sup>*<sup>m</sup> <sup>j</sup>*=<sup>1</sup> *σj*.

If, for the outcome of interest, {*Xi*,*Yj*} were almost fully exchangeable (so the men/ women difference is judged practically irrelevant) the prior *μ* would be concentrated near the diagonal of [0, 1] 2. De Finetti suggested implementing this by considering priors of the form

$$\mu(dp, d\theta) = Z^{-1} e^{-A(p-\theta)^2} dp d\theta$$

for *A* large.

In joint work with Sergio Bacallado and Susan Holmes [3], multivariate versions of such priors are developed. These are required to concentrate near sub-manifolds of cubes or products of simplicies; think about 'approximate no three way interaction'. We used the tools of algebraic statistics to suggest appropriate many variable polynomials which vanish on submanifold of interest. Many ad hoc choices were involved. Sampling from such priors or posteriors is a fresh research area. See [2,14,15].

Doubly intractable priors. Consider an exponential family as in Section 3:

$$p\_{\theta}(\mathbf{x}) = \frac{1}{Z(\theta)} e^{\theta \cdot T(\mathbf{x})} \cdot \mathbf{x}$$

Here *<sup>x</sup>* <sup>∈</sup> <sup>X</sup> a finite set, *<sup>T</sup>* : <sup>X</sup> <sup>→</sup> <sup>R</sup>*<sup>d</sup>* and *<sup>θ</sup>* <sup>∈</sup> <sup>R</sup>*d*. In many real examples, the normalizing constant *Z*(*θ*) will be unknown and unknowable. For a Bayesian treatment, let Π(*dθ*) be a prior distribution on R*d*. For example, the conjugate prior.

If *X*1, *X*2, ..., *Xn* is as i.i.d. sample from *pθ*, *T* is a sufficient statistic and the posterior has the form

$$Z(Z^{-1}(\theta))^n e^{\theta F} \Pi(d\theta),$$

with *F* = ∑*<sup>n</sup> <sup>i</sup>*=<sup>1</sup> *T*(*Xi*) and *Z* another normalizing constant. The problem is that *Z*−1(*θ*) depends on *θ* and is unknown!

The exchange algorithm and many variants offer a useful solution. See [16,17].

In practical implementations, there is an intermediary step requiring a sample form *pT <sup>θ</sup>* , the measure induced by *<sup>p</sup><sup>n</sup> <sup>θ</sup>* under <sup>∑</sup>*<sup>n</sup> <sup>i</sup> <sup>T</sup>*(*xi*) : <sup>X</sup> *<sup>n</sup>* <sup>→</sup> <sup>R</sup>. This is a discrete sampling task and Markov basis techniques have been proved useful. See [16].

*A philosophical comment.* The task undertaken above, finding believable Bayesian interpretations for widely used log linear models, goes somewhat against the grain of standard statistical practice. I do not think anyone takes a reasonably complex, high dimensional hierarchical model seriously. They are mostly used as a part of exploratory data analysis; this is not to deny their usefulness. Making any sense of a high dimensional dataset is a difficult task. Practitioners search through huge collections of models in an automated way. Usually, any reflection suggests the underlying data is nothing like a sample from a well specified population. Nonetheless, models are compared using product likelihood criteria. It is a far far cry from being based on anyone's reasoned opinion.

I have written elsewhere about finding Bayesian justification for important statistical tasks such as graphical methods or exploratory data analysis [18]. These seem like tasks similar to 'how do you form a prior'. Different from the focus of even the most liberal Bayesian thinking.

*The sufficiency approach.* There is a different approach to extending de Finetti's theorem. This uses 'sufficiency'. Consider exchangeable {*Xi*}<sup>∞</sup> *<sup>i</sup>*=1. For each *<sup>n</sup>*, suppose *Tn* : <sup>X</sup> *<sup>n</sup>* <sup>→</sup> <sup>Y</sup> is a function. The {*Tn*} have to fit together according to simple rules satisfied in all of the examples above. Call {*Xi*} *partially exchangeable with respect to Tn* if *P*[*X*<sup>1</sup> = *x*1, ... , *Xn* = *xn*|*Tn* = *tn*] is uniform. Then, Diaconis and Freedman [19] show that a version of de Finetti's theorem holds. The law of {*Xi*} is a mixture of i.i.d. laws indexed by extremal laws. In dozens of examples, these extremal laws can be identified with standard exponential families. This last step remains to be carried out in the generality of Section 3 above. What is required is a version of the Koopman–Pitman–Darmois theorem for discrete random variables. This is developed in [19] when <sup>X</sup> <sup>⊆</sup> <sup>N</sup> and *Tn*(*X*1, ... , *Xn*) = *<sup>X</sup>*<sup>1</sup> <sup>+</sup> ··· <sup>+</sup> *Xn*. Passing to interpretation, this version of partial exchangeability has the following form:

$$\begin{aligned} \text{if } T\_n(\mathbf{x}\_1, \dots, \mathbf{x}\_n) &= T\_n(y\_1, \dots, y\_n), \\ \text{then } P[X\_1 = \mathbf{x}\_1, \dots, X\_n = \mathbf{x}\_n] &= P[X\_1 = y\_1, \dots, X\_n = y\_n] \dots \end{aligned}$$

This is neat mathematics (and allows a very general theoretical development). However, it does not seem as easy to think about in natural examples. Exchangeability via symmetry is much easier. The development above is a half-way house between symmetry and sufficiency. A close relative of the sufficiency approach is the topic of 'extremal models' as developed by Martin-Löf and Lauritzen. See [20] and its references. Moreover, Refs. [21,22] are recent extensions aimed at contingency tables.

*Classical Bayesian contingency table analysis.* There is a healthy development of parametric analysis for the examples of Section 5. This is based on natural conjugate priors. It includes nice theory and R packages to actually carry out calculations in real problems. Three papers that I like are [23–26]. The many wonderful contributions by I.J. Good are still very much worth consulting. See [27] for a survey. Section 5 provides 'observable characterizations' of the models. The problem of providing 'observable characterizations' of the associated conjugate priors (along the lines of [28]) remains open.

**Funding:** Thisresearch received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme under grant agreement No 817257 and funding from NSF grant No 1954042.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The author would like to thank Paula Gablenz, Sourav Chatterjee and Emanuele Dolera for help throughout.

**Conflicts of Interest:** The author declares no conflict of interest.

## **References**

