*2.2. Methods*

We used eight machine-learning models for gastrointestinal-morbidity prediction based on food contamination. The aim of model training was to minimize the root mean squared error (RMSE) between the actual model outputs and the expected outputs over the training set:

$$\min \text{RMSE} = \sqrt{\frac{1}{N} \sum\_{i=1}^{N} |y\_i - \mathcal{g}\_i|^2} \tag{1}$$

where *N* is the number of tuples in the training set, *yi* is the model actual output of the *i*-th tuple, and *y*ˆ*i* is the expected (labeled) output of the *i*-th tuple. In this study, the output morbidity *yi* is calculated as the ratio of the incidences to the resident population in the investigated region (the floating population is not taken into account because of the difficulty of data collection).

A model is evaluated based on its prediction accuracy over the test set. We used fivefold cross-validation, i.e., we partitioned the dataset into five equal-size pieces, and ran the validation five times, each using four pieces as the training set and the remaining piece as the test set. Prediction accuracy was averaged over the five validations.

### 2.2.1. Multiple Linear Regression (MLR)

The MLR method calculates an output *y* from an *n*-dimensional input **x** as:

$$y = a\_0 + \sum\_{i=1}^{n} a\_i x\_i \tag{2}$$

where *ai* are the regression coefficients (*i* = 1, 2, ... , *n*). Here, *n* = 27,013; if a value *xi* is missing, it is filled by the mean value of those nonmissing *xi* of training tuples.

### 2.2.2. Shallow Neural Network

We used a three-layer feed-forward ANN trained by the back-propagation algorithm. Each neuron in the input layer directly accepts an input component *xi*, while each neuron *j* in the hidden layer calculates an inner output *zj* as:

$$z\_j = s \left(\sum\_{i=1}^n w\_{ij} x\_i - \theta\_j\right) \tag{3}$$

where *θj* is the threshold of the neuron, *wij* is the connection weight between the *i*-th input neuron to the neuron *j*, and *s* is the sigmoid activation function:

$$s(\mu) = \frac{1}{1 + \varepsilon^{-\mu}} \tag{4}$$

Similarly, the output neuron calculates the final output *y* as:

$$y = s\left(\sum\_{j=1}^{m} w\_j z\_j - \theta\_0\right) \tag{5}$$

Empirically, we set number of neurons *m* in the hidden layer to √*<sup>n</sup>*.

### 2.2.3. Deep Belief Network (DBN)

A DBN [23] consists of a stack of Restricted Boltzmann Machines (RBMs) [27]. An RBM, consisting of a visible input layer and a hidden layer, is an energy-based probabilistic model that defines a joint probability distribution over an input vector **x** and a hidden vector **z** as:

$$P(\mathbf{x}, \mathbf{z}) = \frac{1}{\sum\_{i=1}^{N} \exp(-E(\mathbf{x}\_i, \mathbf{z}\_i))} \exp(-E(\mathbf{x}, \mathbf{z})) \tag{6}$$

where *<sup>E</sup>*(**<sup>x</sup>**, **z**) = −**x**T**bx** − **z**T**cz** − **<sup>x</sup>**T**wz**, and **b**, **c**, and **w** are the parameter vectors representing visible-to-visible, hidden-to-hidden, and visible-to-hidden interaction weights, respectively. Note that a basic RBM learns distributions over binary vectors, but we can use Gaussian-Bernoulli energy function to transform a real vector into a binary one [28], and then use DBN to learn distributions over the transformed binary vector [29].

After fine-tuning the structural parameters of the DBN on the training sets, we set the number of hidden layers to four, and set the numbers of neurons in the hidden layers to 3860, 550, 80, and 12, respectively. A Gaussian mixture model was added to the topmost RBM of DBN to produce output morbidity *y* from topmost hidden vector **z**. DBN training consists of two stages. The first stage is pretraining, which tries to maximize the joint distribution of each RBM over the training set layer-by-layer:

$$\underset{\mathbf{b},\mathbf{c},\mathbf{w}}{\arg\max} \mathcal{J} = \frac{1}{N} \sum\_{i=1}^{N} \log P(\mathbf{x}\_i, \mathbf{z}\_i) \tag{7}$$

The second stage is to minimize the RMSE of the whole DBN over the training set.

### 2.2.4. Evolutionary Deep Belief Network (EvoDBN)

A classical DBN is trained by a gradient-based, layerwise training algorithm [30], which is easily trapped in local optima, especially when the dimension is high. This issue can be tackled by using evolutionary training algorithms, which evolve populations of solutions to simultaneously explore multiple regions in the solution space to increase the chances of jumping out of local optima [31]. Here, we employed a recent efficient evolutionary algorithm called water wave

optimization (WWO) [32], which has exhibited competitive performance compared to many other popular evolutionary algorithms in neural-network training [33].

To solve an optimization problem, WWO evolves a population of candidate solutions by mimicking wave propagation and breaking in shallow water. In WWO, each solution *X* is analogous to a wave. The higher the energy (fitness) *f*(*X*), the smaller the wavelength *λX*, and thus the smaller the range that the wave propagates. *λX* is initially set to 0.5, and then updated at each generation as:

$$
\lambda\_X = \lambda\_X \cdot \mathfrak{a}^{-(f(X) - f\_{\min} + \mathfrak{a})/(f\_{\max} - f\_{\min} + \mathfrak{a})} \tag{8}
$$

where *f*max and *f*min are the maximum and minimum fitness among the population, respectively, *α* is the wavelength-reduction coefficient suggested set to 1.0026, and is a very small number to avoid division by zero. At each generation, *X* is propagated by adding an offset proportional to *λX* to each dimension *Xi* as follows:

$$X\_i' = X\_i + \lambda\_X \cdot rand(-1, 1) \cdot L\_i \tag{9}$$

where *Li* is the length of the *i*-th dimension of the solution space. Whenever a propagation produces a new best solution *X*<sup>∗</sup>, it is broken into several solitary waves, each of which moves a small distance from *X*∗ in a random dimension *i*:

$$X\_i' = X\_i^\* + \mathcal{N}(0, 1) \cdot \beta L\_i \tag{10}$$

where *β* is the breaking coefficient, and N denotes a normal distribution. The best solitary wave, if better than *X*<sup>∗</sup>, replaces *X*∗ in the population.

The EvoDBN uses the same architecture as DBN, and also employs a Gaussian mixture model to produce output morbidity. When training EvoDBN, WWO is first applied to optimize the {**b**, **c**, **w**} parameters of each RBM layer by layer, where *f*(*X*) corresponds to the objective function in Equation (7). After pretraining, WWO is applied to optimize the parameters of the DBN as a whole, where *f*(*X*) is inversely proportional to RMSE.
