Next Article in Journal
Numerical Analysis of Fourier Finite Volume Element Method for Dirichlet Boundary Optimal Control Problems Governed by Elliptic PDEs on Complex Connected Domains
Next Article in Special Issue
A Comparative Analysis of Numerical Methods for Solving the Leaky Integrate and Fire Neuron Model
Previous Article in Journal
Swarm-Inspired Computing to Solve Binary Optimization Problems: A Backward Q-Learning Binarization Scheme Selector
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Hierarchical Bayesian Model for Inferring and Decision Making in Multi-Dimensional Volatile Binary Environments

1
State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang 110016, China
2
Institutes for Robotics and Intelligent Manufacturing, Chinese Academy of Sciences, Shenyang 110169, China
3
University of Chinese Academy of Sciences, Beijing 100049, China
4
Beijing Key Laboratory of Applied Experimental Psychology, School of Psychology, Beijing Normal University, Beijing 100875, China
5
State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing 100875, China
6
School of Systems Science, Beijing Normal University, Beijing 100875, China
7
Chinese Institute for Brain Research, Beijing 102206, China
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(24), 4775; https://doi.org/10.3390/math10244775
Submission received: 15 November 2022 / Revised: 30 November 2022 / Accepted: 5 December 2022 / Published: 15 December 2022
(This article belongs to the Special Issue Mathematical and Computational Neuroscience)

Abstract

:
The ability to track the changes of the surrounding environment is critical for humans and animals to adapt their behaviors. In high-dimensional environments, the interactions between each dimension need to be estimated for better perception and decision making, for example in volatile or social cognition tasks. We develop a hierarchical Bayesian model for inferring and decision making in multi-dimensional volatile environments. The hierarchical Bayesian model is composed of a hierarchical perceptual model and a response model. Using the variational Bayes method, we derived closed-form update rules. These update rules also constitute a complete predictive coding scheme. To validate the effectiveness of the model in multi-dimensional volatile environments, we defined a probabilistic gambling task modified from a two-armed bandit. Simulation results demonstrated that an agent endowed with the proposed hierarchical Bayesian model is able to infer and to update its internal belief on the tendency and volatility of the sensory inputs. Based on the internal belief of the sensory inputs, the agent yielded near-optimal behavior following its response model. Our results pointed this model a viable framework to explain the temporal dynamics of human decision behavior in complex and high dimensional environments.
MSC:
62C10; 62C12; 62M45; 68T07

1. Introduction

Natural environments are volatile, with ever changing sensory distributions and reward contingencies [1]. In a volatile environment, a biological agent must maintain stable internal states and be able to efficiently capture effective sensory information at the same time [2,3]. These seemingly contradictory requirements are unified into Bayesian inference [4], an optimal probability inference process.
Neuroscience research has shown that Bayesian inference underlies brain functions, such as perception, memory and decision-making, and resulting adaptive animal behaviors [3,5,6,7,8,9,10,11]. Adaptive behaviors are rooted in perceptual inferences and adaptive behavioral responses [12,13,14,15,16]. To understand the mechanisms of adaptive behaviors, one basic approach is to employ a generative model to infer the probabilistic distribution of sensory information and reproduce the temporal dynamics of human perception and decision-making in dynamic environments [17,18]. In this view, a Bayesian agent with a generative model is able to transform sensory inputs into behavioral responses [19]. With appropriate choice of parameters, a Bayesian agent could account for human decision behaviors.
“Observing the observer” is a meta Bayesian framework to simulate the perception processes of humans [20,21]. Perceptual and response models are two key components of this framework. According to this theoretic framework, inversion of the perceptual and response models can map from sensory inputs to response actions based on variational free energy principle [15,22,23] or Bayes’ rule.
To deal with volatile environments, volatility models, such as Hierarchical Gaussian Filtering [23], are developed to deliver an estimation of the changes of the environment. Accumulating evidences from the research on human learning and perception have shown that volatile Bayesian models (e.g., Hierarchical Gaussian Filtering) well explain human behaviors, especially, in changing environments. For example, saccadic response speed can be modulated by prediction precision of the belief on sensory inputs [24]. The volatility of the sensory environment and changes in sensory inputs are overestimated by adults with autism spectrum disorders [14]. This overestimation of volatility leads to the reduced precision of prior belief on sensory inputs. In human social learning, hierarchical prediction errors are encoded by midbrain and septum activity [25]. These evidences manifest that hierarchical Bayesian inference provides an optimal scheme to diminish surprise and reduce uncertainty in a volatility world.
To gain theoretical understandings of decision making under uncertainty with finite resource, the multi-armed bandit problem has been formulated as an abstraction [26,27,28,29]. The goal of the multi-armed bandit problem is to maximize the overall rewards through a series of choices. In neuroscience, multi-armed bandit problem is widely used to investigate economic decision making, contingent learning and human social behavior [30,31,32,33,34,35]. Animals and humans often have to make perceptual inference and settle on a series of decisions in a complex volatility environment. In general, the state space of decision making is high-dimensional. For example, in social interactions, the behaviors of multi-agents play important roles in the decisions of each individual. In a particular situation, agents employ internal models to observe other agents’ behaviors and to simulate their belief about actions [34,35]. The interactions between agents result in complex and correlated behaviors such as competition, cooperation, prediction and judgment. To describe multi-agents’ behaviors in social tasks, models that are able to capture dynamic information and noisy correlation in multi-dimensional state space need to be developed [36].
Bayesian networks are widely used for the inference of features from observed data [37,38,39,40]. In recent years, hierarchical Bayesian networks are developed to model the compositional nature of complex features for recognition tasks [41,42]. To solve perceptual inference and decision making problems in high-dimensional volatile binary environments, in this paper, we develop a hierarchical Bayesian model to infer time-varying hidden states of multi-armed bandits and maximize rewards given uncertain high-dimensional sensory inputs.
In summary, our model is promising to solve complex inference and decision making problems in realistic environments, which are intrinsically dynamic and high dimensional. In addition, our model could be applied to reveal computational mechanisms underlying human cognition and behaviors [43,44].
The rest of this paper is structured as follows. Section 2 introduces the hierarchical Bayesian perceptual model in high-dimensional volatile binary environment. Section 3 derives a set of closed form update equations for perceptual inference. Section 4 develops a response model for reward maximization in volatile multi-armed bandits as a typical example. Experimental results are given in Section 5. Finally, the paper is concluded with discussions.

2. Hierarchical Bayesian Perceptual Model

2.1. Beyond Independency

As a classic task in neuroscience and reinforcement learning, a multi-armed bandit challenges the agent with uncertain reward distribution, revealing rewards probabilistically. Since the agent has to estimate both the mean reward (for exploitation) and the precision of mean reward (for exploration), the multi-armed bandit captures the exploration–exploitation tradeoff dilemma in reward maximization under uncertainty [26,45].
Put simply, a one-armed bandit can be considered as a random binary number generator described by a Bernoulli distribution
Bern ( x 0 ; μ 0 ) = μ 0 x 0 ( 1 μ 0 ) 1 x 0 ,
where x 0 { 0 , 1 } , the state of the one-armed bandit, represents “reward” ( x 0 = 1 ) or “no reward” ( x 0 = 0 ). μ 0 [ 0 , 1 ] is the probability of being in the reward state. For a multi-armed bandit, x 0 ( i ) and μ 0 ( i ) denote the observation and expectation of reward in the i-th arm. The binary vector x 0 = [ x 0 ( 1 ) , x 0 ( 2 ) , , x 0 ( d 0 ) ] T constitutes a binary pattern corresponding to the state of rewarding or non-rewarding of the arms at time t, with d 0 being the total number of the arms. Throughout the paper we use the notation ( i ) in the superscript to indicate the i-th element of a vector.
Assuming independence between the reward distributions of the arms, the joint probability of being in state x 0 is given by the product of reward probabilities of the arms, equivalent to
p ( x 0 ) = exp ( i = 1 d 0 [ x 0 ( i ) ln μ 0 ( i ) + ( 1 x 0 ( i ) ) ln ( 1 μ 0 ( i ) ) ] ) .
However, this independent model is not able to capture possible interaction structure of the arms.
In volatile environments, the reward distributions are non-stationary and often evolve dependently on each other, showing time-variant interaction strength. To quantitatively describe the interactions among the arms of a multi-armed bandit, we introduce the concept that there are low-order interactions among the natural parameters of the underlying multivariate Bernoulli distribution. Denoted by x 1 , the natural parameter vector is mapped to a point μ 0 in the probability space through a multivariate element-wise sigmoid function s
μ 0 ( t ) = s ( x 1 ( t ) , ζ 1 ) ,
with the i-th element of μ 0 ( t ) being
μ 0 ( i ) ( t ) = s ( x 1 ( i ) ( t ) , ζ 1 ( i ) ) = 1 1 + exp ( ζ 1 ( i ) x 1 ( i ) ( t ) ) , i { 1 , 2 , , d 1 } .
d 1 = d 0 is the dimension of x 1 . s ( x 1 , ζ 1 ) is a vector-valued function defined by
[ s ( x 1 ( 1 ) , ζ 1 ( 1 ) ) , s ( x 1 ( 2 ) , ζ 1 ( 2 ) ) , , s ( x 1 ( d 1 ) , ζ 1 ( d 1 ) ) ] T .
The parameter vector
ζ 1 = [ ζ 1 ( 1 ) , ζ 1 ( 2 ) , , ζ 1 ( d 1 ) ] T
is the inverse temperature, with positive elements ( ζ 1 ( i ) > 0 ) .

2.2. Perceiving Tendency and Volatility

In volatile environments, variables of interest, such as reward, are subject to changes. The changes of a variable are again subject to changes, and so forth. The nested nature of volatility is a hallmark of collective phenomena as observed in animal swarms, the financial market and social behavior. To quantitatively describe volatility and pairwise correlations of high dimensional variables, we have developed a hierarchical volatility model, called General Hierarchical Brownian Filter (GHBF), based on the idea of nested Brownian motions [46]. Following this framework, we develop here a hierarchical perceptual model to estimate both the tendency and volatility in the states of a multi-armed bandit (Figure 1). More specifically, the natural parameters x 1 of the underlying multivariate Bernoulli distribution is modeled by a general Brownian motion with pervasion matrix Σ 1 R d 1 × d 1
x 1 = B ( t ; Σ 1 ) .
This Brownian motion captures the tendency of the learned parameter vector x 1 . The volatility (i.e., uncertainties and pairwise correlations) in x 1 is given by Σ 1 R d 1 × d 1 , which is a symmetric positive definite matrix by definition.
Considering the fact that the pervasion intensity Σ 1 is a symmetric positive definite matrix, it could be uniquely represented by a lower triangular matrix L 1 R d 1 × d 1 according to Cholesky decomposition
Σ 1 = L 1 L 1 T .
To further evaluate the volatility Σ 1 (i.e., uncertainties and pairwise correlations) in x 1 , we assume that its decomposition L 1 is modeled by a general Brownian motion in its parameterized space. To be exact, the elements of L 1 is parametrized by a d 2 = d 1 ( d 1 + 1 ) / 2 dimensional vector y 2 , which results from concatenating the lower triangle elements of L 1 in a column-wise fashion. The element in i-th row and j-th column of L 1 is given by
L 1 ( i , j ) = l 1 ( i , j ) = 2 sinh ( y 2 ( ( 2 d 1 j + 2 ) ( j 1 ) 2 + i j + 1 ) ) , 1 j < i d 1 exp ( y 2 ( ( 2 d 1 i + 2 ) ( i 1 ) 2 + 1 ) ) , j = i
where sinh ( · ) denotes a hyperbolic sine function. The vector y 2 gives the logarithm of volatility in the second level
y 2 = W 2 x 2 + b 2 .
The coefficient matrix W 2 is a d 2 -by- d 2 diagonal matrix and represents the coupling strength from level two to level one. Here, W 2 can simply take the form of a diagonal matrix spanned from a column vector w 2 with all positive elements
W 2 ( i , i ) = w 2 ( i ) .
b 2 and x 2 R d 2 represents trend and time-varying fluctuation in log-volatility of the natural parameter respectively. We may further assume that x 2 evolves as a general Brownian motion with pervasion matrix Σ 2 R d 2 × d 2
x 2 = B ( t ; Σ 2 ) .
We can rewrite the coupling (Equations (6) and (7)) as
L 1 = F 2 ( x 2 ; w 2 , b 2 ) .
In the second level, the pervasion matrix Σ 2 is chosen as a diagonal matrix. Let L 2 R d 2 × d 2 be the unique Cholesky decomposition of Σ 2 . We simply assume that L 2 is a constant diagonal matrix spanned by vector λ t o p R d 2 with all positive components.
Figure 1 shows an overview of the hierarchical perceptual model. With this model, a Bayesian agent receives a series of sensory inputs or observations u s = u ( t 1 ) , u ( t 2 ) , , u ( t K ) . K is the total number of trials. At time t k , the sensory input u ( t k ) to the agent is determined by the state x 0 ( t k ) of the bandit deterministically, i.e., with a delta distribution δ ( · )
P ( u ( t k ) x 0 ( t k ) ) = δ ( u ( t k ) = x 0 ( t k ) ) .
In summary, the hierarchical perceptual model constitutes a generative model for sensory observations u ( t ) based on hidden representations of the tendency ( x 1 ) and the volatility ( x 2 ) of the observations.

3. Perceptual Inference Approximated by Variational Approximation

The aforementioned hierarchical perceptual model is constructed based on general continuous Brownian motions. It remains to develop update rules to estimate the posterior distributions for the hidden representations x 1 and x 2 . In order to derive a family of analytical and efficient updates, we discretize the continuous Brownian motions by applying Eulerian method. Sampling interval (SI) ϵ ( t k ) = t k t k 1 is defined by the time that elapses between the arrival of consecutive sensory inputs u ( t k 1 ) and u ( t k ) .
We use the variational Bayesian method [15,20,22,23] to reach an approximation to the posterior distributions of x 1 ( t ) and x 2 ( t ) given the sensory input u ( t ) (i.e., observation). To this end, we maximize the negative free energy, which is the lower bound of log-model evidence, to yield a variational approximation posterior (cf. Appendices Appendix A and Appendix B)
q ( x h ( t k ) ) = 1 Z h exp ( V h ( x h ( t k ) ) ) , h = 1 , 2 ,
where Z h is a normalization constant. V h ( x h ( t k ) is the variational energy given by
V h ( x h ( t k ) = E q ( x s { x h } ( t k ) ) ln p ( x s ( t k ) , u ( t k ) ψ s , ϵ ( t k ) ) .
Here we introduced the notation x s = x 0 , x 1 , x 2 to denote the set of all hidden states, ψ s = w 2 , b 2 , λ t o p , ζ 1 for the hyperparameters of the model, x s { x h } for excluding x h from the set x s , E q ( x ) ( v ) for the expectation of v under the distribution q ( x ) .
In order to complete the derivations, Gaussian quadratic form approximation is used as in [46]. In general, the variational energy V h ( x h ( t k ) ) will deviate from a Gaussian quadratic form. We have to use a Gaussian quadratic form
V ¯ h ( x h ( t k ) ) = 1 2 ( x h ( t k ) μ h ( t k ) ) T P h ( t k ) ( x h ( t k ) μ h ( t k ) )
as an efficient approximation of V h ( x h ( t k ) ) . P h ( t k ) is given by the inverse of the Hessian matrix at the last state μ h ( t k 1 ) , P h ( t k ) = ( C h ( t k ) ) 1 = 2 V ( μ h ( t k 1 ) ) , and then a local maximum point μ h ( t k ) is found as the mode of the posterior Gaussian distribution. This approximation is made by neglecting higher order terms of the logarithm of q ( x h ( t k ) ) , and assuming Gaussian quadratic forms
x h ( t k ) u ( t k ) , ψ s N ( μ h ( t k ) , C h ( t k ) ) . h = 1 , 2
Under this approximation, the inference of the posterior distributions of x h is reduced to the estimation of the mean μ h ( t k ) and the covariance matrix C h ( t k ) , or equivalently the precision matrix P h ( t k ) ( C h ( t k ) ) 1 . Following [46], the update rules for the posterior distributions of x 1 and x 2 are derived.
At the bottom (zeroth) level of the hierarchical perceptual model, we can directly determine multivariate Bernoulli distribution q ( x 0 ( t k ) ) with expectation:
μ 0 ( t k ) = u ( t k ) .
At the first level, following Equation (11), V 1 ( x 1 ) can be calculated as
V 1 ( x 1 ( t k ) ) = E q ( x s { x 1 } ( t k ) ) [ ln p ( x s ( t k ) , u ( t k ) ψ s , ϵ ( t k ) ) ] = ln p ( u ( t k ) x 0 ( t k ) ) + E q ( x 0 ( t k ) ) [ ln p ( x 0 ( t k ) x 1 ( t k ) ) ] + E q ( x 1 ( t k ) , x 2 ( t k ) ) [ ln p ( x 1 ( t k ) x 2 ( t k ) , W 2 , b 2 , ϵ ( t k ) ) ] μ 0 T ( t k ) ln s ( x 1 ( t k ) ; ζ 1 ) + 1 μ 0 ( t k ) T ln 1 s ( x 1 ( t k ) ; ζ 1 ) 1 2 ( x 1 ( t k ) μ 1 ( t k 1 ) ) T ϵ ( t k ) Σ ^ 1 ( t k ) + C 1 ( t k 1 ) 1 ( x 1 ( t k ) μ 1 ( t k 1 ) ) .
where 1 is a d 0 dimensional column vector in which all elements are 1. Here we use the approximation
ϵ ( t k ) Σ 1 ( t k ) + C 1 ( t k 1 ) 1 ( ϵ ( t k ) Σ ^ 1 ( t k ) + C 1 ( t k 1 ) ) 1 ,
with Σ ^ 1 ( t k ) computed from the second level
Σ ^ 1 ( t k ) = L ^ 1 ( t k ) L ^ 1 T ( t k ) L ^ 1 ( t k ) = F 2 ( μ 2 ( t k 1 ) ; w 2 , b 2 ) .
The variational energy V 1 ( x 1 ( t k ) ) is not a standard Gaussian quadratic form, so we have to employ a Gaussian quadratic form V ¯ 1 ( x 1 ( t k ) ) to approximate it. To obtain this approximation form, we give the gradient and Hessian matrix of V 1 ( x 1 ( t k ) ) as follows:
V 1 ( x 1 ( t k ) ) = μ 0 ( t k ) s ( x 1 ( t k ) ; ζ 1 ) 1 2 ϵ ( t k ) Σ ^ 1 ( t k ) + C 1 ( t k 1 ) 1 ( x 1 ( t k ) μ 1 ( t k 1 ) )
and
2 V 1 ( x 1 ( t k ) ) = diag s ( x 1 ( t k ) ; ζ 1 ) ( 1 s ( x 1 ( t k ) ; ζ 1 ) ) 1 2 ϵ ( t k ) Σ ^ 1 ( t k ) + C 1 ( t k 1 ) 1
where the operator ⊙ is the Hadamard product. The operation diag ( v ) is to transform a vector v into a diagonal square matrix with the elements of v on the principal diagonal.
Under the Gaussian quadratic form approximation which is based on a single-step Newton method, the tendency of x 0 ( t k ) is captured by
μ 1 ( t k ) = μ 1 ( t k 1 ) + diag ( ζ 1 ) C 1 ( t k ) PE 0 ( t k )
where PE 0 ( t k ) is the prediction error
PE 0 ( t k ) = μ 0 ( t k ) μ ^ 0 ( t k ) ,
where μ ^ 0 ( t k ) [ μ ^ 0 ( 1 ) ( t k ) , μ ^ 0 ( 2 ) ( t k ) , , μ ^ 0 ( d 0 ) ( t k ) ] T is the prediction according to Equation (3)
μ ^ 0 ( t k ) = s ( μ 1 ( t k 1 ) , ζ 1 ) .
In Equation (19), the prediction error is scaled by the covariance matrix C 1 ( t k ) of the approximate Gaussian distribution, which is converted from the precision matrix
C 1 ( t k ) P 1 ( t k ) 1 P 1 ( t k ) = Π ^ 1 ( t k ) + diag ( ζ 1 ) 2 C ^ 0 ( t k ) .
Here C ^ 0 ( t k ) = diag ( σ ^ 0 ( t k ) ) is the diagonal square matrix containing the observed variance
σ ^ 0 ( t k ) = μ ^ 0 ( 1 ) ( t k ) ( 1 μ ^ 0 ( 1 ) ( t k ) ) μ ^ 0 ( 2 ) ( t k ) ( 1 μ ^ 0 ( 2 ) ( t k ) ) μ ^ 0 ( d 0 ) ( t k ) ( 1 μ ^ 0 ( d 0 ) ( t k ) ) .
Prediction precision Π ^ 1 ( t k ) is given by
Π ^ 1 ( t k ) = ( ϵ ( t k ) Σ ^ 1 ( t k ) + C 1 ( t k 1 ) ) 1 .
At the second level, the volatility, consisting of the uncertainties and pairwise correlations in natural parameters, is inferred by similar variational approximation method [46]. The mean is updated by
μ 2 ( t k ) = μ 2 ( t k 1 ) + ϵ ( t k ) C 2 ( t k ) W 2 T · L ^ g 1 ( t k ) Ω 1 ( t k ) I d 1 vec Δ 1 T ( t k ) .
Here the function vec ( M m × n ) is the vectorization of a matrix M , a linear operation, to obtain a column vector of length m × n by concatenating the columns of the matrix M consecutively from column 1 to column n. The operator ⊗ is Kronecker product. Δ 1 ( t k ) is given by
Δ 1 ( t k ) = C 1 ( t k ) + PE 1 ( t k ) PE 1 T ( t k ) Π ^ 1 ( t k ) I d 1 .
The constant matrix I d is a d-by-d unit square matrix. PE 1 ( t k ) is the prediction error on the hidden state x 1
PE 1 ( t k ) = μ 1 ( t k ) μ 1 ( t k 1 ) .
L ^ g 1 ( t k ) is given by
L ^ g 1 ( t k ) = exp ( W 2 ( 1 ) ) T μ 2 ( t k 1 ) + b 2 ( 1 ) e 2 T ( 1 ) 2 cosh ( W 2 ( 2 ) ) T μ 2 ( t k 1 ) + b 2 ( 2 ) e 2 T ( 2 ) exp ( W 2 ( 3 ) ) T μ 2 ( t k 1 ) + b 2 ( 3 ) e 2 T ( 3 ) 2 cosh ( W 2 ( 4 ) ) T μ 2 ( t k 1 ) + b 2 ( 4 ) e 2 T ( 4 ) exp ( W 2 ( d 2 ) ) T μ 2 ( t k 1 ) + b 2 ( d 2 ) e 2 T ( d 2 ) ,
where the constant vector e 2 ( d 2 ) is a d 1 2 -dimension column vector. The j-th component in e 2 T ( d 2 ) is 1 if j = i or 0 if j i . The column vector W 2 ( i ) is the i-th row in the coefficient matrix W 2 . Ω 1 ( t k ) is
Ω 1 ( t k ) = L ^ 1 T ( t k ) Π ^ 1 ( t k ) .
The precision matrix is updated by
P 2 ( t k ) = Π ^ 2 ( t k ) + W 2 T L ^ g 1 ( t k ) ϵ ( t k ) 2 K d 1 d 1 Ω 1 T ( t k ) [ Ω 1 ( t k ) Δ 1 ( t k ) ] + [ Δ 1 T ( t k ) Ω 1 T ( t k ) ] Ω 1 ( t k ) + Ω 1 T ( t k ) Ω 1 ( t k ) + ϵ ( t k ) 2 [ L 1 T ( t k ) Δ 1 T ( t k ) Ω 1 T ( t k ) ] Π ^ 1 ( t k ) + [ L 1 T ( t k ) Ω 1 T ( t k ) ] [ Π ^ 1 ( t k ) Δ 1 ( t k ) ] + [ L 1 T ( t k ) Ω 1 T ( t k ) ] Π ^ 1 ( t k ) ϵ ( t k ) I d 1 [ Π ^ 1 ( t k ) Δ 1 ( t k ) ] L ^ g 1 T ( t k ) W 2 W 2 T diag lvec δ 1 ( t k ) W 2
where the function lvec ( L ) is to transform a lower triangular matrix L into a column vector obtained by column stacking except all constant zero elements in the upper triangle part of the matrix. The prediction precision matrix Π ^ 2 at the second level is given by
Π ^ 2 ( t k ) = ( ϵ ( t k ) Σ 2 + C 2 ( t k 1 ) ) 1 .
The notation K m n denotes a m n -by- m n commutation matrix. δ 1 ( t k ) is defined as
δ 1 ( t k ) = ϵ ( t k ) [ Δ 1 T ( t k ) Ω 1 T ( t k ) ] L ^ 1 ( t k ) .

4. Decision Making in Volatile Multi-Armed Bandits

To illustrate decision making on the basis of perceptual inference in volatile environments, we introduce, as a toy example, a two-armed bandit problem, which is a complex variant of a one-armed bandit gambling task in [30,47]. In this task, a cautious gambler is asked to bet on the outcomes of a two-armed bandit, and to maximize its overall score (Figure 2). We use upper-case letters A and B to denote the two arms of the bandit, and the notations x 0 ( 1 ) and x 0 ( 2 ) for the states of arm A and B respectively. On each trail, the states of the two arms, i.e., the binary vector x 0 = [ x 0 ( 1 ) , x 0 ( 2 ) ] T , will be revealed to the gambler at the same time after the gambler makes a choice. There are two options available for the gambler to choose from. The first option “Same” represents the congruent states of the two arms, i.e., [ 0 , 0 ] T or [ 1 , 1 ] T . The second option “Different” represents incongruent states of the two arms, i.e., [ 1 , 0 ] T or [ 0 , 1 ] T . Once the gambler makes a decision (to choose “Same” or “Different”), the two arms would randomly generate their states by employing two univariate Bernoulli distributions (Equation (1)). To model a volatile environment, the state distributions of the arms are time-varying (Figure 3).
The gambler’s response a is encoded as:
a = 0 , for choice Different 1 , for choice Same .
The gambler is rewarded if its choice matches the outcome of the bandit. To include volatility also in rewards, the magnitude of reward is varied from trial to trial. The reward is sampled from a reward set S r = 1 , 2 , 3 , , N r , with equal probability of each reward being chosen P ( k ) = 1 / N r , k S r .
The gambler starts the experiment with zero score. On each trial, once the chosen option turns out to be correct, the corresponding reward associated to the choice will be added to its overall score.
To maximize reward, a response model has to be defined. To this end, we first denote the rewards obtained for the correct choice of “Different” and “Same” as r 0 and r 1 , respectively, can construct a reward table for each trial (Table 1).
Then we write a reward (utility) function r ( x 0 , a ) on a trial basis according to the reward table
r ( x 0 , a ) = ( 1 ( x 0 ( 1 ) x 0 ( 2 ) ) 2 ) [ a ( x 0 ( 1 ) x 0 ( 2 ) ) 2 ] 2 r 1 + ( x 0 ( 1 ) x 0 ( 2 ) ) 2 [ a ( x 0 ( 1 ) x 0 ( 2 ) ) 2 ] 2 r 0 .
Given the predicted state μ ^ 0 (Equation (21)), the expected reward of decision a under the corresponding predicted distribution q ( x 0 , μ ^ 0 ) is given by the value function
Q ( a , μ ^ 0 ) = x 0 r ( x 0 , a ) Bern ( x 0 ; μ ^ 0 ) = x 0 r ( x 0 , a ) Bern ( x 0 ( 1 ) ; μ ^ 0 ( 1 ) ) Bern ( x 0 ( 2 ) ; μ ^ 0 ( 2 ) ) = a 2 r 1 [ ( 1 μ ^ 0 ( 1 ) ) ( 1 μ ^ 0 ( 2 ) ) + μ ^ 0 ( 1 ) μ ^ 0 ( 2 ) ] + ( a 1 ) 2 r 0 [ ( 1 μ ^ 0 ( 1 ) ) μ ^ 0 ( 2 ) + μ ^ 0 ( 1 ) ( 1 μ ^ 0 ( 2 ) ) ] = r 1 [ ( 1 μ ^ 0 ( 1 ) ) ( 1 μ ^ 0 ( 2 ) ) + μ ^ 0 ( 1 ) μ ^ 0 ( 2 ) ] , a = 1 r 0 [ ( 1 μ ^ 0 ( 1 ) ) μ ^ 0 ( 2 ) + μ ^ 0 ( 1 ) ( 1 μ ^ 0 ( 2 ) ) ] , a = 0
The agent makes decisions according to a Boltzmann distribution constructed from the value function. The probability of choosing action a is defined by
P a = exp ( Q ( a , μ ^ 0 ) ) b exp ( Q ( b , μ ^ 0 ) ) .
For the binary decision-making task considered here, the probability of choosing action a = 1 is reduced to a sigmoid function
P 1 = 1 1 + exp ( ( Q ( 1 , μ ^ 0 ) Q ( 0 , μ ^ 0 ) ) ) = s ( Q ( 1 , μ ^ 0 ) Q ( 0 , μ ^ 0 ) , 1 ) ,
where s ( · , · ) the sigmoid function defined in Equation (4).
In fact, a biological agent maximizes long-term rewards, instead of immediate rewards, using decision noise as a mechanician to tradeoff exploration and exploitation. We introduce a probability weighting function [47,48] with a noise parameter ζ a > 0 to include decision noise. The probability of choosing action a = 1 is
P ( a = 1 μ ^ 0 , ζ a ) = P 1 ζ a P 1 ζ a + ( 1 P 1 ) ζ a .
Up to now, we have defined a response model (Equations (33)–(37)) based on Bayesian decision theory to maximize expected rewards. The response model is a function of the decision evidence ( Q ( 1 , μ ^ 0 ) Q ( 0 , μ ^ 0 ) ) , i.e., the difference between expected rewards for the two options (“Different”, “Same”). If the decision evidence is positive, the probability of choosing “Same” exceeds 0.5 , and the optimal action is to choose “Same” or a = 1 . If the decision evidence is a negative number, the probability of choosing “Different” exceeds 0.5 and the optimal action is the option “Different” or a = 0 .

5. Simulation Results

The combination of the perceptual model (Equations (5)–(10)) and the response model (Equations (33)–(37)) constitute a Bayesian model (denoted by M 1 ) for decision making in volatile multi-armed bandits. To assess the model’s ability to adapt to volatility, we simulated a gambler with the proposed Bayesian decision model to solve the two-armed bandit task (Figure 2). In the simulation, trials are organized into seventeen blocks, each of which contains 15 trials (Figure 3). The state expectations of the bandit change across blocks, resulting in volatility in sensory inputs. The reward set is specified as S r = { 1 , 2 , 3 , 4 } .
For an ideal observer, it has the access to the actual state u ( t ) = [ u ( 1 ) ( t ) , u ( 2 ) ( t ) ] T generated by the bandit at each time t (Figure 3). Given this ideal information, the ideal observer could make the ideal actions a i d e a l ( t )
a i d e a l ( t ) = 0 , if u ( 1 ) ( t ) u ( 2 ) ( t ) 1 , if u ( 1 ) ( t ) = u ( 2 ) ( t ) .
Based on this series of ideal actions, the cumulative reward obtained by the ideal observer could be computed.
To measure the performance of decision making behavior in the above gambling task, we define a probabilistically optimal reference for comparison. For this purpose, we consider an informed agent, who is given the expectation of the states of the volatile bandit [ P ( x 0 ( 1 ) ( t ) = 1 ) , P ( x 0 ( 1 ) ( t ) = 1 ) ] T . The informed agent needs not learn the states of the bandit, and it uses the same action selection mechanism (Equations (34)–(37)) of the response model π r to obtain the probabilistically optimal expectation of response action a ( t ) , denoted by P * ( a ( t ) = 1 ) . For a decision-making agent, only if the agent fully understands the volatile environments, the expectation of its action a ( t ) can completely coincide with the probabilistically optimal expectation of response action P * ( a ( t ) = 1 ) . Overestimating or underestimating the environmental states will lead to the expectation of the agent’s action a ( t ) to deviate from the optimal behavior. Therefore, P * ( a ( t ) = 1 ) constitutes the optimal decision making behavior of a learning agent could reach. The deviation of a learning agent with sensory inputs u s , actions a s and rewards r s from the informed agent in decision making behavior is measured by regret R ( P ( a ( t k ) = 1 ) | u s , a s , r s ) , defined by
R ( P ( a ( t k ) = 1 ) | u s , a s , r s ) = k = 1 K P ( a ( t k ) = 1 ) P * ( a ( t k ) = 1 ) ,
where P ( a ( t k ) = 1 ) is generated by the learning agent.

5.1. Dynamics of Bayesian Decision Making

We employed a Bayesian agent M 1 , which is endowed with the proposed hierarchical perceptual model and binary response model (Figure 4), to perform the above gambling task (Figure 2). All free parameters of our Bayesian agent M 1 is defined in Appendix C, and forms a random variable vector denoted by ξ 1 . Their initial sufficient statistics of all parameters are listed in Table 2. In details, the optimization of the free parameters was carried out in three steps as follows before the model was used for the gambling task.
(1)
Generating synthetic data. According to the expected states of the arms (Figure 3), we randomly generated a sequence of multivariate binary inputs
u s = u ( t 1 ) , u ( t 2 ) , u ( t 3 ) , , u ( t K ) , ( K = 255 ) .
Then the series of ideal actions a s = a i d e a l ( t 1 ) , a i d e a l ( t 2 ) , , a i d e a l ( t K ) is computed by Equation (38). The random reward sequence r s = r ( t 1 ) , r ( t 2 ) , r ( t 3 ) , , r ( t K ) is generated from uniform distribution U ( 1 , 4 ) based on the reward set S r = 1 , 2 , 3 , 4 .
(2)
Initializing sufficient statistics of all random parameters. To allow our model to work well for sensory inputs, we choose particular initial sufficient statistics of the random parameter vector ξ 1 , and determined the prior distribution of ξ 1 . The configuration for the parameters of the Bayesian agent (Figure 4) is shown in Table 2.
(3)
Maximizing negative free energy. To obtain the optimal prior parameters μ ξ 1 * , C ξ 1 * of the parameter ξ 1 , we maximize negative free energy (Equations (A19)–(A21)) by using the quasi-Newton Broyden–Fletcher–Goldfarb–Shanno method based on a line search framework [49].
In order to reveal the dynamic interaction between the natural parameters of the two-armed bandit, we show one example gambling process performed by the Bayesian agent characterized by the optimal parameters. The Bayesian agent tracked online the tendency μ 1 of the natural parameters x 1 associated to the bandit (Figure 5), so that it is able to make decisions based on the estimated decision evidence. The evolution of μ 1 follows well of the trend of expected states of the bandit (Figure 3), generating good prediction of the states (Figure 6b,c).
After the observation of the tentative rewards (Figure 6a) during the cue phase of a trial, the Bayesian agent makes a choice in the decision phase according to the perceptual model and the response model. In the first block of the simulation (trial 1 to 15, Figure 3), the two arms have the same expected states of maximal uncertainty, i.e., P ( x 0 ( 1 ) = 1 ) = P ( x 0 ( 2 ) = 1 ) = 0.5 , and the binary state patterns of the two-armed bandit are equal probable. Both of the belief states μ 0 ( 1 ) , μ 0 ( 2 ) of two arms fluctuates around 0.5 (Figure 6). During this block, the prediction correlation ρ ^ 1 fluctuates and decreases slightly towards zero, reflecting the fact that the states of the two arms are uncorrelated. From the second block to the tenth block (trial 16 to 150, Figure 3), the expected states of the two arms are incongruent. Therefore, the changes in the prediction tendency μ ^ 1 ( 1 ) (Figure 5a, as well as in the predicted mean μ ^ 0 ( 1 ) of arm A in Figure 6b) are on average in opposite directions as the changes in the prediction tendency μ ^ 1 ( 2 ) (Figure 5b, as well as in the predicted mean μ ^ 0 ( 2 ) of arm B in Figure 6c). Meanwhile, the prediction correlation ρ ^ 1 continues to decrease during this stage (Figure 6d), manifesting the incongruency of the two arms. From the eleventh block to the seventeenth block (trial 151 to 255, Figure 3), the changes in μ ^ 1 ( 1 ) and μ ^ 1 ( 2 ) share the same trend (Figure 5), so do the changes in μ ^ 0 ( 1 ) and μ ^ 0 ( 2 ) (Figure 6b,c), due to the fact that the two arms have the same expected states. Consequently, the prediction correlation ρ ^ 1 continues to increase during this stage (Figure 6d).
The log-volatility in the natural parameters ( μ 2 ( 1 ) and μ 2 ( 3 ) , i.e., internal representation of the expected states) of the two arms has notable changes from the third block to the fourteenth block (trial 31 to 210, Figure 7a,c). The changes are more evident from the sixth block to the fourteenth block, during which volatility is more vigorous. From the second to the tenth blocks (trial 31 to 150), the expected states of the two arms are not equal. Instead, they become incongruent (Figure 3). During this period, the log-volatility state μ 2 ( 2 ) , corresponding to the prediction correlation ρ ^ 1 , decreased and kept a descending trend (Figure 7b). This is consistent with the fact that the two arms are incongruent at the time. As a contrast, from the eleventh block to the seventeenth block, the expected states of the two arms are equal (trial 151 to 255, Figure 3), therefore, the Bayesian learner discovered an increasing log-volatility state μ 2 ( 2 ) during this stage (Figure 7b).
In Figure 8, our Bayesian agent M 1 is compared with the informed agent. In our Bayesian decision model, the evidence for decision-making is quantified by the probability P ( a ( t ) = 1 ) (red solid line in Figure 8a). It is close to the probabilistically optimal expectation of response action P * ( a ( t ) = 1 ) given by the informed agent (blue dashed line in Figure 8a). The action selection behavior of our Bayesian agent was similar to the optimal probability decision pattern. In this experiment, the regret R ( P ( a ( t k ) = 1 ) | u s , a s , r s ) is worked out by substituting P ( a ( t ) = 1 ) generated by our Bayesian agent into Equation (39), and is 27.3588.
Since the Bayesian agent made decisions based on the estimated decision evidence, it may be distracted by high rewards associated to the wrong action. In the third block (trial 31 to 45), the rewards of the option “Same” were higher than the option “Different”, the Bayesian agent were biased towards choosing “Same”, reflexing the fact that less likely but highly rewarded actions are worth to be tried (Figure 6e). This phenomenon was also evident in the beginning of the eleventh block (trial 151 to 155), where high rewards were more often assigned to the option “Different”. The Bayesian agent from time to time reduced the probability of choosing the option “Same”, leading to select “Different” more often (Figure 6e). The cumulative reward obtained by the Bayesian agent maintains a linear increasing trend irrespective of the volatility (red line in Figure 6f and Figure 8b), keeping close to the reward gained by the informed agent (blue line in Figure 8b).

5.2. Bayesian Model Selection

In order to evaluate the proposed hierarchical Bayesian model for inferring and decision making, we adopt the Bayesian model selection methodology [50]. It is a general principle to favor a model that achieves balanced tradeoff between complexity and flexibility. The proposed hierarchical Bayesian model has the sophisticated complexity to capture volatility in a multiscale fashion. We compare it with a well-known baseline model in psychology and Reinforcement Learning (RL), namely the Rescorla–Wagner (RW) model [26,51]. As a special case of the Temporal-Difference Learning method, the Rescorla–Wagner model updates value estimations based on prediction errors [26].
To perform fair comparisons, we construct a variant of the RW model using the same response model as the proposed hierarchical Bayesian model (cf. Appendix G). The agent with the RW model and the above response model is denoted by M 2 . Under the same variational Bayesian learning scheme, we search the optimal parameters for M 2 on each sequence of sensory inputs (Appendix D).
We conducted a Bayesian model selection experiment to compare the proposed Bayesian agent M 1 based on a variant of GHBF with the agent M 2 based on the RW model. The detailed simulation was performed as the following steps.
(1)
Generating synthetic dataset D . According to Figure 3, we randomly generated 100 sequences of multivariate binary inputs u s = u ( t 1 ) , u ( t 2 ) , u ( t 3 ) , , u ( t K ) ( K = 255 ). Then the series of ideal actions a s = a i d e a l ( t 1 ) , a i d e a l ( t 2 ) , , a i d e a l ( t K ) are computed according to Equation (38). Random reward sequences r s = r ( t 1 ) , r ( t 2 ) , r ( t 3 ) , , r ( t K ) are generated from uniform distribution U ( 1 , 4 ) based on the reward set S r . Here we used the notation D to denote the set of sensory and action sequences
D = u s , r s , a s | u s and r s are repeatedly generated .
(2)
Initializing sufficient statistics of all random parameters in our Bayesian agent M 1 . We choose particular initial sufficient statistics of a parameter vector ξ 1 to allow the Bayesian agent M 1 to work well on all sequences of sensory inputs. Then we determined the prior distribution of ξ 1 . All configurations for parameters of the agent based on GHBF (Figure 4) are shown in Table 2.
(3)
Initializing sufficient statistics of all random parameters in the RW-agent M 2 . We determined a particular initial value of a parameter vector ξ 2 (Table A2) for the agent M 2 . All configurations for parameters of the agent based on Rescorla–Wagner model were shown in Table A2. The response model of the RW model uses the same parameter configuration as in the Bayesian agent in step 2.
(4)
Maximizing negative free energy. On each sequence of sensory inputs, we performed an optimization method to obtain the optimal prior parameters μ ξ 1 * , C ξ 1 * of the parameter ξ 1 for the agent M 1 and the optimal prior parameters μ ξ 2 * , C ξ 2 * of the parameter ξ 2 for the agent M 2 according to Equation (A21) respectively. In this paper, we implemented the quasi-Newton Broyden–Fletcher–Goldfarb–Shanno method based on a line search framework [49] to obtain negative free energy maximization (Equations (A19)–(A21)).
(5)
Evaluating negative free energy. On each sequence of sensory inputs, we can evaluate the maximum negative free energies F ξ 1 * for the agent M 1 and F ξ 2 * for the agent M 2 according to Equation (A22). Then Bayesian Factors are evaluated by Equations (A33) and (A34).
In each gambling task, two agents M 1 , M 2 generate their time-courses of the predicted states μ 0 ( t ) on the expectation of the states of the two-armed bandit. The predicted states are recorded into a d 0 × K matrix T
T = [ μ 0 ( t 1 ) , μ 0 ( t 2 ) , , μ 0 ( t K ) ] R d 0 × K .
Given any pair of sequences (randomly generated in step 1) ( u s , a s , r s ) D , our Bayesian agent M 1 and the RW agent dynamically infer the states, and form inference trajectories T 1 ( u s , a s , r s ) and T 2 ( u s , a s , r s ) respectively. The mean dynamic inference trajectory T ¯ i and the standard deviation σ T i are computed as
T ¯ i = 1 | D | ( u s , a s , u s ) D T i ( u s , a s , r s ) σ T i = 1 | D | 1 ( u s , a s , r s ) D T i ( u s , a s , r s ) T ¯ i 2 ,
where the notation | D | is the number of elements in the dataset D (e.g., | D | = 100 ).
Figure 9 shows that both the Bayesian agent (red lines) and the RW agent (blue dashed lines) are able to track the ground truth or real probabilities well, showing quick jumps at the points where the ground truth undertakes remodeling. However, the RW agent produces more variable predictions, as shown by larger standard deviation (blue shaded areas vs. red shaded areas). The RW agent often overshoots its estimation (blue dashed lines vs. black lines). These results demonstrate that the RW agent overfits observations.
In each gambling task, given the sensory inputs u s , actions a s and rewards r s , two agents M 1 , M 2 generate their series of the expectations of the response action P ( a ( t ) = 1 ) , denoted by P 1 ( a ( t k ) = 1 ) | k = 1 , 2 , , K and P 2 ( a ( t k ) = 1 ) | k = 1 , 2 , , K respectively. Their regrets can be evaluated by substituting P i ( a ( t k ) = 1 ) into Equation (40). The mean regret R ¯ i and the standard deviation σ T i on the synthetic dataset D are computed as
R ¯ i = 1 | D | ( u s , a s , r s ) D R ( P i ( a ( t k ) = 1 ) | u s , a s , r s ) σ R i = 1 | D | 1 ( u s , a s , r s ) D R ( P i ( a ( t k ) = 1 ) | u s , a s , r s ) R ¯ i 2 .
The mean R ¯ 1 and standard deviation σ R 1 of our Bayesian agent (based on GHBF) M 1 are smaller than the mean R ¯ 2 and standard deviation σ R 2 of the RW-agent (based on the RW model) M 2 (Figure 10).
To evaluate the two models more formally, given the three sequences ( u s , r s , a s ) , we computed Bayesian Factor B F without Bayesian Information Criterion (BIC)
B F : = p ( u s , r s , a s | M 1 ) p ( u s , r s , a s | M 2 ) exp ( F ξ 1 * F ξ 2 * ) ,
and Bayesian Factor B F B I C with BIC (cf. Appendices Appendix E and Appendix F)
B F B I C : = exp F ξ 1 * F ξ 2 * d ξ 1 d ξ 2 2 ln ( K ) ,
where d ξ i is the number of free parameters estimated by the model. The notations F ξ 1 * , F ξ 2 * are respectively the maximal negative free energies of the two agents M 1 , M 2 on the given pair of the sequences ( u s , r s , a s ) . Under both measures, Bayesian Factors on the observation dataset D are concentrated on the range larger than 100 (i.e., B F > 100 , B F B I C > 100 ) (Figure 11a,b), meaning decisive evidence for the Bayesian agent outperforming the RW agent according to Table A1.

6. Discussion

6.1. Contributions of This Work

In this article, we have introduced a hierarchical Bayesian model that describes how to infer volatility (i.e., environmental uncertainty and correlations) in a multi-dimensional space. In this model, the bottom level is to learn the state expectation of a multi-armed bandit, which is described by a multivariate Bernoulli distribution. The natural parameter x 1 of the Bernoulli distribution is learned by the first level. Under the Brownian and Gaussian assumption on x 1 , volatility can be strictly determined by the Cholesky Decomposition of pervasion intensity of Brownian motion x 1 [46]. Therefore, we can define the volatility in x 1 as the Cholesky Decomposition of pervasion intensity of x 1 . Next, the volatility in x 1 can be represented by x 2 , with evolves as a Brownian motion. The low-order interactions between the dimensions of the Bernoulli distribution and the environmental uncertainties are captured in the second level, corresponding to x 2 .
The hierarchical Bayesian model assumes that the tendency of a binary pattern evolves as a general Brownian motion at the first level. The tendency can be updated by Equation (19), where prediction error PE 0 ( t k ) is the information gap between the agent’s belief and sensory input. This quantity is a target that the agent should learn to diminish. The parameter vector ζ 1 functions as weighting vector to weight prediction error PE 0 ( t k ) . The covariance C 1 ( t k ) plays the role of complex adaptive learning rate in Equation (19).
In principle, the proposed model could be easily generalized to a Bayesian framework for decision making in high-dimensional multinary environments, by defining appropriate forms of perceptual models and response models. In this study, the input space was assumed to be binary. For multinary environments, the representations of the tendency of the inputs could be defined accordingly to form a hierarchical perceptual model. Here we derived a response model from Bayesian decision theory with the goal of maximizing expected rewards or minimizing expected risk or loss [47,52]. For other problems of interest, it is sufficient to construct a compatible response model addressing the particular optimization criteria of the question. For example, recognition and navigation tasks could be formulated in the proposed Bayesian framework to cope with the interactions between multimodal information [53,54].
In summary, the main contributions of this work are twofold. First, the model captures the correlations between the dimensions of the sensory space, and is able to make decisions contingent on the structure of the sensory inputs. Simulations show that our model is applicable to complex inference and decision making tasks that could not be tackled by methods with independence assumptions of the high dimensional input features. Second, the model represents the tendency and volatility of the sensory inputs in a hierarchical manner based on the idea of nested Brownian motions. The resulting hierarchical computational framework naturally allows the interactions between layers, and is able to track the dynamics of the environment.

6.2. Related Works

The proposed hierarchical Bayesian model is most related to the Rescorla–Wagner model [35,51,55,56]. Equation (18) has the form of a generalized form of the Rescorla–Wagner equation in reinforcement learning [26]
μ 1 ( t k ) = μ 1 ( t k 1 ) + Γ Δ μ 1 ( t k ) ,
where Δ μ 1 ( t k ) is an error signal (or learned target) at time t k . In the cognitive neuroscience field, some variants of the RW model have been introduced for the behavioral paradigm of multi-armed bandits. However, due to the limitation of the RW model, it is difficult to capture the volatility of the signal. More importantly, since the learning rate in RW model is constant, it is difficult to interpret the subject’s dynamic process of capturing effective information during the experiment. As an example, given the reward R ( t k 1 ) at time t k , the standard RW model estimates the value state variable V by
V ( t k ) = V ( t k 1 ) + α P E ( t k ) P E ( t k ) = R ( t k 1 ) V ( t k 1 ) ,
which is simplified to
V ( t k ) = ( 1 α ) V ( t k 1 ) + α R ( t k 1 ) .
It is clear to see that the learning rate α plays a role of a moving average, weighting initial value V ( t 0 ) and a reward sequence R ( t 1 ) , R ( t 2 ) , , R ( t k ) . This is an inflexible filtering method to cope with volatility. For small learning rate α , the RW model prefers to predict based on the input history, a good scenario for slow changing signals. The RW model with large learning rate prefers to rely on most recent rewards, a good scenario for fast changing signals. However, the RW model did not unify the two learning processes (i.e., the learning rate is not able be adapted depending on the environment and agent state). In this sense, our hierarchical Bayesian model provides a theoretically justified mechanism to adapt learning rate dynamically according to the volatility of the environment and the states of the agent.
For a single-step update, the time complexity of our hierarchical Bayesian model is O ( d 0 4 ) (Equations (13)–(31)), while the time complexity of the RW model is O ( d 0 ) . In our model, capturing volatility to form adaptive learning rate leads to a higher computational cost. Experiments show that this computational cost is necessary for the model to flexibly adapt to volatile environments. On the synthetic dataset D , the trajectories of the state estimation formed by our hierarchical Bayesian model (light red shadow area in Figure 9) are distributed narrower than those of the RW model (light blue shadow area in Figure 9), indicating the stability and robustness of the proposed model.

6.3. Strengths and Limitations

Our hierarchical Bayesian model is general enough to be easily applied in high-dimensional environments. The number of parameters of the model scales quadratically with respect to the dimension of the input space. Given the number of dimensions d 0 , corresponding to the dimension of x 0 ( t ) at the bottom level (i.e., sensory input u ( t ) ), the dimension of the parameter ξ 1 of our perceptual model is d 0 + 2 d 1 + 5 d 2 = d 0 ( d 0 + 5 ) 2 (cf. Appendix C). In the Bayesian learning process, the optimization algorithm (i.e., quasi-Newton Broyden–Fletcher–Goldfarb–Shanno method) needs to numerically evaluate the gradient of the negative free energy with respect to each component of the model parameter ξ 1 . For a large number of dimensions d 0 , parallel computing framework based on CPU and GPU need to be developed in order to improve the evaluation efficiency of numerical gradients.

6.4. Future Work

In this paper, we construct a hierarchical Bayesian model for inferring and decision-making in multivariate volatile binary environment, and test and validate it on a synthetic dataset. We plan to use this model to explain human decision-making behaviors and brain activities. To this end, we need to collect behavior and neuroimaging data while human subjects are performing the same task of multi-armed bandit as defined in this paper. For theoretical interest, the mechanism of the adaptive learning rate and correlation among natural parameters are worthy for further clarification in our hierarchical model, and we look forward to analyzing these critical mechanisms in future investigation.

7. Conclusions

We have introduced a hierarchical Bayesian model for decision making in high-dimensional volatile environments, and derived a family of interpretable closed form update rules. Based on this framework, we define a Bayesian agent endowed with the proposed hierarchical Bayesian model, as a mentalizing model of a biological agent, to perform an abstract multi-armed bandit task. Simulations show that our model is applicable to complex tasks that could not be tackled by models with independency assumptions. Crucially, the proposed model contains a hierarchical perceptual model that is able to capture different covariances (e.g., prediction covariance, posterior covariance, likelihood covariance). As an important indicator of mental process, prediction correlation is dynamically estimated in the second level of the hierarchical perceptual model. Prediction correlation describes quantitatively (weak) pairwise interactions among different perception quantities (e.g., natural parameters of multi-armed bandits). In conclusion, the proposed hierarchical Bayesian model provides a powerful tool to solve complex perception and decision making problems in high-dimensional volatile environments [57], as well as to quantify complex phenomena such such as perceptual decision making, spatial navigation, social interactions and exploratory behaviors [58,59,60,61,62,63].

Author Contributions

C.Z. and B.S. conceived the model. C.Z. conducted the simulations, with inputs from B.S., K.Z., F.T., Y.T. and X.L. C.Z. and B.S. analyzed the results. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science and Technology Innovation 2030 Major Program of China grant number 2022ZD0205000.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Custom code is written by the authors based on MATLAB. Raw data and MATLAB code to analyze results can be accessed at https://github.com/changbozhu/GHBF-mvBern-simulation.git (accessed on 8th December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this paper:
GHBFGeneral Hierarchical Brownian Filter
SISampling interval
RLReinforcement Learning
RWRescorla–Wagner
BFBayesian Factor
BICBayesian Information Criterion

Appendix A. Bayesian Agent

A Bayesian agent summarizes its past experience of perceptions and decisions to adapt to the external environment. After observing sensory input u , the agent integrates internal salability priors obtained from past experience with the current information provided by sensory input. Then it yields inferences and predictions of about the external environment. Based on the estimated states of the external environment, the agent makes a decision to manipulate the external environment. Bayes’ rule subserves optimal probability inferences and calculus for representing beliefs and acting in external environment in an efficient and consistent manner [22,23,64,65].
More specifically, a Bayesian agent M is defined by the likelihood p ( u | x , M ) and a priori p ( x | M ) on the hidden state x . After recieving sensory input u , the agent infers a posterior distribution p ( x | u , M ) according to Bayes’ rule for future perception and action
p ( x | u , M ) = p ( x | M ) p ( u | x , M ) p ( x | M ) p ( u | x , M ) d x .
To act in the external environment, the agent selects an action a * from an action set A according to the prediction or posterior distribution of hidden states p ( x | u , M ) q ( x ; χ ) , with χ being the sufficient statistics of posterior hidden states x . q ( x ; χ ) is an approximation for the true posterior p ( x | u , M ) . In general, a response model π r is defined to map hidden states into actions, which can be a deterministic or stochastic mapping.
Figure A1. Interaction between an agent and the external environment.
Figure A1. Interaction between an agent and the external environment.
Mathematics 10 04775 g0a1
Unfortunately, the integral in Equation (A1) is intractable to compute. To calculate the above posterior p ( x | u , M ) , we resort to variational Bayesian methods [22] to approximate Bayesian inference efficiently. This is done by finding a lower bound on the logarithm of model evidence ln p ( u | M ) , called negative free energy F ( q ( x ; χ ) , u )
ln p ( u | M ) = ln q ( x ; χ ) p ( u , x | M ) q ( x ; χ ) d x q ( x ; χ ) ln p ( u , x | M ) q ( x ; χ ) d x = q ( x ; χ ) ln p ( u , x | M ) d x q ( x ; χ ) ln q ( x ; χ ) d x = H ( x ; χ ) U ( x ; χ ) = ln p ( u | M ) D K L [ q ( x ; χ ) | | p ( x | u , M ) ] = F ( q ( x ; χ ) , u ) ,
where H ( x ; χ ) = q ( x ; χ ) ln q ( x ; χ ) d x is the entropy, and
U ( x ; χ ) = q ( x ; χ ) ln p ( u , x | M ) d x
is the internal energy. Equation (A2) tells that the lower bound is negative free energy, i.e., the entopy H ( x ; χ ) minus the internal energy U ( x ; χ ) .The Kullback–Leibler divergence D K L [ q ( x ; χ ) | | p ( x | u , M ) ] 0 measures the difference between the approximation and the true posterior. The better the approximation q ( x ; χ ) is, the smaller the divergence is. The minimal divergence 0 occurs when the ideal approximation q ( x ; χ ) is equal to p ( x | u , M ) . The agent therefore could obtain the optimal approximation posterior q ( x ; χ ) by maximizing negative free energy F ( q ( x ; χ ) , u )
q ( x ; χ 0 ) = arg max q F ( q ( x ; χ ) , u ) .
We use the Lagrange method to solve this maximization problem. The Lagrangian functional is defined as
F ¯ ( q ( x ; χ ) , u ) = F ( q ( x ; χ ) , u ) + υ [ q ( x ; χ ) d x 1 ] ,
where υ is a Lagrange multiplier. The solution of the optimal problem (Equation (A3)) is also the solution of the variational equation (Equation (A5))
δ F ¯ ( q ( x ; χ ) , u ) δ q = 0 .

Appendix B. Variational Bayesian Inference

Given a Bayesian perceptual model p ( x s , u | ψ , ϵ ) , where ψ and ϵ are parameters, the model evidence p ( u | ψ , ϵ ) is often analytically intractable. Therefore, exact Bayesian posteriors could not be analytically calculated. We apply variational Bayesian methods to transform the calculation of exact Bayesian posteriors p ( x s | u , ψ , ϵ ) into finding the optimal variational posteriors q ( x s ) (cf. Equations (A2)–(A5) in Appendix A). The lower bound on the logarithm of the model evidence p ( u | ψ , ϵ ) is given by
ln p ( u | ψ , ϵ ) = ln q ( x s ) p ( u , x s | ψ , ϵ ) q ( x s ) d x s q ( x s ) ln p ( u , x s | ψ , ϵ ) q ( x s ) d x s = q ( x s ) ln p ( u , x s | ψ s , ϵ ) d x s q ( x s ) ln q ( x s ) d x s = U ( x s ) + H ( x s ) = F ( q ( x s ) ) .
Then we use an important assumption that marginal variational posteriors over latent variables are independent, i.e., the joint variational posterior distribution factorizes with respect to all marginal posteriors
q ( x s ) = x h x s q ( x h ) ,
where x h is one element of x s . The factorized form in Equation (A7) corresponds to the so-called mean field approximation, an approximation scheme developed in statistical mechanics.
It should be noted that we now wish to maximize the negative free energy F ( q ( x s ) ) with respect to each approximation posterior q ( x h ) under the constraint of normalized probability q ( x h ) d x h = 1 , h . The Lagrangian functional F ¯ ( q ( x s ) ) is defined as
F ¯ ( q ( x s ) ) = F ¯ ( q ( x s h ) , q ( x h ) ) F ( q ( x s ) ) + h = 1 H κ h q ( x h ) d x h 1 = q ( x s ) ln p ( x s , u | ψ s , ϵ ) d x s + q ( x s ) ln q ( x s ) d x s + h = 1 H κ h q ( x h ) d x h 1 = q ( x s h ) q ( x h ) ln p ( x s , u | ψ s , ϵ ) d x s h d x h + q ( x s h ) q ( x h ) ln q ( x s h ) q ( x h ) d x s h d x h + i H { h } κ i q ( x i ) d x i 1 + κ h q ( x h ) d x h 1 ,
where κ h is a Larangian multiplier. We use x s h to denote the set defined by the subtraction of two sets x s { x h } , the notation H for an index set { 1 , 2 , 3 , , H } , and the notation H { h } for the subtraction of two sets H { h } . The variation of Equation (A8) with respect to q ( x h ) is
δ F ¯ ( q ( x s ) ) δ q ( x h ) = δ F ¯ ( q ( x s h ) , q ( x h ) ) δ q ( x h ) = q ( x s h ) ln p ( x s , u | ψ s , ϵ ) d x s h + q ( x s h ) ln q ( x s h ) d x s h + ln q ( x h ) + κ h + 1 = 0 .
The optimal variational approximation posterior q ( x h ) is of the form of Boltzmann distribution
q ( x h ) = 1 Z h exp V h ( x h ) V h ( x h ) = q ( x s h ) ln p ( x s , u | ψ s , ϵ ) d x s h H e ( x s h ) = q ( x s h ) ln q ( x s h ) d x s h Z h = exp ( H e ( x s h ) + κ h + 1 ) ,
where V h ( x h ) = q ( x s h ) ln p ( x s , u | ψ s , ϵ ) d x s h corresponds to negative internal energy over the hidden variable x h . The quantity V h ( x h ) is often called variational energy.

Appendix C. Probabilistic Representation of Parameters

In a Bayesian model, an unknown parameter can be treated as a random variable. Probability models could be employed to determine the parameters. Put simply, the probability density function of each random parameter is modeled by a delta-function at each time, and their values follow various multivariate Gaussian distributions [22,23]. In addition, different parameters may have different constraints, therefore we introduce parameterizations to represent these constrained parameters [46].
The coupling mapping F 2 contains bias b 2 and coupling strength w 2 as parameters (Equation (9)). We make an assumption on b 2 that it is a multivariate Gaussian distribution with the mean μ b 2 and the variance C b 2
q ( b 2 ) = N ( b 2 ; μ b 2 , C b 2 ) .
In principle, there should not be any constraints on the coupling strength w 2 . However, there is no reason to choose w 2 to be a negative element, since the negativity in w 2 could be counterbalanced by the negativity in x 2 . Therefore, the lower bound on each component w 2 ( i ) is chosen to be 0. In addition, considering the fact that w 2 is involved in the update of the positive definite precision matrix P 2 (Equation (30)), each component w 2 ( i ) ( i = 1 , 2 , , d 2 ) should have a upper bound. If the value of w 2 ( i ) is too large, P 2 would be degenerated. To avoid such violations, we set the upper bound of w 2 ( i ) to be a constant value α w 2 ( i ) > 0 , i.e., the i-th component of a constant column vector α w 2 . We use a sigmoid function to map a multivariate Gaussian variable into a bounded variable w 2 . This transformation and the priors on w 2 are given as
w 2 ( i ) = W 2 ( i , i ) = α w 2 ( i ) 1 + exp ( w 2 ( i ) G ) , i { 1 , 2 , , d 2 } w 2 = α w 2 s ( w 2 G , 1 ) q ( w 2 ) = q ( w 2 G ) = N ( w 2 G ; μ w 2 G , C w 2 G ) .
The parameter λ t o p naturally has a lower bound 0 constrained by variances, but if λ t o p is not bounded from above, it may cause some violations: for a large λ t o p , it yields small prediction precision Π ^ 2 where all variances are close to 0 and causes the posterior precision P 2 not to be a positive definite matrix. That is to say, an unbounded vector λ t o p violates the conditions of the update equations, yielding an improbable perceptual inference. Therefore, we set an upper bound α λ t o p on λ t o p , through a bounded sigmoid function similar as in Equation (A12)
λ t o p ( i ) = α λ t o p ( i ) 1 + exp ( λ t o p ( i ) G ) , i { 1 , 2 , , d 2 } λ t o p = α λ t o p s ( λ t o p G , 1 ) q ( λ t o p ) = q ( λ t o p G ) = N ( λ t o p G ; μ λ t o p G , C λ t o p G ) .
In the hierarchical model, we have introduced sensory noise parameters ζ 1 , with all positive components. We represent these parameters in logarithmic space to preserve nonnegativity. More specifically, ζ 1 is expressed in its log-space by a Gaussian random vector ζ 1 G
ζ 1 ( i ) = exp ( ζ 1 ( i ) G ) , ζ 1 ( i ) G R ζ 1 = exp ( ζ 1 G ) q ( ζ 1 ) = q ( ζ 1 G ) = N ( ζ 1 G ; μ ζ 1 G , C ζ 1 G ) .
Here, we employ an element-wise exponential function exp ( · ) to map a multivariate Gaussian random variable ζ 1 G into ζ 1 .
Aside from these structural parameters, the initial priors on all hidden states are also determined following similar way. In details, we use a Gaussian random variable to express the initial mean μ h ( t 0 )
q ( μ h ( t 0 ) ) = N ( μ h ( t 0 ) ; μ μ h ( t 0 ) , C μ h ( t 0 ) ) h { 1 , 2 } .
Each of the initial prior covariances C h ( t 0 ) | h = 1 , 2 is restricted to a principal diagonal and positive definite matrix. All principal diagonal elements in C h ( t 0 ) form a column vector c h . Since the components in c h are positive, they are represented by multivariate Gaussian random variables in log-space
c h = exp ( c h G ) q ( C h ) = q ( c h G ) = N ( c h G ; μ c h G , C c h G ) h { 1 , 2 } .
For the response model expressed by Equations (34)–(37), there is only one inverse temperature parameter ζ a , which is also restricted to be positive. We can use the same representation method as ζ 1 to express ζ a .
ζ a = exp ( ζ a G ) q ( ζ a ) = q ( ζ a G ) = N ( ζ a G ; μ ζ a G , C ζ a G )
where μ ζ a G , C ζ a G are the mean and variance of a Gaussian random variable ζ a G respectively.

Appendix D. Variational Bayesian Learning

A Bayesian agent receives and encodes sensory input u ( t ) , and then makes a perceptual decision (i.e., action) a ( t ) A based on random reward r ( t ) and perceptual evidence. These two successive processes correspond to the two main functional models of an agent: a perceptual model to encode sensory inputs and a response model to make perceptual decisions [20,21,23]. Here, we employ a GHBF as the perceptual model M p with perceptual parameter vector ψ and a simple response model defined by Equations (34)–(37) as the response model π r with the response parameter vector ψ r . The combined model is denoted by M = ( M p , π r ) . All its parameters are denoted by ξ .
We introduce the following mean field approximation to fit the parameters of the combined model with the sensory inputs u s = u ( t 1 ) , u ( t 2 ) , , u ( t K ) , actions a s = a ( t 1 ) , a ( t 2 ) , , a ( t K ) and random rewards r s = r ( t 1 ) , r ( t 2 ) , , r ( t K )
q ( ξ ) q ( ψ ) q ( ψ r ) = q ( λ t o p ) q ( σ u ) h = 2 H q ( w h ) p ( b h ) h = 1 H q ( μ h ( t 0 ) ) q ( C h ( t 0 ) ) .
Then
ln p ( u s , r s , a s | M ) = ln p ( u s , r s , a s , ξ | M ) d ξ = ln p ( u s , r s , a s , ξ | M ) q ( ξ ) q ( ξ ) d ξ q ( ξ ) ln ( p ( u s , r s , a s , ξ | M ) q ( ξ ) ) d ξ = q ( ξ ) ln p ( u s , r s , a s , ξ | M ) q ( ξ ) ln q ( ξ ) d ξ F ( q ( ξ ) )
We use the Lagrange multiplier method to work out the optimal variational posterior as below
q ( ξ ) = 1 Z ξ exp ( V ( ξ ) ) V ( ξ ) = ln p ( u s , r s , a s , ξ | M ) .
Then we execute Laplace’s approximation to determine a Gaussian approximation of the variational posterior solution (Equation (A21))
ξ * = arg max ξ V ( ξ ) = arg max ξ ln p ( u s , r s , a s , ξ | M ) = arg max ξ ln p ( ξ , a s | u s , r s , M ) p ( u s ) = arg max ξ ln p ( ξ , a s | u s , r s , M ) = arg max ξ ln p ( a s | ξ , u s , r s , M ) + ln p ( ξ ) = arg max ξ k = 1 K ln p ( a ( t k ) | u ( t k ) , r ( t k ) , ξ , M ) + ln p ( ξ ) = arg max ξ k = 1 K ln p a ( t k ) | r ( t k ) , χ s ( t k ) = M p ( u ( t k ) , ψ ) , ψ r + ln p ( ψ ) μ ξ * = ξ * C ξ * = 2 V ( ξ * ) ξ ξ T ,
where p a ( t k ) | χ s ( t k ) = M p ( u ( t k ) , ψ ) , ψ r is given by a particular response model. χ s ( t k ) is the set of sufficient statistics of posterior hidden states in our hierarchical Bayesian perceptual model at time t k
χ s ( t k ) = μ 1 ( t k ) , C 1 ( t k ) , μ 2 ( t k ) , C 2 ( t k ) .
Finally, the maximum value F ξ * of the negative free energy F ξ is given by
F ξ F ξ * = V ( μ ξ * ) + d ξ 2 ln 2 π e + 1 2 ln det ( C ξ * ) .

Appendix E. Evaluating Negative Free Energy

For a Bayesian agent M with parameters ξ , the posterior p ( ξ | u s , r s , a s , M ) on parameters ξ is approximated by a multivariate Gaussian distribution q ( ξ ) under the Laplacian approximation
p ( ξ | u s , r s , a s , M ) q ( ξ ) = N ( ξ ; μ ξ , C ξ ) ,
where C ξ is a covariance matrix. The mean μ ξ is determined by maximizing the quantity p ( ξ | u s , a s , M )
μ ξ * = arg max ξ p ( ξ | u s , r s , a s , M ) = arg max ξ p ( ξ , u s , r s , r s , a s | M ) p ( u s , r s , a s | M ) = arg max ξ p ( ξ , u s , r s , a s | M ) .
The optimal q ( ξ ) is determined by maximizing the negative free energy F ξ
max ξ ln p ( u s , r s , a s | ξ , M ) max q ( ξ ) F ξ = max q ( ξ ) q ( ξ ) ln p ( u s , r s , a s , ξ | M ) q ( ξ ) ln q ( ξ ) d ξ
We use the notation V ( ξ ) to denote the quantity ln p ( u s , r s , a s , ξ | M ) and then use Taylor’s theorem to expand V ( ξ ) at the point μ ξ *
V ( ξ ) V ( μ ξ * ) + V ( μ ξ * ) ξ ( ξ μ ξ * ) + 1 2 ( ξ μ ξ * ) T 2 V ( μ ξ * ) 2 ξ ( ξ μ ξ * ) .
The first term q ( ξ ) V ( ξ ) d ξ in the negative free energy F ξ is evaluated by
q ( ξ ) V ( ξ ) d ξ V ( μ ξ * ) + V ( μ ξ * ) ξ E q ( ξ | μ ξ * , C ξ ) [ ξ μ ξ * ] + 1 2 E q ( ξ | μ ξ * , C ξ ) [ ( ξ μ ξ * ) T 2 V ( μ ξ * ) 2 ξ ( ξ μ ξ * ) ] = V ( μ ξ * ) + 1 2 tr C ξ 2 V ( μ ξ * ) 2 ξ
The last term H e ( ξ ) = q ( ξ ) ln q ( ξ ) d ξ is given by
H e ( ξ ) = q ( ξ ) ln q ( ξ ) d ξ = E q ( ξ | μ ξ * , C ξ ) ln q ( ξ | μ ξ * , C ξ ) = E q ( ξ | μ ξ * , C ξ ) [ d ξ 2 ln 2 π 1 2 ln det ( C ξ ) 1 2 ( ξ μ ξ * ) T C ξ 1 ( ξ μ ξ * ) ] = d ξ 2 ln 2 π + 1 2 ln det ( C ξ ) + 1 2 tr ( I d ξ ) = d ξ 2 ln 2 π e + 1 2 ln det ( C ξ )
Therefore, the negative free energy F ξ is calculated as
F ξ = E q ( ξ ) [ V ( μ ξ * ) ] + H e ( ξ ) = V ( μ ξ * ) + 1 2 tr C ξ 2 V ( μ ξ * ) 2 ξ + d ξ 2 ln 2 π e + 1 2 ln det ( C ξ )
F ξ is a scalar function of the covariance C ξ . The optimal point or a stationary point C ξ * is found where F ξ reaches the maximum. This is done by making the partial derivative F ξ C ξ to be a zero matrix O .
F ξ C ξ = 1 2 2 V ( μ ξ * ) 2 ξ + 1 2 C ξ 1 = O C ξ * = 2 V ( μ ξ * ) 2 ξ 1
At the optimal point C ξ * , the maximal value of F ξ is
F ξ * = V ( μ ξ * ) + d ξ 2 ln 2 π e + 1 2 ln det ( C ξ * ) max ξ ln p ( u s , r s , a s | ξ , M ) = ln p ( u s , r s , a s | μ ξ * , M ) .

Appendix F. Bayesian Model Selection

Grounded on probability theory, Bayesian model selection is to evaluate different models based on the observed data, favoring the model with balanced tradeoff between complexity and flexibility. Given a series of sensory inputs u s = u ( t 1 ) , u ( t 2 ) , , u ( t K ) , a series of actions a s = a ( t 1 ) , a ( t 2 ) , , a ( t K ) and a series of random rewards r s = r ( t 1 ) , r ( t 2 ) , , r ( t K ) , Bayesian model selection is to select the optimal agent M * to best interpret sensory inputs and actions
M * = arg max M p ( M | u s , a s , r s ) .
Taking two different agents M 2 , M 1 into account, we can define Bayesian Factor as
p ( M 2 | u s , a s , r s ) = p ( M 2 ) p ( u s , a s , r s | M 2 ) p ( u s , a s , r s ) p ( M 1 | u s , a s , r s ) = p ( M 1 ) p ( u s , a s , r s | M 1 ) p ( u s , a s , r s ) p ( M 1 | u s , a s , r s ) p ( M 2 | u s , a s , r s ) = B F p ( M 1 ) p ( M 2 ) B F = p ( u s , a s , r s | M 1 ) p ( u s , a s , r s | M 2 ) ,
where p ( M i ) is the prior distribution of M i . Here, we make a general assumption that the prior distribution of an agent is a non-informative prior. Under the assumption of non-informative priors, the prior distribution is equivalent to a uniform distribution p ( M 1 ) p ( M 2 ) = 1 . Then the ratio of the posterior distributions p ( M 1 | u s , a s , r s ) p ( M 2 | u s , a s , r s ) is simply given by the Bayesian Factor.
Bayesian model selection problem is reduced to selecting an agent with maximal model evidence p ( u s , a s , r s | M i ) . In the Bayesian learning framework, log-model evidence ln p ( u s , a s , r s | M ) can be approximated by the optimal negative free energy
F ξ * ln p ( u s , r s , a s | μ ξ * , M )
defined in Equation (A30). By computing the negative free energies of two different agents F ξ 1 * , F ξ 2 * , Bayesian Factor is given by
B F = p ( u s , a s , r s | M 1 ) p ( u s , a s , r s | M 2 ) = exp ( ln p ( u s , a s , r s | M 1 ) ln p ( u s , a s , r s | M 2 ) ) exp ( F ξ 1 * F ξ 2 * ) .
For the ease of using Bayesian Factor, Harold Jeffreys gave a scale for the interpretation of Bayesian Factor (Table A1) [66]. If B F > 1 , the agent M 1 is more strongly supported by the observed date, and vice versa (if B F < 1 , the agent M 2 is more strongly supported).
Table A1. Bayes Factors and interpretations.
Table A1. Bayes Factors and interpretations.
Bayesian Factor BF Interpretations
0 < B F < 1 100 Decisive evidence for M 2
1 100 < B F < 1 10 Strong evidence for M 2
1 10 < B F < 1 3 Moderate evidence for M 2
1 3 < B F < 1 Weak evidence for M 2
1 < B F < 3 Weak evidence for M 1
3 < B F < 10 Moderate evidence for M 1
10 < B F < 100 Strong evidence for M 1
B F > 100 Decisive evidence for M 1
According to the Bayesian Information Criterion (BIC) [67], log-model evidence ln p ( u s , a s , r s , | M i ) can be approximated by
ln p ( u s , a s , r s | M i ) ln p ( u s , a s , r s | μ ξ i * , M i ) d ξ i 2 ln ( K ) ln p ( u s , a s , r s | M i ) = F ξ i * d ξ i 2 ln ( K ) ,
where K is the number of sensory inputs in u s . d ξ i is the number of free parameters estimated by the model. Therefore, Bayesian Factor is modified by
B F B I C = p ( u s , a s , r s | M 1 ) p ( u s , a s , r s | M 2 ) = exp ( ln p ( u s , a s , r s | M 1 ) ln p ( u s , a s , r s | M 2 ) ) exp F ξ 1 * F ξ 2 * d ξ 1 d ξ 2 2 ln ( K ) .

Appendix G. Rescorla–Wagner Model

The Rescorla–Wagner (RW) model is a basic model in reinforcement learning (RL) field and cognitive neuroscience field [26,51]. As a baseline model for comparison, we construct a two dimensional RW model to capture the dynamic expectation μ 0 = E [ x 0 ] of the two armed bandits in the above gambling task
μ 0 ( i ) ( t k ) = μ 0 ( i ) ( t k 1 ) + α Δ μ 0 ( i ) ( t k ) Δ μ 0 ( i ) ( t k ) = u ( i ) ( t k ) μ 0 ( i ) ( t k ) t k , μ 0 ( i ) ( t k ) [ 0 , 1 ] i = 1 , 2 ,
where α ( 0 , 1 ) is a positive learning rate. To yield a prediction μ ^ 0 ( t k ) on x 0 ( t k ) before receiving the actual sensory input u ( t k ) at time t k , the RW model uses its most recent state i.e., μ 0 ( t k 1 ) as the prediction
μ ^ 0 ( t k ) : = μ 0 ( t k 1 ) .
To produce an action based on the predicted state μ ^ 0 = [ μ ^ 0 ( 1 ) , μ ^ 0 ( 2 ) ] T , the RW model needs a response model to work with. We use the same response model based on Bayesian decision theory (Section 4) for fair comparison.
To perform the variational Bayesian learning scheme (cf. Appendix D), we assume that all parameters of the RW model are random variables. Following similar treatmeants as in Appendix C, the initial prior state μ 0 ( 0 ) = [ μ 0 ( 1 ) ( 0 ) , μ 0 ( 2 ) ( 0 ) ] T is represented in logit-space μ μ 0 G ( 0 )
μ 0 ( 0 ) = s ( μ 0 G ( 0 ) , 1 ) ,
where μ μ 0 G ( 0 ) is a two-dimensional Gaussian distribution with mean μ μ 0 G ( 0 )
p ( μ 0 G ( 0 ) ) = N ( μ 0 G ( 0 ) ; μ μ 0 G ( 0 ) , C μ 0 G ( 0 ) ) .
Since the learning rate α is a value between 0 and 1, it is represented by a random variable α G in the logit-space. We further assume that α G is a Gaussian random variable with mean μ α G and variance C α G
α = s ( α G , 1 )
p ( α G ) = N ( α G ; μ α G , C α G ) .
In this paper, all parameter configurations for the RW model are listed in Table A2.
Table A2. Parameters of the Rescorla–Wagner model. Parameters labeled by ‘Free’ are optimized by the inversion of the model. Fixed parameters are constant and not optimized. The notation 0 is a zero vector. Given all initial priors, we search the optimal priors on the free parameters μ ξ according to the free energy principle (Equations (A19) and (A21)).
Table A2. Parameters of the Rescorla–Wagner model. Parameters labeled by ‘Free’ are optimized by the inversion of the model. Fixed parameters are constant and not optimized. The notation 0 is a zero vector. Given all initial priors, we search the optimal priors on the free parameters μ ξ according to the free energy principle (Equations (A19) and (A21)).
NameDescriptionInitial ValueFixed or Free
Parameters of Rescorla–Wagner model
d u Dimension of u 2constant
d 0 Dimension of μ 0 2constant
μ 0 ( t 0 ) Prior initial state Fixed
μ μ 0 G ( t 0 ) Mean of μ 0 G ( t 0 ) 0
C μ 0 G ( t 0 ) Covariance of μ 0 G ( t 0 ) I d 0
α Learning rate α Free
μ α G Mean of α G 0
C α G Covariance of α G 0.01

References

  1. Cisek, P.; Puskas, G.A.; El-Murr, S. Decisions in changing conditions: The urgency-gating model. J. Neurosci. 2009, 29, 11560–11571. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Weiss, A.; Chambon, V.; Lee, J.K.; Drugowitsch, J.; Wyart, V. Interacting with volatile environments stabilizes hidden-state inference and its brain signatures. Nat. Commun. 2021, 12, 2228. [Google Scholar] [CrossRef] [PubMed]
  3. Vargas, D.V.; Lauwereyns, J. Setting the space for deliberation in decision-making. Cogn. Neurodyn. 2021, 15, 743–755. [Google Scholar] [CrossRef] [PubMed]
  4. Knill, D.C.; Richards, W. Perception as Bayesian Inference; Cambridge University Press: Cambridge, UK, 1996. [Google Scholar]
  5. Ernst, M.O.; Banks, M.S. Humans integrate visual and haptic information in a statistically optimal fashion. Nature 2002, 415, 429–433. [Google Scholar] [CrossRef]
  6. Weilnhammer, V.A.; Stuke, H.; Sterzer, P.; Schmack, K. The neural correlates of hierarchical predictions for perceptual decisions. J. Neurosci. 2018, 38, 5008–5021. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  7. Zhang, W.; Wu, S.; Doiron, B.; Lee, T.S. A Normative Theory for Causal Inference and Bayes Factor Computation in Neural Circuits. In Advances in Neural Information Processing Systems; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: New York, NY, USA, 2019; Volume 32. [Google Scholar]
  8. Friston, K.; FitzGerald, T.; Rigoli, F.; Schwartenbeck, P.; Pezzulo, G. Active inference and learning. Neurosci. Biobehav. Rev. 2016, 68, 862–879. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Shikauchi, Y.; Miyakoshi, M.; Makeig, S.; Iversen, J.R. Bayesian models of human navigation behaviour in an augmented reality audiomaze. Eur. J. Neurosci. 2021, 54, 8308–8317. [Google Scholar] [CrossRef]
  10. Zhang, J.; Gu, Y.; Chen, A.; Yu, Y. Unveiling Dynamic System Strategies for Multisensory Processing: From Neuronal Fixed-Criterion Integration to Population Bayesian Inference. Research 2022, 2022, 9787040. [Google Scholar] [CrossRef]
  11. Zhou, L.; Gu, Y. Cortical Mechanisms of Multisensory Linear Self-motion Perception. Neurosci. Bull. 2022, 1–13. [Google Scholar] [CrossRef]
  12. Chikkerur, S.; Serre, T.; Tan, C.; Poggio, T. Attention as a Bayesian inference process. In Human Vision and Electronic Imaging XVI; Rogowitz, B.E., Pappas, T.N., Eds.; Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series; SPIE Press: California, USA, 2011; Volume 7865, p. 786511. [Google Scholar]
  13. Vossel, S.; Mathys, C.; Stephan, K.E.; Friston, K.J. Cortical coupling reflects Bayesian belief updating in the deployment of spatial attention. J. Neurosci. 2015, 35, 11532–11542. [Google Scholar] [CrossRef]
  14. Lawson, R.P.; Mathys, C.; Rees, G. Adults with autism overestimate the volatility of the sensory environment. Nat. Neurosci. 2017, 20, 1293–1299. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. Friston, K. The free-energy principle: A unified brain theory? Nat. Rev. Neurosci. 2010, 11, 127–138. [Google Scholar] [CrossRef] [PubMed]
  16. Friston, K. A theory of cortical responses. Philos. Trans. R. Soc. B Biol. Sci. 2005, 360, 815–836. [Google Scholar] [CrossRef] [PubMed]
  17. Stefanics, G.; Heinzle, J.; Horváth, A.A.; Stephan, K.E. Visual mismatch and predictive coding: A computational single-trial ERP study. J. Neurosci. 2018, 38, 4020–4030. [Google Scholar] [CrossRef] [Green Version]
  18. Wang, B.A.; Schlaffke, L.; Pleger, B. Modulations of insular projections by prior belief mediate the precision of prediction error during tactile learning. J. Neurosci. 2020, 40, 3827–3837. [Google Scholar] [CrossRef]
  19. Sun, Y.; Gomez, F.; Schmidhuber, J. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments. In Artificial General Intelligence; Schmidhuber, J., Thórisson, K.R., Looks, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 41–51. [Google Scholar]
  20. Daunizeau, J.; den Ouden, H.E.M.; Pessiglione, M.; Kiebel, S.J.; Stephan, K.E.; Friston, K.J. Observing the Observer (I): Meta-Bayesian Models of Learning and Decision-Making. PLoS ONE 2010, 5, e15554. [Google Scholar] [CrossRef]
  21. Daunizeau, J.; Den Ouden, H.E.; Pessiglione, M.; Kiebel, S.J.; Friston, K.J.; Stephan, K.E. Observing the observer (II): Deciding when to decide. PLoS ONE 2010, 5, e15555. [Google Scholar] [CrossRef]
  22. Beal, M.J. Variational Algorithms for Approximate Bayesian Inference. Ph.D. Thesis, University College London (UCL), London, UK, 2003. [Google Scholar]
  23. Mathys, C.D.; Daunizeau, J.; Friston, K.J.; Stephan, K.E. A Bayesian Foundation for Individual Learning Under Uncertainty. Front. Hum. Neurosci. 2011, 5, 39. [Google Scholar] [CrossRef] [Green Version]
  24. Vossel, S.; Mathys, C.; Daunizeau, J.; Bauer, M.; Driver, J.; Friston, K.; Stephan, K. Spatial Attention, Precision, and Bayesian Inference: A Study of Saccadic Response Speed. Cereb. Cortex 2013, 24, 1436–1450. [Google Scholar] [CrossRef] [Green Version]
  25. Diaconescu, A.O.; Mathys, C.; Weber, L.A.; Kasper, L.; Mauer, J.; Stephan, K.E. Hierarchical prediction errors in midbrain and septum during social learning. Soc. Cogn. Affect. Neurosci. 2017, 12, 618–634. [Google Scholar] [CrossRef]
  26. Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
  27. Si, B.; Herrmann, J.M.; Pawelzik, K. Gain-based Exploration: From Multi-armed Bandits to Partially Observable Environments. In Proceedings of the International Conference on Natural Computation, Haikou, China, 24–27 August 2007; pp. 177–182. [Google Scholar]
  28. Atan, O.; Tekin, C.; van der Schaar, M. Global bandits. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5798–5811. [Google Scholar] [CrossRef] [Green Version]
  29. Xu, X.; Xie, H.; Lui, J.C.S. Generalized Contextual Bandits with Latent Features: Algorithms and Applications. IEEE Trans. Neural Netw. Learn. Syst. 2021, 1–13. [Google Scholar] [CrossRef] [PubMed]
  30. Behrens, T.E.J.; Woolrich, M.W.; Walton, M.E.; Rushworth, M.F.S. Learning the value of information in an uncertain world. Nat. Neurosci. 2007, 10, 1214–1221. [Google Scholar] [CrossRef] [PubMed]
  31. Walton, M.E.; Behrens, T.E.; Buckley, M.J.; Rudebeck, P.H.; Rushworth, M.F. Separable learning systems in the macaque brain and the role of orbitofrontal cortex in contingent learning. Neuron 2010, 65, 927–939. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Costa, V.D.; Mitz, A.R.; Averbeck, B.B. Subcortical substrates of explore-exploit decisions in primates. Neuron 2019, 103, 533–545. [Google Scholar] [CrossRef] [PubMed]
  33. Hampton, A.N.; Bossaerts, P.; O’Doherty, J.P. Neural correlates of mentalizing-related computations during strategic interactions in humans. Proc. Natl. Acad. Sci. USA 2008, 105, 6741–6746. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  34. Heuer, L.; Orland, A. Cooperation in the Prisoner’s Dilemma: An experimental comparison between pure and mixed strategies. R. Soc. Open Sci. 2019, 6, 182142. [Google Scholar] [CrossRef] [Green Version]
  35. Hill, C.A.; Suzuki, S.; Polania, R.; Moisa, M.; O’doherty, J.P.; Ruff, C.C. A causal account of the brain network computations underlying strategic social behavior. Nat. Neurosci. 2017, 20, 1142–1149. [Google Scholar] [CrossRef] [Green Version]
  36. Bolis, D.; Balsters, J.; Wenderoth, N.; Becchio, C.; Schilbach, L. Beyond autism: Introducing the dialectical misattunement hypothesis and a Bayesian account of intersubjectivity. Psychopathology 2017, 50, 355–372. [Google Scholar] [CrossRef]
  37. Konishi, T.; Kubo, T.; Watanabe, K.; Ikeda, K. Variational Bayesian Inference Algorithms for Infinite Relational Model of Network Data. IEEE Trans. Neural Netw. Learn. Syst. 2015, 26, 2176–2181. [Google Scholar] [CrossRef]
  38. Chien, J.T.; Ku, Y.C. Bayesian Recurrent Neural Network for Language Modeling. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 361–374. [Google Scholar] [CrossRef] [PubMed]
  39. Qi, Y.; Liu, B.; Wang, Y.; Pan, G. Dynamic ensemble modeling approach to nonstationary neural decoding in Brain-computer interfaces. In Proceedings of the Advances in Neural Information Processing Systems 32 (Nips 2019), Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
  40. Li, H.; Barnaghi, P.; Enshaeifar, S.; Ganz, F. Continual Learning Using Bayesian Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 4243–4252. [Google Scholar] [CrossRef] [PubMed]
  41. Wang, H.; Yeung, D.Y. Towards Bayesian deep learning: A framework and some existing methods. IEEE Trans. Knowl. Data Eng. 2016, 28, 3395–3408. [Google Scholar] [CrossRef] [Green Version]
  42. Du, C.; Zhu, J.; Zhang, B. Learning Deep Generative Models With Doubly Stochastic Gradient MCMC. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3084–3096. [Google Scholar] [CrossRef] [PubMed]
  43. Mirza, M.B.; Adams, R.A.; Mathys, C.; Friston, K.J. Human visual exploration reduces uncertainty about the sensed world. PLoS ONE 2018, 13, e0190429. [Google Scholar] [CrossRef] [PubMed]
  44. Adolphs, R. Cognitive neuroscience of human social behaviour. Nat. Rev. Neurosci. 2003, 4, 165–178. [Google Scholar] [CrossRef] [PubMed]
  45. Pezzulo, G.; Friston, K.J. The value of uncertainty: An active inference perspective. Behav. Brain Sci. 2019, 42, e47. [Google Scholar] [CrossRef] [PubMed]
  46. Zhu, C.; Zhou, K.; Han, Z.; Tang, Y.; Tang, F.; Si, B. General hierarchical Brownian filter in multi-dimensional volatile environments. 2022; submitted. [Google Scholar]
  47. Mathys, C.D.; Lomakina, E.I.; Daunizeau, J.; Iglesias, S.; Brodersen, K.H.; Friston, K.J.; Stephan, K.E. Uncertainty in perception and the Hierarchical Gaussian Filter. Front. Hum. Neurosci. 2014, 8, 825. [Google Scholar] [CrossRef] [Green Version]
  48. Al-Nowaihi, A.; Dhami, S. Probability Weighting Functions; University of Leicester: Leicester, UK, 2010. [Google Scholar]
  49. Nocedal, J.; Wright S., J. Numerical Optimization; Spinger: New York, NY, USA, 2006. [Google Scholar]
  50. Ando, T. Bayesian Model Selection and Statistical Modeling; CRC Press: Cleveland, OH, USA, 2010. [Google Scholar]
  51. Zhang, L.; Gläscher, J. A brain network supporting social influences in human decision-making. Sci. Adv. 2020, 6, eabb4159. [Google Scholar] [CrossRef]
  52. Berger, J.O. Statistical Decision Theory and Bayesian Analysis; Springer Inc.: New York, NY, USA, 2013. [Google Scholar]
  53. Zeng, T.; Si, B. A brain-inspired compact cognitive mapping system. Cogn. Neurodyn. 2021, 15, 91–101. [Google Scholar] [CrossRef]
  54. Chen, S.; Tang, J.; Zhu, L.; Kong, W. A multi-stage dynamical fusion network for multimodal emotion recognition. Cogn. Neurodyn. 2022, 1–10. [Google Scholar] [CrossRef]
  55. Walkenbach, J.; Haddad, N.F. The Rescorla-Wagner theory of conditioning: A review of the literature. Psychol. Rec. 1980, 30, 497–509. [Google Scholar] [CrossRef]
  56. Zhang, L.; Lengersdorff, L.; Mikus, N.; Gläscher, J.; Lamm, C. Using reinforcement learning models in social neuroscience: Frameworks, pitfalls and suggestions of best practices. Soc. Cogn. Affect. Neurosci. 2020, 15, 695–707. [Google Scholar] [CrossRef] [PubMed]
  57. Zheng, N.; Ding, J.; Chai, T. DMGAN: Adversarial Learning-Based Decision Making for Human-Level Plant-Wide Operation of Process Industries Under Uncertainties. IEEE Trans. Neural Netw. Learn. Syst. 2021, 32, 985–998. [Google Scholar] [CrossRef] [PubMed]
  58. Chen, X.; Yang, T. A neural network model of basal ganglia’s decision-making circuitry. Cogn. Neurodyn. 2021, 15, 17–26. [Google Scholar] [CrossRef] [PubMed]
  59. Mao, D. Neural Correlates of Spatial Navigation in Primate Hippocampus. Neurosci. Bull. 2022, 1–13. [Google Scholar] [CrossRef] [PubMed]
  60. Zheng, L.; Liu, W.; Long, Y.; Zhai, Y.; Zhao, H.; Bai, X.; Zhou, S.; Li, K.; Zhang, H.; Liu, L.; et al. Affiliative bonding between teachers and students through interpersonal synchronisation in brain activity. Soc. Cogn. Affect. Neurosci. 2020, 15, 97–109. [Google Scholar] [CrossRef] [Green Version]
  61. Wang, Y.; Yang, X.; Tang, Z.; Xiao, S.; Hewig, J. Hierarchical neural prediction of interpersonal trust. Neurosci. Bull. 2021, 37, 511–522. [Google Scholar] [CrossRef]
  62. Wang, W.; Fu, C.; Kong, X.; Osinsky, R.; Hewig, J.; Wang, Y. Neuro-behavioral dynamic prediction of interpersonal cooperation and aggression. Neurosci. Bull. 2022, 38, 275–289. [Google Scholar] [CrossRef]
  63. Dong, W.; Chen, H.; Sit, T.; Han, Y.; Song, F.; Vyssotski, A.L.; Gross, C.T.; Si, B.; Zhan, Y. Characterization of exploratory patterns and hippocampal–prefrontal network oscillations during the emergence of free exploration. Sci. Bull. 2021, 66, 2238–2250. [Google Scholar] [CrossRef]
  64. Friston, K.; FitzGerald, T.; Rigoli, F.; Schwartenbeck, P.; Pezzulo, G. Active Inference: A Process Theory. Neural Comput. 2017, 29, 1–49. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  65. Friston, K.; Mattout, J.; Trujillo-Barreto, N.; Ashburner, J.; Penny, W. Variational free energy and the Laplace approximation. NeuroImage 2007, 34, 220–234. [Google Scholar] [CrossRef] [PubMed]
  66. Harold Jeffreys, S. Theory of Probability; Clarendon Press: Oxford, UK, 1961. [Google Scholar]
  67. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Figure 1. Overview of the hierarchical perceptual model.
Figure 1. Overview of the hierarchical perceptual model.
Mathematics 10 04775 g001
Figure 2. A gambling task. Cautious gamblers participated in a simple decision-making task in a volatile environment. There were four phases in a trial. (1) Cue: Two options and their rewards were presented; (2) Decision. Once the gambler had made a choice, the choice was displayed bigger and was highlighted; (3) Outcome. Once the two arms (denoted by letters A,B) had randomly generated their states, the outcome of the choice was output and then made an increment of the score only if the choice was right. (4) Fixation. This phase was the interval between trials. The screen only presented the score until the beginning of the next trial.
Figure 2. A gambling task. Cautious gamblers participated in a simple decision-making task in a volatile environment. There were four phases in a trial. (1) Cue: Two options and their rewards were presented; (2) Decision. Once the gambler had made a choice, the choice was displayed bigger and was highlighted; (3) Outcome. Once the two arms (denoted by letters A,B) had randomly generated their states, the outcome of the choice was output and then made an increment of the score only if the choice was right. (4) Fixation. This phase was the interval between trials. The screen only presented the score until the beginning of the next trial.
Mathematics 10 04775 g002
Figure 3. Expected states of the two-armed bandit. (a) Expected states of arm A. The mean of the state of arm A changes over time in a block fashion (black line). (b) Expected states of arm B. The mean of the state of arm B evolves over time (black line), showing variable correlations with that of arm A. By manipulating the expectations of the states of the two arms, we constructed a volatile environment. There were 17 blocks in the experiment. Each block consists of 15 trials.
Figure 3. Expected states of the two-armed bandit. (a) Expected states of arm A. The mean of the state of arm A changes over time in a block fashion (black line). (b) Expected states of arm B. The mean of the state of arm B evolves over time (black line), showing variable correlations with that of arm A. By manipulating the expectations of the states of the two arms, we constructed a volatile environment. There were 17 blocks in the experiment. Each block consists of 15 trials.
Mathematics 10 04775 g003
Figure 4. A Bayesian agent consists of the proposed hierarchical perceptual model and a binary response model based on Bayesian decision theory. The reward r ( t ) = [ r 0 ( t ) , r 1 ( t ) ] T is drawn uniformly from a set on each trial.
Figure 4. A Bayesian agent consists of the proposed hierarchical perceptual model and a binary response model based on Bayesian decision theory. The reward r ( t ) = [ r 0 ( t ) , r 1 ( t ) ] T is drawn uniformly from a set on each trial.
Mathematics 10 04775 g004
Figure 5. Temporal dynamics of the tendency μ 1 of the natural parameter at the first level. (a) The evolution of μ 1 ( 1 ) , the first component of μ 1 , is shown in red. The time-varying trajectory of the prediction error P E 1 ( 1 ) is shown in blue. (b) The evolution of μ 1 ( 2 ) , the second component of μ 1 , is shown in red. The time-varying trajectory of the prediction error P E 1 ( 2 ) is shown in blue. Light-red shaded area represents the uncertainty of each quantity (i.e., μ 1 ( i ) ( t ) ± C 1 ( i , i ) ( t ) , i 1 , 2 ). The red markers , represent the priors on the standard deviation and mean of each quantity.
Figure 5. Temporal dynamics of the tendency μ 1 of the natural parameter at the first level. (a) The evolution of μ 1 ( 1 ) , the first component of μ 1 , is shown in red. The time-varying trajectory of the prediction error P E 1 ( 1 ) is shown in blue. (b) The evolution of μ 1 ( 2 ) , the second component of μ 1 , is shown in red. The time-varying trajectory of the prediction error P E 1 ( 2 ) is shown in blue. Light-red shaded area represents the uncertainty of each quantity (i.e., μ 1 ( i ) ( t ) ± C 1 ( i , i ) ( t ) , i 1 , 2 ). The red markers , represent the priors on the standard deviation and mean of each quantity.
Mathematics 10 04775 g005
Figure 6. Temporal dynamics of the posterior states in a gambling task. (a) Rewards for two choices “Different’ and “Same” were randomly generated by a discrete uniform distribution U ( 1 , 4 ) . Blue dots represent the reward value for option “Same” on each trial, and red dots for option “Different”. (b) The green dots are the sensory inputs of u ( 1 ) (i.e., states of arm A). The red line represents the estimated probability μ ^ 0 ( 1 ) ( t k ) = s ( μ 1 ( 1 ) ( t k 1 ) , ζ 1 ( 1 ) ) . (c) The green dots are the sensory inputs of u ( 2 ) (i.e., states of arm B). The red line represents the estimated probability μ ^ 0 ( 2 ) ( t k ) = s ( μ 1 ( 2 ) ( t k 1 ) , ζ 1 ( 2 ) ) . (d) Prediction correlation ρ ^ 1 ( t ) is extracted from the inverse prediction precision Π ^ 1 ( t ) generated by the second (log-volatility) level. (e) Blue dots denote the optimal choice a i d e a l on each trial. The red line is the trajectory of the expectation probability that the states of two arms of the bandit are the same (i.e., P ( a = 1 ) ). The orange dots are the response action a generated by the agent on each trial. (f) The green dashed line is the cumulative reward of the ideal observer taking the ideal actions a i d e a l . The red line shows the cumulative reward obtained by the Bayesian agent.
Figure 6. Temporal dynamics of the posterior states in a gambling task. (a) Rewards for two choices “Different’ and “Same” were randomly generated by a discrete uniform distribution U ( 1 , 4 ) . Blue dots represent the reward value for option “Same” on each trial, and red dots for option “Different”. (b) The green dots are the sensory inputs of u ( 1 ) (i.e., states of arm A). The red line represents the estimated probability μ ^ 0 ( 1 ) ( t k ) = s ( μ 1 ( 1 ) ( t k 1 ) , ζ 1 ( 1 ) ) . (c) The green dots are the sensory inputs of u ( 2 ) (i.e., states of arm B). The red line represents the estimated probability μ ^ 0 ( 2 ) ( t k ) = s ( μ 1 ( 2 ) ( t k 1 ) , ζ 1 ( 2 ) ) . (d) Prediction correlation ρ ^ 1 ( t ) is extracted from the inverse prediction precision Π ^ 1 ( t ) generated by the second (log-volatility) level. (e) Blue dots denote the optimal choice a i d e a l on each trial. The red line is the trajectory of the expectation probability that the states of two arms of the bandit are the same (i.e., P ( a = 1 ) ). The orange dots are the response action a generated by the agent on each trial. (f) The green dashed line is the cumulative reward of the ideal observer taking the ideal actions a i d e a l . The red line shows the cumulative reward obtained by the Bayesian agent.
Mathematics 10 04775 g006
Figure 7. Temporal dynamics of the expectation of the logarithm of volatility μ 2 in the natural parameter x 1 at the second level. Each panel shows the evolution of one element of μ 2 in red and the corresponding element of PE 2 in blue. Light-red shaded area represents the uncertainty of each quantity (i.e., μ 2 ( i ) ( t ) ± C 2 ( i , i ) ( t ) , i 1 , 2 , 3 ). The red markers , represent the priors of the standard deviation and mean of each quantity.
Figure 7. Temporal dynamics of the expectation of the logarithm of volatility μ 2 in the natural parameter x 1 at the second level. Each panel shows the evolution of one element of μ 2 in red and the corresponding element of PE 2 in blue. Light-red shaded area represents the uncertainty of each quantity (i.e., μ 2 ( i ) ( t ) ± C 2 ( i , i ) ( t ) , i 1 , 2 , 3 ). The red markers , represent the priors of the standard deviation and mean of each quantity.
Mathematics 10 04775 g007
Figure 8. Temporal dynamics of the expectation of response action P ( a ( t ) = 1 ) . (a) The expectation of response action P ( a ( t ) = 1 ) generated by our Bayesian agent M 1 (red solid line) match closely to the probabilistically optimal expectation of response action P * ( a ( t ) = 1 ) (blue dashed line). (b) The cumulative reward obtained by our Bayesian agent (red solid line) tightly follows the cumulative reward of the probabilistically optimal expectation of response action P * ( a ( t ) = 1 ) given by the informed agent (blue dashed line). The ideal observer, who knows the actual outcomes of the bandit in advance, has the highest cumulative reward of the task.
Figure 8. Temporal dynamics of the expectation of response action P ( a ( t ) = 1 ) . (a) The expectation of response action P ( a ( t ) = 1 ) generated by our Bayesian agent M 1 (red solid line) match closely to the probabilistically optimal expectation of response action P * ( a ( t ) = 1 ) (blue dashed line). (b) The cumulative reward obtained by our Bayesian agent (red solid line) tightly follows the cumulative reward of the probabilistically optimal expectation of response action P * ( a ( t ) = 1 ) given by the informed agent (blue dashed line). The ideal observer, who knows the actual outcomes of the bandit in advance, has the highest cumulative reward of the task.
Mathematics 10 04775 g008
Figure 9. The mean inference trajectory T ¯ i ( i = 1 , 2 ) of the predicted states of the two-armed bandit x 0 . (a) The evolution of mean inference trajectories corresponding to μ 0 ( 1 ) over the dataset D . (b) The evolution of mean inference trajectories corresponding to μ 0 ( 2 ) over the dataset D . In both panels, the groundtruth is shown by black lines. The mean inference trajectories given by the Bayesian agent and the RW agent are in red and blue, respectively. The shaded areas correspond to the standard deviations ± σ T i .
Figure 9. The mean inference trajectory T ¯ i ( i = 1 , 2 ) of the predicted states of the two-armed bandit x 0 . (a) The evolution of mean inference trajectories corresponding to μ 0 ( 1 ) over the dataset D . (b) The evolution of mean inference trajectories corresponding to μ 0 ( 2 ) over the dataset D . In both panels, the groundtruth is shown by black lines. The mean inference trajectories given by the Bayesian agent and the RW agent are in red and blue, respectively. The shaded areas correspond to the standard deviations ± σ T i .
Mathematics 10 04775 g009
Figure 10. The statistic mean R ¯ i and standard deviation σ R i of the regrets on the synthetic dataset D .
Figure 10. The statistic mean R ¯ i and standard deviation σ R i of the regrets on the synthetic dataset D .
Mathematics 10 04775 g010
Figure 11. Histogram of Bayesian Factor. (a) Bayesian Factor without the Bayesian Information Criterion B F . (b) Bayesian Factor with the Bayesian Information Criterion B F B I C .
Figure 11. Histogram of Bayesian Factor. (a) Bayesian Factor without the Bayesian Information Criterion B F . (b) Bayesian Factor with the Bayesian Information Criterion B F B I C .
Mathematics 10 04775 g011
Table 1. Reward table.
Table 1. Reward table.
a
01
x 0 (0,0)0 r 1
(1,1)0 r 1
(1,0) r 0 0
(0,1) r 0 0
Table 2. Parameters of our hierarchical Bayesian model. Parameters labeled by ‘Free’ are optimized by the inversion of the model. Fixed parameters are constant and not optimized. The notation 1 is a constant column vector with all components being 1. The notation 0 is a zero vector. The matrix O d is a d by d constant matrix in which all elements are 0. The notation logit ( · ) denotes a logistic function logit ( x ) = ln ( x 1 x ) . Given all initial priors, we search for the optimal priors on all optimized parameters μ ξ according to the free energy principle (Equations (A19) and (A21)).
Table 2. Parameters of our hierarchical Bayesian model. Parameters labeled by ‘Free’ are optimized by the inversion of the model. Fixed parameters are constant and not optimized. The notation 1 is a constant column vector with all components being 1. The notation 0 is a zero vector. The matrix O d is a d by d constant matrix in which all elements are 0. The notation logit ( · ) denotes a logistic function logit ( x ) = ln ( x 1 x ) . Given all initial priors, we search for the optimal priors on all optimized parameters μ ξ according to the free energy principle (Equations (A19) and (A21)).
NameDescriptionInitial ValueFixed or Free
Parameters of our Bayesian perceptual model
d 0 = d u Dimension of sensory input u 2constant
d 1 Dimension of x 1 2constant
d 2 Dimension of x 2 3constant
ϵ ( t k ) Sampling interval ϵ ( t k ) 1constant
α λ t o p Upper bound on λ t o p 0.1 · 1 constant
λ t o p Volatility of x 2 Free
μ λ t o p G Mean of λ t o p G logit ( 0.1 ) · 1
C λ t o p G Covariance of λ t o p G 1 × 10 2 I d 2
α w 2 Upper bound on w 2 1 · 1 constant
w 2 Coupling strength Free
μ w 2 G Mean of w 2 G logit ( 0.25 ) · 1
C w 2 G Covariance of w 2 G 1 × 10 2 · I d 2
b 2 Coupling bias 0 Fixed
μ b 2 Mean of b 2 0
C b 2 Covariance of b 2 O 3
μ 2 ( t 0 ) Prior mean of x 2 Free
μ μ 2 ( t 0 ) Mean of μ 2 ( t 0 ) ln ( 0.16 ) · 1
C μ 2 ( t 0 ) Covariance of μ 2 ( t 0 ) 1 × 10 2 · I 3
C 2 ( t 0 ) Prior covariance of x 2 Free
μ c 2 G Mean of c 2 G ln ( 1 )
C c 2 G Covariance of c 2 G I d 2
μ 1 ( t 0 ) Prior mean of x 1 Free
μ μ 1 ( t 0 ) Mean of μ 1 ( t 0 ) 0
C μ 1 ( t 0 ) Covariance of σ 1 G O d 2
C 1 ( t 0 ) Prior covariance of x 1 Free
μ σ 1 G Mean of σ 1 G ln ( 0.16 ) · 1
C σ 1 G Covariance of σ 1 G 0.1 I d 1
ζ 1 Coefficient Fixed
μ ζ 1 G Mean of ζ 1 G 0
C ζ 1 G Covariance of ζ 1 G O 2
Parameters of our response model
d a Dimension of a1Fixed
ζ a Coefficient Fixed
μ ζ a G Mean of ζ a G ln ( 2 )
C ζ a G Covariance of ζ a G 0
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Zhu, C.; Zhou, K.; Tang, F.; Tang, Y.; Li, X.; Si, B. A Hierarchical Bayesian Model for Inferring and Decision Making in Multi-Dimensional Volatile Binary Environments. Mathematics 2022, 10, 4775. https://doi.org/10.3390/math10244775

AMA Style

Zhu C, Zhou K, Tang F, Tang Y, Li X, Si B. A Hierarchical Bayesian Model for Inferring and Decision Making in Multi-Dimensional Volatile Binary Environments. Mathematics. 2022; 10(24):4775. https://doi.org/10.3390/math10244775

Chicago/Turabian Style

Zhu, Changbo, Ke Zhou, Fengzhen Tang, Yandong Tang, Xiaoli Li, and Bailu Si. 2022. "A Hierarchical Bayesian Model for Inferring and Decision Making in Multi-Dimensional Volatile Binary Environments" Mathematics 10, no. 24: 4775. https://doi.org/10.3390/math10244775

APA Style

Zhu, C., Zhou, K., Tang, F., Tang, Y., Li, X., & Si, B. (2022). A Hierarchical Bayesian Model for Inferring and Decision Making in Multi-Dimensional Volatile Binary Environments. Mathematics, 10(24), 4775. https://doi.org/10.3390/math10244775

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop