Siamese Reconstruction Network: Accurate Image Reconstruction from Human Brain Activity by Learning to Compare

Jiang, Lingyun; Qiao, Kai; Wang, Linyuan; Zhang, Chi; Chen, Jian; Zeng, Lei; Bu, Haibing; Yan, Bin

doi:10.3390/app9224749

Open AccessArticle

Siamese Reconstruction Network: Accurate Image Reconstruction from Human Brain Activity by Learning to Compare

by

Lingyun Jiang

,

Kai Qiao

,

Linyuan Wang

,

Chi Zhang

,

Jian Chen

,

Lei Zeng

,

Haibing Bu

and

Bin Yan

^*

PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2019, 9(22), 4749; https://doi.org/10.3390/app9224749

Submission received: 10 September 2019 / Revised: 31 October 2019 / Accepted: 2 November 2019 / Published: 7 November 2019

(This article belongs to the Special Issue Magnetic Resonance and Electromagnetic Evaluation of Brain Function in Health and Disease)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

This study can be used in brain decoding, to calculate the brain perceived visual information.

Abstract

Decoding human brain activities, especially reconstructing human visual stimuli via functional magnetic resonance imaging (fMRI), has gained increasing attention in recent years. However, the high dimensionality and small quantity of fMRI data impose restrictions on satisfactory reconstruction, especially for the reconstruction method with deep learning requiring huge amounts of labelled samples. When compared with the deep learning method, humans can recognize a new image because our human visual system is naturally capable of extracting features from any object and comparing them. Inspired by this visual mechanism, we introduced the mechanism of comparison into deep learning method to realize better visual reconstruction by making full use of each sample and the relationship of the sample pair by learning to compare. In this way, we proposed a Siamese reconstruction network (SRN) method. By using the SRN, we improved upon the satisfying results on two fMRI recording datasets, providing 72.5% accuracy on the digit dataset and 44.6% accuracy on the character dataset. Essentially, this manner can increase the training data about from n samples to 2n sample pairs, which takes full advantage of the limited quantity of training samples. The SRN learns to converge sample pairs of the same class or disperse sample pairs of different class in feature space.

Keywords:

brain decoding; visual reconstruction; functional magnetic resonance imaging; Siamese neural network; deep learning; learning to compare

1. Introduction

Scientists have always been fascinated in how the human brain works, so they have carried out studies to understand the workings of the brain [1]. An influential predictive coding hypothesis states that brain activity can be decoded to reconstruct and classify what a person has sawn [2,3,4]. Human brain decoding [5,6,7] plays an important role in brain-machine interfaces, for example helping disabled persons in expressing and motioning and promoting the developing of the brain mechanism. Since the development of functional magnetic resonance imaging (fMRI), it has become an effective tool in understanding brain activity, for example used in Brain decoding. Brain decoding can be divided into three aspects: classification, identification, and reconstruction [8]. It has made some progresses in this area, but most previous researches focused on predicting the category of image stimulus [9,10] or identifying the image stimuli from a candidate set [11]. The one important and difficult field in neuroscience is the reconstruction of image stimuli to read out, or decode, mental content from brain activity. However, two problems restrict the accuracy of visual reconstruction. One is the confusing fMRI measurement noise and the other is the high dimensionality and limited amount of fMRI data. These two problems lead to unsatisfactory reconstruction results and even wrong categories. The first problem is caused by physical and acquisition equipment, and it can only be solved by a significant breakthrough of hardware development. Therefore, we are committed to solve the second problem.

In brain decoding, few studies have reported on perceptual image reconstruction. The current reconstruction methods mainly focus on the innovation of the method, rather than the use of limited instances. Thirion et al. [12] reconstructed simple images by an explicit inverse reconstruction approach for the same subject. Miyawaki et al. [13] achieved the reconstruction of simple visual images by combining local image bases of multiple scales. Van et al. [14] reconstructed handwritten digits of 6 and 9 from the fMRI data by employing a hierarchical generative model to learn the hierarchy of features [15]. Schoenmakers et al. [16] reconstructed handwritten letters of “BRAINS” from fMRI data by a straightforward linear Gaussian approach using the linear gauss model relying on the inversion of properly regularized encoding models. Yargholi et al. [17] reconstructed handwritten digits 6 and 9 by using Bayesian networks. Fujiwara et al. [18] proposed a model that was based on probabilistic extension of the CCA model [19] to establish mappings between the stimulus and the brain. Wang et al. [20] have achieved the good results on many tasks by the deep canonically correlated autoencoders (DCCAE) for correlation-based representation learning. It has been proven that CNN can be applied for visual reconstruction, because the features of some layers in CNN are strongly correlated with the brain activities of the visual cortex in recent studies [21,22,23]. Wen et al. [1] proposed encoding and decoding models based on multivariate linear regression and deconvolutional neural network [24] to describe the bi-directional relationships between the CNN and the brain and reconstruct the dynamic video frame by frame. Du et al. [25] has improved the reconstruction results by casting the reconstruction of visual stimulus as the Bayesian inference of missing view in a multiview latent variable model. Taken together, the innovation of this method could promote the better understanding of the human brain mechanism. However, for deep learning, the accurate reconstruction of image stimuli remains challenging due to the limited instances.

In deep learning, the conventional wisdom is that deep neural networks can learn from high-dimensional data, such as images or languages, but this is because they have many labeled samples to train on. These supervised learning models need large amounts of labelled data and many iterations to train the large number of parameters. However, there are always less instances in fMRI measurement. As a consequence, common deep learning methods applied to the visual reconstruction have poor effect due to the over-fitting. It is hard to obtain more instances because of equipment limitation and task complexity. Moreover, it takes a long time to acquire fMRI data. The performance of the visual reconstruction method can be improved when implementing feature mapping first and then reconstruction. Given the lack of data, it is hard to realize the end-to-end method for achieving state-of-the-art performance. Therefore, we need a method to solve the problem of limited data instances and realize less sample learning. We think that the current convolution neural networks do not make full use of these instances, because, when a batch of samples is sent to the network, the neural network has learned the average features of those inputs and updates the weights of parameters utilizing this information. However, it does not consider the relationship between different batches of samples, which might be better on characteristic expression when it is taken into consideration, just like humans always recognize images by comparing the difference of images, even though they never have a concept of a certain category of object. Aside from that, the confusion that is caused by the high dimensionality of fMRI data instances needs to be solved. We propose a fully connected network to select voxels to reduce its dimensionality to train.

To make full use of data by considering the relationship of samples, an idea hits our mind that humans have an inherent ability to acquire and recognize new patterns. When presented with image stimuli, people seem to quickly understand new concepts [26]. Humans are very good at recognizing objects with very little direct supervision, or none at all i.e., few-shot [27] or zero-shot [28] learning. Aimed at learning the ability of humans, one-shot learning is addressed by developing domain-specific features or inference procedures, which possess highly discriminative properties for the target task. Siamese nets were first introduced in the early 1990s by Bromley and Le Cun to solve signature verification as an image matching problem [29]. The seminal work toward one-shot learning dates back to the early 2000s with the work of Li [30]. Koch et al. [31] explored a method for learning Siamese neural networks that employed a unique structure to naturally rank the similarity between inputs. The Siamese neural network consists of twin networks that accept distinct inputs, and the network is joined by an energy function at the top. In a word, the Siamese network makes the output features cluster or disperse by learning a similarity function. Otherwise, because of the input pair, it increases the training network samples to solve overfitting. It can further play a role in taking full advantage of the data. With the further development of Siamese neural networks, Hoffer and Harwood proposed the triplet network model to learn useful representations by distance comparisons [32,33]. However, it is usually used in the classification problem and utilized for visual reconstruction in this study. In our opinion, the Siamese neural network is useful in visual reconstruction by the end-to-end method when converging or dispersing the input pairs in characteristic space whether they belong to the same or different classes.

For humans, only a small amount of images or no image is needed to recognize an object. Based on the description of the object, the object can be identified based on past experience. That is the reason why human can recognize a new thing that our visual system is naturally capable of extracting features from any object and comparing them. We do not have to see what we have seen before, because we are able to compare different objects, and this is the key why we have a small sample learning ability. However, the neural network only has the ability of extracting features instead of being comparative. The comparative ability of humans is because we have prior knowledge that we can use to learn [34]. Therefore, how do we achieve this kind of fast learning? In our study, we take the comparative relationship into consideration to realize few-shot learning. We think that people recognize the image by comparing the features in the image to achieve recognition, that is, less sample learning. Children can quickly identify what is a “horse” and “zebra”, even if they have not seen them a few times because our visual cells can automatically extract the features (e.g., outlines, lighting) of images and then compare to our previous experience to identify the image [34].

In this study, we propose a Siamese reconstruction network (SRN) for visual reconstruction based on the Siamese neural network. For the reconstructed images, if they are come from the same class, the relevant dimensions of similarity may be constructed in the course of learning to learn. The SRN can construct meaningful representations of what makes two objects deeply similar that go substantially beyond low-level image features; our method draws its strength from making full use of data by learning to compare when we present sample pairs to the network. In this way, the network potentially changes the distribution of characteristics in certain layers by clustering the feature vectors of same classes and enlarging the distance from the feature vectors of different classes before reconstructing the visual image. It can also increase the training samples similar to the Siamese neural network. Through the effective use of this data, end-to-end reconstruction performance will be improved. Finally, the satisfying results of two fMRI recording datasets demonstrate that our approach can be used on the limited instances of fMRI measurements for reconstructing visual images.

In summary, we make the following contributions: (1) Inspired by the visual mechanism, we introduce the mechanism of comparison into deep learning to realize better visual reconstruction via the Siamese neural network. (2) A novel Siamese reconstruction network (SRN) was developed to reconstruct the image stimuli from human fMRI. (3) Our proposed SRN achieved overwhelming superiority and exceeded by about 10% than the state of the art.

2. Materials and Methods

2.1. Experiment Data

Two public fMRI datasets were obtained from van Gerven et al. [14,35]. Dataset 1 contains 100 handwritten grayscale digits (with equal number of the class “6” and the class “9”) at a 28 × 28 pixel resolution that were taken from the training set of the MNIST database and the fMRI data from V1, V2, and V3. Dataset 2 contains 360 grayscale handwritten characters (with equal number of the class “B”, the class “R”, the class “A”, the class “N”, the class “S”, and the class “9”) at a 56 × 56 pixel resolution taken from van Gerven et al. and the fMRI data of V1 and V2 taken from three subjects [16]. These handwritten gray-scale digits and characters were presented to one subject. There are four runs interspersed with 30 s rest periods to perform 100 trials in total. In each trial, a handwritten digit or character was presented to the subject, and each block lasts 12.5 s. Blood-oxygenation-level dependent (BOLD) is obtained by the Siemens 3T MRI system. This experiment employed the EPI sequence with a repetition time (TR) of 2.5 s and isotropic voxel size of 2 × 2 × 2 mm³. The fMRI data of each image stimulus contains 3092 voxels in total from V1, V2, and V3 regions. Additionally, the detailed information regarding the fMRI data can refer to Van Gerven et al. and the two public datasets can be downloaded through http://artcogsys.com. We conducted experiments on the two datasets to test the SRN. The visual images from Dataset 2 were downsampled from 56 × 56 pixels to 28 × 28 pixels in our experiments.

2.2. Selecting Voxels

Voxel selection is an important component in fMRI brain decoding, because many voxels may have small correlation, which might not respond to the visual stimulus. It not only results in high dimensionality of fMRI data, but also overfitting during training. Although the loss of a part of the information usually occurs in a dimension reduction of voxels, the effect of overfitting can be mitigated because dimension reduction of voxels significantly increases the ratio of sample size and dimension.

Faced with the problem of how to select appropriate voxels, we proposed a method that build a encoding model mapping image stimuli to voxels, and to choose those voxels that are maximally correlated with the image stimuli by measuring the correlation with the effect of fitting namely encoding performance, as shown in Figure 1. First, we establish a classification network with five fully connected layers and then train the network, respectively, with the original images of the Dataset 1 and Dataset 2 to achieve a high accuracy rate of classification, which is 98% on the Dataset 1 and 94% on the Dataset 2. Next, the image from the training set is input to the network, and the vectors of the last but one layer are output as the high dimensional features. Finally, we chose voxels for which the model provided better predictability (encoding performance). This codifies our intuition that voxels better predicted with visual images are those to be included in the decoding model. The goodness-of-fit between model predictions and measured voxel activities was quantified while using the coefficient of determination (R2), which indicates the percentage of variance that is explained by the model. In the experiments, we computed the R2 of each voxel while using 10-fold cross validation on training data, and then voxels with positive R2 were selected for further analysis. Finally, we selected those voxels whose R2 is at the top 200, and reduced the dimensionality of fMRI data from 3092 to 200.

2.3. Overview of the Proposed Siamese Reconstruction Network

To achieve accurate visual reconstruction, we proposed the SRN to fully use the data because of the limitation of samples. Inspired by humans’ ability to recognize images by learning to compare, the network uses the correlation between the pair of inputs as the pattern of brain response to difference of images. It not only considers the relationship in the same batch of samples, but also the samples in different batches. Moreover, it also increases the compactness of the feature vectors of same classes and enlarges the distance from feature vectors of extra-classes during the learning process. The Siamese neural network inspires the method. A Siamese neural network consists of twin networks that accept distinct inputs but are joined by an energy function at the top. In this study, we have been influenced by Koch et al. [31], who proposed Siamese neural networks for one-shot image recognition. This network consists of twin networks that accept distinct inputs, but they are joined by an energy function at the top. Moreover, the network is symmetric, so that, whenever two distinct images are presented to the twin networks, the top conjoining layer will compute the same metric as if they were to present the same two images, but to the opposite twins. After learning, the network can increase the compactness of the feature vectors of the same classes and enlarge the distance from the feature vectors. Therefore, our proposed reconstruction network consists of twin sequences of fully connected layers. Additionally, the SRN accepts a pair of inputs that consisted of anchor fMRI data and target fMRI data. The anchor fMRI data and target fMRI data are selected from training dataset randomly. The anchor fMRI data is input to the main channel, and the processing of the sample needs to be reconstructed is called main channel, as shown in the upper portion of Figure 2. The target fMRI data are input to the auxiliary channel and the processing of the sample used to constrain the feature expression of network is called the auxiliary channel, as shown in the bottom portion of Figure 2. Moreover, the validation sets of fMRI data are input to the main channel, and they can be reconstructed more accurately in the trained Siamese network. Moreover, the validation sets of fMRI data are input to the main channel, and they can be more accurately reconstructed in the trained Siamese network.

We now describe the structure of the Siamese reconstruction net in detail and the specifics of the learning algorithm used in our experiments, as follows.

2.4. Model

Figure 2 summarizes the framework of our network. It only contains fully connected network with

L

layers each with

N_{l}

units, where

h_{1, l}

represents the hidden vector in layer

l

for the first twin, and

h_{2, l}

denotes the same for the second twin. We use exclusively rectified linear units (ReLU) in all layers. This network consists of twin networks that accept distinct inputs but they are joined by an energy function at the last but one layer and with shared weight matrices at each layer. Additionally, the output of the last layer is measured the similarity of the original images by L2 component-wise distance. In addition, the detailed network architectures are shown as in Table 1. Let FC (K) denote the fully connected network with K filters.

Loss Function. In this paper, we compute two losses. One is to compute L2 component-wise distance between high dimensional features as the Siamese loss, the other is to compute L2 component-wise distance between reconstructed images and original images as MSE loss.

Siamese Loss. From the last but one layer, the anchor fMRI data and target fMRI data are inferenced from the distinct data batches. Subsequently, we calculate the embedding of them. The embedding is represented by

f (x) \in R^{d}

. It embeds the voxel responses x into a

d

-dimensional Euclidean space. Here, we want to ensure that if the high dimension feature

f (x_{i}^{a})

(

a n c h o r

) and

f (x_{i}^{t})

(

t a r g e t

) of a specific voxels vector from same classes are closer than those from different classes. Figure 3 visualizes this.

Let

M

represent the minibatch size, where

i

indexes the

i

th minibatch. Now, let

y (x_{i}^{a}, x_{i}^{t})

be a length

M

vector, which contains the labels for the minibatch, where we assume

y (x_{i}^{a}, x_{i}^{t}) = 1

whenever

x^{a}

and

x_{t}

are from the same character class and

y (x_{i}^{a}, x_{i}^{t}) = 0

otherwise.

The Siamese loss that is being minimized is then

L_{s} (x_{i}^{a}, x_{i}^{t}) = y (x_{i}^{a}, x_{i}^{t}) \log p (x_{i}^{a}, x_{i}^{t}) + (1 - y (x_{i}^{a}, x_{i}^{t})) \log p (1 - y (x_{i}^{a}, x_{i}^{t}))

(1)

MSE Loss. From the last layer, the anchor network outputs a batch of 784 dimensional vectors

f (x_{i}^{a})

, which is compared with the original image

y_{i}

, to make them more similar. The MSE loss where the L2 component-wise distance between vectors is computed that is being minimized is, then

L_{m} = \sum_{i}^{N} [{‖ f (x_{i}^{a}) - y_{i} ‖}_{2}^{2}]

(2)

Subsequently, we combine the Siamess loss

L_{s}

and MSE loss

L_{m}

together, as follows.

α

and

β

are the hyperparameters, that are the weighting factors for Siamese loss and MSE loss, respectively.

w

is the weight of network and

λ {‖ w ‖}_{2}

is the regularization term avoiding over fitting. The total loss that is being minimized is then

L_{t} = α L_{s} + β L_{m} + λ {‖ w ‖}_{2}

(3)

2.5. Implementation

In our experiments, we use Tensorflow for the implementation and test them on a NVIDIA Tesla K100 GPU cluster in Nvidia DGX station. We train the neural network for 100 epochs with a batch size of 64, with the learning rate of 0.01, being decreased by 10% every 20 steps, and use Adam as our solver.

3. Experimental Results

3.1. The Results of Reconstruction

We have trained the datasets in the network to prove the effectiveness of the proposed SRN. The reconstructed handwritten digits and characters of the proposed SRN are shown in Figure 4 and Figure 5, respectively, where the first row denotes the presented images and the second row denotes the reconstructed images of SRN.

Overall, the images that were reconstructed by SRN have shown fine reconstruction effects for handwritten digits and characters, especially in capturing the essential features of the presented images. Although there are subtle differences in the strokes, the obtained reconstructions of handwritten digits and characters shared certain characteristics of their corresponding original images. In addition, some original images that have complex shapes and architectures are difficult to be reconstructed, but they are more recognizable to the correct class that was reconstructed by the proposed SRN.

We attribute this phenomenon to the fact that, when we train the SRN for visual reconstruction, we take the relationship of different batches into consideration, which takes advantage of data in the limitation of sample number. After the training process, the feature vectors of same classes are clustered, while the feature vectors of different classes are dispersed. Therefore, the features that were extracted by well-trained SRN from the same classes are similar. The images reconstructed based on these features have similar appearance. It is an open and difficult problem to evaluate the quality of reconstructed images [36]. To better evaluate the reconstruction performance quantitatively, we used several standard image similarity metrics, including Pearson’s correlation coefficient (PCC), mean squared error (MSE), and structural similarity index (SSIM) [20]. However, traditional metrics, such as per-pixel MSE, not evaluating the joint statistics of the results is because they do not measure the very structure that the structured losses aim to capture. Therefore, as the auxiliary metrics, MSE and PCC do not well indicate the similarity. By contrast, SSIM can address this shortcoming by taking texture into account, so it is a more convincing measure of structural similarity.

In the experiment, we employed 10-fold cross validation to test the proposed Siamese method. We listed the PCC, MSE, and SSIM of the reconstructed images in the SRN. The results of Dataset 1 and Dataset 2 are shown in Table 2 and Table 3, respectively.

Next, we presented the quantitative results of the two test datasets based on 10-fold cross validation as compared with the several state-of-the-art methods. As shown in Table 4, in both Dataset 1 and Dataset 2 the PCC and MSE of our proposed Siamese method were slightly weaker than those of the DGMM [20]. When compared with the latest methods, the Siamese reconstruction net provided about 8% higher accuracy on the most important SSIM metric in Dataset 1 and 10% higher accuracy in Dataset 2. The better performance is caused by the fact that converging or dispersing input pairs in feature space according to whether they belong to the same or different classes and making full use of limited training samples by learning to compare. The high values of MSE and PCC should not be a cause of concern, because it is not so important for pixel-level comparison due to the complex noise in the fMRI data and limited samples. Moreover, humans do not care much on detailed information at the pixel level and the structure according to attention mechanism. This is the reason why we should disperse the characteristics of extra-classes to distinguish the class. Consequently, the two metrics only serve as auxiliary measures and our proposed method performed better overall.

We set up the comparative experiment of single fully connect network (FCN) in order to further illustrate the effect of Siamese reconstruction net, as shown in the Table 4. The Siamese reconstruction net provided about 12% higher accuracy on the most important SSIM metric in Dataset 1 and 13% higher accuracy in Dataset 2 when compared with the FCN methods.

The images that were reconstructed by the Siamese reconstruction net on each dataset performed poorer or better than the existing methods. We attribute this phenomenon to the fact that, because of the handwritten digits having simpler characteristics and less variety, the image similarity and its quantitative evaluation are significantly superior to those of characters’. Therefore, the images reconstructed by the Siamese reconstruction net can hardly be improved in Dataset 1. In addition, there are several complex characteristics and a variety of the obtained reconstructions of handwritten characters, which increase the difficulty in reconstruction, but improve the accuracy of distinction by Siamese reconstruction net.

3.2. The Clustering of Visual Voxel Responses

Giving more account for the efforts of the Siamese reconstruction net, we output the high dimensional features of last but one layer. Subsequently, those features come from different classes are mapped to 2-D space by T-SNE [37]. As shown in Figure 6, below, the red points represent the features of samples coming from class “9” and the green points represent the features of samples coming from the class “6”.

The figure shows the effect of clustering and dispersing characteristics. After using the Siamese reconstruction net, the features coming from the different class tend to be relatively dispersive, and those coming from same class tend to be relatively clustering. However, the clustering effect of the same classes is not as ideal as we thought. We analyzed that the proposed method should not cluster the features of the same class in visual reconstruction as closer as classification problems, because the purpose of the network is to accurately reconstruct the visual images. The original images, even in the same class, have complex shapes and various structures, so that, when training, it will not increase the compactness of the feature vectors of the same class and therefore enlarge the distance from the feature vectors between the extra-classes.

4. Discussion

Advantage. When compared to current convolution neural networks, the Siamese neural network, which accepts sample pairs, can not only make use of every samples, but also the relationship between every two samples during training process, thus may work better than the plain deep network. In this study, we introduced SRN into the visual reconstruction and improved the effort of reconstruction through converging or dispersing the input pairs in feature space according to whether they belong to the same or different classes. Essentially, the operation, called “learning to compare”, can be regarded as one regularization term, thus the potential to other visual decoding task that usually encounters the limited training samples.

Selective Reconstruction. Given the perplexing fMRI measurement noise and high dimensionality of limited data instances, it is important and challenging to remove those ineffective voxels to reduce the dimensionality of voxels and avoid interference, while keeping the necessary information. In the study, through selecting valuable voxels as many previous studies, we obtained improved reconstruction. However, some information was also removed during the selection, and making full use of acquired voxels is hard to achieve. Thus, how to keep necessary information and avoid interference in the developing the computation model may need to attract more attention.

Inspiration of Human Visual Mechanism. Humans can recognize a new image without having seen lots of images, because the human visual system is naturally capable of extracting features from objects and comparing them based on the past experience. Therefore, we introduced the comparative relationship to the deep neural network and improve the reconstruction effect. There are some other intriguing mechanisms in human visual system, such as visual attention that human can quickly locate important target areas and conduct detailed analysis. For the further development of brain decoding, we can introduce other human visual mechanisms to the deep neural network.

5. Conclusions

This study first proposed SRN based on Siamese neural network for the visual reconstruction, inspired by humans’ inherent ability to recognize one image while using few samples by comparing between images. There are always less instances in fMRI measurement, while these supervised learning models need large amounts of labelled data to train the large number of parameters, so common deep learning methods that are applied to the visual reconstruction have a poor effect due to the over-fitting. To make full use of data, the proposed method takes the relationship of sample pairs into account through changing the distributions in feature space by clustering the feature vectors of similar classes and enlarging the distance from the feature vectors of different classes during learning. Given the high dimensionality of limited data instances, it is essential to select voxels to reduce the dimensionality of fMRI data and learn the mapping from fMRI data to extract features in visual reconstruction. First, we employed the extracted features based on trained classification networks to estimate the correlation of each voxel to accomplish the selection of voxels. Second, we trained the reconstruction network with the selected voxels. Based on qualitative and quantitative results, the proposed Siamese reconstruction net performs considerably better in all two datasets in the most important SSIM metric, providing 8% and 10% higher accuracy in Dataset 1 and Dataset 2 as compared to the state-of-the-art methods. These results have demonstrated that our proposed Siamese reconstruction net architecture plays a role in clustering the characters of the same class and it showed effectiveness in visual reconstruction. To the best of our knowledge, this study is the first to propose a mechanism of comparison for dramatically clustering the characters based on Siamese neural network in visual image reconstruction. Visual reconstruction was realized by an end-to-end method instead of mapping the voxel responses to the feature space. In this work, we did not explore the full range of possibilities that Siamese reconstruction net potentially enables. Our future work includes applications of our method to fMRI mental illness classification problem.

Author Contributions

Conceptualization, L.J., H.B. and B.Y.; Formal analysis, L.W.; Methodology, L.J., K.Q. and C.Z.; Software, L.J. and J.C.; Validation, J.C. and L.Z.; Writing-original draft, L.J.; Writing-review & editing, K.Q., L.W., H.B. and B.Y.

Funding

This work was funded by the National Key R&D Program of China under grant 2017YFB1002502 and National Natural Science Foundation of China (No. 61701089 and No. 61601518).

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Wen, H.; Shi, J.; Zhang, Y.; Lu, K.-H.; Cao, J.; Liu, Z. Neural Encoding and Decoding with Deep Learning for Dynamic Natural Vision. Cereb. Cortex 2018, 28, 4136–4160. [Google Scholar] [CrossRef]
Von Helmholtz, H. Handbuch der Physiologischen Optik; Leopold Voss: Leipzig, Germany, 1867; Volume 9. [Google Scholar]
Barlow, H.B. Possible principles underlying the transformation of sensory messages. Sens. Commun. 1961, 1, 217–234. [Google Scholar]
Rao, R.P.; Ballard, D.H. Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects. Nat. Neurosci. 1999, 2, 79. [Google Scholar] [CrossRef]
Cox, D.D.; Savoy, R.L. Functional magnetic resonance imaging (fMRI) “brain reading”: Detecting and classifying distributed patterns of fMRI activity in human visual cortex. Neuroimage 2003, 19, 261–270. [Google Scholar] [CrossRef]
Haynes, J.; Rees, G. Decoding mental states from brain activity in humans. Nat. Rev. Neurosci. 2006, 7, 523. [Google Scholar] [CrossRef]
Norman, K.A.; Polyn, S.M.; Detre, G.J.; Haxby, J.V. Beyond mind-reading: Multi-voxel pattern analysis of fMRI data. Trends Cogn. Sci. 2006, 10, 424–430. [Google Scholar] [CrossRef]
Kay, K.N.; Gallant, J.L. I can see what you see. Nat. Neurosci. 2009, 12, 245. [Google Scholar] [CrossRef]
Damarla, S.R.; Just, M.A. Decoding the representation of numerical values from brain activation patterns. Hum. Brain Mapp. 2013, 34, 2624–2634. [Google Scholar] [CrossRef]
Mokhtari, F.; Hossein-Zadeh, G.-A. Decoding brain states using backward edge elimination and graph kernels in fMRI connectivity networks. J. Neurosci. Methods 2013, 212, 259–268. [Google Scholar] [CrossRef]
Kay, K.N.; Naselaris, T.; Prenger, R.J.; Gallant, J.L. Identifying natural images from human brain activity. Nature 2008, 452, 352. [Google Scholar] [CrossRef]
Thirion, B.; Duchesnay, E.; Hubbard, E.; Dubois, J.; Poline, J.-B.; Lebihan, D.; Dehaene, S. Inverse retinotopy: Inferring the visual content of images from brain activation patterns. Neuroimage 2006, 33, 1104–1116. [Google Scholar] [CrossRef]
Miyawaki, Y.; Uchida, H.; Yamashita, O.; Sato, M.; Morito, Y.; Tanabe, H.C.; Sadato, N.; Kamitani, Y. Visual Image Reconstruction from Human Brain Activity using a Combination of Multiscale Local Image Decoders. Neuron 2008, 60, 915–929. [Google Scholar] [CrossRef] [Green Version]
van Gerven, M.A.; de Lange, F.P.; Heskes, T. Neural decoding with hierarchical generative models. Neural Comput. 2010, 22, 3127–3142. [Google Scholar] [CrossRef] [PubMed]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed]
Schoenmakers, S.; Barth, M.; Heskes, T.; van Gerven, M. Linear reconstruction of perceived images from human brain activity. NeuroImage 2013, 83, 951–961. [Google Scholar] [CrossRef] [PubMed]
Yargholi, E.; Hossein-Zadeh, G.-A. Reconstruction of digit images from human brain fMRI activity through connectivity informed Bayesian networks. J. Neurosci. Methods 2016, 257, 159–167. [Google Scholar] [CrossRef] [PubMed]
Fujiwara, Y.; Miyawaki, Y.; Kamitani, Y. Modular encoding and decoding models derived from Bayesian canonical correlation analysis. Neural Comput. 2013, 25, 979–1005. [Google Scholar] [CrossRef]
Hardoon, D.R.; Szedmak, S.; Shawe-Taylor, J. Canonical correlation analysis: An overview with application to learning methods. Neural Comput. 2004, 16, 2639–2664. [Google Scholar] [CrossRef]
Wang, W.; Arora, R.; Livescu, K.; Bilmes, J. On Deep Multi-View Representation Learning. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 1083–1092. [Google Scholar]
Yamins, D.L.; Hong, H.; Cadieu, C.; DiCarlo, J.J. Hierarchical modular optimization of convolutional networks achieves representations similar to macaque IT and human ventral stream. In Proceedings of the Advances in Neural Information Processing Systems 26, Lake Tahoe, NV, USA, 5–8 December 2013; pp. 3093–3101. [Google Scholar]
Yamins, D.L.; Hong, H.; Cadieu, C.F.; Solomon, E.A.; Seibert, D.; DiCarlo, J.J. Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. USA 2014, 111, 8619–8624. [Google Scholar] [CrossRef] [Green Version]
Kriegeskorte, N. Deep neural networks: A new framework for modeling biological vision and brain information processing. Annu. Rev. Vis. Sci. 2015, 1, 417–446. [Google Scholar] [CrossRef]
Zeiler, M.D.; Taylor, G.W.; Fergus, R. Adaptive deconvolutional networks for mid and high level feature learning. In Proceedings of the 13th International Conference on Computer Vision, Barcelona, Spain, 6–13 November 2011; Volume 1, p. 6. [Google Scholar]
Du, C.; Du, C.; He, H. Sharing deep generative representation for perceived image reconstruction from human brain activity. arXiv 2017, arXiv:170407575. [Google Scholar]
Lake, B.; Salakhutdinov, R.; Gross, J.; Tenenbaum, J. One shot learning of simple visual concepts. In Proceedings of the 33rd Annual Meeting of the Cognitive Science Society 2011, Boston, MA, USA, 20–23 July 2011; Volume 33. [Google Scholar]
Fei-Fei, L.; Fergus, R.; Perona, P. One-shot learning of object categories. IEEE Trans. Pattern Anal. Mach. Intell. 2006, 28, 594–611. [Google Scholar] [CrossRef] [PubMed]
Lampert, C.H.; Nickisch, H.; Harmeling, S. Attribute-based classification for zero-shot visual object categorization. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 453–465. [Google Scholar] [CrossRef] [PubMed]
Bromley, J.; Guyon, I.; LeCun, Y.; Säckinger, E.; Shah, R. Signature verification using a “siamese” time delay neural network. In Proceedings of the Advances in Neural Information Processing Systems 7, Denver, CO, USA, 28 November–1 December 1994; pp. 737–744. [Google Scholar]
Fe-Fei, L. A Bayesian approach to unsupervised one-shot learning of object categories. In Proceedings of the Ninth IEEE International Conference on Computer Vision, Nice, France, 13–16 October 2003; pp. 1134–1141. [Google Scholar]
Koch, G.; Zemel, R.; Salakhutdinov, R. Siamese neural networks for one-shot image recognition. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; Volume 2. [Google Scholar]
Hoffer, E.; Ailon, N. Deep Metric Learning Using Triplet Network; Springer: Berlin, Germany, 2015; pp. 84–92. [Google Scholar]
Harwood, B.; Kumar, B.; Carneiro, G.; Reid, I.; Drummond, T. Smart mining for deep metric learning. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2821–2829. [Google Scholar]
Sung, F.; Yang, Y.; Zhang, L.; Xiang, T.; Torr, P.H.S.; Hospedales, T.M. Learning to Compare: Relation Network for Few-Shot Learning. arXiv 2017, arXiv:171106025. [Google Scholar]
Van Gerven, M.A.; Cseke, B.; De Lange, F.P.; Heskes, T. Efficient Bayesian multivariate fMRI analysis using a sparsifying spatio-temporal prior. NeuroImage 2010, 50, 150–161. [Google Scholar] [CrossRef] [PubMed]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016; pp. 2234–2242. [Google Scholar]
Pezzotti, N.; Lelieveldt, B.P.; van der Maaten, L.; Höllt, T.; Eisemann, E.; Vilanova, A. Approximated and user steerable tSNE for progressive visual analytics. IEEE Trans. Vis. Comput. Graph. 2016, 23, 1739–1752. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Selecting voxels with a classification network of five fully connected layers. In order to choose those voxels that are maximally correlated with the visual images during training.

Figure 2. A fully connected architecture selected for reconstruction task. The twin networks join immediately after the 512 unit fully-connected layer where the L2 component-wise distance between vectors is computed.

Figure 3. The Siamese Loss minimizes the distance between an

a n c h o r

and a positive

t a r g e t

, both of which coming from same class, and maximizes the distance between the an anchor and a negative

t a r g e t

coming from the different classes.

Figure 3. The Siamese Loss minimizes the distance between an

a n c h o r

and a positive

t a r g e t

, both of which coming from same class, and maximizes the distance between the an anchor and a negative

t a r g e t

coming from the different classes.

Figure 4. Examples of reconstructed 10 distinct handwritten digits obtained from Dataset 1.

Figure 5. Examples of reconstructed 12 distinct handwritten characters obtained from Dataset 2.

Figure 6. The effect on projection of clustering and dispersing characteristics.

Table 1. Best fully connected architecture selected for reconstruction task. Anchor data is fed to Anchor network. Siamese twin is not depicted, but joins immediately after the 512 unit fully-connected layer where the L2 component-wise distance between vectors is computed.

Anchor Network
FC(2048) + ReLU + Dropout
FC(1024) + ReLU + Dropout
FC(1024) + ReLU + Dropout
FC(512) + ReLU + Dropout
FC(784) + ReLU

Table 2. The main corresponding quantitative evaluation of structural similarity index (SSIM) reconstructed by Siamese reconstruction network (SRN) for each cross validation on Dataset 1.

	A	B	C	D	E	F	G	H	I	J
SRN	0.739	0.747	0.735	0.737	0.720	0.717	0.731	0.675	0.684	0.761

Table 3. The main corresponding quantitative evaluation of SSIM reconstructed by SRN for each cross validation on Dataset 2.

	A	B	C	D	E	F	G	H	I	J
SRN	0.422	0.430	0.483	0.457	0.442	0.465	0.444	0.433	0.459	0.427

Table 4. Performance of several image reconstruction methods on the test sets. Results were averaged over 10-fold cross validation.

Datasets	Algorithms	PCC	MSE	SSIM
Dataset 1	Miyawaki et al. [13]	0.767	0.042	0.466
	BCCA [18]	0.411	0.119	0.192
	DCCAE-A [20]	0.548	0.074	0.358
	DGMM [25]	0.803	0.037	0.645
	FCN	0.594	0.065	0.601
	SRN	0.736	0.045	0.725
Dataset 2	Miyawaki et al.	0.481	0.067	0.191
	BCCA	0.348	0.128	0.058
	DCCAE-A	0.354	0.073	0.186
	DGMM	0.498	0.058	0.340
	FCN	0.386	0.071	0.310
	SRN	0.461	0.068	0.446

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiang, L.; Qiao, K.; Wang, L.; Zhang, C.; Chen, J.; Zeng, L.; Bu, H.; Yan, B. Siamese Reconstruction Network: Accurate Image Reconstruction from Human Brain Activity by Learning to Compare. Appl. Sci. 2019, 9, 4749. https://doi.org/10.3390/app9224749

AMA Style

Jiang L, Qiao K, Wang L, Zhang C, Chen J, Zeng L, Bu H, Yan B. Siamese Reconstruction Network: Accurate Image Reconstruction from Human Brain Activity by Learning to Compare. Applied Sciences. 2019; 9(22):4749. https://doi.org/10.3390/app9224749

Chicago/Turabian Style

Jiang, Lingyun, Kai Qiao, Linyuan Wang, Chi Zhang, Jian Chen, Lei Zeng, Haibing Bu, and Bin Yan. 2019. "Siamese Reconstruction Network: Accurate Image Reconstruction from Human Brain Activity by Learning to Compare" Applied Sciences 9, no. 22: 4749. https://doi.org/10.3390/app9224749

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Siamese Reconstruction Network: Accurate Image Reconstruction from Human Brain Activity by Learning to Compare

Abstract

Featured Application

Abstract

1. Introduction

2. Materials and Methods

2.1. Experiment Data

2.2. Selecting Voxels

2.3. Overview of the Proposed Siamese Reconstruction Network

2.4. Model

2.5. Implementation

3. Experimental Results

3.1. The Results of Reconstruction

3.2. The Clustering of Visual Voxel Responses

4. Discussion

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI