1. Introduction
The goal of this paper is to develop a versatile deep learning neural network classification model that improves the interpretation of ambiguous and degraded stimuli through the inclusion of context during the training and testing phases. The deep learning neural network selected for the classification model is the convolution neural network (CNN) because it offers an effective way to integrate context stimuli with a target stimulus for the purpose of extracting features that are coupled across the target and context stimuli. The resulting context-integrating CNN classification model is referred to as the CINET. The CINET is inspired by the context effect, which is the influence of the surrounding environment on the perception of stimuli [
1,
2,
3]. Numerous studies related to the context effect have shown that the integration of contextual information improves the interpretation of spoken words [
4,
5], written letters and words [
6,
7,
8], physical objects [
9,
10,
11], sounds [
12,
13], smells [
14], tastes [
15], threats [
16], colors [
17], and facial emotions [
18,
19,
20]. The context effect has also been widely studied to show how contextual information is used to uniquely resolve the interpretation of ambiguous stimuli [
7,
8,
21,
22,
23,
24,
25,
26]. Ambiguous stimuli contain conflicting sensory information which provides the brain with multiple, mutually exclusive interpretations [
24].
Figure 1 is a simplified illustration of an example that is often used to show how the brain exploits contextual information to correctly interpret an ambiguous letter. In isolation, the letter in
Figure 1a is equally likely to be interpreted as an A or an H. However, as shown in
Figure 1b,c the same ambiguous letter is uniquely interpreted as an H in the word THE and as an A in the word CAT.
The CINET attempts to emulate the brain’s ability to resolve the interpretation of ambiguous and degraded stimuli; however, it is not aimed at modelling the internal mechanisms of the brain involved in context integration. Instead, the aim is to model, at the input-output level, how context included in the learning phase influences the resolution of stimuli in the classification phase. Specifically, the goal is to demonstrate that the CINET parameters can be manipulated to emulate various aspects of the Context Shift Decrement (CSD) principle [
27] and the related Context Reinstatement Effect (CRE) [
28], which are central to explaining how context influences perception. Together, the CSD and CRE principle state that recognition is more accurate if the relationship between the context and target is strong, and recognition decreases when this relationship is weak or the context is changed during the recognition phase. A letter classification problem is selected because it can elegantly demonstrate the capabilities and performance of the CINET by incorporating context letters to form meaningful words. The model, however, is equally applicable to more complex problems, such as the interpretation of ambiguous objects in the visual domain and ambiguous words in a spoken sentences in the auditory domain. Furthermore, the target and context stimuli can be from different modalities to emulate multisensory context integration.
The structure of the paper is as follows:
Section 2 describes the structure and parameters of the generalized CINET classifier model. The CNN implementation of the CINET for multidimensional inputs is described in
Section 3. The visual stimuli used in the experiments and the methods used to manipulate target and context stimuli are described in
Section 4. The series of experiments designed to demonstrate the capabilities and properties of the CINET, the results, and a discussion of the results are presented in
Section 5. Finally, the contributions of the study are summarized in
Section 6.
2. The Generalized CINET Classifier Model
The interpretation of ambiguous stimuli is formulated as a pattern recognition problem; therefore, the focus is on modelling the mapping between an ambiguous stimulus (system input) and the class of the ambiguous stimulus (system output). Due to the inclusion of context, the design of the CINET classifier is unlike the design of most pattern classifiers, which mainly focus on training and testing with isolated, context-free patterns. In the formulations, the stimuli classes are represented by ωi,i = 1, 2, …, L, where is the number of stimulus classes.
The proposed CINET classifier model is illustrated in
Figure 2. This section focuses on the input and context integration component of the model. The CNN classifier component is described in detail in the next section. In the model, the target stimulus is represented by
, the context stimuli by
, the context weights by
, the stimulus noise by
, the weighted and noisy stimuli by
, the context-integrated stimulus by
, and the classifier output by
. The context-integrated stimulus, which is the input to the CNN classifier, can be written as
where the symbol
is used to represent the general context-integration operation. Equation (1) can be written more compactly as
where,
when
and
when
.
In this generalized formulation, the subscript can represent a position (spatial) index or a time (temporal) index. The transformed target stimulus is padded on the left and right by and transformed context stimuli where both and are positive constants with values greater than or equal to zero. The context “span” is defined as where, , and the resulting classifier is referred to as a CINET(S) classifier. The model is symmetrical if and asymmetrical if . Furthermore, if , that is, the context span , the CINET(S) classifier reduces to a context-free classifier represented by CINET(0). For the CINET(0) classifier, the input stimulus is simply . The weight assigned to context can be varied from zero (no influence) to one (full influence) in order to control the strength of the target-context relationship. The noise and added to and accounts for randomness in the target and context stimuli, respectively.
The context-integration operation in Equation (2) is critical because it specifies the manner in which the target and context stimuli are integrated to form the input into the CNN classifier, which in turn will determine the type of features that are extracted from the context-integrated input. For example, if the target and the context stimuli are arrays, they can be integrated into an array through averaging, a large array through concatenation, or an cuboid through a stacking operation. The averaging operation mixes the target and context stimulus arrays into a single array. As a result, there is no control over the strength of the coupling between the target and context stimuli. The concatenation operation also suffers from a lack of controlled coupling. The cuboid option is selected for the development of the CINET classifier model because it offers the most flexible choices for selecting features that are not only coupled across the target and context stimuli, but also features with controlled coupling.
3. CNN Implementation of the CINET Classifier Model
In the most general case, the CNN classifier in
Figure 2 can be replaced with any classifier. As noted in the Introduction, the CNN is selected because it is ideal for combining target and context stimuli and for extracting coupled target-context features with controlled coupling. This section begins with a brief introduction to CNNs and is followed by a detailed description of the multidimensional CINET and its special cases.
3.1. Convolution Neural Networks
CNNs, inspired by the pioneering work of Nobel laureates David Hubel and Thorsten Wiesel on information processing in the visual cortex [
29,
30,
31], are a class of deep learning networks that have proven to be very effective for large-scale object classification and detection in images [
32,
33,
34,
35,
36,
37,
38]. Common CNN architectures generally consist of a series of convolution and pooling layers followed by a fully connected network (FCN). The function of the convolution operations in each layer is to detect features from the output of the previous layer. As a result, the complexity of the features detected increases as the number of convolution layers in the network increases. The pooling layer reduces the spatial dimension of the convolution layer output through subsampling. The most often used pooling operation is max-pooling, in which a block of features is replaced by its maximum value in order to select the most robust feature in the block. The FCN is a standard feed-forward network using either sigmoid or tanh activation functions in the hidden layers and softmax activations in the outputs in order to interpret the network outputs as class posterior probabilities. The gradient descent backpropagation algorithm is used to train the network.
Designing a CNN for a given problem involves specifying the architecture, which includes the number of convolution layers; the number, stride, padding, and the dimensions of the filters in each convolution layer, the size, stride, operation (maximum, average) of the filters in the pooling layers; the sequence of the convolution and pooling layers; the number of layers in the FCN; and the activation functions in the convolution and FCN layers. The hyperparameters that need to be specified during the training phase include the loss function, weight-initialization, learning rate, momentum term, convergence criterion, and batch size.
3.2. The Multidimensional CINET(S) Model
The most general formulation of the CINET(S) in
Figure 2 is obtained by assuming that stimuli
and
are multidimensional (arrays with more than two dimensions). A color image comprised of red, green, and blue component images may be regarded as a three-dimensional stimulus. Examples of three-dimensional signals include seismic volumes, X-ray computed tomography, and LIDAR data. In the generalized formulation, the multidimensional input into the CNN can include higher-dimensional arrays, such as multisensor satellite images and hyperspectral images. Each multidimensional input can be represented by a cuboid, and the cuboids from multiple inputs can be integrated, using the stacking operation, into hypercuboids. The height, width, and depth of hypercuboids and cuboids will be represented by the variables
,
, and
, respectively. Note that
does not represent the depth (number of layers) of the CNN. To avoid this confusion, the cuboid depth will be referred to as “
z-depth.”
The CINET(S) for multidimensional stimuli, shown in
Figure 3, is described in detail. It is then shown that the models for one-dimensional and two-dimensional stimuli are special cases of the multidimensional stimulus model. If
,
, and
are the height, width, and z-depth of the multidimensional input stimuli, respectively, the dimension of the cuboid
in Equation (2) will be
, and the dimension of the hypercuboid
in Equation (1) will be
That is, the input to the CNN is the hypercuboid
of dimension
formed by stacking the weighted-noisy target and context stimuli, as shown in the figure.
In order to simplify the formulations, it is assumed that the convolutions in each layer are the “same” through zero-padding the input so that the filter outputs have the same dimensions as the input. Moreover, it will be assumed that the height and width of the filters in all convolution layers are the same. If the convolution is “valid,” the dimensions of the filtered outputs can be easily adjusted according to the height and width of the filter. In the first convolution layer, each filter is selected to be a cuboid filter with the same z-depth as the input hypercuboid so that the target and context are fully coupled within the receptive field of each neuron in the layer. The filters, centered at zero in the
plane, are assumed to have dimensions
. If the number of filters in the first layer is
and the
cuboid filter is represented by
, the output of the filter is given by
Note that the convolution of the input hypercuboid with a cuboid filter having the same z-depth results in an array with dimension
. A bias
is added to the filtered output and passed through the nonlinear
activation function so that the activation of filter
in the first layer is given by
where,
. The output of the first convolution layer are the
activations combined into a
cuboid, which can be written as
,
If pooling follows and the stride and size of the pooling filter are
and
, respectively, the output of the pooling layer is given by
where,
} is the pooling window, and
+1, and
+1 are the height and width of the pooled output, respectively. In the next convolution stage, the cuboid
is convolved with a cuboid filter
and the filtered output is given by the cuboid convolution
As in the previous step, a bias is added to each filtered output and passed through the activation function, and the activations are combined into a cuboid. If a pooling layer follows, the height and width of the cuboid are adjusted accordingly. The convolution and pooling operations are repeated and terminate into a flattening operation in which the rows of the cuboid are combined into a vector which is the input to a fully connected feed-forward neural network with layers.
The fully connected network (FCN) uses the
sigmoidal, or
tanh activation function for the intermediate hidden layers, the softmax activation function for the output layer, and the cross-entropy for the loss-function. As noted earlier, it is assumed that the target stimuli
belongs to one of
classes represented by
. The softmax layer will, therefore, have
outputs, one for each class of the target stimulus. If
is the weighted sum of the inputs into a neuron
in the softmax layer, the softmax layer outputs are given by
The cross-entropy cost function is given by
During testing, the softmax outputs can be regarded as estimates of class posterior probabilities; therefore, the target stimulus can be assigned to the class
yielding the highest posterior probability, which is given by
3.3. Special Cases of the CINET(S) Model
As mentioned earlier, the one-dimensional and two-dimensional inputs into the CNN are special cases of the multidimensional inputs. For the two-dimensional case, the main difference is that z-depth of the target and context stimuli is unity. Therefore, the dimension of in Equation (2) will be , and the dimension of the cuboid in Equation (1) will be The cuboid input to the CNN can, therefore, be written as In order to match the z-depth of the input cuboid, the filters in the first convolution layer will, therefore, have dimension . Other than the changes in the dimensions of the cuboid input and filters in the first layer, the convolution, pooling, and FCN layer operations are identical to the operations in the multidimensional input case.
For one-dimensional inputs, the heights and depths of the target and context stimuli are unity and are, therefore, vectors. The dimension of
in Equation (2) will be
, and the dimension of
in Equation (1) will be 1
Note that, although
is an array, it is written as a cuboid with unity height for consistency. The cuboid input to the CNN can, therefore, be written as
The dimension of each filter in the first layer will be
and the filtered output will be a vector with dimension
, which can be written as a
cuboid. The output of the
filter in the first convolution layer is given by
A bias is added to each filtered output and passed through the activation function. The filtered outputs are combined into a cuboid. The width of the cuboid is adjusted if a pooling layer follows the convolution layer. Subsequent convolutions are also unit height cuboid convolutions which result in vectors which are then combined into unit height cuboids. An FCN with softmax outputs is implemented after the last pooling layer, and a target stimulus is assigned to class using the rule in Equation (3).
5. Classification Experiments and Results
In the experiments that follow, the context-free CINET(0) and context-integrating CINET(S) classifiers were trained to classify only the target letters. The training sets were generated by adding a small level of random noise () to the noise-free target and context letters. The inclusion of noise at such small levels introduces minor variations in the letters; therefore, the resulting training sets are referred to as “noise-free training sets” in the experiments. The networks were initialized with random weights and training was terminated when the cross-entropy fell below 0.001. The CINET(0) classifier was tested on the three ambiguous letters in varying noise levels. The CINET(S) classifiers were tested on the three ambiguous letters in congruent and incongruent environments. In the experiments conducted, a total of one hundred distorted and noisy versions of each ambiguous letter were generated to form the test set at each noise level . Because the classification results of a CNN are dependent on the initial weights, a total of thirty CNNs were initialized with random weights. The performance of each network was evaluated using the test sets. Consequently, the total number of tests conducted for each ambiguous letter at a given noise level was 30 × 100 = 300. The results for each ambiguous letter were averaged across the 300 tests, and the final classification probability was given by averaging the averaged results of the ambiguous letters. The following experiments were designed:
Set 1: Context-free classification with the CINET(0) classifier
The first set of experiments was aimed at demonstrating the performance of the classifier when no context is integrated into the training and testing phases. A CINET(0) classifier was trained to classify the six noise-free isolated target letters {A,H,O,U,P,R}, shown in
Figure 4, and was tested with the three isolated, ambiguous letters {[A/H], [O/U], [P/R]}, shown in
Figure 5. The noise level in the test ambiguous letters was varied from 0.1 to 2. It is important to note that in the absence of context, the true class of the ambiguous letter is unknown. An ambiguous letter could be classified into any one of the six classes, however, the interest is mainly on estimating the probability of an ambiguous letter being classified into one of its two possible categories. The classification probabilities are summarized in the row labeled Set 1 in
Table 1. The classification probability, in this case, can be interpreted by considering the first entry in the Set 1 row of the table, which shows that the average probability of classifying the three ambiguous characters into their possible classes was 0.48 when the noise level was 0.1. That is, the average of the probabilities of classifying [A/H] as an A or an H, [O/U] as an O or a U, and [P/R] as a P or an R was 0.48 when the noise level was 0.1. Observe that the probabilities drop as the noise increases because it becomes increasingly difficult to classify each ambiguous letter into one of its two possible classes.
Set 2: Training and testing the CINET(2) classifier with congruent context
These experiments were aimed at demonstrating the improvement in performance when strong context (unity weights) is incorporated in training, and the same noise-free context (congruent) is used during testing in order to emulate learning and testing in the same environments. A symmetrical CINET(2) classifier
with unity weights
was trained with the noise-free context-augmented training set {B
AG, T
HE, M
OW, F
UN, S
PY, I
RK} to classify the six target letters in the center. That is, B and G were the context for target stimulus A, T and E were the context for target H, and so on. The training set is shown in
Figure 9a. The classifier was tested with the center letters replaced with ambiguous letters, as shown in
Figure 9b. That is, the test set was {B[
A/H]G, T[
A/H]E, M[
O/U]W, F[
O/U]N, S[
P/R]
Y, I[
P/
R]K}. The same noise levels used in Set 1 were added to the center ambiguous letters. Correct classification occurred if the noisy, ambiguous letter, surrounded by congruent context letters, was resolved correctly. The results, shown in the row labelled Set 2 in the table, can be interpreted by considering the first entry, 0.99, which shows that when the noise level was 0.1, the average of the probabilities of correctly classifying [A/H] as an A when the input was as B[
A/H]G, [A/H] as an H when the input was T[
A/H]E, [O/U] as an O when the input was M[
O/U]W, [O/U] as an U when the input was F[
O/U]N, [P/R] as a P when the input was S[
P/R]Y, and [P/R] as a R when the input was I[
P/
R]K, was 0.99. By including context, the CINET(2) classifier has the ability to resolve stimulus ambiguities quite effectively. As expected, the classification probabilities dropped when the noise levels increased.
Set 3: Training and testing the CINET(4) classifier with congruent context
These experiments were aimed at demonstrating the improvement in performance when additionally strong context (unity weights) is incorporated in training, and congruent context is used during testing. A symmetrical CINET(4) classifier
with unity weights
was trained with the context-augmented training set {BE
AST, ET
HYL, FL
UID, GN
OME, IM
PLY, SC
REW} to classify the six target letters in the center. The training set is shown in
Figure 10a. As in Set 2, the classifier was tested with the center letters replaced with ambiguous letters.
Figure 10b shows the test set with the noise-free ambiguous letters surrounded by noise-free congruent context. The test results under varying noise levels in the ambiguous letters are shown in the row labelled Set 3 in
Table 1. The probabilities are interpreted just as they were for Set 2. It is clear that, for the same range of noise levels, the performance of the CINET(4) classifier is much better than the CINET(2) classifier. It could, therefore, be concluded that incorporating additional context improves the classification of ambiguous letters.
Set 4: Testing the CINET(4) classifier with weighted incongruent context
This set of experiments was aimed at demonstrating how the weights can be manipulated to simulate incongruent testing environments and to show how the performance is affected by varying the context weights during testing. The CINET(4) classifier designed in Set 3 was tested with two different sets of context weights. The first set of weights,
, were selected to show how the attenuation of context affects the performance. The next set of context weights,
, were selected to have a decaying influence as the separation span (spatial/temporal lag) between the target and context stimuli was increased. The resulting noise-free test sets with weighted incongruent context are shown in
Figure 11. The results for this set of experiments are presented in the rows labelled Set 4(A) and Set 4(B) in
Table 1, respectively. As expected, the performance declines as the incongruency in the testing environment is increased.
Set 5: Testing the CINET(4) classifier with noisy incongruent context
This set of experiments was aimed at demonstrating how the performance is affected when noise is added to the context stimuli during testing to generate incongruent context environments. The CINET(4) classifier designed in Set 3 with unity weights was tested with noise in the ambiguous letters, as well as statistically equivalent noise in the context letters. An example of a test set using
is shown in
Figure 12. The results are presented in the row labelled Set 5. As expected, performance declines when context incongruency is increased by adding noise. However, the general trend observed in the Set 3 results is maintained.
Set 6: Testing the CINET(2) classifier with flipped incongruent context
The last set of experiments were different in the sense that the CINET(2) classifier trained in Set 2 was tested with “flipped” context to demonstrate how performance is affected if the incorrect context is used during testing. The test set, therefore, was {G[
A/H]B, E[
A/H]T, W[
O/U]M, N[
O/U]F, Y[
P/R]S, K[
P/
R]I}. The test set is shown in
Figure 13, and the results in varying noise levels are shown in the row labelled Set 6. Despite the fact that the same context letters were used, the results are quite poor. This, however, is not unexpected because the temporal pattern of the context was changed.
The best result at each noise level is shown in boldface font in
Table 1. For comparison purposes and to observe the trends,
Figure 14 summarizes the correct resolution probabilities from the experiments of Sets 2–5. The Set 1 results are also included in the figure to serve as the context-free reference. The results from Set 6 are not included in the figure.
Although not the primary focus of this study, the CINET(S) model can also be used to design experiments to demonstrate the influence of context on the recognition of non-ambiguous target stimuli in varying congruent and incongruent environments simply by testing the targets instead of the ambiguous stimuli. In general, it can be expected that the performance will be improved by including the congruent context in the learning and recognition phases. This was confirmed by repeating all six experiments in which the target letters {A,H,O,U,P,R} were tested. The average classification probabilities are summarized in
Table 2 and
Figure 15. The best results are shown in boldface font. As expected, the classification probabilities are higher for non-ambiguous targets. By comparing
Figure 14 and
Figure 15, it is interesting to observe that the performance trends for the classification of ambiguous and non-ambiguous stimuli are quite similar.
Conclusions from the Experiments
The results in
Table 1 and
Table 2 and the trends in
Figure 14 and
Figure 15 show that the CINET(S) classifiers perform in a desirable manner in the sense that various aspect of the CSD principle and the CRE are demonstrated. That is, congruent context helps resolve classification ambiguities, and this ability decreases as the ambiguity levels and context incongruencies are increased. The CNN offers an effective method for extracting features that are coupled across the target and context stimuli. Moreover, the random stimulus noise and context weights offer an effective way of manipulating the relationship and strength of the coupling.
The six sets of experiments and the results obtained demonstrate, quite effectively, the performance trends of the CINET(S) classifier. It can be expected that other forms of ambiguity and context manipulations will result in similar trends. Furthermore, similar results would be obtained even if the letters used for context did not form meaningful words, as long as the same context letters were used for both training and testing. Also noteworthy is that the use of simulated ambiguities and context environments enabled the systematic and quantifiable evaluation of the CINET(S) classifier model under a wide range of conditions. Clearly, such extensive experimentation and evaluation would not be possible with real data unless an enormously large data set with quantifiable ambiguities and context is collected. Undoubtedly, the CINET(S) classifier will perform similarly on real data.
Experiments can also be designed to demonstrate the influence of context on perceiving a missing stimulus, for example, a missing letter in a learned word. Because the stimulus index
can be temporal, the model can also be applied to resolve ambiguities in sequentially occurring events, such as garbled words in a sentence. Context that is not inherently sequential can also be accommodated in the model. For example, if the background of an object in an image is regarded as the context, the image can be segmented into two components: object (target) and background (context). The input to the CINET(S) classifier would then be a concatenation of the target features and context features. Finally, it is important to note that the target and context stimuli can be from mixed modalities (e.g., visual and auditory stimuli) for multisensory target-context integration, which is yet another way to combine multisensory information in brain-inspired classification systems [
42].