Hybrid Spiking Fully Convolutional Neural Network for Semantic Segmentation

Zhang, Tao; Xiang, Shuiying; Liu, Wenzhuo; Han, Yanan; Guo, Xingxing; Hao, Yue

doi:10.3390/electronics12173565

Open AccessArticle

Hybrid Spiking Fully Convolutional Neural Network for Semantic Segmentation

by

Tao Zhang

¹,

Shuiying Xiang

^1,2,*

,

Wenzhuo Liu

¹,

Yanan Han

¹,

Xingxing Guo

¹ and

Yue Hao

²

¹

State Key Laboratory of Integrated Service Networks, Xidian University, Xi’an 710071, China

²

State Key Discipline Laboratory of Wide Bandgap Semiconductor Technology, Xidian University, Xi’an 710071, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(17), 3565; https://doi.org/10.3390/electronics12173565

Submission received: 30 June 2023 / Revised: 18 August 2023 / Accepted: 21 August 2023 / Published: 23 August 2023

(This article belongs to the Special Issue Advances in Photonic Neural Networks and Neuromorphic Computation)

Download

Browse Figures

Versions Notes

Abstract

:

The spiking neural network (SNN) exhibits distinct advantages in terms of low power consumption due to its event-driven nature. However, it is limited to simple computer vision tasks because the direct training of SNNs is challenging. In this study, we propose a hybrid architecture called the spiking fully convolutional neural network (SFCNN) to expand the application of SNNs in the field of semantic segmentation. To train the SNN, we employ the surrogate gradient method along with backpropagation. The accuracy of mean intersection over union (mIoU) for the VOC2012 dataset is higher than that of existing spiking FCNs by almost 30%. The accuracy of mIoU can reach 39.6%. Moreover, the proposed hybrid SFCNN achieved excellent segmentation performance for other datasets such as COCO2017, DRIVE, and Cityscapes. Our hybrid SFCNN is a valuable and interesting contribution to extending the functionality of SNNs, especially for power-constrained applications.

Keywords:

spiking convolutional neural network; semantic segmentation; surrogate gradient; supervised training

1. Introduction

Image segmentation is one of the fundamental tasks in the field of computer vision, and is critical for real-world applications such as autonomous driving and medical diagnosis. Image segmentation can be divided into two types: semantic segmentation and instance segmentation. In semantic segmentation, the pixels in an image are divided into corresponding categories, aiming to achieve the pixel-level classification. Each specific class corresponds to a specific object instance. In instance segmentation, it not only requires pixel-level classification, but also the ability to distinguish different instances within each specific category.

In the past few decades, artificial neural networks (ANNs) have shown outstanding performance in the image segmentation field, for instance, FCN [1], U-Net [2], Linknet [3], DeepLab [4], SegNet [5], Graph-FCN [6], and GPSNet [7]. Note, the computing cost is huge [8,9], which is a serious bottleneck in power-constrained systems such as Internet of Things devices. As a low-power alternative to ANN, spiking neural networks (SNNs), which show the advantage of low power consumption, have attracted more and more attention. In SNN, binary information transmission is utilized, resulting in energy efficiency [10,11,12,13,14,15]. Supervised training algorithms for image classification tasks with both shadow SNN and deep SNN have been rapidly developed, for example, biologically-plausible spike-timing dependent plasticity (STDP)-based supervised training algorithms [16,17,18], ANN-SNN conversion methods [19,20,21,22], and backpropagation based on surrogate gradient [23,24,25,26]. At present, the SNN is mainly limited to classification tasks. Thus, it is still highly desirable to explore novel architectures and algorithms to extend the application of SNN to the fields of more complex tasks such as image segmentation.

In this paper, we propose a hybrid spiking fully convolutional neural network (SFCNN) architecture to extend the applications of SNN to the fields of semantic segmentation. The main contributions are as follows. Firstly, we introduce a hybrid spiking fully convolutional network to solve the semantic segmentation problem. Secondly, in the encoder stage, we use binary to transmit information, and adopt the surrogate gradient method to train the hybrid SFCNN directly through backpropagation. The computational complexity is lower than the floating-point number. Thirdly, the accuracy of the VOC2012 dataset (39.6%) is much higher than that of existing network spiking FCN (9.9%) [27]. The rest of the paper is structured as follows. In Section 2, the methods and materials are presented. The network structure, surrogate gradient, and training algorithms are described. Section 3 presents the results. The semantic segmentation results are quantified and compared for different conditions. Several different datasets—VOC2012 [28], COCO 2017 [29], and DRIVE [30]—are considered, and the testing results are compared with other well-known models. Finally, Section 4 summarizes the conclusions.

2. Methods and Materials

2.1. Network Architecture

The network structure of the hybrid SFCNN is presented in Figure 1. The whole network consists of a spike encoder, a four-layer encoding module, a four-layer decoding module, and a final layer of 1 × 1 convolution. Since the size of the image in the dataset is inconsistent, the image is processed into 384 × 384 size and then entered into the first convolution layer for spike coding. The parameters of the convolution layer are consistent with the parameters of the following encoding module. The encoding block consists of the convolution layer and the following spiking neurons for spike activation. The convolution layer computes the convolution of the input RGB image, its output tensor is four-dimensional, and then becomes a five-dimensional tensor by expanding it in the time domain. The five-dimensional tensor is passed through the spiking neurons and is activated as a five-dimensional coded tensor composed of binary spikes, namely, the floating-point tensor is replaced with the discrete spike tensor. After that, each layer of the encoding network uses spikes to transmit information and perform computation. A representative example of the spike-encoding result is shown in Figure 2. Here, each pixel is encoded into 64 spike sequences of length T.

After spike encoding, the tensor size is T × 64 × 384 × 384. The structure of the four coding modules is the same, and is presented in Figure 3a. Each encoder block consists of a maximum pooling layer, a convolution layer, and an activation layer. For the max pooling layer, the pooling kernel size is 2, and the stride is 2. For the convolution layer, the kernel size is 3, the stride is 1, and the padding is 1. For the activation layer, the IF neuron is employed as the spiking neuron as it is similar to the ReLu function. The tensor size after four encoding blocks becomes T × 1024 × 24 × 24.

For the decoding block, there are two commonly adopted approaches, including upsampling and deconvolution. The principle of upsampling is presented in Figure 4. As illustrated in Figure 4a, we assume that the coordinates of point ABCD are known and the pixel value is f(A), and then we can calculate the pixel value of the point P. The formula can be expressed as follows:

f (E) \approx \frac{x - x_{1}}{x_{2} - x_{1}} f (B) + \frac{x_{2} - x}{x_{2} - x_{1}} f (A), E = (x, y_{2})

(1)

f (F) \approx \frac{x - x_{1}}{x_{2} - x_{1}} f (C) + \frac{x_{2} - x}{x_{2} - x_{1}} f (D), F = (x, y_{1})

(2)

f (P) \approx \frac{y - y_{1}}{y_{2} - y_{1}} f (E) + \frac{y_{2} - y}{y_{2} - y_{1}} f (F)

(3)

For clarity, an example is illustrated in Figure 4b. After the upsampling process, the matrix size is changed from 2 × 2 to 4 × 4.

The principle of deconvolution is presented in Figure 5, which includes three steps. The first step is inserting zeros. More precisely, for a given matrix with height = 2, as in Figure 5a, if stride = 2, we can insert stride-1 zeros between each element to obtain a new feature map, as in Figure 5b. Here, the size of the new feature map can be calculated by

h e i g h t^{’} = h e i g h t + (s t r i d e - 1) \times (h e i g h t - 1) = 3

(4)

where height and height’ represent the matrix size before and after inserting zeros. The second step is padding. We consider kernel size’ = kernel size = 2, stride’ = 1. Then, we can calculate padding’ = kernel size − padding − 1. For the matrix shown in Figure 5b, padding = 0. As illustrated in Figure 5c, padding’ = 1. Thus, after padding, the matrix size is changed from 3 × 3 to 5 × 5. The third step is convolution with kernel size’, and the matrix size is 4 × 4 after this step.

Because of the gradient vanishing and gradient explosion problems caused by the network depth, we adopt the upsampling method in the decoding block similar to the decoding module of the UNet. Note that the time dimension needs to be removed, which can be completed by calculating the firing rate of the spike sequence. Please note that each decoding module in the network has two inputs. One input is the convolution result of the previous layer, while the other input is the output of the encoder block. The structure of the decoding module is presented in Figure 3b. Taking the fourth layer decoding module as an example, the result of encoder block4 is first calculated as the firing rate, changing from a five-dimensional tensor to a four-dimensional tensor, and input into decoder block4. After the upsampling with bilinear interpolation, the obtained tensor size is 1024 × 48 × 48. Then, the output result of encoder block3 is used to calculate the firing rate. The tensor size is 512 × 48 × 48. Therefore, the two results are concatenated along the channel dimension to form thicker features, resulting in a size of 1536 × 48 × 48. After two layers of convolution with the same size as in the encoding module, the tensor size is 512 × 48 × 48, and this tensor is used as one of the inputs of decoder block3. After four decoding blocks, the tensor size is 64 × 384 × 384. Here, we first consider the VOC2012 dataset. As the VOC2012 dataset includes 21 classes, after a layer of 1 × 1 convolution for dimension reduction, the final tensor size is 21 × 384 × 384. Then the tensor and label are used to calculate the loss function and gradient, enabling training to be completed through backpropagation. The number of parameters in the encoding block and the corresponding size of the feature maps are shown in Table 1.

2.2. Surrogate Gradient and Training Algorithms

During training, cross entropy loss is used and can be calculated as follows:

l o s s = \sum_{j = 1}^{m} \sum_{i = 1}^{n} - y_{j i} \log (\overset{}{{\hat{y}}_{i j}) - (1 - y_{j i}) \log (1 - {\hat{y}}_{i j})}

(5)

where m is the sample size of the current batch, n is the number of categories, y is the vector output by the network, and

\hat{y}

is the corresponding label vector. The training algorithm for the hybrid SFCN network is presented in Algorithm 1.

Here, the IF neuron is employed as the spiking neuron. In training, we use 0 and 1 to pass information in the Encoder Blocker. The difficulty for directly training is that the output of the spiking neuron is not differentiable and, thus, the backpropagation training cannot be carried out directly. This can be solved directly by replacing the derivative of spikes using differentiable surrogate functions during backpropagation without pre-training and conversion.

In the experiment, the softsign function is used as a surrogate function, which can be expressed as follows:

g (x) = \frac{1}{2} (\frac{α x}{1 + | α x |} + 1)

(6)

The shape of the softsign function and the corresponding gradient are presented in Figure 6.

Algorithm 1: Hybrid SFCN network for pixel-level semantic segmentation

Input: RGB image(3 × 384 × 384), T = 6, max_epoch = 300
Output: semantic segmentation result
1: for epoch to max_epoch do
2: for pixel in image do
3: for t to T do
4: Spike Encoder → spike sequence (length = 64)
5: end for
6: pixel Spiking Encoder → spike sequence (T × 64)
7: end for
8: image Spiking Encoder → feature map (T × 64 × 384 × 384)
9: Spike Feature Extractor → feature map(T × 1024 × 24 × 24)
10:    Fire rate → feature map (1024 × 24 × 24)
11:    Unsample (four Decoder Block) → feature map (64 × 384 × 384)
12:    1 × 1 convolutional → feature map (21 × 384 × 384)
13:    calculate loss
14:    Do backpropagation and weight update
15: end for

3. Experiments and Results

3.1. Dataset and Training Parameters

The VOC2012 dataset contains 21 categories when considering the background. There are 2913 sets of Person, Animal, Vehicle, and Indoor for the training set and verification set, and 1449 sets of test set. The outline of each object in the ground-truth picture has a specific color, such as red for motorcycles and green for people.

The hardware environment is as follows: two Intel(R) Xeon(R) E5-2620 v4 CPUs, 2.10 GHz running frequency, 64 G memory capacity, two Nvidia GTX 2070 Super GPUs, 16 G video memory. The software environment is as follows: Python language, interpreter version 3.8, deep learning framework PyTorch [31], version 1.8.0, and spiking neural network framework SpikingJelly [32]. The parameters used in the training process are shown in Table 2. It should be noted that since the fixed step makes training losses fluctuate dramatically, the learning rate adjuster is employed after the optimizer. Here, the learning rate adjuster is adopted after the optimizer is updated. The adjustment strategy used in this experiment is CosineAnned Scheduler.

3.2. Quantifiers for the Testing Accuracy

The intersection over union (IOU) refers to the intersection of the true value and the predicted value of the pixel in the union of the true value and the predicted value of the pixel.

I o U = \frac{I n t e r s e c t i o n (A, B)}{U n i o n (A, B)}

(7)

where A represents the intersection composed of the ground-truth pixels, and B represents the intersection composed of the predicted pixels. For multi-class tasks, the mean intersection over union (mIoU) is further employed by calculating the average value of the IoU for each class.

mIoU = \frac{s u m o f IoU o f p e r c l a s s}{t h e n u m b e r o f c l a s s e s}

(8)

In addition, pixel accuracy is another index of image segmentation, which represents the percentage of correctly classified pixels in the total pixels in an image.

Pixel accuracy = \frac{total pixel classified correctly}{total number of pixel} = \frac{TP + TN}{TP + TN + FP + FN}

(9)

Precision, Recall, and F1 are also used to evaluate the performance of image segmentation.

Precision = \frac{TP}{TP + FP}

(10)

Recall = \frac{TP}{TP + FN}

(11)

F 1 = \frac{2 TP}{2 TP + FN + FP}

(12)

TP (True Positive) means that a pixel can be correctly predicted to belong to a certain category. TN (True Negative) means that a pixel can be correctly predicted not to belong to a certain category. FP (False Positive) means that a pixel cannot be correctly predicted to belong to a certain category. FN (False Negative) means that a pixel cannot be predicted correctly and does not belong to a certain category.

3.3. Results

The loss in the training process is shown in Figure 7a. With the increase in the training cycle, the loss function decreases; meanwhile, after 250 cycles, the decline speed of the loss value decreases gradually; after 270 cycles, the loss value becomes stable. The values of mIoU and Pixel accuracy (PixelAcc) are shown in Figure 7b,c. Both the mIoU and Pixel accuracy increases with the increase in the training cycle. When the training convergence is achieved, mIoU is stable around 0.45 and PixelAcc is stable around 0.808. The learning rate is presented in Figure 7d.

VOC2012 and COCO2017 (Microsoft Common Objects in Context) datasets are the most commonly used datasets for semantic segmentation tasks, so we thought it was necessary to test on them. And since our model is inspired by the unet network, which is proposed for processing medical images, we used the Drive (Digital Images for Vessel Extraction) dataset. In addition, to further verify the validity of our model, we also conducted tests on the Cityscape dataset.

The mIoU and PixelAcc for different observation time T, different upsample methods, and surrogate functions are calculated and compared. For the purpose of comparison, we also consider the conventional CNN. The results are presented in Table 3. It can be seen that the mIoU reaches 0.498 and PixelAcc reaches 0.781. For the hybrid SFCNN, the highest mIoU is 0.396 and the highest PixelAcc is 0.732.

In order to compare the influence of different parameters on the results, we consider different time steps T, upsampling methods, and gradient surrogate functions. For sequences 1 and 2, we discuss the effects of different T on the results. It turns out that the longer the step, the more accurate the result. However, the time step cannot be increased indefinitely, otherwise it will lead to a sudden increase in training time and even the disappearance of the gradient. For sequences 1 and 3, we discuss the effects of different upsampling methods on the results. It can be found that with upsampling, the performance is better. For sequence 1 and sequence 4, we compared different surrogate functions such as softsign and sigmoid, and found that softsign performs better.

As we can see, we came to the same conclusion on the Drive and COCO2017 datasets, and while the results are slightly different on Cityscapes, the difference is only six parts per thousand, which we think is reasonable.

The quality of our model depends heavily on the selection of time steps. If the time step is too small, the performance will be reduced considerably, such as T = 2, mIoU = 0.125. If the time step is too large, T = 8, our model is difficult to train.

The semantic segmentation results for some randomly selected samples are further presented in Figure 8. It can be seen that for the sample i, m, and n with a large number of targets, our network still has a good segmentation effect. Of course, from the perspective of visualization, our network still needs to be improved significantly. For example, for samples Figure 8a,e, only a part of the objects can be recognized, and for samples Figure 8f,g, the recognition effect of horses is not good due to the occlusion problem. We believe that the pre-trained model is very important, but the field of SNN currently lacks an authoritative and effective pre-trained model.

In order to verify the robustness of the proposed hybrid SFCNN network, we also conducted tests on the COCO2017 dataset, which is a large-scale object detection and segmentation dataset mainly captured from complex everyday scenes. It is the largest dataset for semantic segmentation so far. There are 80 categories (excluding background), more than 330,000 images, and the number of individuals in the whole dataset exceeds 1.5 million. The results are shown in Table 4. Therefore, we believe that the proposed hybrid SFCNN network has practical application value in real life.

We also attempt to implement semantic segmentation on medical images, such as the Drive dataset, which is taken from 453 different individuals aged 25~90 years, among which 40 images are randomly selected. The pixel of each image is 565 × 584. As the number of images is small, to avoid overfitting problems, simple data enhancement is carried out on the image during the training, including rotation and cropping. Note that the mask needs the same data enhancement operation as the image. This dataset is commonly used to measure the performance of retinal vascular segmentation methods. As shown in Table 4, the PixelAcc reaches 0.963. The semantic segmentation results for some randomly selected samples are further presented in Figure 9. We can see that the samples are well segmented.

For the dataset VOC2012, based on our proposed model, the mIoU is about 9% higher than FCN and about 30% higher than SpikingFCN. We think there are two main reasons: one is that our model is a hybrid SNN, and the other is that our network is deeper. The trade-off is that our model requires more training time. We also compare it with classical CNNs [33,34]. It also can be seen that our model has a decline in metrics compared to CNN. However, our model has an advantage in energy consumption. To further verify the validity of our model, we compared the accuracy, recall rate, and f1 scores on VOC2012, COCO2017, DRIVE, and Cityscapes.

To perform statistical analysis, we calculated the standard deviations of the results of five independent experiments on the VOC2012 dataset based on our model and CNN, and the results are presented in Table 4. It can be seen that there iare no significant fluctuations.

Finally, in order to verify the advantages of SNN in forward propagation, we randomly selected an image in the test set to evaluate the energy consumption. Because our network is a hybrid network, we only calculated the energy consumption of the convolutional coding layer and the four-layer coding module in the forward propagation process. We believe that the energy consumption of the decoding part will not differ much. At the same time, we ignore the energy consumed by stored procedures and other peripheral circuits, and only focus on multiplying and adding operations.

According to Ref. [27], the energy consumption of SNN needs to calculate the Spike rate of each convolution layer. The calculation formula is as follows:

\frac{S p i k e s o f l o v e r a l l t i m e - s t e p s}{N e u r o u s o f l a y e r l}

(13)

In detail, we calculate the energy consumption of the ANN by calculating the total number of floating point operations (FLOPs) [35,36,37]. The energy consumption of the SNN is based on the Spike rate of the FLOP of ANN and the convolution of each layer. The entire calculation process is based on CMOS technology [38], as shown in Table 5.

The formula for calculating the number of FLOPs for convolution in ANN is as follows:

{FLOPs}_{ANN} (l) = k^{2} \times O^{2} \times C_{i n} \times C_{o u t}

(14)

where k is the convolution kernel size, O is the output feature map size, and C_in and C_ou_t are the input and output dimensions, respectively. The specific parameters can be found in Table 1.

The energy consumption in ANN is then calculated as:

E_{ANN} = \sum_{l} {FLOPs}_{ANN} (l) \times E_{M A C}

(15)

Thus, the total number of FLOPs in SNN can be obtained as

{FLOPs}_{SNN} (l) = {FLOPs}_{ANN} \times S p i k e r a t e (l)

(16)

Since the SNN is binary and therefore does not involve multiplication, we can obtain the energy consumption of the entire coding part:

E_{SNN} = {FLOPs}_{SNN} (l) \times E_{A C}

(17)

The final calculated energy consumption is presented in Table 6. Obviously, the energy consumption of the SNN is much lower than that of the ANN.

4. Conclusions

In summary, we proposed a hybrid spiking full convolutional neural network to solve the semantic segmentation task. The surrogate gradient method was employed to solve the problem that the SNN is difficult to train directly. The effects of observation time window length T, upsampling method, and the surrogate functions on the segmentation performance were examined. The experiments showed that the highest mIoU was 0.396 and the highest PixelAcc was 0.732 for the VOC2012 dataset, which is significantly better than the Spiking FCN. Meanwhile, our network also performed well on other datasets. The mIoU was 0.421 and the PixelAcc was 0.769 for the COCO2017 dataset. The PixelAcc was 0.963 for the DRIVE dataset. Therefore, we believe that the proposed hybrid SFCNN network has practical application value in real life.

Author Contributions

Conceptualization, S.X. and T.Z.; methodology, S.X. and T.Z.; software, T.Z.; validation, W.L. and T.Z.; formal analysis, S.X. and T.Z.; data curation, T.Z. and Y.H. (Yanan Han); writing—original draft preparation, T.Z. and S.X.; writing—review and editing, W.L. and X.G.; visualization, T.Z.; supervision, S.X. and Y.H. (Yue Hao); funding acquisition, S.X. and Y.H. (Yue Hao). All authors have read and agreed to the published version of the manuscript.

Funding

The National Key Research and Development Program of China (2021YFB2801900, 2021YFB2801901, 2021YFB2801902, 2021YFB2801904); the National Natural Science Foundation of China (No. 61974177, No. 61674119); the National Outstanding Youth Science Fund Project of National Natural Science Foundation of China (62022062); the Fundamental Research Funds for the Central Universities (QTZX23041).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; Volume 10, pp. 3431–3440. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. Med. Image Comput. Comput.-Assist. Interv. 2015, 9351, 234–241. [Google Scholar]
Chaurasia, A.; Culurciello, E. Linknet: Exploiting encoder representations for efficient semantic segmentation. In Proceedings of the 2017 IEEE Visual Communications and Image Processing (VCIP), St. Petersburg, FL, USA, 10–13 December 2017; pp. 1–4. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Lu, Y.; Chen, Y.; Zhao, D.; Chen, J. Graph-FCN for Image Semantic Segmentation; Springer International Publishing: Cham, Switzerland, 2019; Volume 11554, pp. 97–105. [Google Scholar]
Geng, Q.; Zhang, H.; Qi, X.; Huang, G.; Yang, R.; Zhou, Z. Gated path selection network for semantic segmentation. IEEE Trans. Image Process. 2021, 30, 2436–2449. [Google Scholar] [CrossRef]
Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Effificient processing of deep neural networks: A tutorial and survey. Proc. IEEE Inst. Electr. Electron. Eng. 2017, 105, 2295–2329. [Google Scholar] [CrossRef]
Jegou, H.; Douze, M.; Schmid, C. Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 117–128. [Google Scholar] [CrossRef]
Akopyan, F.; Sawada, J.; Cassidy, A.; Alvarez-Icaza, R.; Arthur, J.; Merolla, P.; Imam, N.; Nakamura, Y.; Datta, P.; Nam, G.J.; et al. Truenorth: Design and tool flow of a 65 mw 1 million neuron programmable neurosynaptic chip. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2015, 34, 1537–1557. [Google Scholar] [CrossRef]
Davies, M.; Srinivasa, N.; Lin, T.; Chinya, G.; Cao, Y.; Choday, S.H.; Dimou, G.; Joshi, P.; Imam, N.; Jain, S.; et al. Loihi: A neuromorphic manycore processor with on-chip learning. IEEE Micro 2018, 38, 82–99. [Google Scholar] [CrossRef]
Li, S.; Zhang, Z.; Mao, R.; Xiao, J.; Chang, L.; Zhou, J. A fast and energy-efficient SNN processor with adaptive clock/event-driven computation scheme and online learning. IEEE Trans. Circuits Syst. I Regul. Pap. 2021, 68, 1543–1552. [Google Scholar] [CrossRef]
Roy, K.; Jaiswal, A.; Panda, P. Towards spike-based machine intelligence with neuromorphic computing. Nature 2019, 575, 607–617. [Google Scholar] [CrossRef]
Huynh, P.K.; Varshika, M.L.; Paul, A.; Isik, M.; Balaji, A.; Das, A. Implementing spiking neural networks on neuromorphic architectures: A review. arXiv 2022, arXiv:2202.08897. [Google Scholar]
Shastri, B.J.; Tait, A.N.; de Lima, T.F.; Pernice, W.H.P.; Bhaskaran, H.; Wright, C.D.; Prucnal, P.R. Photonics for artificial intelligence and neuromorphic computing. Nat. Photon. 2021, 15, 102–114. [Google Scholar] [CrossRef]
Xiang, S.; Zhang, Y.; Gong, J.; Guo, X.; Lin, L.; Hao, Y. STDP-based unsupervised spike pattern learning in a photonic spiking neural network with VCSELs and VCSOAs. IEEE J. Quantum Electron. 2019, 25, 1–9. [Google Scholar] [CrossRef]
Xiang, S.; Ren, Z.; Zhang, Y.; Song, Z. Training a multi-layer photonic spiking neural network with modified supervised learning algorithm based on photonic STDP. IEEE J. Quantum Electron. 2020, 27, 1–9. [Google Scholar] [CrossRef]
Ferré, P.; Mamalet, F.; Thorpe, S.J. Unsupervised feature learning with winner-takes-all based stdp. Front. Comput. Neurosci. 2018, 12, 24. [Google Scholar] [CrossRef] [PubMed]
Rueckauer, B.; Lungu, I.-A.; Hu, Y.; Pfeiffer, M.; Liu, S.-C. Conversion of continuous-valued deep networks to efficient event-driven networks for image classification. Front. Neurosci. 2017, 11, 682. [Google Scholar] [CrossRef]
Midya, R.; Wang, Z.; Asapu, S. Artificial neural network (ANN) to spiking neural network (SNN) converters based on diffusive memristors. Adv. Electron. Mater. 2019, 5, 1900060. [Google Scholar] [CrossRef]
Ding, J.; Yu, Z.; Tian, Y.; Huang, T. Optimal ANN-SNN conversion for fast and accurate inference in deep spiking neural networks. arXiv 2021, arXiv:2105.11654. [Google Scholar]
Bu, T.; Fang, W.; Ding, J.; Dai, P.; Yu, Z.; Huang, T. Optimal ANN-SNN conversion for high-accuracy and ultra-low-latency spiking neural networks. arXiv 2023, arXiv:2303.04347. [Google Scholar]
Neftci, E.; Mostafa, H.; Zenke, F. Surrogate gradient learning in spiking neural networks: Bringing the power of gradient-based optimization to spiking neural networks. IEEE Signal Process. Mag. 2019, 36, 51–63. [Google Scholar] [CrossRef]
Safa, A.; Catthoor, F.; Gielen, G. Convsnn: A surrogate gradient spiking neural framework for radar gesture recognition. Softw. Impacts 2021, 10, 100131. [Google Scholar] [CrossRef]
Zenke, F.; Vogels, T. The remarkable robustness of surrogate gradient learning for instilling complex function in spiking neural networks. Neural Comput. 2021, 33, 899–925. [Google Scholar] [CrossRef] [PubMed]
Kim, S.; Park, S.; Na, B.; Yoon, S. Spiking-yolo: Spiking neural network for energy-efficient object detection. Proc. AAAI Conf. Artif. Intell. 2020, 34, 11270–11277. [Google Scholar] [CrossRef]
Kim, Y.; Chough, J.; Panda, P. Beyond classification: Directly training spiking neural networks for semantic segmentation. Neuromorphic Comput. Eng. 2022, 2, 044015. [Google Scholar] [CrossRef]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Zitnick, C.L.; Dollár, P. Microsoft coco: Common objects in context. arXiv 2014, arXiv:1405.0312, 740–755. [Google Scholar]
Al-Rawi, M.; Qutaishat, M.; Arrar, M. An improved matched filter for blood vessel detection of digital retinal images. Comput. Biol. Med. 2007, 37, 262–267. [Google Scholar] [CrossRef]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 2019, 32, 8024–8035. [Google Scholar]
Github. Available online: https://github.com/fangwei123456/spikingjelly (accessed on 17 December 2019).
Yuan, Q.; Chen, K.; Yu, Y.; Le, N.Q.K.; Chua, M.C.H. Prediction of anticancer peptides based on an ensemble model of deep learning and machine learning using ordinal positional encoding. Brief. Bioinform. 2023, 24, 630. [Google Scholar] [CrossRef]
Kha, Q.H.; Ho, Q.T.; Le, N.Q.K. Identifying SNARE Proteins Using an Alignment-Free Method Based on Multiscan Convolutional Neural Network and PSSM Profiles. J. Chem. Inf. Model. 2022, 62, 4820–4826. [Google Scholar] [CrossRef]
Lee, J.; Delbruck, T.; Pfeiffer, M. Training deep spiking neural networks using backpropagation. Front. Neurosci. 2016, 10, 508. [Google Scholar] [CrossRef] [PubMed]
Park, S.; Kim, S.; Na, B.; Yoon, S. T2FSNN: Deep spiking neural networks with time-to-first-spike coding. In Proceedings of the 2020 57th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 20–24 July 2020; pp. 1–6. [Google Scholar]
Rathi, N.; Roy, K. Diet-SNN: Direct input encoding with leakage and threshold optimization in deep spiking neural networks. arXiv 2020, arXiv:2008.03658. [Google Scholar]
Horowitz, M. 1.1 Computing’s energy problem (and what we can do about it). In Proceedings of the 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, CA, USA, 9–13 February 2014; pp. 10–14. [Google Scholar]

Figure 1. Overall block diagram of the hybrid SFCN network.

Figure 2. Schematic diagram of spike encoding over time window of T.

Figure 3. Schematic diagram of (a) encoder block and (b) decoder block.

Figure 4. Principle of upsampling (a) and an example (b).

Figure 5. Principle of deconvolution. (a) Feature map of size 2 × 2; (b) insert zeros; (c) padding; (d) convolution. The red squares represent the part of the feature graph to be convolved, the convolution kernel and the result after convolution.

Figure 6. (a) Softsign function and (b) corresponding gradient.

Figure 7. (a) Training loss, (b) mIoU, (c) Pixel accuracy, and (d) learning rate during the training process.

Figure 8. Semantic segmentation results for some randomly selected samples (a–n) from the VOC2012 dataset.

Figure 9. Semantic segmentation results for some randomly selected samples from the DRIVE dataset. The left column is the original image, the middle column is the ground-truth mask, and the right column is the predicted output.

Table 1. Training parameters of the encoder block.

Layer	Param	Feature Map Size
Conv2d	1792	B, 64, 384, 384
IFNode	0	T, B, 64, 384, 384
MaxPool2d	0	T, B, 64, 192, 192
Conv2d	73,856	T, B, 128, 192, 192
IFNode	0	T, B, 128, 192, 192
MaxPool2d	0	T, B, 128, 96, 96
Conv2d	295,168	T, B, 256, 96, 96
IFNode	0	T, B, 256, 96, 96
MaxPool2d	0	T, B, 256, 48, 48
Conv2d	1,180,160	T, B, 512, 48, 48
IFNode	0	T, B, 512, 48, 48
MaxPool2d	0	T, B, 512, 24, 24
Conv2d	4,719,616	T, B, 1024, 24, 24
IFNode	0	T, B, 1024, 24, 24

Table 2. Parameters used for training.

Parameter	Description	Value
T	Time window	6 ms
dt	Simulation time step	1 ms
V_th	Membrane voltage threshold	1 mV
opt	Optimizer	Adam
r	Initial Learning rate	0.0005
bs	Batch size	8
Lr_sch	Lr_scheduler	Cos

Table 3. Values of mIoU and Pixel accuracy for different experiment conditions.

Seq	Compare Experiments	Dataset	mIoU	Pixel Acc
1	T = 6, upsample = Upsampling, surrogate_function = SoftSign	VOC2012	0.397	0.737
2	T = 4, upsample = Upsampling, surrogate_function = SoftSign	VOC2012	0.319	0.708
3	T = 6, upsample = ConvTransposed, surrogate_function = SoftSign	VOC2012	0.377	0.729
4	T = 6, upsample = Upsampling, surrogate_function = Sigmoid	VOC2012	0.375	0.727
5	T = 6, upsample = Upsampling, surrogate_function = SoftSign	COCO2017	0.421	0.796
6	T = 4, upsample = Upsampling, surrogate_function = SoftSign	COCO2017	0.401	0.708
7	T = 6, upsample = ConvTransposed, surrogate_function = SoftSign	COCO2017	0.396	0.698
8	T = 6, upsample = Upsampling, surrogate_function = Sigmoid	COCO2017	0.413	0.736
9	T = 6, upsample = Upsampling, surrogate_function = SoftSign	DRIVE	0.397	0.737
10	T = 4, upsample = Upsampling, surrogate_function = SoftSign	DRIVE	0.319	0.708
11	T = 6, upsample = ConvTransposed, surrogate_function = SoftSign	DRIVE	0.377	0.729
12	T = 6, upsample = Upsampling, surrogate_function = Sigmoid	DRIVE	0.375	0.727
13	T = 6, upsample = Upsampling, surrogate_function = SoftSign	Cityscapes	0.602	0.617
14	T = 4, upsample = Upsampling, surrogate_function = SoftSign	Cityscapes	0.608	0.616
15	T = 6, upsample = ConvTransposed, surrogate_function = SoftSign	Cityscapes	0.526	0.561
16	T = 6, upsample = Upsampling, surrogate_function = Sigmoid	Cityscapes	0.603	0.616

Table 4. Values of mIoU and Pixel accuracy for different datasets. The corresponding results for other networks are included for the purpose of comparison. T = 6, unsample = UnSample, surrogate_function = SoftSign.

Method	Dataset	mIoU	PixelAcc	Precision	Recall	F1
FCN [1]	VOC2012	0.309	-	-	-	-
DeepLab [2]	VOC2012	0.323	-	-	-	-
Spiking-FCN [3]	VOC2012	0.099	-	-	-	-
Spiking-DeepLab [3]	VOC2012	0.223	-	-	-	-
CNN	VOC2012	0.494 ± 0.003	0.782 ± 0.004	0.300	0.234	0.236
ours	VOC2012	0.391 ± 0.007	0.737 ± 0.005	0.295	0.253	0.235
CNN	COCO2017	0.521	0.823	0.356	0.313	0.343
ours	COCO2017	0.421	0.769	0.348	0.291	0.283
CNN	DRIVE	-	0.986	0.922	0.920	0.921
ours	DRIVE	-	0.967	0.796	0.845	0.823
CNN	Cityscapes	0.652	0.637	0.932	0.423	0.582
ours	Cityscapes	0.602	0.617	0.921	0.413	0.566

Table 5. Energy consumption for the 45 nm CMOS process.

Operation	Energy (pJ)
32 bit FP MULT	3.7
32 bit FP ADD	0.9
32 bit FP MAC	4.6
32 bit FP AC	0.9

Table 6. Energy consumption comparison.

Method	Layer	Energy of Every Layer (J)	Energy (J)
	Encoder Block 0	0.0117209
	Encoder Block 1	0.1250238
ANN	Encoder Block 2	0.1250238	0.5118161
	Encoder Block 3	0.1250238
	Encoder Block 4	0.1250238
	Spike Encoder	0.0001753
	Encoder Block 1	0.0006791
SNN	Encoder Block 2	0.0069469	0.0255342
	Encoder Block 3	0.0120946
	Encoder Block 4	0.0056383

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, T.; Xiang, S.; Liu, W.; Han, Y.; Guo, X.; Hao, Y. Hybrid Spiking Fully Convolutional Neural Network for Semantic Segmentation. Electronics 2023, 12, 3565. https://doi.org/10.3390/electronics12173565

AMA Style

Zhang T, Xiang S, Liu W, Han Y, Guo X, Hao Y. Hybrid Spiking Fully Convolutional Neural Network for Semantic Segmentation. Electronics. 2023; 12(17):3565. https://doi.org/10.3390/electronics12173565

Chicago/Turabian Style

Zhang, Tao, Shuiying Xiang, Wenzhuo Liu, Yanan Han, Xingxing Guo, and Yue Hao. 2023. "Hybrid Spiking Fully Convolutional Neural Network for Semantic Segmentation" Electronics 12, no. 17: 3565. https://doi.org/10.3390/electronics12173565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hybrid Spiking Fully Convolutional Neural Network for Semantic Segmentation

Abstract

1. Introduction

2. Methods and Materials

2.1. Network Architecture

2.2. Surrogate Gradient and Training Algorithms

3. Experiments and Results

3.1. Dataset and Training Parameters

3.2. Quantifiers for the Testing Accuracy

3.3. Results

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI