Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion
Abstract
1. Introduction
- Layer-centric tile fusion (LCTF) reduces overlaps through receptive field alignment and integrates residual connections via feature merging, thereby eliminating redundant data loading.
- A reuse-distance-aware caching strategy provides flexible and optimized on-chip storage by prioritizing data types with shorter reuse distances. A systematic modeling framework is developed to analyze the impact of various design choices on memory efficiency.
- Experimental results on ResNet-18 and SRGAN demonstrate that the proposed methods achieve 5.04–43.44% reduction in energy-delay product (EDP) and 20.28–58.33% reduction in on-chip memory usage compared to state-of-the-art methods.
2. Proposed Method
2.1. Layer-Centric Tile Fusion
2.2. Feature Merging for Residual Connections
3. Memory-Efficient Modeling Framework
3.1. Reuse-Distance-Aware Caching Strategy
3.2. Modeling Framework Architecture
4. Evaluation
4.1. Experiment Setup
4.2. Design Space Exploration
4.3. State-of-the-Art Comparison
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
CNN | Convolutional Neural Network |
EMA | External Memory Access |
EDP | Energy-Delay Product |
LCTF | Layer-Centric Tile Fusion |
RDA | Reuse-Distance-Aware Caching Strategy |
MAC | Multiply-Accumulate |
SotA | State-of-the-Art |
References
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef]
- Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
- Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [Google Scholar] [CrossRef]
- Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014; pp. 609–622. [Google Scholar] [CrossRef]
- Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. SIGARCH Comput. Archit. News 2017, 45, 1–12. [Google Scholar] [CrossRef]
- Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Republic of Korea, 18–22 June 2016; pp. 243–254. [Google Scholar] [CrossRef]
- Alwani, M.; Chen, H.; Ferdman, M.; Milder, P. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, China, 15–19 October 2016; pp. 1–12. [Google Scholar] [CrossRef]
- Li, Z.; Kim, S.; Im, D.; Han, D.; Yoo, H.J. An Efficient Deep-Learning-Based Super-Resolution Accelerating SoC with Heterogeneous Accelerating and Hierarchical Cache. IEEE J. Solid-State Circuits 2023, 58, 614–623. [Google Scholar] [CrossRef]
- Goetschalckx, K.; Verhelst, M. Breaking High-Resolution CNN Bandwidth Barriers with Enhanced Depth-First Execution. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 323–331. [Google Scholar] [CrossRef]
- Goetschalckx, K.; Wu, F.; Verhelst, M. DepFiN: A 12-nm Depth-First, High-Resolution CNN Processor for IO-Efficient Inference. IEEE J. Solid-State Circuits 2023, 58, 1425–1435. [Google Scholar] [CrossRef]
- Colleman, S.; Verhelst, M. High-Utilization, High-Flexibility Depth-First CNN Coprocessor for Image Pixel Processing on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 461–471. [Google Scholar] [CrossRef]
- Lee, J.; Lee, J.; Yoo, H.J. SRNPU: An Energy-Efficient CNN-Based Super-Resolution Processor with Tile-Based Selective Super-Resolution in Mobile Devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 320–334. [Google Scholar] [CrossRef]
- Huang, C.T.; Ding, Y.C.; Wang, H.C.; Weng, C.W.; Lin, K.P.; Wang, L.W.; Chen, L.D. eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; MICRO ’52. pp. 182–195. [Google Scholar] [CrossRef]
- Lee, J.; Shin, D.; Lee, J.; Lee, J.; Kang, S.; Yoo, H.J. A Full HD 60 fps CNN Super Resolution Processor with Selective Caching based Layer Fusion for Mobile Devices. In Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, 9–14 June 2019; pp. C302–C303. [Google Scholar] [CrossRef]
- Mo, H.; Zhu, W.; Hu, W.; Wang, G.; Li, Q.; Li, A.; Yin, S.; Wei, S.; Liu, L. 9.2 A 28nm 12.1TOPS/W Dual-Mode CNN Processor Using Effective-Weight-Based Convolution and Error-Compensation-Based Prediction. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; Volume 64, pp. 146–148. [Google Scholar] [CrossRef]
- Ding, Y.C.; Lin, K.P.; Weng, C.W.; Wang, L.W.; Wang, H.C.; Lin, C.Y.; Chen, Y.T.; Huang, C.T. A 4.6-8.3 TOPS/W 1.2-4.9 TOPS CNN-based Computational Imaging Processor with Overlapped Stripe Inference Achieving 4K Ultra-HD 30fps. In Proceedings of the ESSCIRC 2022—IEEE 48th European Solid State Circuits Conference (ESSCIRC), Milan, Italy, 19–22 September 2022; pp. 81–84. [Google Scholar] [CrossRef]
- Ma, Y.; Kim, M.; Cao, Y.; Vrudhula, S.; Seo, J.s. End-to-end scalable FPGA accelerator for deep residual networks. In Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, USA, 28–31 May 2017; pp. 1–4. [Google Scholar] [CrossRef]
- Shi, M.; Houshmand, P.; Mei, L.; Verhelst, M. Hardware-Efficient Residual Neural Network Execution in Line-Buffer Depth-First Processing. IEEE J. Emerg. Sel. Top. Circuits Syst. 2021, 11, 690–700. [Google Scholar] [CrossRef]
- Wan, W.; Kubendran, R.; Schaefer, C.; Eryilmaz, S.B.; Zhang, W.; Wu, D.; Deiss, S.; Raina, P.; Qian, H.; Gao, B.; et al. A compute-in-memory chip based on resistive random-access memory. Nature 2022, 608, 504–512. [Google Scholar] [CrossRef] [PubMed]
- Wang, Z.; Nalla, P.S.; Krishnan, G.; Joshi, R.V.; Cady, N.C.; Fan, D.; Seo, J.s.; Cao, Y. Digital-Assisted Analog In-Memory Computing with RRAM Devices. In Proceedings of the 2023 International VLSI Symposium on Technology, Systems and Applications (VLSI-TSA/VLSI-DAT), Hsinchu, Taiwan, 17–20 April 2023; pp. 1–4. [Google Scholar] [CrossRef]
- Vignali, R.; Zurla, R.; Pasotti, M.; Rolandi, P.L.; Singh, A.; Gallo, M.L.; Sebastian, A.; Jang, T.; Antolini, A.; Scarselli, E.F.; et al. Designing Circuits for AiMC Based on Non-Volatile Memories: A Tutorial Brief on Trade-Off and Strategies for ADCs and DACs Co-Design. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 1650–1655. [Google Scholar] [CrossRef]
- Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
- Bhardwaj, K.; Milosavljevic, M.; O’ Neil, L.; Gope, D.; Matas, R.; Chalfin, A.; Suda, N.; Meng, L.; Loh, D. Collapsible Linear Blocks for Super-Efficient Super Resolution. Proc. Mach. Learn. Syst. 2022, 4, 529–547. [Google Scholar]
- Huang, C.T. Ernet Family: Hardware-Oriented Cnn Models For Computational Imaging Using Block-Based Inference. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1643–1647. [Google Scholar] [CrossRef]
- Mei, L.; Goetschalckx, K.; Symons, A.; Verhelst, M. DeFiNES: Enabling Fast Exploration of the Depth-First Scheduling Space for DNN Accelerators Through Analytical Modeling. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 25 February–1 March 2023; pp. 570–583. [Google Scholar] [CrossRef]
Methods | Wout | Hout | Sizeolp | Overlap Management | Tile Position Shift |
---|---|---|---|---|---|
Line-buffer-based | Configurable | Fixed (1) | Stored on-chip | Yes | |
Pyramid-based | Configurable | Configurable | Stored on-chip/Recompute/ Reload from off-chip memory | No | |
Proposed LCTF | Configurable | Configurable | Flexible Storage | Yes |
Type 0 | Type 1 | Type 2 | Type 3 | Type 4 | Type 5 | Type 6 | Type 7 | Type 8 | |
---|---|---|---|---|---|---|---|---|---|
Padding | Top and Left side | Top only | Top and Right side | Left only | – | Right only | Bottom and Left side | Bottom only | Bottom and Right side |
Wolp | Produce | Produce & Consume | Consume | Produce | Produce and Consume | Consume | Produce | Produce and Consum | Consume |
Holp | Produce | Produce | Produce | Produce and Consume | Produce and Consume | Produce and Consume | Consume | Consume | Consume |
Tile Width Variation | Decrease | Consistent | Increase | Decrease | Consistent | Increase | Decrease | Consistent | Increase |
Tile Height Variation | Decrease | Decrease | Decrease | Consistent | Consistent | Consistent | Increase | Increase | Increase |
Works | EDP Reduction | Memory Reduction | |
---|---|---|---|
(Under Same Memory) | (Under Same EDP) | ||
ISSCC’21 [17] | 19.41% | 58.33% | |
ResNet18 | ESSCIRC’22 [18] | 32.23% | 57.89% |
ISCAS’17 [19] | 43.44% | / | |
ISSCC’21 [17] | 5.04% | 20.28% | |
SRGAN | ESSCIRC’22 [18] | 22.30% | / |
ISCAS’17 [19] | 40.29% | / |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, H.; He, J.; Gui, Y.; Peng, S.; Huang, L.; Yan, X.; Fan, Y. Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion. Electronics 2025, 14, 3269. https://doi.org/10.3390/electronics14163269
Zhang H, He J, Gui Y, Peng S, Huang L, Yan X, Fan Y. Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion. Electronics. 2025; 14(16):3269. https://doi.org/10.3390/electronics14163269
Chicago/Turabian StyleZhang, Hao, Jianheng He, Yupeng Gui, Shichen Peng, Leilei Huang, Xiao Yan, and Yibo Fan. 2025. "Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion" Electronics 14, no. 16: 3269. https://doi.org/10.3390/electronics14163269
APA StyleZhang, H., He, J., Gui, Y., Peng, S., Huang, L., Yan, X., & Fan, Y. (2025). Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion. Electronics, 14(16), 3269. https://doi.org/10.3390/electronics14163269