Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion

Zhang, Hao; He, Jianheng; Gui, Yupeng; Peng, Shichen; Huang, Leilei; Yan, Xiao; Fan, Yibo

doi:10.3390/electronics14163269

Open AccessArticle

Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion

by

Hao Zhang

¹

,

Jianheng He

¹

,

Yupeng Gui

¹

,

Shichen Peng

¹

,

Leilei Huang

²,

Xiao Yan

³ and

Yibo Fan

^1,*

¹

The State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai 200433, China

²

The Institute of Microelectronic Circuits and Systems, East China Normal University, Shanghai 200062, China

³

The School of Electronic Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710061, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(16), 3269; https://doi.org/10.3390/electronics14163269

Submission received: 23 July 2025 / Revised: 10 August 2025 / Accepted: 14 August 2025 / Published: 18 August 2025

(This article belongs to the Special Issue Research on Key Technologies for Hardware Acceleration)

Download

Browse Figures

Versions Notes

Abstract

Convolutional neural networks (CNNs) have achieved remarkable success in computer vision tasks, driving the rapid development of hardware accelerators. However, memory efficiency remains a key challenge, as conventional accelerators adopt layer-by-layer processing, leading to frequent external memory accesses (EMAs) of intermediate feature data, which increase energy consumption and latency. While layer fusion has been proposed to enhance inter-layer feature reuse, existing approaches typically rely on fixed data management tailored to specific architectures, introducing on-chip memory overhead and requiring trade-offs with EMAs. Moreover, prevalent residual connections further weaken fusion benefits due to diverse data reuse distances. To address these challenges, we propose layer-centric tile fusion, which integrates residual data loading with feature merging by leveraging receptive field relationships among feature tiles. A reuse distance-aware caching strategy is introduced to support flexible storage for various data types. We also develop a modeling framework to analyze the trade-off between on-chip memory usage and EMA-induced energy-delay product (EDP). Experimental results demonstrate that our method achieves 5.04–43.44% EDP reduction and 20.28–58.33% memory usage reduction compared to state-of-the-art designs on ResNet-18 and SRGAN.

Keywords:

convolutional neural networks; layer fusion; residual connections; memory-efficient; modeling

1. Introduction

Convolutional neural networks (CNNs) have become a cornerstone of computer vision tasks [1,2,3,4]. However, their complex architectures and intensive computations impose considerable demands on hardware memory systems [5,6,7,8]. On the one hand, the overhead introduced by frequent off-chip feature data transfers presents a major challenge. These external memory accesses (EMAs) not only lead to significant power consumption but also suffer from limited memory bandwidth, resulting in increased latency. On the other hand, on-chip memory resources are also under pressure. With increasing image resolutions and growing network depth, large volumes of intermediate data must be accessed with low latency, thereby significantly increasing on-chip memory requirements.

Prior research on CNN accelerators has revealed that layer fusion effectively improves memory efficiency by optimizing EMA [9,10]. The core idea involves leveraging on-chip memory to reuse feature data and fuse computations from adjacent layers into unified processing stacks [9]. Within each stack, feature maps are typically partitioned into tiles, enabling the reuse of output tile data as inputs for subsequent layers without intermediate off-chip transfers. Since each tile is processed independently, additional features must be loaded from adjacent tiles at boundaries to ensure complete receptive field coverage for each layer; these extra data are referred to as overlaps. Different strategies for managing tiles and overlaps have resulted in two categories of fusion approaches: line-buffer-based methods [10,11,12,13] and pyramid-based methods [9,14,15,16].

As illustrated in Figure 1a, line-buffer-based methods constrain tile height to a single row. Consequently, multiple feature rows must be cached as overlaps per layer, incurring significant memory overhead when processing high-resolution images. Furthermore, generating only a single output row per operation limits computational parallelism and restricts the exploration of memory impacts from different tile dimensions. In contrast, pyramid-based methods depicted in Figure 1b utilize variable tile dimensions. However, their overlap sizes are influenced by the cumulative receptive fields of all fused layers, rendering them inherently stack-centric. As stack depth increases, the progressively enlarged overlaps limit fusion capability and increase storage requirements. These constraints highlight the need for a more flexible and memory-efficient data management method for layer fusion.

Moreover, the prevalent use of residual connections in CNNs diminishes the advantages of layer fusion due to varying data reuse distances. Layer fusion facilitates the reuse of feature between adjacent layers; however, residual connections typically span multiple layers, necessitating either increased on-chip memory usage or additional off-chip memory accesses [17,18,19]. Shiman et al. [20] proposed a hybrid approach to optimize residual connections within line-buffer-based layer fusion. Nevertheless, their analysis lacks a thorough examination of the relationship between the receptive fields of residual data and layer fusion. Furthermore, an in-depth exploration of different design trade-offs for memory efficiency in layer fusion scenarios involving residual connections remains necessary.

Beyond accelerators built on conventional memory technologies, another line of research investigates resistive memory for in-memory computing, where analog vector–matrix multiplication is performed directly within memory arrays to minimize data movement. Recent prototypes, such as NeuRRAM [21], have demonstrated competitive accuracy and energy efficiency through coordinated optimization across device, circuit, and architecture levels. At the circuit level, hybrid and digitally assisted designs address device non-idealities and mitigate converter overhead by integrating analog crossbars with calibrated digital paths, thereby improving the accuracy–efficiency trade-off [22]. Additionally, recent studies emphasize that the design of analog-to-digital and digital-to-analog converters significantly impacts system-level efficiency, with precision, energy consumption, and variability tolerance closely tied to dataflow characteristics [23]. In contrast, our work focuses on a complementary aspect of network acceleration by enhancing data reuse and memory scheduling in conventional architectures. It leverages tile-level receptive field analysis and a dynamic caching strategy to reduce off-chip data transfers and align reuse across layers. The proposed method thus complements resistive memory approaches by relieving bandwidth bottlenecks and offering tiling patterns that remain applicable when resistive arrays perform core computation while digital circuits handle control and accumulation.

Based on the above analysis, this paper makes the following contributions:

Layer-centric tile fusion (LCTF) reduces overlaps through receptive field alignment and integrates residual connections via feature merging, thereby eliminating redundant data loading.
A reuse-distance-aware caching strategy provides flexible and optimized on-chip storage by prioritizing data types with shorter reuse distances. A systematic modeling framework is developed to analyze the impact of various design choices on memory efficiency.
Experimental results on ResNet-18 and SRGAN demonstrate that the proposed methods achieve 5.04–43.44% reduction in energy-delay product (EDP) and 20.28–58.33% reduction in on-chip memory usage compared to state-of-the-art methods.

The remainder of this paper is structured as follows: Section 2 introduces the layer-centric tile fusion and the feature merging approach for optimizing residual data loading. Section 3 details the reuse-distance-aware caching strategy, and further describes the modeling framework for evaluating memory efficiency. Section 4 quantitatively analyzes the proposed methods’ performance regarding energy, delay, and on-chip memory usage. Finally, Section 5 concludes the paper.

2. Proposed Method

2.1. Layer-Centric Tile Fusion

As depicted in Figure 1b, the input tile’s receptive field of the first layer under stack-centric methods is jointly determined by intermediate layer overlap and newly generated output tile. We propose a layer-centric approach that minimizes overlap size through layer receptive field alignment, where the dimension of input tile is solely governed by the current layer’s newly generated output tile. Configurable-sized tiles are adopted as the basic unit. For a single-layer convolution with kernel size

K \times K

, input tile dimension (width or height)

W_{i n} (H_{i n})

, and overlap size

S i z e_{o l p}

, P is the size of padding around the feature map. Based on the principles of convolution, the corresponding output tile dimension

W_{o u t} (H_{o u t})

is calculated as follows:

W_{o u t} (H_{o u t}) = W_{i n} (H_{i n}) + S i z e_{o l p} - (K - 1) + P

(1)

In LCTF,

S i z e_{o l p}

is set to a minimum value of

K - 1

. As shown in Figure 2, this ensures that each layer’s overlap size is fixed and only dependent on the current layer’s receptive field, independent of the number of fused layers.

LCTF manages two types of overlap. As shown in the orange portion of Figure 2, Wolp, which refers to the overlap in the width dimension, needs to be supplemented on the left side of the input. Based on the tile currently being processed, it is produced during the processing of the tile to its left, occurring ahead of one tile’s processing cycle in the time dimension. The Wolp is released from memory after reusing along with the layer’s computation. Therefore, the reuse distance of Wolp, which is the period from data production to consumption, is equivalent to one processing cycle of a tile. The blue portion in Figure 2 depicts Holp (i.e., overlap in the height dimension), which needs to be supplemented above the input tile. Holp is generated during the computation of the upper tile. Assuming that the feature map is partitioned into t tiles in the width dimension, Holp takes t processing cycles to be consumed after its generation. Thus, the reuse distance of Holp equals t tiles’ processing cycles, substantially exceeding that of Wolp. It is noteworthy that both layer fusion methods supplement overlaps on the left and upper sides of the tile. This choice assumes that the feature tile processing sequence prioritizes left-to-right followed by top-to-bottom orders. Although the proposed method also supports alterations in this sequence—equivalent to rotating the feature map by 90 degrees and interchanging Holp and Wolp reuse distances—no additional fundamental differences are introduced; hence, we do not provide further analysis on this variant.

The supplement of layer-centric overlaps results in an upward and leftward shift of each layer’s tile position by

(K - 1) / 2

. As illustrated in the right portion of Figure 2, for a tile with its top-left corner at

(h, w)

, passing through a layer with kernel size 3 shifts this position to

(h - 1, w - 1)

in the next layer. This coordinate shift uniformly affects the relative position of every point within the tile on the feature map.

Due to varying padding and overlap conditions at the edges, an image can be divided into nine tile types, each exhibiting different dimension changes and positional offsets. As shown in Figure 3, tiles on the left side (Types 0, 3, 6) include padding, producing Wolp without consumption, causing the tile width to decrease across layers. Middle-positioned tiles (Types 1, 4, 7) only consume Wolp, thus maintaining a constant width. Right-side tiles (Types 2, 5, 8) incorporate both padding and Wolp, leading to an increase in width with each layer. A similar pattern applies vertically regarding tile height and Holp.

Table 1 summarizes the characteristics of various layer fusion methods concerning tile dimensions, overlap size, overlap data management, and tile position. The line-buffer-based layer fusion method allows configurability solely in tile width, while the proposed LCTF and pyramid-based methods enable configurability in both width and height dimensions.

S i z e_{o l p}

,

W_{o u t}

, and

H_{o u t}

collectively determine the input tile dimension. LCTF and line-buffer-based methods both ensure a fixed minimum

S i z e_{o l p}

per layer, whereas the stack-centric nature of pyramid-based methods results in cumulative increases in both

S i z e_{o l p}

and input tile dimensions as layers are fused. The line-buffer-based methods are essentially LCTF when the value of

H_{o u t}

is one, and all overlaps are stored on-chip. However, our method enables configurable tile size and flexible overlap management, as detailed in Section III-A. Lastly, owing to their inherently layer-centric characteristics, both LCTF and line-buffer-based methods incur positional shifts of tiles.

2.2. Feature Merging for Residual Connections

For the commonly utilized residual blocks described in [2,24,25,26], residual connections sum feature data at identical spatial positions from the block entrance and exit, where the entrance refers to the input of the initial layer and the exit refers to the output of the terminal layer. Previous layer fusion methods typically partition residual connections into distinct stacks and require residual data reloading or recomputation at the exit point [20], which inevitably introduces additional EMA and computational overhead.

Feature merging addresses this inefficiency by thoroughly analyzing receptive field relationships, and integrating the data loading of residual connections seamlessly with LCTF. As illustrated in Figure 4a, the tile position at the exit layer is projected back onto the entrance feature map. Due to the intrinsic tile position shift in LCTF, the required residual data encompasses portions originating from the entrance layer’s input tile, Wolp, and Holp. These feature data are merged and subsequently reused, with the section corresponding to the input tile referred to as Tile-Merged, and the sections corresponding to Wolp and Holp referred to as W-Merged and H-Merged, respectively.

Merged feature data achieves dual-phase utilization in on-chip memory: it is loaded once and reused during both the convolutional processing phase and the residual summation phase. While original layer fusion methods immediately consume these feature data at the entrance layer, feature merging demands that these data be retained in on-chip storage until residual summation has concluded, thereby extending the reuse distance explicitly to the exit layer. Notably, as indicated by the red highlighted regions in Figure 4b, when the receptive field of residual data exceeds the range of the entrance input tile, the dimensions of W-Merged and H-Merged regions require proportional enlargement to incorporate supplementary feature data outside the overlap and input tile area. Despite necessitating additional on-chip memory allocation for these supplemental feature data segments, the proposed feature merging approach maintains substantial efficiency advantages when compared to the conventional method of separately loading overlap, input tile, and residual data.

3. Memory-Efficient Modeling Framework

3.1. Reuse-Distance-Aware Caching Strategy

Prior works have commonly employed fixed storage strategies tailored for various data types in layer fusion [10,14,18], with the aim of achieving a balance between the on-chip memory footprint and EMA. In contrast, we introduces a reuse-distance-aware caching strategy (RDA), which dynamically formulates optimal caching policies for all involved data types by accounting for the available on-chip memory capacity. The concept of reuse distance denotes the interval from data production to its subsequent consumption. Data characterized by shorter reuse distances typically exhibit more frequent access patterns, thereby warranting higher prioritization for on-chip memory storage.

Figure 5 illustrates the reuse distances associated with various data types in LCTF after feature merging. Specifically, input and output tiles are refreshed at each computational step within a layer, yielding the shortest reuse distance of one layer. Given that layer fusion necessitates the on-chip reuse of input/output feature tiles, the on-chip memory must be sufficient to accommodate both these tiles. The Tile-Merged data are generated at the entrance point of the residual stack and subsequently consumed post-summation at the exit point, resulting in a reuse distance equal to the processing cycle of one stack. Similarly, both W-Merged data and Wolp are produced during the computation of the immediately preceding tile to the left, and they are consumed and refreshed during the current tile’s processing, indicating a reuse distance equivalent to a single tile processing cycle. However, considering the dual-phase utilization of W-Merged data in both convolutional processing and residual summation phases, W-Merged data obtains higher on-chip storage prioritization relative to Wolp. Concurrently, during the processing of the current tile, prefetching of the subsequent input tile occurs when buffer capacity permits, thus maintaining a consistent one-tile-cycle reuse distance. H-Merged data and Holp exhibit the longest reuse distances, corresponding to the complete processing cycles of all t tiles across the width dimension of the feature map. Similarly, H-Merged data attains higher priority for on-chip storage compared to Holp.

By implementing this dynamic reuse-distance-aware caching strategy, the system adaptively determines optimal storage policies for diverse data types in response to varying on-chip memory capacity. Moreover, this strategy facilitates the systematic exploration of memory-efficient design trade-offs.

3.2. Modeling Framework Architecture

In the hardware acceleration of CNNs, frequent off-chip memory accesses and intensive multiply-accumulate (MAC) operations lead to considerable power consumption. Concurrently, limited memory bandwidth and the high computational demands introduce substantial hardware delay. These two factors jointly dominate the EDP, which serves as a critical metric for evaluating memory-efficient architectures. Although the aforementioned optimization methods have notably enhanced data loading and storage efficiency within the layer fusion dataflow, the underlying relationship between on-chip memory footprint and EDP has yet to be thoroughly elucidated. Furthermore, a quantitative evaluation of the proposed methods across diverse architectural configurations and network models is required to validate their generality and effectiveness. To tackle these issues, we propose a systematic modeling framework comprising five processing steps.

(1) Inputs decoding. The framework inputs comprise workload description, hardware architecture specification, and layer fusion parameters. The workload description is used to model a residual network that incorporates a variety of representative operators. It includes both the structural aspects of the network, such as convolution layers, residual connections, and pooling, as well as the detailed parameters of each layer, including kernel size, feature map dimensions, and stride. The hardware specification abstracts the critical resource of neural network accelerators, encompassing the number of compute units, on-chip memory, off-chip bandwidth constraints, and operation frequency. The compute units support configurable parallelism across feature map dimensions, while the on-chip memory accommodates diverse capacity and bit-width interface. Regarding layer fusion parameters, the adoption of reuse-distance-aware caching strategy circumvents fixed management for different data types. Consequently, relevant parameters need only specify the number of fused layers within each stack and the tile sizes. LCTF supports configurable tile sizes, yielding distinct feature access patterns that facilitate more flexible exploration of the design space. Upon completing these input configurations, the framework iteratively proceeds through the subsequent modeling processes.

(2) Feature map tiling. For each stack of the network, the output feature maps are partitioned based on predefined tile size configurations. As summarized in Table 2, different tile types exhibit distinct characteristics during the process, including padding positions, overlap data management, and dimensional variations across layers, which impact the scheduling and storage of feature data. Therefore, each tile type requires independent and meticulous modeling. The framework first performs automatic identification of all tile types based on the spatial dimensions of the feature maps and the given tile size settings. It then enumerates the total number of tiles corresponding to each type. In addition, when the feature map dimensions and tile size are not divisible or do not satisfy the division into nine tile types, the framework automatically adjusts the edge tiles and the number of tile types according to the actual dimensions. For every identified tile type, the framework further computes the tile dimensions at each layer according to Table 2 and proceeds to conduct a detailed analysis of the associated data volumes, enabling accurate estimation of memory access and processing demands.

(3) Analyze the quantity of different types of data for tiles. For each identified tile type within a specific stack, the receptive field shift of tiles across layers is determined based on computed tile dimensions and convolution kernel sizes. Subsequently, the size of the merged data resulting from these receptive field shifts is calculated explicitly. Following this, the modeling framework thoroughly analyzes the quantity of each feature data type involved during the processing of each layer, ensuring comprehensive and accurate data quantification.

(4) Reuse-distance-aware data allocation. The framework prioritizes different types of feature data according to the reuse-distance-aware strategy. Data with shorter reuse distances, which indicates more frequent reuse, are dynamically allocated to the lower levels of the on-chip memory hierarchy to improve storage efficiency and minimize redundant memory transfers. These high-priority data are retained in memory until completely reused, after which the occupied memory space is released. Based on the on-chip memory capacity specified in the hardware configuration, when the available memory becomes insufficient to hold all required feature data, the remaining data must be reloaded from off-chip memory. This results in EMA overhead and increases memory access latency, which impacts the overall system memory efficiency.

(5) Cost model and Output. After completing the preceding four modeling steps, the framework derives the total number of MAC operations (

N_{o p}

) and memory accesses (

N_{m e m}

) for each tile. These two metrics predominantly determine the overall energy consumption and hardware delay [20]. Based on these results, the EDP overhead during acceleration can be quantitatively estimated using the developed cost model. Specifically, assuming the hardware accelerator contains

N_{m a c}

MAC units and is limited by an off-chip memory bandwidth of

B W

, the energy and delay of the entire process are calculated as follows:

e n e r g y = \sum_{T i l e s} (N_{m e m} \times E_{a c c e s s} + N_{o p} \times E_{m a c})

(2)

d e l a y = \sum_{T i l e s} (MAX (\frac{N_{e m a}}{B W}, \frac{N_{o p}}{N_{m a c}}))

(3)

Here,

E_{a c c e s s}

and

E_{m a c}

represent the energy consumed per memory access (including both loading and write-back) and per MAC operation, respectively. Their values are referenced from [27].

4. Evaluation

4.1. Experiment Setup

To evaluate the effectiveness of the proposed methods, we adopt a hardware template representing neural network accelerators operating at 250 MHz. The number of MAC units is configured as 512 and 2048 to explore the accelerator’s performance under compute-bound and memory-bound conditions, respectively. The off-chip memory bandwidth is limited to 64 bits per cycle with a 100 MHz operating frequency. Weights are preloaded into the weight on-chip memory on a stack basis to support the LCTF processing flow. We iterate various feature data on-chip memory footprints to investigate their relationship with EDP. To validate the general applicability of our methods across different networks, we evaluate two representative residual networks: ResNet18 [2], which is tailored for image classification, and SRGAN [24], which is designed for image enhancement tasks.

4.2. Design Space Exploration

To evaluate memory efficiency, we first fix a set of layer fusion parameters. For residual networks, each residual block is typically treated as an individual fusion stack, thus primarily emphasizing the exploration of tile sizes on performance. In image enhancement tasks, the resolution of input feature maps is relatively large, allowing for an extensive evaluation of various tile size combinations in height and width dimensions. Taking input feature maps with a resolution of

480 \times 270

as an example, we compare the EDP across different tile size configurations. As illustrated in Figure 6, it can be observed that excessively small or large tile sizes result in increased EDP overhead, whereas intermediate, moderately sized tiles achieve optimal performance. A tile size of 16 in both height and width dimensions is selected, yielding an EDP overhead of approximately

1.7 \times 10^{19} pJ \cdot cycles

. For ResNet18, the presence of multiple down-sampling layers progressively reduces input feature map dimensions. Considering this reduction and to maintain computational unit utilization, we choose a tile size of

2 \times 2

.

Figure 7 illustrates the results of design space exploration for ResNet18 and SRGAN. The vertical axis indicates the EDP values resulting from network acceleration given specific on-chip memory footprints. Typically, an increased memory footprint enables the accommodation and reuse of additional feature data on-chip, thereby decreasing energy consumption and latency associated with EMA. The green design points represent baseline implementations without optimization, wherein on-chip memory primarily stores data types required for layer fusion operations, with residual data allocated only within the leftover memory capacity. This storage strategy neglects prioritizing data types characterized by shorter reuse distances, thus yielding the highest EDP among the evaluated points. The blue design points reflect configurations adopting only the reuse-distance-aware caching strategy. Results from these configurations highlight that prioritizing the storage of data with higher reuse frequencies leads to significant EDP reductions. In comparison, the orange points represent configurations integrating both feature merging and the reuse-distance-aware caching. The combination facilitates loading residual data only once through LCTF integration. Consequently, these configurations achieve the lowest EDP values within equivalent memory footprints and require 9 KB and 62.5 KB less on-chip memory to achieve optimal EDP (i.e., full on-chip reuse of all feature data types).

Considering the configuration with 512 MAC units, both models operate in compute-bound scenarios, where the computational demand (

N_{o p} / N_{m a c}

) generally exceeds memory access constraints (

N_{e m a} / B W

). Taking SRGAN with an on-chip memory footprint of 100 KB as a specific instance, the orange design point employing both RDA and feature merging attains a 24.85% EDP reduction compared to the green baseline configuration. This improvement predominantly arises from a notable decrease in EMA-induced energy. When scaling the number of MAC units to 2048, the majority of tiles transition to a memory-bound regime, with memory access constraints (

N_{e m a} / B W

) becoming predominant. In this memory-bound context, the EDP reduction for SRGAN significantly increases to 50.34%, comprising a 30% decrease in energy and a 29.05% decrease in latency. For the ResNet18 model, due to its network structure, it consistently remains highly compute-bound. Therefore, the EDP reduction is predominantly driven by energy savings. These findings substantiate that the proposed optimization strategies effectively enhance both compute-bound and memory-bound workloads, demonstrating particularly substantial gains in memory-bound scenarios.

4.3. State-of-the-Art Comparison

To quantitatively evaluate the proposed methods, we perform a comparative analysis against layer fusion approaches adopted in state-of-the-art residual network accelerators [17,18,19]. These accelerators apply diverse layer fusion strategies and employ fixed feature management approaches, consequently leading to predetermined on-chip memory footprints when accelerating networks at specified input resolutions. To ensure fair and accurate comparisons, we replicate these accelerators within our modeling framework, uniformly scaling the number of computing units to 512 and maintaining identical energy consumption and bandwidth parameters. Figure 7 demonstrates that these approaches result in greater hardware overhead compared to the proposed methods. We further quantify the reduction ratios achieved by our methods relative to SotA approaches under two constraints: identical memory capacity or equivalent EDP, as summarized in Table 3.

Specifically, Ref. [17] employs line-buffer-based layer fusion, caching complete rows of feature maps and residual data to reduce EDP. This, however, results in the highest on-chip memory requirements. Conversely, our methods achieve significant memory reductions of 58.33% and 20.28% under equivalent EDP conditions through the integration of LCTF and feature merging.

The approach in [18] retains all residual data on-chip using pyramid-based layer fusion, caching Wolp on-chip and recomputing Holp. This method introduces substantial recomputation overhead due to accumulated overlap sizes, thereby increasing the EDP significantly. Under identical memory constraints, our methods demonstrate EDP reductions of 32.23% and 22.30%. This recomputation overhead becomes particularly severe when processing high-resolution tasks such as SRGAN, causing its EDP to exceed all design points in our work.

Lastly, Ref. [19] minimizes on-chip memory usage by only reusing input and output features, leading to extensive repeated off-chip memory accesses for residual data and intermediate features. Consequently, it incurs the highest EDP. In contrast, our proposed methods demonstrate EDP reductions of 43.44% and 40.29%.

Overall, the proposed methods enable flexible and efficient feature data management under diverse design constraints, achieving a well-balanced trade-off between on-chip memory usage and EDP. Compared to fixed-strategy implementations, they demonstrate superior adaptability and performance across various scenarios.

5. Conclusions

This work addresses the memory inefficiency in residual network acceleration under layer fusion from the perspectives of both EDP and on-chip memory usage, proposing three key innovations: layer-centric tile fusion, which minimizes data overlap through receptive field alignment; feature merging, which integrates residual data loading within LCTF to eliminate redundant memory accesses; a reuse-distance-aware caching strategy, which dynamically determines optimal on-chip data storage policies based on data reuse characteristics. The proposed framework facilitates systematic exploration and evaluation of memory-efficiency trade-offs across diverse networks and hardware configurations. Experimental results demonstrate that our methods achieve improvements compared to state-of-the-art accelerators, yielding reductions of 5.04–43.44% in EDP and 20.28–58.33% in on-chip memory usage. Future research will aim to enhance the generality of the proposed methods by incorporating more detailed modeling of data transfer and computation processes, extending the framework to support advanced network structures such as residual-in-residual connections. Currently, our framework handles outer residual connections through off-chip data reloads to avoid excessive on-chip memory overhead. In addition, we plan to enable modeling support for transformer-based architectures, complementing the existing focus on CNNs.

Author Contributions

Conceptualization, H.Z. and J.H.; methodology, H.Z. and J.H.; software, H.Z. and J.H.; validation, H.Z., Y.G. and S.P.; formal analysis, H.Z.; investigation, Y.G. and S.P.; data curation, H.Z.; writing—original draft preparation, H.Z.; writing—review and editing, L.H.; visualization, H.Z.; supervision, X.Y.; project administration, Y.F.; funding acquisition, X.Y. and Y.F. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key R&D Program of China (2023YFB4502802), in part by the National Natural Science Foundation of China under Grant 62031009 and 62304179, in part by the “Ling Yan” Program for Tackling Key Problems in Zhejiang Province (No.2022C01098), in part by Alibaba Innovative Research (AIR) Program, in part by Alibaba Research Fellow (ARF) Program, in part by the Fudan-ZTE Joint Lab, and in part by CCF-Alibaba Innovative Research Fund For Young Scholars.

Data Availability Statement

The raw data supporting the conclusions are included in the article. Further inquiries can be directed to the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CNN	Convolutional Neural Network
EMA	External Memory Access
EDP	Energy-Delay Product
LCTF	Layer-Centric Tile Fusion
RDA	Reuse-Distance-Aware Caching Strategy
MAC	Multiply-Accumulate
SotA	State-of-the-Art

References

Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image Super-Resolution Using Deep Convolutional Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 38, 295–307. [Google Scholar] [CrossRef]
Dong, C.; Loy, C.C.; Tang, X. Accelerating the super-resolution convolutional neural network. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 391–407. [Google Scholar]
Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [Google Scholar] [CrossRef]
Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 13–17 December 2014; pp. 609–622. [Google Scholar] [CrossRef]
Jouppi, N.P.; Young, C.; Patil, N.; Patterson, D.; Agrawal, G.; Bajwa, R.; Bates, S.; Bhatia, S.; Boden, N.; Borchers, A.; et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. SIGARCH Comput. Archit. News 2017, 45, 1–12. [Google Scholar] [CrossRef]
Han, S.; Liu, X.; Mao, H.; Pu, J.; Pedram, A.; Horowitz, M.A.; Dally, W.J. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In Proceedings of the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), Seoul, Republic of Korea, 18–22 June 2016; pp. 243–254. [Google Scholar] [CrossRef]
Alwani, M.; Chen, H.; Ferdman, M.; Milder, P. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, China, 15–19 October 2016; pp. 1–12. [Google Scholar] [CrossRef]
Li, Z.; Kim, S.; Im, D.; Han, D.; Yoo, H.J. An Efficient Deep-Learning-Based Super-Resolution Accelerating SoC with Heterogeneous Accelerating and Hierarchical Cache. IEEE J. Solid-State Circuits 2023, 58, 614–623. [Google Scholar] [CrossRef]
Goetschalckx, K.; Verhelst, M. Breaking High-Resolution CNN Bandwidth Barriers with Enhanced Depth-First Execution. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 323–331. [Google Scholar] [CrossRef]
Goetschalckx, K.; Wu, F.; Verhelst, M. DepFiN: A 12-nm Depth-First, High-Resolution CNN Processor for IO-Efficient Inference. IEEE J. Solid-State Circuits 2023, 58, 1425–1435. [Google Scholar] [CrossRef]
Colleman, S.; Verhelst, M. High-Utilization, High-Flexibility Depth-First CNN Coprocessor for Image Pixel Processing on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2021, 29, 461–471. [Google Scholar] [CrossRef]
Lee, J.; Lee, J.; Yoo, H.J. SRNPU: An Energy-Efficient CNN-Based Super-Resolution Processor with Tile-Based Selective Super-Resolution in Mobile Devices. IEEE J. Emerg. Sel. Top. Circuits Syst. 2020, 10, 320–334. [Google Scholar] [CrossRef]
Huang, C.T.; Ding, Y.C.; Wang, H.C.; Weng, C.W.; Lin, K.P.; Wang, L.W.; Chen, L.D. eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; MICRO ’52. pp. 182–195. [Google Scholar] [CrossRef]
Lee, J.; Shin, D.; Lee, J.; Lee, J.; Kang, S.; Yoo, H.J. A Full HD 60 fps CNN Super Resolution Processor with Selective Caching based Layer Fusion for Mobile Devices. In Proceedings of the 2019 Symposium on VLSI Circuits, Kyoto, Japan, 9–14 June 2019; pp. C302–C303. [Google Scholar] [CrossRef]
Mo, H.; Zhu, W.; Hu, W.; Wang, G.; Li, Q.; Li, A.; Yin, S.; Wei, S.; Liu, L. 9.2 A 28nm 12.1TOPS/W Dual-Mode CNN Processor Using Effective-Weight-Based Convolution and Error-Compensation-Based Prediction. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; Volume 64, pp. 146–148. [Google Scholar] [CrossRef]
Ding, Y.C.; Lin, K.P.; Weng, C.W.; Wang, L.W.; Wang, H.C.; Lin, C.Y.; Chen, Y.T.; Huang, C.T. A 4.6-8.3 TOPS/W 1.2-4.9 TOPS CNN-based Computational Imaging Processor with Overlapped Stripe Inference Achieving 4K Ultra-HD 30fps. In Proceedings of the ESSCIRC 2022—IEEE 48th European Solid State Circuits Conference (ESSCIRC), Milan, Italy, 19–22 September 2022; pp. 81–84. [Google Scholar] [CrossRef]
Ma, Y.; Kim, M.; Cao, Y.; Vrudhula, S.; Seo, J.s. End-to-end scalable FPGA accelerator for deep residual networks. In Proceedings of the 2017 IEEE International Symposium on Circuits and Systems (ISCAS), Baltimore, MD, USA, 28–31 May 2017; pp. 1–4. [Google Scholar] [CrossRef]
Shi, M.; Houshmand, P.; Mei, L.; Verhelst, M. Hardware-Efficient Residual Neural Network Execution in Line-Buffer Depth-First Processing. IEEE J. Emerg. Sel. Top. Circuits Syst. 2021, 11, 690–700. [Google Scholar] [CrossRef]
Wan, W.; Kubendran, R.; Schaefer, C.; Eryilmaz, S.B.; Zhang, W.; Wu, D.; Deiss, S.; Raina, P.; Qian, H.; Gao, B.; et al. A compute-in-memory chip based on resistive random-access memory. Nature 2022, 608, 504–512. [Google Scholar] [CrossRef] [PubMed]
Wang, Z.; Nalla, P.S.; Krishnan, G.; Joshi, R.V.; Cady, N.C.; Fan, D.; Seo, J.s.; Cao, Y. Digital-Assisted Analog In-Memory Computing with RRAM Devices. In Proceedings of the 2023 International VLSI Symposium on Technology, Systems and Applications (VLSI-TSA/VLSI-DAT), Hsinchu, Taiwan, 17–20 April 2023; pp. 1–4. [Google Scholar] [CrossRef]
Vignali, R.; Zurla, R.; Pasotti, M.; Rolandi, P.L.; Singh, A.; Gallo, M.L.; Sebastian, A.; Jang, T.; Antolini, A.; Scarselli, E.F.; et al. Designing Circuits for AiMC Based on Non-Volatile Memories: A Tutorial Brief on Trade-Off and Strategies for ADCs and DACs Co-Design. IEEE Trans. Circuits Syst. II Express Briefs 2024, 71, 1650–1655. [Google Scholar] [CrossRef]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Bhardwaj, K.; Milosavljevic, M.; O’ Neil, L.; Gope, D.; Matas, R.; Chalfin, A.; Suda, N.; Meng, L.; Loh, D. Collapsible Linear Blocks for Super-Efficient Super Resolution. Proc. Mach. Learn. Syst. 2022, 4, 529–547. [Google Scholar]
Huang, C.T. Ernet Family: Hardware-Oriented Cnn Models For Computational Imaging Using Block-Based Inference. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 1643–1647. [Google Scholar] [CrossRef]
Mei, L.; Goetschalckx, K.; Symons, A.; Verhelst, M. DeFiNES: Enabling Fast Exploration of the Depth-First Scheduling Space for DNN Accelerators Through Analytical Modeling. In Proceedings of the 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Montreal, QC, Canada, 25 February–1 March 2023; pp. 570–583. [Google Scholar] [CrossRef]

Figure 1. Example of two convolutional layers with kernels size of 3. (a) Line-buffer-based layer fusion. (b) Pyramid-based layer fusion.

Figure 2. Illustration of the proposed layer-centric tile fusion using two convolutional layers with kernels size of 3 as an example. The figure demonstrates the management of different types of feature data and the position shift of tiles across feature maps.

Figure 3. The nine tile types, along with their position shift and tile size variation, during the processing of each layer. The red dashed line corresponds to the tile position and size prior to convolution.

Figure 4. Examples of residual blocks with feature merging. The black shading represents the projection of the output feature tile onto the input feature map. (a) For two consecutive

3 \times 3

convolutions, the receptive field of the residual data aligns with the input feature tiles of layer fusion, allowing direct merging. (b) When the receptive field of the residual data extends beyond the input tiles of layer fusion, additional feature data must be merged accordingly.

Figure 4. Examples of residual blocks with feature merging. The black shading represents the projection of the output feature tile onto the input feature map. (a) For two consecutive

3 \times 3

convolutions, the receptive field of the residual data aligns with the input feature tiles of layer fusion, allowing direct merging. (b) When the receptive field of the residual data extends beyond the input tiles of layer fusion, additional feature data must be merged accordingly.

Figure 5. Different reuse distances and corresponding data types after feature merging. Different colors illustrate the four reuse distances corresponding to 1 layer, 1 stack, 1 tile, and t tiles.

Figure 6. Comparison of EDP under different tile size configurations for the SRGAN network.

Figure 7. Design space exploration on networks. Orange points indicate configurations with both feature merging and reuse-distance-aware caching. Blue points represent designs employing only reuse-distance-aware caching, with residual data processed separately. Green points denote baseline unoptimized designs that prioritize layer fusion-related feature data for on-chip storage. The star-shaped markers represent the modeling results of layer fusion approaches employed in state-of-the-art (SotA) residual network accelerators [17,18,19]. (a) ResNet18 exploration results; (b) SRGAN exploration results.

Table 1. Comparison of different layer fusion methods in terms of output tile dimensions (

W_{out}

,

H_{out}

), overlap size (

{Size}_{olp}

), overlap management, and tile position shift characteristics.

Table 1. Comparison of different layer fusion methods in terms of output tile dimensions (

W_{out}

,

H_{out}

), overlap size (

{Size}_{olp}

), overlap management, and tile position shift characteristics.

Methods	W_out	H_out	Size_olp	Overlap Management	Tile Position Shift
Line-buffer-based	Configurable	Fixed (1)	$K - 1$	Stored on-chip	Yes
Pyramid-based	Configurable	Configurable	$\sum_{Stack}^{{Layer}_{i}} (K_{i} - 1)$	Stored on-chip/Recompute/ Reload from off-chip memory	No
Proposed LCTF	Configurable	Configurable	$K - 1$	Flexible Storage	Yes

Table 2. Comparison of tile types in terms of padding positions, overlap data management, and tile dimension variations across layers.

	Type 0	Type 1	Type 2	Type 3	Type 4	Type 5	Type 6	Type 7	Type 8
Padding	Top and Left side	Top only	Top and Right side	Left only	–	Right only	Bottom and Left side	Bottom only	Bottom and Right side
Wolp	Produce	Produce & Consume	Consume	Produce	Produce and Consume	Consume	Produce	Produce and Consum	Consume
Holp	Produce	Produce	Produce	Produce and Consume	Produce and Consume	Produce and Consume	Consume	Consume	Consume
Tile Width Variation	Decrease	Consistent	Increase	Decrease	Consistent	Increase	Decrease	Consistent	Increase
Tile Height Variation	Decrease	Decrease	Decrease	Consistent	Consistent	Consistent	Increase	Increase	Increase

Table 3. Comparison with state-of-the-art methods under identical memory capacity or equivalent EDP constraints.

	Works	EDP Reduction	Memory Reduction
	Works	(Under Same Memory)	(Under Same EDP)
	ISSCC’21 [17]	19.41%	58.33%
ResNet18	ESSCIRC’22 [18]	32.23%	57.89%
	ISCAS’17 [19]	43.44%	/
	ISSCC’21 [17]	5.04%	20.28%
SRGAN	ESSCIRC’22 [18]	22.30%	/
	ISCAS’17 [19]	40.29%	/

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, H.; He, J.; Gui, Y.; Peng, S.; Huang, L.; Yan, X.; Fan, Y. Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion. Electronics 2025, 14, 3269. https://doi.org/10.3390/electronics14163269

AMA Style

Zhang H, He J, Gui Y, Peng S, Huang L, Yan X, Fan Y. Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion. Electronics. 2025; 14(16):3269. https://doi.org/10.3390/electronics14163269

Chicago/Turabian Style

Zhang, Hao, Jianheng He, Yupeng Gui, Shichen Peng, Leilei Huang, Xiao Yan, and Yibo Fan. 2025. "Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion" Electronics 14, no. 16: 3269. https://doi.org/10.3390/electronics14163269

APA Style

Zhang, H., He, J., Gui, Y., Peng, S., Huang, L., Yan, X., & Fan, Y. (2025). Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion. Electronics, 14(16), 3269. https://doi.org/10.3390/electronics14163269

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Memory-Efficient Feature Merging for Residual Connections with Layer-Centric Tile Fusion

Abstract

1. Introduction

2. Proposed Method

2.1. Layer-Centric Tile Fusion

2.2. Feature Merging for Residual Connections

3. Memory-Efficient Modeling Framework

3.1. Reuse-Distance-Aware Caching Strategy

3.2. Modeling Framework Architecture

4. Evaluation

4.1. Experiment Setup

4.2. Design Space Exploration

4.3. State-of-the-Art Comparison

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI