Energy-Constrained Deep Neural Network Compression for Depth Estimation

Zeng, Xiangrong; Zhang, Maojun; Zhong, Zhiwei; Liu, Yan

doi:10.3390/electronics12030732

Open AccessArticle

Energy-Constrained Deep Neural Network Compression for Depth Estimation

by

Xiangrong Zeng

,

Maojun Zhang

,

Zhiwei Zhong

and

Yan Liu

^*

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(3), 732; https://doi.org/10.3390/electronics12030732

Submission received: 25 December 2022 / Revised: 17 January 2023 / Accepted: 18 January 2023 / Published: 1 February 2023

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

Many applications, such as autonomous driving, robotics, etc., require accurately estimating depth in real time. Currently, deep learning is the most popular approach to stereo depth estimation. Some of these models have to operate in highly energy-constrained environments, while they are usually computationally intensive, containing massive parameter sets ranging from thousands to millions. This makes them hard to perform on low-power devices with limited storage in practice. To overcome this shortcoming, we model the training process of a deep neural network (DNN) for depth estimation under a given energy constraint as a constrained optimization problem and solve it through a proposed projected adaptive cubic quasi-Newton method (termed ProjACQN). Moreover, the trained model is also deployed on a GPU and an embedded device to evaluate its performance. Experiments show that the stage four results of ProjACQN on the KITTI-2012 and KITTI-2015 datasets under a 70% energy budget achieve (1) 0.13% and 0.61%, respectively, lower three-pixel error than the state-of-the-art ProjAdam when put on a single RTX 3090Ti; (2) 4.82% and 7.58%, respectively, lower three-pixel error than the pruning method Lottery-Ticket; (3) 5.80% and 0.12%, respectively, lower three-pixel error than ProjAdam on the embedded device Nvidia Jetson AGX Xavier. These results show that our method can reduce the energy consumption of depth estimation DNNs while maintaining their accuracy.

Keywords:

energy-constrained; projected cubic quasi-Newton optimizer; embedded device

1. Introduction

As a classical computer vision problem, stereo image-based depth estimation has a wide range of applications such as autonomous driving, robotics, 3D scene understanding, etc. [1,2,3,4,5]. Many of these have to operate in highly energy-constrained environments. However, most of the existing deep-learning-based depth estimation methods focus on designing more powerful network architectures to obtain more accurate depth images. Thus, their strong performance often requires massive computing resources, making them too heavy to run on embedded devices.

There are several state-of-the-art studies [6,7,8,9,10,11] that attempt to address this problem by building lightweight networks, that is, they trade off computation and accuracy with low inference time. These “mini” models could be further reduced in size to save more energy while maintaining a similar performance. An open-source study [12] proposes an end-to-end DNN training framework that provides quantitative energy consumption guarantees via weighted sparse projection and input masking. The input mask enables the input sparsity to be controlled by a trainable parameter, and thus increases the energy reduction opportunities. They also present a projected version Adam (termed ProjAdam) to train the model under quantitatively estimated energy consumption. However, the method only applies to image classification tasks, and the network only includes fully connected layers and 2D convolution layers, which are different from many depth estimation networks.

In this paper, we also formulate the training process of depth estimation DNN under a given energy budget as a constrained optimization problem. Different from prior work, we model the energy consumption of the depth estimation network after analyzing its architecture and taking 3D convolution layers into consideration. Furthermore, we propose a projected adaptive cubic quasi-Newton optimizer (termed ProjACQN) to obtain a better model of such a complex optimization problem. Unlike the commonly used first-order projected optimizers [12,13] or proxy optimizers [14,15,16], which only utilize the gradient information, ProjACQN incorporates Hessian information with the norm of difference between the two previous estimates, and performs the projection operation onto the energy constraint after the parameter update.

We evaluate the efficiency of our method on KITTI-2012 [17] and KITTI-2015 [18] using the depth estimation DNN AnyNet [6]. Experiments show that our method can reduce energy consumption of AnyNet by 30% while improving its accuracy by 0.13% and 0.62% compared to ProjAdam [12] on KITTI-2012 and KITTI-2015, respectively. Our method is also compared with three existing pruning methods (including L1-norm pruning [19], BN pruning [20] and Lottery-Ticket [21]) under the same energy budget, and achieves the best result. In addition, we perform the models on the embedded device Nvidia Jetson AGX Xavier [22], and find that ProjACQN with a 70% energy budget is able to outperform a dense model without an energy budget in terms of three-pixel error and time consumption.

2. Related Work

Generally, research on depth estimation using DNNs mainly focuses on designing architectures with better performance. These methods can be classified into three classes: supervised, semi-supervised and unsupervised. Taking [23,24] as examples of supervised methods, ref. [23] performs prediction at adaptively selected locations, which are easier to be estimated accurately and can alleviate excessive computation. Ref. [24] proposes a depth refinement architecture using 3D dilated convolutions to predict geometrically consistent disparity images. These supervised stereo depth estimation methods have achieved great success, but they require per-pixel ground truth depth data, which are often hard to acquire. To resolve this issue, an alternative is the class of unsupervised methods, which uses the geometric constraints between stereo images as the supervisory signal instead of directly inputting ground truth disparity. Ref. [25] combines an unsupervised stereo disparity estimation network with a perceptual loss network, which enables it to refine the predicted disparity. Ref. [26] designs a Siamese autoencoder architecture to extract mutual information between the rectified stereo images. It also exploits the mutual epipolar attention, and uses the optimal transport algorithm to refine the depth image. In addition, there are very few works on semi-supervised stereo depth estimation using DNNs. Ref. [27] trains in a semi-supervised manner to combine information from LIDAR and photometric data. The model can achieve a better performance than those trained only with LIDAR.

The methods mentioned above often contain massive parameter sets to achieve outperformance, while they require very a high cost of computation and GPU memory to be put on embedded devices. Recently, many studies have focused on improving energy efficiency through artifical intelligence [6,7,8,9,10,11,28,29]. This has inspires many studies to explore lightweight DNNs for stereo depth estimation. Ref. [10] performs online adaptive stereo depth estimation through a self-supervised learning method, which helps to save computation and GPU memory. Ref. [11] is based on a Max-tree hierarchical representation of image pairs, and is able to identify matching regions along image scan-lines. Ref. [8] estimates depth via a series of binary classifications. Instead of obtaining an accurate depth map, it classifies objects according to their relative distance. Ref. [7] proposes an efficient neural network, in which its computation and latency saving mostly benefit from using the depth-wise separable convolution and network pruning. Refs. [6,30] proposes AnyNet to achieve a wide accuracy range of the disparity map according to the permitted reference time. In practice, AnyNet has four stages, and the higher the stage, the better the accuracy with more time cost.

3. Problem Formulation

In this section, we provide an energy model for a depth estimation DNN which consists of fully connected layers, 2D convolution layers and 3D convolution layers.

3.1. Energy Consumption Modeling

Generally, the energy cost of inferencing a typical depth estimation DNN after the popular systolic array hardware architecture [31] can be formulated as [12]

\underset{M, W}{m i n} L_{W_{d e n s e}} (M, W) s . t E (M, W) \leq E_{b u d g e t}

(1)

where

L (\cdot)

is the loss function, W the weight tensor of the sparse model,

W_{d e n s e}

the weight tensor of the original dense model, M the input mask,

E_{b u d g e t}

the given energy budget and E the total energy cost. The depth estimation DNN model mainly consists of a sequence of fully connected layers, 2D convolution layers and 3D convolution layers. Also, their energy consumption can be decomposed into two parts: computation energy and data access energy. Let

U, V, W

be the sets of the fully connected layers, 2D convolution layers and 3D convolution layers in a DNN, respectively;

E_{c o m p}^{(i)}

the energy consumed by computation units in layer i; and

E_{d a t a}^{(i)}

the energy consumed in layer i when accessing data from the hardware memory. The total energy cost can be acquired through

E = \sum_{u \in U} (E_{c o m p}^{(u)} + E_{d a t a}^{(u)}) + \sum_{v \in V} (E_{c o m p}^{(v)} + E_{d a t a}^{(v)}) + \sum_{w \in W} (E_{c o m p}^{(w)} + E_{d a t a}^{(w)})

(2)

assuming

e_{M A C}

denotes the energy consumption of one Multiply-and-Accumulate (MAC) operation,

X^{(i)}

the input tensor of layer i and

s u p p (\cdot)

returns a binary tensor which labels a nonzero position. Then, the computation energy cost of a fully connected layer is

E_{c o m p}^{(u)} = e_{M A C} s u m (s u p p (X^{(u)}) \cdot s u p p (W^{(u)}))

(3)

The computation energy cost of a 2D convolution layer is

E_{c o m p}^{(v)} = e_{M A C} s u m (s u p p (X^{(u)}) * s u p p (W^{(u)}))

(4)

Generally, to accelerate the loading speed, it is common to load data from the DRAM to the cache and then from the cache to the register file (RF) when DNN is inferencng [12,32]. Thus, each layer has to load their input and weight three times, once each from DRAM, RF and cache. Let

e_{D R A M}

,

e_{R F}

and

e_{c a c h e}

be the unit energy costs of DRAM, RF and cache, respectively. Thus, the data access energy cost of each layer can be formulated as

\begin{matrix} E_{d a t a}^{(l a y e r)} = & e_{D R A M} (N_{D R A M}^{i n p u t} + N_{D R A M}^{w e i g h t}) + e_{R F} (N_{R F}^{i n p u t} + N_{R F}^{w e i g h t}) \\ + e_{c a c h e} (N_{c a c h e}^{i n p u t} + N_{c a c h e}^{w e i g h t}) \end{matrix}

(5)

where

N_{D R A M}^{i n p u t}

,

N_{R F}^{i n p u t}

and

N_{c a c h e}^{i n p u t}

denote the total numbers of DRAM, RF and cache accesses related to input, respectively, and

N_{D R A M}^{w e i g h t}

,

N_{R F}^{w e i g h t}

and

N_{c a c h e}^{w e i g h t}

the total numbers related to weight. Ref. [12] only gives the detailed computation of the energy cost for the fully connected layer and 2D convolution layer.

3.1.1. Computation Energy for 3D Convolution Layer

Let

W^{(w)} \in R^{c_{o u t} \times c_{i n} \times r \times r \times r}

be the weight tensor, where

c_{i n}

and

c_{o u t}

are the number of input channels and output channels, respectively, and let

X^{(w)} \in R^{c_{i n} \times h \times w \times d}

be the input tensor, where h, w and d are the height, width and depth of X, respectively. The computation of the 3D convolution operation can be formulated as

{(X^{(w)} * W^{(w)})}_{j, x, y, z} = \sum_{i = 1}^{c_{i n}} \sum_{x^{^{'}}, y^{^{'}}, z^{^{'}} = 0}^{r - 1} X_{i, x - ⌊ \frac{r}{2} ⌋ + x^{^{'}}, y^{^{'}} - ⌊ \frac{r}{2} ⌋ + y^{^{'}}, z^{^{'}} - ⌊ \frac{r}{2} ⌋ + z^{^{'}}}^{(w)} W_{j, i, x^{^{'}}, y^{^{'}}, z^{^{'}}}^{(w)}

(6)

where x, y and z represent the corresponding position of the output element. Assume the convolution padding is p and the convolution stride s. Let

h^{^{'}} = ⌊ (h + 2 p - r) / s ⌋ + 1

,

w^{^{'}} = ⌊ (w + 2 p - r) / s ⌋ + 1

,

d^{^{'}} = ⌊ (d + 2 p - r) / s ⌋ + 1

. Then, the size of the output tensor is

c_{o u t} \times d^{'} \times h^{^{'}} \times w^{^{'}}

. Thus, the number of MAC operations is

s u m (s u p p (X^{(w)}) * s u p p (W^{(w)})) \leq h^{^{'}} w^{^{'}} d^{^{'}} | | W^{(w)} {| |}_{0}

, and the computation energy cost can be approximated through

E_{c o m p}^{(w)} \approx e_{M A C} h^{^{'}} w^{^{'}} d^{^{'}} | | W^{(w)} {| |}_{0}

(7)

3.1.2. Data Access Energy for 3D Convolution Layer

According to Equation (5), to acquire the data access energy for the 3D convolution layer, the access number of the cache, RF and DRAM are needed. Let

N_{W_c a c h e}^{(w)}

,

N_{W_R F}^{(w)}

and

N_{W_D R A M}^{(w)}

be the access numbers of the weight tensor, respectively, and

N_{X_c a c h e}^{(w)}

,

N_{X_R F}^{(w)}

and

N_{X_D R A M}^{(w)}

the access numbers of the input tensor. For simplification, we can unfold the input tensor as

{\bar{X}}^{(w)} \in R^{h^{^{'}} w^{^{'}} d^{^{'}} \times c_{i n} r^{3}}

according to the output channel, where each column represents all elements in

X^{(w)}

that operate with one element in

W^{(w)}

. The access number can be formulated as

\begin{matrix} N_{W_c a c h e}^{(w)} = ⌊ h^{^{'}} w^{^{'}} d^{^{'}} / s_{h} ⌋ | | W^{(w)} {| |}_{0} \\ N_{W_R F}^{(w)} = h^{^{'}} w^{^{'}} d^{^{'}} / s_{h} | | W^{(w)} {| |}_{0} \\ N_{W_D R A M}^{(w)} = ⌊ h^{^{'}} w^{^{'}} d^{^{'}} / s_{h} ⌋ m a x (0, | | W^{(w)} {| |}_{0} - k_{W}) + m i n (k_{W}, | | W^{(w)} {| |}_{0}) \\ N_{X_c a c h e}^{(w)} = ⌊ c_{o u t} / s_{w} ⌋ | | {\bar{X}}^{(w)} {| |}_{0} \\ N_{X_R F}^{(w)} = c_{o u t} | | {\bar{X}}^{(w)} {| |}_{0} + 2 h^{^{'}} w^{^{'}} d^{^{'}} | | W^{(w)} {| |}_{0} \\ N_{X_D R A M}^{(w)} = | | X^{(w)} {| |}_{0} + N_{o v e r l a p} + c_{o u t} h^{^{'}} w^{^{'}} d^{^{'}} \end{matrix}

(8)

where

k_{W}

represents the cache size for the weight matrix, and

N_{o v e r l a p}

the number of overlapped elements due to the nature of the 3D convolution operation.

s_{h}

and

s_{w}

denote the width and height of the systolic array, respectively. According to Equations (5) and (8), the total data access energy consumption for a 3D convolution layer can be formulated as

E_{d a t a}^{(w)} = e_{D R A M} (N_{X_D R A M}^{(w)} + N_{W_D R A M}^{(w)}) + e_{R F} (N_{X_R F}^{(w)} + N_{W_R F}^{(w)}) + e_{c a c h e} (N_{X_c a c h e}^{(w)} + N_{W_c a c h e}^{(w)})

(9)

4. Optimization Algorithm

In this section, we first provide the formulation of our projected adaptive cubic quasi-Newton optimizer. Then, we utilize the optimizer to solve the constrained optimization problem (1).

4.1. Projected Cubic Quasi-Newton Optimizer

Consider the following optimization problem:

\underset{W \in C}{m i n} L (W)

(10)

where the constraint set

C

is convex and compact. We can find the optima through the following augmented second-order approximation:

\underset{W \in C}{m i n} L (W_{k}) + g_{k}^{T} (W - W_{k}) + \frac{1}{2} {(W - W_{k})}^{T} B_{k} (W - W_{k}) + \frac{ρ}{6} | | W - W_{k} {| |}^{3}

(11)

where

g_{k}

represents the gradient at iteration k,

B_{k}

an approximation to the Hessian matrix at

W_{k}

, and

ρ

a positive constant. We can obtain a stationary point without restriction through setting the derivative of the object function to zero:

g_{k} + B_{k} (x - x_{k}) + \frac{ρ}{2} | | x - x_{k} | | (x - x_{k}) = 0

(12)

The update of the stationary point becomes

W_{k + 1} = W_{k} - (B_{k} + \frac{ρ}{2} | | W_{k} - W_{k - 1} {| | \cdot I)}^{- 1} g_{k}

(13)

To avoid inverse difficulty caused by matrix degradation, we constrain the absolute value of the diagonal elements to be greater than a given positive parameter

θ

:

W_{k + 1} = max (a b s (B_{k} + \frac{ρ}{2} | | x_{k} - x_{k - 1} {| | \cdot I), θ \cdot I)}^{- 1} g_{k}

(14)

Then, the optima can be acquired through projecting the stationary point into the constraint set. This yields a novel update,

W_{k + 1}^{*} = p r o j_{C} (max (a b s (B_{k} + \frac{ρ}{2} | | x_{k} - x_{k - 1} {| | \cdot I), θ \cdot I)}^{- 1} g_{k})

(15)

where the projection operation

p r o j_{C}

is to project the stationary point into optimization constraint set

C

. The detailed algorithm for ProjACQN is shown in Algorithm 1. We then use this algorithm to solve problem (1) in the following subsection.

Algorithm 1 ProjACQN

Require:: Mini-batch size $n_{g}$ , stepsize $η$ , exponential decay rates $β_{1}, β_{2} \in [0, 1)$ , positive parameters $ϵ, ρ, θ$ , constrain set $C$ .
Require:: $x_{0}, g_{0}, B_{0}, m_{0}, V_{0}$ //Initialize variables
Require:: $k \leftarrow 0, t \leftarrow 0$ //Initialize timestep
1:: while $x_{k}$ not converged do
2:: $k \leftarrow k + 1$
3:: $g_{k} \leftarrow L (W_{k}; n_{g})$ //Stochastic gradient at timestep k
4:: $s_{k} \leftarrow x_{k} - x_{k - 1}$
5:: $g_{k} \leftarrow g_{k} - g_{k - 1}$
6:: $B_{k} \leftarrow B_{k - 1} + \frac{s_{k}^{T} g_{k} - s_{k}^{T} B_{k - 1} s_{k}}{| | s_{k} {| |}_{4}^{4} + ϵ} D i a g (s_{k}^{2})$ //Update diagonal Hessian
7:: $D_{k} \leftarrow max (a b s (B_{k} + \frac{ρ}{2} | | x_{k} - x_{k - 1} | | \cdot I), θ \cdot I)$
8:: $m_{k} \leftarrow \frac{(1 - β_{1}^{k - 1}) β_{1} m_{k - 1} + (1 - β_{1}) g_{k}}{1 - β_{1}^{k}}$ //Update first moment
9:: $V_{k} \leftarrow \frac{(1 - β_{2}^{k - 1}) β_{2} V_{k - 1} + (1 - β_{2}) D_{k}^{2}}{1 - β_{2}^{k}}$ //Update high-order second moment
10:: $W_{k + 1} \leftarrow p r o j_{C} (W_{k} - η V_{k}^{- \frac{1}{2}} m_{k})$ //Update parameters with projection operation
11:: end while

4.2. Update Weight Tensor and Input Mask

The problem in Equation (1) has two variables, M and W. For simplification, it can be transformed into two sub-problems and solved through alternative updating of the following two equations:

W^{*} = \underset{W \in W}{a r g m i n} L (M, W)

(16)

M^{*} = \underset{M \in M}{a r g m i n} L (M, W)

(17)

where

W = {W^{'} | E (M, W^{'}) \leq E_{b u d g e t}}

and

M = {M^{'} | | | M^{'} | |_{0} \leq q, M^{'} \in [0, 1]}

. The sub-problems (16) and (17) are similar to (10), and can be optimized with the projected cubic quasi-Newton optimizer. According to Equations (15)–(17), the update of the weight tensor can be formulated as

W_{k + 1} = p r o j_{W} (W_{k} - (max (a b s (B_{k} + \frac{ρ}{2} | | x_{k} - x_{k - 1} {| | \cdot I), θ \cdot I)}^{- 1} g_{k})

(18)

The update of the mask tensor can be formulated as

M_{k + 1} = p r o j_{M} (M_{k} - (max (a b s (B_{k}^{'} + \frac{ρ}{2} | | x_{k}^{'} - x_{k - 1}^{'} | | \cdot I^{'}), θ \cdot I^{'})^{- 1} g_{k}^{'})

(19)

According to [12], the projection problem (18) can be transformed into a knapsack problem, and the projection problem (19) is the well-known

L_{0}

norm projection. To summarize, the complete algorithm for training an energy-constrained depth estimation DNN is given in Algorithm 2.

Algorithm 2 Energy-Constrained Depth Estimation DNN Training

Require:: Mini-batch size $n_{g}$ , stepsize $η$ , exponential decay rates $β_{1}, β_{2} \in (0, 1)$ , mask sparsity decay step $▵ q$ , positive parameters $ϵ, ρ, θ$ .
Require:: $q = | | M_{t} {| |}_{0} - ▵ q$
Require:: $k \leftarrow 0, t \leftarrow 0$ //Initialize timestep
1:: while True do
2:: while not converged do
3:: Update weight tensor $W_{k}$ through solving (18)
4:: $k \leftarrow k + 1$
5:: end while //If accuracy decreases, exit loop with $W_{k}$ and $M_{t}$ ;
6:: while not converged do
7:: Update input mask $M_{t}$ through solving (19)
8:: $t \leftarrow t + 1$
9:: end while Round values of $M_{t + 1}$ into {0,1} //Decay the sparsity constraint $q = q - ▵ q$
10:: end while
11:: return $x_{k + 1}, M_{t + 1}$

5. Experiment

To validate the efficiency of our method, we perform extensive experiments on KITTI-2012 [17] and KITTI-2015 [18] using the depth estimation network AnyNet [7]. The experiments can be divided into three parts: First, we test the performance of AnyNet trained with a GPU under different energy-cost budgets, and the results are compared with the projected first order optimizer ProjAdam. Then, we apply three existing pruning methods (including L1-norm pruning [19], BN pruning [20] and Lottery-Ticket [21]) to AnyNet under the same energy budget for comparison. FInally, the performance of the trained models are also accessed on the embedded device Nvidia Jetson AGX Xavier [22].

Implementation Details We implement AntNet with four stages, and the output of higher stages have better performance with more time cost. The hyper-parameters of the network are set according to the default values. The DNN framework we experiment on is PyTorch1.10.0 with python3.6 and is GPU-accelerated through Cuda-1.13. The hardware is a single RTX 3090Ti with I9-10920X CPU, while the RAM is 32GB. Following AnyNet, we use the metric three-pixel error to evaluate the performance (lower is better). In addition, the predicted depth maps are also enhanced through histogram equalization [33] for comparison.

Dataset The training set of KITTI-2012 contains 194 image pairs, while the test set contains 195 image pairs. Both the training set and test set of KITTI-2015 have 200 image pairs.

Experiment Setup We perform a careful hyper-parameter tuning for optimizers in experiments as follows:

ProjACQN: We set

θ = 1

,

β_{1} = 0.9

,

β_{2} = 0.999

,

ϵ = 10^{- 8}

,

ρ = 1

. The learning rates for weight and input mask update are set to

η = 5 \times 10^{- 4}

and

η = 1 \times 10^{- 4}

, respectively. Meanwhile, the weight decay for weight and input mask update are set to

η = 1 \times 10^{- 4}

and

η = 1 \times 10^{- 5}

, respectively.

ProjAdam: The learning rates are searched among

{a \times 10^{b}}

where

a \in {1, 2, 3, 4, 5, 6, 7, 8, 9}

and

b \in {- 5, - 4, - 3}

, while the weight decay for weight update and input mask update are set to the same as that of ProjACQN. Other parameters are set as their default values in the literature.

5.1. Results on GPU under Different Energy Budgets

Here, we test the performance of AnyNet trained through GPU under 50%, 60% and 70% energy budgets. Quantitative results on KITTI-2012 and KITTI-2015 are shown in Table 1 and Table 2, respectively. On KITTI-2012, the four stage results of ProjACQN and ProjAdam are comparable to the dense model under a 70% energy budget, while ProjACQN achieves a much lower three-pixel error under a 60% energy budget. On KITTI-2015, the four stage results of ProjACQN achieve a lower three-pixel error under 60% and 70% energy costs than ProjAdam and even outperform the dense model. This may be due to the removal of redundant information. Furthermore, ProjACQN achieves a comparable result under a 50% energy budget. We also show the curves of training loss on KITTI-2012 and KITTI-2015 under a 70% energy cost in Figure 1 and Figure 2, respectively. From these figures, we can see that ProjACQN achieves the best convergence speed. Figure 3 gives some visual examples of the predicted disparity.

5.2. Comparison between Different Pruning Methods

To comprehensively compare our method with prior work, we also slim AnyNet through three existing pruning methods including L1-norm pruning, BN pruning and Lottery-Ticket to AnyNet on KITTI-2015 and KITTI-2012. Here, we list the results under a 70% energy cost in Table 3. Our method achieves an evidently lower three-pixel error than the other methods.

5.3. Results on Embedded Device

In this section, we perform the model trained through ProjAdam and ProjACQN on an Nvidia Jetson AGX Xavier under 50%, 60% and 70% energy budgets. Quantitative results on KITTI-2012 and KITTI-2015 are shown in Table 4 and Table 5, respectively. It can be found that the stage four result of ProjACQN has an obvious advantage on KITTI-2012 and KITTI-2015, while the performance of the other three stages are comparable. The FPS of stage four using the dense model, ProjAdam and ProjACQN is 11.5, 20.4 and 20.31, respectively. It should be noted that the three-pixel error increased mostly due to the Float16 data type of the embedded device, which could be solved through quantization. Figure 4 and Figure 5 present some visual examples of disparity predictions from stage four of AnyNet under 50%, 60% and 70% energy budgets on KITTI-2102 and KITTI-2015, respectively. We can see that the segmentation results using ACQN-H is more similar to the dense model than ProjAdam. It is worth noting that the results are noisy under 50% and 60% energy budgets, due to the input mask.

6. Conclusions

We have presented an approach to compress deep neural networks under a given energy constraint for depth estimation. The training process of the depth estimation DNN model is formulated as a constrained optimization problem, and can be solved through the proposed projected adaptive cubic quasi-Newton optimizer. Experiments show that our method can reduce the energy consumption of AnyNet by 30% while improving accuracy by 0.13% and 0.62% compared to the state-of-the-art method ProjAdam on KITTI-2012 and KITTI-2015, respectively. When comparing with existing pruning methods, ProjACQN also achieves the best three-pixel error. It is worth mentioning that, when performing the models on the embedded device Nvidia Jetson AGX Xavier, ProjACQN with a 70% energy budget is able to outperform the dense model without an energy budget in terms of three-pixel error and time consumption.

Author Contributions

Conceptualization, X.Z., M.Z. and Y.L.; methodology, X.Z.; software, Y.L.; validation, Z.Z. and Y.L.; formal analysis, X.Z., M.Z., Z.Z. and Y.L.; investigation, X.Z. and Y.L.; resources, Z.Z.; data curation, Z.Z. and Y.L.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z., M.Z. and Y.L.; visualization, Z.Z.; supervision, M.Z.; project administration, X.Z.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported in part by the Hunan Provincial Natural Science Foundation (Grant No. 2019JJ50746) and the National Natural Science Foundation of China (Grant No. 61602494).

Data Availability Statement

Data openly available in a public repository.

Conflicts of Interest

The authors declare no conflict of interest.

References

Wang, Y.; Chao, W.L.; Garg, D.; Hariharan, B.; Campbell, M.; Weinberger, K.Q. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving. 2018. Available online: https://arxiv.org/abs/1812.07179 (accessed on 1 February 2020).
Tian, Y.; Du, Y.; Zhang, Q.; Cheng, J.; Yang, Z. Depth estimation for advancing intelligent transport systems based on self-improving pyramid stereo network. IET Intell. Transp. Syst. 2020, 14, 338–345. [Google Scholar] [CrossRef]
Dong, X.; Garratt, M.A.; Anavatti, S.G.; Abbass, H.A. Towards Real-Time Monocular Depth Estimation for Robotics: A Survey. 2021. Available online: https://arxiv.org/abs/2111.08600 (accessed on 1 November 2021).
Zhou, Y.; Gallego, G.; Rebecq, H.; Kneip, L.; Li, H.; Scaramuzza, D. Semi-dense 3D Reconstruction with a Stereo Event Camera. Proc. Eur. Conf. Comput. Vis. 2018, 11205, 242–258. [Google Scholar]
Bardozzoa, F.; Collinsbc, T.; Forgioned, A.; Hostettler, A.; Tagliaferri, R. StaSiS-Net: A stacked and siamese disparity estimation network for depth reconstruction in modern 3D laparoscopy. Med. Image Anal. 2022, 77, 102380. [Google Scholar] [CrossRef]
Wang, Y.; Lai, Z.; Huang, G.; Wang, B.H.; Van Der Maaten, L.; Campbell, M.; Weinberger, K.Q. Anytime Stereo Image Depth Estimation on Mobile Devices. In Proceedings of the 2019 International Conference on Robotics and Automation (ICRA), Montreal, QC, Canada, 20–24 May 2019; pp. 5893–5900. [Google Scholar]
Wofk, D.; Ma, F.; Yang, T.; Karaman, S.; Sze, V. FastDepth: Fast Monocular Depth Estimation on Embedded Systems. In Proceedings of the 2019 International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 6101–6108. [Google Scholar]
Badki, A.; Troccoli, A.J.; Kim, K.; Kautz, J.; Sen, P.; Gallo, O. Bi3D: Stereo Depth Estimation via Binary Classifications. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1597–1605. [Google Scholar]
Aguilera, C.A.; Aguilera, C.; Navarro, C.A.; Sappa, A.D. Fast CNN Stereo Depth Estimation through Embedded GPU Devices. Sensors 2020, 20, 3249. [Google Scholar] [CrossRef]
Gan, W.; Wong, P.; Yu, G.; Zhao, R.; Vong, C. Light-weight network for real-time adaptive stereo depth estimation. Neurocomputing 2021, 441, 118–127. [Google Scholar] [CrossRef]
Brandt, R.; Strisciuglio, N.; Petkov, N. MTStereo 2.0: Accurate Stereo Depth Estimation via Max-Tree Matching. Int. Conf. Comput. Anal. Images Patterns 2021, 13052, 110–119. [Google Scholar]
Yang, H.; Zhu, Y.; Liu, J. Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking. arXiv 2018, arXiv:1806.04321. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Yun, J.; Lozano, A.C.; Yang, E. A General Family of Stochastic Proximal Gradient Methods for Deep Learning. arXiv 2020, arXiv:2007.07484. [Google Scholar]
Yang, Y.; Yuan, Y.; Chatzimichailidis, A.; van Sloun, R.J.G.; Lei, L.; Chatzinotas, S. ProxSGD: Training Structured Neural Networks under Regularization and Constraints. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
Bai, Y.; Wang, Y.; Liberty, E. ProxQuant: Quantized Neural Networks via Proximal Operators. arXiv 2018, arXiv:1810.00861. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar]
Menze, M.; Geiger, A. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3061–3070. [Google Scholar]
Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; Graf, H.P. Pruning Filters for Efficient ConvNets. arXiv 2016, arXiv:1608.08710. [Google Scholar]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning Efficient Convolutional Networks through Network Slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2755–2763. [Google Scholar]
Frankle, J.; Carbin, M. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. arXiv 2018, arXiv:1803.03635. [Google Scholar]
NVIDIA. Nvidia Jetson AGX Xavier. 2022. Available online: https://www.nvidia.cn/autonomous-machines/jetson-agx-xavier/ (accessed on 1 December 2022).
Liao, J.; Fu, Y.; Yan, Q.; Luo, F.; Xiao, C. Adaptive depth estimation for pyramid multi-view stereo. Comput. Graph. 2021, 97, 268–278. [Google Scholar] [CrossRef]
Chabra, R.; Straub, J.; Sweeney, C.; Newcombe, R.A.; Fuchs, H. StereoDRNet: Dilated Residual StereoNet. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 11786–11795. [Google Scholar]
Wang, B.; Feng, Y.; Fu, H.; Liu, H. Unsupervised Stereo Depth Estimation Refined by Perceptual Loss. In Proceedings of the 2018 Ubiquitous Positioning, Indoor Navigation and Location-Based Services (UPINLBS), Wuhan, China, 22–23 March 2018; pp. 1–6. [Google Scholar]
Huang, B.; Zheng, J.; Giannarou, S.; Elson, D.S. H-Net: Unsupervised Attention-based Stereo Depth Estimation Leveraging Epipolar Geometry. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Smolyanskiy, N.; Kamenev, A.; Birchfield, S. On the Importance of Stereo for Accurate Depth Estimation: An Efficient Semi-Supervised Deep Neural Network Approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1007–1015. [Google Scholar]
Salem, H.; El-Hasnony, I.M.; Kabeel, A.E.; El-Said, E.M.; Elzeki, O.M. Deep Learning model and Classification Explainability of Renewable energy-driven Membrane Desalination System using Evaporative Cooler. Alex. Eng. J. 2022, 61, 10007–10024. [Google Scholar] [CrossRef]
Abdel-Razek, S.A.; Marie, H.S.; Alshehri, A.; Elzeki, O.M. Energy Efficiency through the Implementation of an AI Model to Predict Room Occupancy Based on Thermal Comfort Parameters. Sustainability 2022, 14, 7734. [Google Scholar] [CrossRef]
Meta, A.I. Stereo Depth Estimation on KITTI2012. 2022. Available online: https://paperswithcode.com/sota/stereo-depth-estimation-on-kitti2012 (accessed on 1 December 2022).
Kung, H.T. Why systolic architectures? IEEE Comput. 1982, 15, 37–46. [Google Scholar] [CrossRef]
Mishra, R. Design of A Large Signal Memory Array for High Frequency Microprocessors. Int. J. Electr. Electron. Data Commun. 2015, 3, 53–56. [Google Scholar]
Gonzalez, R.C. Digital Image Processing; PEARSON INDIA: Tamil Nadu, India, 2019. [Google Scholar]

Figure 1. The training loss curves of four stages on KITTI-2012 under 70% energy cost.

Figure 2. The training loss curves of four stages on KITTI-2015 under 70% energy cost.

Figure 3. Disparity predictions using ProjACQN from four stages of AnyNet under 70% energy cost on KITTI-2012 and KITTI-2015. Three-pixel errors are shown as red numbers, the lower the better.

Figure 4. Disparity predictions from stage four of AnyNet under 50%, 60% and 70% energy budgets on KITTI-2012. The percentages in the figure represent the three-pixel errors, the lower the better. Evaluation results of ProjACQN are marked in blue.

Figure 5. Disparity predictions from stage four of AnyNet under 50%, 60% and 70% energy budgets on KITTI-2015. The percentages in the figure represent the three-pixel errors, the lower the better. Evaluation results of ProjACQN are marked in blue.

Table 1. Three-pixel error (%) and resulting energy consumption of AnyNet with 50%, 60% and 70% energy budget on KITTI-2012 dataset. A lower three-pixel error is better.

	Energy Budget	Energy	Three-Pixel Error (%)
	Energy Budget	Energy	Stage 1	Stage 2	Stage 3	Stage 4
Dense Model	100%	97.6%	14.65	9.27	6.21	5.60
ProjAdam	70%	67.4%	15.42	11.87	6.97	5.98
	60%	55.4%	16.12	13.83	10.40	11.22
	50%	49.5%	16.06	13.98	10.84	40.55
ProjACQN	70%	65.5%	16.11	9.38	6.48	5.85
	60%	55.8%	16.35	9.32	6.63	6.09
	50%	49.3%	17.15	12.24	11.13	13.65

Table 2. Three-pixel error (%) and resulting energy consumption of AnyNet with 50%, 60% and 70% energy budget on KITTI-2015 datases. A lower three-pixel error is better.

	Energy Budget	Energy	Three-Pixel Error (%)
	Energy Budget	Energy	Stage 1	Stage 2	Stage 3	Stage 4
Dense Model	100%	97.6%	13.24	9.12	6.30	5.75
ProjAdam	70%	67.1%	13.20	9.01	6.50	5.74
	60%	55.3%	20.63	17.70	14.15	13.40
	50%	49.8%	16.06	13.98	10.84	40.55
ProjACQN	70%	67.2%	12.79	8.60	5.66	5.13
	60%	57.4%	13.64	8.83	6.29	5.64
	50%	49.0%	17.12	12.25	11.13	13.69

Table 3. Three-pixel error (%) of different pruning methods using AnyNet under 70% energy budget on KITTI-2012 and KITTI-2015 datasets.

Pruning Method	Dataset	Three-Pixel Error (%)
Pruning Method	Dataset	Stage 1	Stage 2	Stage 3	Stage 4
BN	KITTI-2012	27.29	16.27	10.73	10.97
L1-Norm		26.14	19.99	10.01	9.83
Lottery-Ticket		24.25	19.14	10.71	10.67
ProjACQN		16.11	13.98	6.48	5.85
BN	KITTI-2015	39.62	26.20	15.13	15.02
L1-Norm		29.42	20.63	11.92	12.54
Lottery-Ticket		28.28	22.48	13.03	12.71
ProjACQN		12.79	8.60	5.66	5.13

Table 4. Three-pixel error (%) of AnyNet with 50%, 60% and 70% energy budget on KITTI-2012 dataset using AGV. A lower three-pixel error is better.

	Energy Budget	Three-Pixel Error (%)
	Energy Budget	Stage 1	Stage 2	Stage 3	Stage 4
Dense Model	100%	18.23	19.91	11.21	10.64
ProjAdam	70%	22.76	15.37	14.20	14.05
	60%	21.52	17.37	12.16	13.74
	50%	18.90	14.06	9.30	64.20
ProjACQN	70%	15.27	12.70	8.14	8.25
	60%	15.20	13.73	8.32	8.48
	50%	16.74	16.21	12.18	12.44

Table 5. Three-pixel error (%) of AnyNet with 50%, 60% and 70% energy budget on KITTI-2015 datases using AGV. A lower three-pixel error is better.

	Energy Budget	Three-Pixel Error (%)
	Energy Budget	Stage 1	Stage 2	Stage 3	Stage 4
Dense Model	100%	14.11	17.10	10.41	9.03
ProjAdam	70%	16.66	14.42	9.97	8.50
	60%	17.20	13.63	9.06	10.22
	50%	21.76	20.40	17.17	40.63
ProjACQN	70%	15.30	13.92	9.87	8.38
	60%	16.28	14.39	9.24	9.32
	50%	18.15	16.71	13.86	13.91

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zeng, X.; Zhang, M.; Zhong, Z.; Liu, Y. Energy-Constrained Deep Neural Network Compression for Depth Estimation. Electronics 2023, 12, 732. https://doi.org/10.3390/electronics12030732

AMA Style

Zeng X, Zhang M, Zhong Z, Liu Y. Energy-Constrained Deep Neural Network Compression for Depth Estimation. Electronics. 2023; 12(3):732. https://doi.org/10.3390/electronics12030732

Chicago/Turabian Style

Zeng, Xiangrong, Maojun Zhang, Zhiwei Zhong, and Yan Liu. 2023. "Energy-Constrained Deep Neural Network Compression for Depth Estimation" Electronics 12, no. 3: 732. https://doi.org/10.3390/electronics12030732

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Energy-Constrained Deep Neural Network Compression for Depth Estimation

Abstract

1. Introduction

2. Related Work

3. Problem Formulation

3.1. Energy Consumption Modeling

3.1.1. Computation Energy for 3D Convolution Layer

3.1.2. Data Access Energy for 3D Convolution Layer

4. Optimization Algorithm

4.1. Projected Cubic Quasi-Newton Optimizer

4.2. Update Weight Tensor and Input Mask

5. Experiment

5.1. Results on GPU under Different Energy Budgets

5.2. Comparison between Different Pruning Methods

5.3. Results on Embedded Device

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI