3.1. Overview
This section describes the unsupervised depth estimation distillation framework. This framework takes a single RGB image as input and produces a dense depth map. The details of our model are shown in
Figure 1. In order to achieve unsupervised monocular depth prediction distillation, we follow Liu et al. [
15] and propose a dense prediction distillation strategy. This strategy adopts pixel-wise similarity produced by outputs of the teacher and student networks, structured pair-wise similarity produced by intermediate feature maps, and holistic similarity produced by a CNN adversarial module, and generates an integrated loss function based on these similarities. In addition to these elements, Liu et al. [
15] also employ the ground truth of related CV tasks. In this work, we abandon the ground truth module and regard the rest of the framework in [
15] as our baseline model. Moreover, we introduce a pose network as an extra module to assist the convergence of the student network. Although the pose network is commonly seen in self-supervised monocular depth estimation models to produce photometric difference as supervision, no such design has been reported for the student net during distillation. Following this, the idea of GAN [
15,
31] is retained, and we replace the traditional CNN with Transformer [
16] as the discriminator, considering its advantages in regard to global information capturing and training parallelism.
As is shown in
Figure 1, our framework is composed of three main blocks: a Knowledge Distillation module, a Pose Network module, and a Transformer module. Transformer [
16] is employed as the discriminator of GAN [
15,
31]. At the same time, the student network is regarded as the generator to produce fake samples, while the outputs of the teacher network are real samples. More detailed descriptions of these blocks are given in the following three sections.
3.2. Preliminaries
The current Knowledge Distillation Network is based on [
15], which deals with dense prediction CV tasks such as semantic segmentation, object detection, and depth estimation. We adopt the ideas of pixel-wise loss, pair-wise loss, and holistic loss contained in the previous work.
Pixel-Wise Distillation. In the original paper [
15], the task is semantic segmentation, which outputs the probability distribution of different classes for the teacher and student networks. The loss design in the previous paper is not suitable for estimating depth that is spatially continuous, so, in this paper, we calculate the difference between the depth maps from the teacher and student networks with the loss function defined as follows:
where
R represents all the depth map pixels,
and
denote the depth value of the
ith pixel from the teacher network and student network, respectively,
W and
H denote the width and height of the depth map, respectively, and
S indicates that the loss is used to update the parameters of the student module. The pixel-wise loss term is shown as
in
Figure 1.
Pair-Wise Distillation. In addition to the straightforward difference between the depth maps at the pixel level, pair-wise loss is also applied in our model. This loss pays more attention to the structural similarity between intermediate feature maps produced by the teacher and student networks. In this part, two hyperparameters, i.e., connection range
and granularity
are defined to represent the range on the maps we used to calculate the structure. Assuming the dimensions of the feature map are
, the feature map can be transformed to an affinity graph for further comparison. Each pixel on the affinity map is aggregated by
pixels in the spatial local patch of the feature map, and we only consider structural similarity on the top-
close pixels, i.e., the affinity map contains
pixels and
range connections. The aggregation method is average pooling. If we assume
(dimension is
) represents the
ith pixel of the affinity map,
and
represent the structural similarity between the
ith pixel and
jth pixel of the teacher module and student module, respectively. The functions used to describe the discrepancy between the feature maps in the two modules can be defined as follows:
where
denotes the pixel set of the affinity graph. According to the experiment in [
15],
and
.
S indicates that the loss is used to update the student module parameters. The pair-wise loss term is shown as
in
Figure 1.
Holistic Distillation. Another strategy that we integrate into our framework is holistic distillation. This part can map the depth graphs generated from the teacher and student modules to high-order space and compute their holistic loss. Specifically, the original paper [
15] employs a traditional self-attention CNN as a discriminator of a GAN and treats the outputs of the student and teacher net as fake and real samples, respectively. If the fake and real samples are represented as
and
, the holistic loss function can be written as follows:
where
is the embedding network, i.e., the discriminator in the GAN. The depth maps
and
from the teacher and student networks can concatenate with the color image
I and then be regarded as the inputs of the discriminator. The module is composed of two self-attention layers with four convolution blocks, and it can project the concatenated input to a high-order embedding score.
D indicates that the loss is used to update the discriminator module parameters. As can be seen from the formula, the discriminator is updated to generate lower embedding scores from the compact network (student) and higher embedding scores from the dense network (teacher). In this training process, the discriminator gets smarter to distinguish fake samples from real ones. For the student net, the holistic loss can be defined as follows:
This loss can update the student module parameters by maximizing the scores generated by the discriminator. The training process contains two steps:
- 1.
Fix the student module parameters; minimize the to train the discriminator so that the module has enough capacity to recognize the fake and real samples.
- 2.
Fix the discriminator parameters; use the along with other loss terms to train the compact network, thus generating high-quality depth maps.
By iterating the above two steps, the adversarial training generates a compact module with better convergence.
The above three loss terms are regarded as our baseline.
3.3. Pose-Assisted Network
There is little research on unsupervised depth model distillation. Inspired by the previous unsupervised depth estimation networks, we integrate a pose network [
8] into the student network to improve the performance of the baseline network. This module can be trained to predict the relative pose between consecutive frames and the pose information can be utilized to construct the photometric reprojection error. In our framework, it is used to assist the convergence of the student depth network and is only employed in the training stage. This module needs two consecutive frames as input, which is different from the above distillation part, where a single color image is enough. We address this problem by loading 12 images per batch, which means the batch size is 12 for the distillation part, but half for the pose module.
Just as in [
8,
9],
is defined as the relative pose for source image
, with respect to target image
. The reprojected image from
can be written as follows:
where
is the depth map of
,
K is the intrinsic parameters of the camera lens,
denotes the resulting 2D coordinates of projected
in
, and the angle bracket represents the sampling operator for the sake of aligning image size. With the target image
and synthesized image
, the photometric reprojection loss can be calculated as follows:
where
, and
is in the form of an L1 norm that is robust to outliers, SSIM [
45] estimates the pixel-wise similarity, and
V represents the valid pixels that are reprojected from
to
plane.
In addition, we can add another constraint between the two consecutive frames. Because of the widespread geometry inconsistency in depth estimation, we enforce the consistency of the consecutive depth maps [
10] by minimizing
. According to the pose information from the pose net, we can warp the source image depth to the target image plane, named
. If we define depth inconsistency for each pixel
p as:
Then the geometry consistency loss for each depth map can be defined as follows:
We use the sum of corresponding depths to normalize the depth inconsistency, thus avoiding the discrepancy in distribution. For pixels of dynamic objects and occlusions,
shows an unreasonable value. Therefore, we can use a weight mask
M to weigh
, thereby reducing the side effect from these pixels, i.e.,:
Normally, edge-aware smoothness [
20] can ensure that the smoothness is guided by the edge of the images. The smoothness loss is defined as follows:
where ∇ denotes the first derivative along the
X and
Y directions.