1. Introduction
Over the past several years, deep-learning-based computer vision techniques have been extensively applied to computer-aided diagnosis (CAD). In computational pathology, pathological image analysis based on the deep learning method has proven powerful in improving efficiency and accuracy in cancer detection [
1]. The morphology of the nuclei is the essential feature used by pathologists in cancer diagnosis and further cancer prognoses, such as predicting survival [
2] and pathological grading of tumors [
3]. Accurate nuclei segmentation and classification can advance the quality of tissue segmentation [
4,
5]. Nuclei segmentation is the crucial first step to obtaining the morphological features used in the downstream analysis. However, the morphological heterogeneity of nuclei makes studies challenging. The karyomorphism shows variability, while different diseases may cause chromatin abnormalities to exhibit variable size and shape patterns. Another severe problem is that the cells in a cancerous tumor are usually densely packed and even have more than one nucleus, causing overlapping nuclei. This overlapping brings difficulty for further research on separating neighboring instances via automatic segmentation.
Extracting each nucleus and distinguishing its type can promote the diagnostic potential in present-day digital pathology pipelines. For instance, precisely distinguishing each nucleus from tumors or lymphocytes can significantly facilitate downstream analysis of tumor-infiltrating lymphocytes (TIL), which has been proven effective in predicting cancer recurrence [
6]. The nucleus-by-nucleus classification has become another problem researchers have been interested in recently due to the high variability and diversity of nuclei appearance in a whole slide image.
The current deep models for histopathology image diagnosis are mainly based on single-task learning. Single-task learning is designing a model for a specific task and then optimizing iteratively. In this case, the nuclei segmentation and classification tasks require two independent models, one for detecting the location of each nucleus and the other for classifying the type of nuclei [
7,
8]. For more complicated tasks, we are accustomed to modeling each part of the task by disassembling. However, there exists an obvious problem in this way. When modeling each sub-task, it is easy to ignore the relationships, conflicts, and constraints between different sub-tasks, resulting in the downgrading of the overall performance of the entire task.
To address the above issue, multi-task models have drawn much attention [
9,
10,
11,
12]. The multi-task models have the following advantages: (1) multiple tasks share the same model, reducing the amount of memory; (2) multiple tasks obtain results through a forward calculation at one time, and the inference speed increases; (3) associated tasks share information and complement each other, improving each tasks’ performance.
Recently, several multi-task deep models for histopathology image diagnosis have been suggested and achieved promising results [
13,
14,
15]. Unfortunately, these approaches still suffer from efficiency issues, such as dealing with a cumbersome model with a huge amount of parameters. In addition, the classification on the CoNSeP dataset [
13] seems hard to meet the needs of practical pathological diagnosis.
The present paper proposes a lightweight, multi-task deep learning framework for segmenting and classifying nuclei simultaneously. To address the problem of network stability encountered by batch normalization (BN) when dealing with small batch sizes, we introduce two newly designed blocks. We devise an efficient encoder-decoder architecture, where the encoder adopts our proposed RGS for down-sampling, while the decoder uses Dense-Ghost-Module (DGM) and convolution for up-sampling. By encoding the HV distance of nuclei pixels, we can obtain more representative features on the instance with fewer layers. Here, HV distance can be used to segment overlapping nuclei instances accurately. Later, the decoder using the output features of the encoder predicts nuclei types. According to the above characteristics, we call the proposed network GSN-HVNET. Our experimental results show that the proposed model can retain shallow features on nuclei to improve segmentation and classification accuracy. Our main contributions are outlined below:
We propose a novel, lightweight, multi-task deep learning framework containing a unified model for segmentation and classification of nuclei instances simultaneously with superior efficiency and accuracy.
We propose the newly designed RGS and DGS to improve accuracy and compress the training model.
We redefine the classification principle of the CoNSeP dataset so that the auxiliary diagnostic results have practical significance in pathological diagnosis.
Our experiments on the CoNSeP, Kumar, and CPM-17 datasets confirm the improvements to existing works [
13,
14]. Compared with the state-of-the-art HoVer-Net [
13], the number of parameters is reduced by 64%. In addition, we try different batch sizes in our experiments and prove that batch size is no longer a strict limitation on the proposed network; even when a small batch is presented, the proposed network can maintain a high performance.
The remainder of this paper is organized as follows:
Section 2 introduces the current research on applying learning algorithms in histopathology image analysis. Our new network architecture is presented in
Section 3. We conduct experiments and show desirable results in
Section 4. Finally,
Section 5 concludes our work and gives a brief discussion of future work.
3. Proposed Method
Figure 1 shows an overview of the GSN-HVNET for simultaneous nuclei instance segmentation and classification. The network input starts with
images, which are center patches cropped out from the sample images of size
. The model can simultaneously segment the nuclei and predict nuclei types and HV-Maps (horizontal and vertical maps). After a post-processing procedure, the nuclei instance can be obtained using HV-Map and nuclei pixel predictions. The final output results can be obtained by combining the segmentation results with the nuclei-type predictions. In other words, the network can complete the segmentation and classification of nuclei instances at the last step.
3.1. Network Architecture
Figure 2 illustrates the detailed structure of the proposed GSN-HVNET. The proposed network consists of an encoder and a decoder for automatic segmentation and classification of nuclei instances. The encoder can extract an effective set of features; then, the output result of the encoder is used as the decoder input. The decoder contains three branches. Branch I (NSS) is used in nuclei semantic segmentation, and branch II (HV) predicts the HV distances of nuclei pixels to their mass centers. Nuclei types are predicted in branch III (NC). We combine the output of branch I and branch II to accomplish the instance segmentation. Then, the instance segmentation result combines the branch III output to accomplish automatic segmentation and classification of the nuclei instance. The encoder employs the proposed RGS, as discussed in
Section 3.1.1. The details of GBS and RGS will be introduced in
Section 3.1.2 and
Section 3.1.3, respectively. In
Section 3.1.4, the decoder designed with DGS will be described. The details of DGS will be presented in
Section 3.1.5.
3.1.1. Encoder
To extract a practical set of features, we design a novel residual ghost network as part of the encoder in the overall network. The network employs a Conv2D-SN-ReLU (CSR) and a series of 4 Residual-Ghost-Modules (RGMs) for down-sampling. Here, the CSR block is composed of a Conv2D, SN, and ReLU. An RGM consists of multiple instances of our improved Ghost-Block—Residual-Ghost-Block with switchable normalization (RGS) [
35]. Benefiting from ghost convolution, our network requires much fewer parameters to generate abundant feature maps compared with using ordinary convolution, resulting in an improved computational efficiency of our encoder. Moreover, the SN can select an optimal combination of different normalizers for different normalization layers, improving the network stability, i.e., the accuracy is not affected by the batch size. Each RGM is used as a down-sampling level of 2, which means that the spatial resolution of the input is reduced by a factor of 2. We will give a detailed discussion on RGS in the two subsequent subsections.
3.1.2. Ghost Block with Switchable Normalization
Figure 3 compares the structure of Ghost-Block-BN (GBB) [
36] and our suggested Ghost-Block-SN (GBS). As is known, Ghost-Block can help a convolutional neural network to generate more features at a much lower cost. To do that, a Ghost-Block first generates several intrinsic feature maps using ordinary convolution operation and then uses cheap linear operations to expand the features and increase the channels. The computational cost of the linear operations on feature maps is much lower than traditional convolution and transcends other existing efficient works. We can customize the kernel size of the primary convolution in a Ghost-Block, and
point-wise convolution is employed in this paper. In the Residual-Ghost-Block (RGB), each Ghost-Block is followed by a BN layer, which offers several advantages, including stabilizing and speeding up the training procedure. However, the performance of GBB is severely restricted by the batch size. This is because BN only utilizes a single normalizer in the entire network, which can be unstable and hurt the accuracy in the case of a small batch size.
To solve the above problem, we apply switchable normalization (SN), which is robust to a wide range of batch sizes, whether a small batch size or not. As shown in
Figure 4, SN measures channel-wise, layer-wise, and minibatch-wise statistics by using instance normalization (IN) [
37], layer normalization (LN) [
38], and batch normalization (BN) [
39], respectively, and tries to find an optimal combination by learning their important weights, ensuring the stability and accuracy of the network in the case of small batch size.
3.1.3. Residual Ghost Block with Switchable Normalization
Our RGS adopts the structure of residual block—the essential building unit of residual neural network (ResNet) [
40]—owing to its outstanding performance. The key idea behind residual block is to reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. As shown in
Figure 5, we embed the proposed GBS in a residual block as RGS. Later, several RGSs are stacked to form the RGM. Our network contains of four stacked RGMs with 1, 2, 3, and 1 RGS, respectively. Compared with original ResNet-50, our network employs fewer building units to extract feature maps and reduce redundant features, leading to a reduction in model size. In addition, our proposed RGS is generic and can be used in the construction of other lightweight deep learning architectures.
3.1.4. Decoder
As aforementioned, the decoder contains three branches to obtain accurate nuclei instance segmentation and classification simultaneously. These three branches adopt the same architecture consisting of a series of up-sampling operations and two Dense-Ghost-Modules. A DGM contains a series of cascading DGSs. Through stacking multiple DGSs, we can enrich the receptive field with relatively fewer parameters compared with the most popular Dense-Block, resulting in increased computational efficiency. As is known, low-level information is critical in segmentation tasks because it precisely helps to determine object boundaries. To make use of it, we adopt the skip connections to merge feature from each RGS in the encoder via the concatenation operation. The DGM follows the first and second up-sampling operations. There are eight and four DGSs in the first and second DGM, respectively. Each of the three branches contains three up-sampling steps, making the output feature the same dimension as the input image, i.e., . By combining the results of the two up-sampling branches, NSS and HV, we can obtain accurate boundaries of each individual cell nucleus, and thereby accomplish the nuclei instance segmentation. Compared with independent networks for different tasks, the proposed network is a unified model to simultaneously accomplish nuclei segmentation and classification, thus reducing the total training time.
3.1.5. Dense Ghost Module with Switchable Normalization
In this part, we propose a novel module applied to the decoder of GSN-HVNET. An example of the proposed module is shown in
Figure 6, in which
. Each DGS connects to other DGSs with forwarding feedback and employs GBS to extract feature maps. The feature maps from all preceding layers are utilized as current inputs, and the feature maps output by a DGS are used as inputs for all subsequent layers.
Thus, the proposed module can retain more abundant features as inputs of subsequent layers.
Similarly, benefiting from the lightweight nature of GBS, our proposed DGS utilizes fewer parameters to generate abundant feature maps and valid features compared with Dense-Block [
41]. Moreover, it helps to avoid unnecessary calculations by reducing redundant feature maps. Particularly, the DGM can maintain its performance under a small mini-batch size.
3.1.6. Joint Loss Function of GSN-HVNET
We design different loss functions for each different task. In
Table 1, we define the notations for our works. The joint loss function
is defined by
The NSS branch corresponds to a semantic segmentation task, and its loss function is designed using BCE loss and dice loss. It is defined by
where
and
represent the binary cross-entropy loss function and dice loss function for the output of the NSS branch, respectively. The
and
are scalars that give weights to their associated loss function. The above two functions are defined by
and
where
X represents the ground truth,
Y denotes the prediction, and
K represents the number of categories. In order to avoid zero denominators, we set
to
in the numerator and denominator.
The loss function for the HV branch is defined by
where
represents the mean squared error measuring the difference between the HV distances prediction and the ground truth,
and
are the weights of their associated loss function. The loss function
is used to calculate the gradients of the mean squared error between HV maps and ground truth.
and
are defined by
and
where
I represents the input image and
is defined as the regression output of HV branch. The pixel-wise softmax predictions of NSS and NC branches are represented by
and
, respectively.
denotes the ground truth of the HV distance of nuclei pixels to their mass centers.
The loss function of
is defined by
Similarly, and are used to balance the two loss functions and .
3.2. Post-Processing
The proposed network produces three outputs. To obtain the nuclei location and separate overlapping or clustered nuclei, we need to post-process the output of NSS and HV. Within each HV map, there are significant differences between pixels in adjacent instances. Using this property, we can calculate the gradient so as to separate the clustered nuclei. To do that, we have
where
and
represent the horizontal and vertical predictions produced by the HV branch, and
and
refer to the horizontal and vertical components, respectively, of the Sobel operator, which calculates the horizontal and vertical derivative approximations. In
Figure 1,
highlights the regions where pixels in adjacent regions of two instances differ significantly in the horizontal and vertical maps.
We compute the marker
M according to
where
q is the output probability map of the NSS branch and
is a threshold function acting on
q and sets values above
h to 1 or 0; otherwise,
is a rectifier setting all negative value to 0 and M is the output marker. We can obtain desired segmentation results by choosing appropriate
h and
k.
Next, we compute the energy landscape
E according to
Finally, given the energy landscape E, a marker-controlled watershed is carried out using M as the marker to determine how to split , given the energy landscape E. The task of joint segmentation and classification of nuclei requires converting per-pixel nuclei type prediction in the NSS branch to the prediction of the type of nuclei instances. To do that, we combine the post-processing result with NC branch output.
5. Conclusions
In this paper, we designed a lightweight, multi-task deep learning framework for nuclei segmentation and classification. Our model follows an encoder-decoder architecture, and the decoder consists of three branches, each outputting a prediction for a sub-task. To sufficiently use the correlation among the three branches, we employ NSS and HV branches to complete the nuclei instance segmentation and use NC branch to predict the classes of each nucleus in a learning process. Two newly designed blocks, Residual-Ghost-SN and Dense-Ghost-SN, are employed in the encoder and decoder parts, respectively, to reduce the computational cost and improve the network stability under small batch sizes. Extensive experiments have been carried out on the CoNSeP, Kumar, and CPM-17 datasets, and the results demonstrate that our model offers a state-of-the-art trade-off between computational efficiency and both segmentation and classification accuracy.
Ultimately, our idea is generic, and can be easily deployed to other histopathology images analysis works. Moreover, the blocks proposed in this paper, including Residual-Ghost-SN and Dense-Ghost-SN, are also generic and can be flexibly embedded into other deep CNNs for histopathology image diagnostic tasks. However, regarding their application in the field of natural images, we have not conducted experiments, and the effects cannot be guaranteed. Thus, we pose this as an open problem and expect to provide a theoretical analysis with complete proof in further research.