Author Contributions
Conceptualization, L.Y. and H.C.; Methodology, L.Y. and H.C.; Software, L.Y.; Validation, L.Y., W.Z. and H.C.; Data Curation, H.C.; Writing—Original Draft Preparation, L.Y.; Writing—Review and Editing, H.C. and S.P.; Supervision, H.C.; Project Administration, H.C. All authors have read and agreed to the published version of the manuscript.
Figure 1.
An overview of a click-based interactive segmentation system. “pc” and “nc” stand for positive (green points in RS images) and negative (red points in RS images) clicks placed by a human for foreground and background selection, respectively. Users can obtain the desired high-quality masks with just a few simple clicks.
Figure 1.
An overview of a click-based interactive segmentation system. “pc” and “nc” stand for positive (green points in RS images) and negative (red points in RS images) clicks placed by a human for foreground and background selection, respectively. Users can obtain the desired high-quality masks with just a few simple clicks.
Figure 2.
The pipeline of DRE-Net. The symbols “⊗” and “⊕” indicate concatenation and addition operations, respectively.
Figure 2.
The pipeline of DRE-Net. The symbols “⊗” and “⊕” indicate concatenation and addition operations, respectively.
Figure 3.
Example of the Zoom-In technique applied to remote sensing images. The green and red points stand for positive and negative clicks, respectively. The yellow bounding box is determined by the extreme points of the previous mask.
Figure 3.
Example of the Zoom-In technique applied to remote sensing images. The green and red points stand for positive and negative clicks, respectively. The yellow bounding box is determined by the extreme points of the previous mask.
Figure 4.
Visualization of three common click-encoding methods. Distance transformation calculates the Euclidean distance between pixels and clicks, while Gaussian transformation makes a Gaussian distribution centered on each click. Disk encoding uses a binary disk with a defined radius.
Figure 4.
Visualization of three common click-encoding methods. Distance transformation calculates the Euclidean distance between pixels and clicks, while Gaussian transformation makes a Gaussian distribution centered on each click. Disk encoding uses a binary disk with a defined radius.
Figure 5.
Visualization of the automatic update of the radii. “R7” means that the click is encoded with a radius of 7 pixels. The green and red points stand for positive and negative clicks, respectively. The radius is fixed to the mid-value when keying and will be updated in the next iteration. So, the radius of the negative click is 5 pixels in the 2nd iteration and is updated to 7 pixels in the 3rd iteration.
Figure 5.
Visualization of the automatic update of the radii. “R7” means that the click is encoded with a radius of 7 pixels. The green and red points stand for positive and negative clicks, respectively. The radius is fixed to the mid-value when keying and will be updated in the next iteration. So, the radius of the negative click is 5 pixels in the 2nd iteration and is updated to 7 pixels in the 3rd iteration.
Figure 6.
Distribution of the minimum distance with HRNet32-OCR [
44] as the backbone on the Potsdam dataset. We sampled the clicks generated in each corrective sampling segmentation.
Figure 6.
Distribution of the minimum distance with HRNet32-OCR [
44] as the backbone on the Potsdam dataset. We sampled the clicks generated in each corrective sampling segmentation.
Figure 7.
An example for disk encoding and DRE. The green and red points stand for positive and negative clicks, respectively. All clicks in disk encoding are encoded with the same radius, while the clicks close to the mask boundary have a smaller encoding radius in DRE. Through such a differentiated encoding method, we tell the network the different influences of each click and achieve higher accuracy with the same number of clicks.
Figure 7.
An example for disk encoding and DRE. The green and red points stand for positive and negative clicks, respectively. All clicks in disk encoding are encoded with the same radius, while the clicks close to the mask boundary have a smaller encoding radius in DRE. Through such a differentiated encoding method, we tell the network the different influences of each click and achieve higher accuracy with the same number of clicks.
Figure 8.
The architecture of the IIP module for HRNet32-OCR [
44]. “Conv, 3 × 3/2, 16→64” means a 3 × 3 conv layer with
,
, and
. The ScaleLayer divides each element of the feature maps by a learnable parameter k.
Figure 8.
The architecture of the IIP module for HRNet32-OCR [
44]. “Conv, 3 × 3/2, 16→64” means a 3 × 3 conv layer with
,
, and
. The ScaleLayer divides each element of the feature maps by a learnable parameter k.
Figure 9.
(
a) Procedure of the incremental training strategy. It includes (
b) the RoC-first training stage and (
c) the incremental training stage. Both stages use the segmentor described in
Section 3.1.4 with different click-encoding methods. We call the result of the RoC-first training stage ToR-Net (the model trained on the largest radius). The training strategy of the RoC-first training stage is consistent with those in recent works [
22,
24]. “RS” and “CS” in (
b,
c) stand for “random sampling” and “corrective sampling”, respectively.
Figure 9.
(
a) Procedure of the incremental training strategy. It includes (
b) the RoC-first training stage and (
c) the incremental training stage. Both stages use the segmentor described in
Section 3.1.4 with different click-encoding methods. We call the result of the RoC-first training stage ToR-Net (the model trained on the largest radius). The training strategy of the RoC-first training stage is consistent with those in recent works [
22,
24]. “RS” and “CS” in (
b,
c) stand for “random sampling” and “corrective sampling”, respectively.
Figure 10.
Visualization of the process of constructing a sample set with image truth pairs.
Figure 10.
Visualization of the process of constructing a sample set with image truth pairs.
Figure 11.
Mean IoU@k curves for the Potsdam dataset and Vaihingen dataset.
Figure 11.
Mean IoU@k curves for the Potsdam dataset and Vaihingen dataset.
Figure 12.
Comparison of the CG on the surface category in the Potsdam and Vaihingen datasets. A smoothly rising curve shows better CG performance.
Figure 12.
Comparison of the CG on the surface category in the Potsdam and Vaihingen datasets. A smoothly rising curve shows better CG performance.
Figure 13.
Visualization of the interaction process for FocalClick [
24], RITM [
22], and our method. The green and red points stand for positive and negative clicks, respectively. The clicks were generated according to the rules described in
Section 4.2, and their coding radii are presented. We changed the colors of ground-truth masks (cars and surfaces) to show them clearly. The images in the red block show the CG deterioration problem. The abbreviation “1pc, 2nc” indicates one positive and two negative clicks in the white block. Our method achieved high accuracy earlier while ensuring the CG of the network.
Figure 13.
Visualization of the interaction process for FocalClick [
24], RITM [
22], and our method. The green and red points stand for positive and negative clicks, respectively. The clicks were generated according to the rules described in
Section 4.2, and their coding radii are presented. We changed the colors of ground-truth masks (cars and surfaces) to show them clearly. The images in the red block show the CG deterioration problem. The abbreviation “1pc, 2nc” indicates one positive and two negative clicks in the white block. Our method achieved high accuracy earlier while ensuring the CG of the network.
Figure 14.
RoC ablation results for the Potsdam and Vaihingen datasets. The abbreviations “BS” and “IT” indicate the baseline and incremental training strategy, respectively.
Figure 14.
RoC ablation results for the Potsdam and Vaihingen datasets. The abbreviations “BS” and “IT” indicate the baseline and incremental training strategy, respectively.
Figure 15.
CG ablation results for the surface category in the Potsdam and Vaihingen datasets. The abbreviations “BS” and “IT” indicate the baseline and incremental training strategy, respectively. A smoothly rising curve shows better CG performance.
Figure 15.
CG ablation results for the surface category in the Potsdam and Vaihingen datasets. The abbreviations “BS” and “IT” indicate the baseline and incremental training strategy, respectively. A smoothly rising curve shows better CG performance.
Table 1.
Statistics of the datasets, including the resolution, size, image number, and mask number.
Table 1.
Statistics of the datasets, including the resolution, size, image number, and mask number.
Dataset | Resolution | Size | Split | Images | Samples |
---|
Potsdam | 5 cm | | train | 2960 | 14,675 |
test | 740 | 3689 |
Vaihingen | 9 cm | | train | 1322 | 5570 |
test | 329 | 1418 |
Table 2.
Comparison of the and on the Potsdam dataset. (The best is bolded and the runner-up is underlined).
Table 2.
Comparison of the and on the Potsdam dataset. (The best is bolded and the runner-up is underlined).
| Buildings | Cars | Surfaces | Trees | Low Vegetation | All Six Categories |
---|
Method | | | | | | | | | | | | |
FocalClick | 3.70 | 4.87 | 12.82 | 16.76 | 9.55 | 13.04 | 16.24 | 18.46 | 14.41 | 16.93 | 9.55 | 13.04 |
BRS | 2.90 | 3.99 | 6.72 | 10.71 | 8.20 | 11.36 | 13.43 | 16.83 | 12.66 | 15.61 | 9.74 | 12.59 |
RGB-BRS | 2.67 | 3.55 | 5.66 | 9.43 | 6.73 | 10.00 | 11.62 | 15.53 | 10.97 | 14.46 | 8.47 | 11.53 |
f-BRS | 2.59 | 3.37 | 5.69 | 9.49 | 6.96 | 10.38 | 12.64 | 16.19 | 11.47 | 14.72 | 8.81 | 11.72 |
RITM | 2.52 | 3.24 | 5.17 | 8.52 | 6.04 | 9.03 | 10.95 | 14.61 | 9.82 | 13.34 | 7.73 | 10.67 |
-Net | 1.96 | 2.49 | 4.10 | 7.25 | 4.98 | 7.83 | 9.07 | 13.16 | 8.57 | 12.26 | 6.67 | 9.63 |
-Net | 2.05 | 2.56 | 4.10 | 7.24 | 4.98 | 7.78 | 9.18 | 13.24 | 8.64 | 12.28 | 6.74 | 9.67 |
Table 3.
Comparison of the and on the Vaihingen dataset. (The best is bolded and the runner-up is underlined).
Table 3.
Comparison of the and on the Vaihingen dataset. (The best is bolded and the runner-up is underlined).
| Buildings | Cars | Surfaces | Trees | Low Vegetation | All Six Categories |
---|
Method | | | | | | | | | | | | |
FocalClick | 5.27 | 7.33 | 14.26 | 17.64 | 6.49 | 9.29 | 12.28 | 16.17 | 11.83 | 15.11 | 10.80 | 13.98 |
BRS | 2.40 | 3.66 | 7.51 | 13.42 | 5.87 | 8.97 | 11.35 | 15.44 | 12.80 | 15.84 | 8.24 | 11.55 |
RGB-BRS | 2.24 | 3.35 | 6.29 | 11.50 | 5.01 | 7.76 | 9.65 | 13.89 | 11.24 | 14.80 | 7.04 | 10.23 |
f-BRS | 2.21 | 3.15 | 6.32 | 11.33 | 4.98 | 7.81 | 9.72 | 13.96 | 11.45 | 14.76 | 7.09 | 10.18 |
RITM | 2.20 | 3.05 | 5.67 | 10.54 | 4.36 | 6.79 | 8.46 | 12.33 | 9.72 | 13.39 | 6.34 | 9.39 |
-Net | 1.92 | 2.45 | 5.49 | 10.86 | 4.00 | 6.50 | 7.64 | 11.73 | 9.17 | 13.14 | 5.74 | 8.84 |
-Net | 1.94 | 2.53 | 5.02 | 10.97 | 3.97 | 6.42 | 7.41 | 11.48 | 8.81 | 12.84 | 5.55 | 8.73 |
Table 4.
Evaluation results for the metric and the inference speed on the Potsdam dataset and Vaihingen dataset. indicates the number of failures to reach the target IoU of 85%/90% within 20 clicks. SPC indicates the average running time in seconds per click, and the time measures the total running time taken to process a dataset. (The best is bolded and the runner-up is underlined).
Table 4.
Evaluation results for the metric and the inference speed on the Potsdam dataset and Vaihingen dataset. indicates the number of failures to reach the target IoU of 85%/90% within 20 clicks. SPC indicates the average running time in seconds per click, and the time measures the total running time taken to process a dataset. (The best is bolded and the runner-up is underlined).
Dataset | Method | | | SPC, s | Time, H:M:S |
---|
Potsdam | FocalClick | 1373 | 2052 | 0.123 | 2:31:15 |
BRS | 1224 | 1841 | 1.020 | 20:54:16 |
RGB-BRS | 730 | 1380 | 1.406 | 28:48:55 |
f-BRS | 1011 | 1635 | 0.176 | 3:36:25 |
RITM | 489 | 1081 | 0.106 | 2:10:21 |
-Net | 351 | 911 | 0.110 | 2:15:11 |
-Net | 373 | 916 | 0.110 | 2:15:16 |
Vaihingen | FocalClick | 393 | 699 | 0.104 | 0:49:09 |
BRS | 381 | 629 | 0.772 | 6:04:54 |
RGB-BRS | 218 | 433 | 1.064 | 8:22:55 |
f-BRS | 274 | 503 | 0.104 | 0:49:09 |
RITM | 128 | 342 | 0.096 | 0:45:23 |
-Net | 96 | 309 | 0.097 | 0:45:53 |
-Net | 97 | 311 | 0.097 | 0:45:51 |
Table 5.
Comprehensive comparison of the different methods.
Table 5.
Comprehensive comparison of the different methods.
Method | RoC | CG | Generalizability | Inference Speed |
---|
FocalClick | Bad | Good | Bad | Fast |
BRS | Bad | Very Good | Bad | Slow |
RGB-BRS | Good | Very Good | Good | Very Slow |
f-BRS | Good | Bad | Bad | Fast |
RITM | Very Good | Bad | Very Good | Very Fast |
DRE-Net | Best | Very Good | Best | Very Fast |
Table 6.
RoC ablation results for the Potsdam and Vaihingen datasets. The abbreviations “BS” and “IT” indicate the baseline and incremental training strategy, respectively. (The best is bolded and the runner-up is underlined).
Table 6.
RoC ablation results for the Potsdam and Vaihingen datasets. The abbreviations “BS” and “IT” indicate the baseline and incremental training strategy, respectively. (The best is bolded and the runner-up is underlined).
Dataset | Method | | | | | |
---|
Potsdam | BS | 5.88 | 7.73 | 10.67 | 489 | 1081 |
+DRE | 5.55 | 7.27 | 10.12 | 370 | 921 |
+IT | 5.15 | 6.92 | 9.81 | 381 | 926 |
+DRE+IT | 4.98 | 6.67 | 9.63 | 351 | 911 |
Vaihingen | BS | 4.56 | 6.34 | 9.39 | 128 | 342 |
+DRE | 4.22 | 5.92 | 8.99 | 122 | 317 |
+IT | 4.18 | 5.89 | 8.98 | 130 | 328 |
+DRE+IT | 3.99 | 5.74 | 8.84 | 96 | 309 |
Table 7.
Test results on the Potsdam dataset under different thresholds and different hyperparameters . stands for the model trained on a radius of 7 pixels and . mIoU@k represents the mean intersection over union (IoU) between the predictions and ground-truth masks at the kth iteration. (The best is bolded).
Table 7.
Test results on the Potsdam dataset under different thresholds and different hyperparameters . stands for the model trained on a radius of 7 pixels and . mIoU@k represents the mean intersection over union (IoU) between the predictions and ground-truth masks at the kth iteration. (The best is bolded).
| Model | mIoU@1 | mIoU@2 | mIoU@3 | mIoU@5 | mIoU@10 | mIoU@20 |
---|
3 | | 47.04% | 59.36% | 67.84% | 75.89% | 83.69% | 88.81% |
4 | | 47.43% | 66.88% | 73.44% | 80.16% | 86.66% | 90.72% |
5 | | 48.56% | 66.76% | 72.77% | 79.56% | 86.49% | 90.74% |
6 | | 43.63% | 66.12% | 73.31% | 80.67% | 87.42% | 91.36% |
7 | | 44.53% | 67.54% | 73.69% | 81.04% | 87.76% | 91.73% |
8 | | 36.97% | 53.91% | 66.75% | 76.39% | 85.15% | 89.88% |
7 | | 44.53% | 67.54% | 73.69% | 81.04% | 87.76% | 91.73% |
7 | | 48.47% | 67.43% | 74.33% | 81.00% | 87.21% | 91.12% |
7 | | 52.92% | 69.51% | 74.92% | 81.23% | 87.68% | 91.62% |
7 | | 44.53% | 67.54% | 73.69% | 81.04% | 87.76% | 91.73% |
7 | | 52.92% | 69.51% | 74.92% | 81.23% | 87.68% | 91.62% |
7 | -Net | 55.14% | 70.13% | 76.75% | 83.29% | 89.27% | 92.73% |
7 | -Net | 56.53% | 70.55% | 76.76% | 83.14% | 89.05% | 92.63% |