*Appendix A.4 Pairwise Comparisons between Proposed NI-U-Net++ and Related Studies*

This research uses Figures 3 and A3 in Section 2.3 to discuss the pairwise comparison between NI-U-Net++ and related studies in Table 5. Figure A3 uses NI-U-Net++ as background, but some highlights have been added for further comparison. The red arrows refer to the sub-U-Nets; each of them has a complete encoder–decoder process. Here defines a concept of compression ratio, which is the ratio between the input and output size (height or weight) of the encoder. "Sub-U-Net No.1" has the highest compression ratio, while "Sub-U-Net No.4" has the lowest compression ratio. The blue dash frame highlights the deep supervision mentioned in Section 2.3, and the orange frames refer to the micro-networks.

	- a. U-Net only has the "Sub-U-Net No. 1". Therefore, the compression ratio is constant at a high level.
	- b. U-Net does not have deep supervision design.
	- c. U-Net utilizes the 3 × 3 convolution layers and "Relu" activation instead of the micro-network in Figures 3, 4 and A3.
	- a. U-Net++ also has four sub-U-Nets as in the NI-U-Net++.
	- b. U-Net++ also has the deep supervision as in the NI-U-Net++.
	- c. However, the U-Net++ applies the 3 × 3 convolution layer and "Relu" activation as in U-Net instead of the micro-network in NI-U-Net++.
	- a. NI-U-Net only has the "Sub-U-Net No. 1". Therefore, the compression ratio is constant at a high level.
	- b. NI-U-Net has not deep supervision design.
	- c. NI-U-Net utilizes the same micro-network as in NI-U-Net++.

**Figure A3.** The pairwise comparisons for U-Net, U-Net++, NI-U-Net, and proposed NI-U-Net++.

*Appendix A.5 Additional Results of the Pre-Training Process*


**Table A2.** The Dice score of U-Net, U-Net++, NI-U-Net, and NI-U-Net++.

**Figure A4.** The loss and accuracy curves of U-Net [14] using the synthetic dataset. The green "A" and "B" correspond to the two highlights mentioned in Section 3.2. (**a**) Refers to the epoch-wised loss curves in the training and validation sets. (**b**) Refers to the epoch-wised accuracy curves in the training and validation sets. The horizontal dash lines refer to the references of final converge status.

**Figure A5.** The loss and accuracy curves of U-Net++ [15] using the synthetic dataset. The green "A" and "B" correspond to the two highlights mentioned in Section 3.2. (**a**) Refers to the epoch-wised loss curves in the training and validation sets. (**b**) Refers to the epoch-wised accuracy curves in the training and validation sets. The horizontal dash lines refer to the references of final converge status.

**Figure A6.** The loss and accuracy curves of NI-U-Net [57] using the synthetic dataset. The green "A" and "B" correspond to the two highlights mentioned in Section 3.2. (**a**) Refers to the epoch-wised loss curves in the training and validation sets. (**b**) Refers to the epoch-wised accuracy curves in the training and validation sets. The horizontal dash lines refer to the references of final converge status.

**Figure A7.** The loss and accuracy curves of Chiodini2020 [44] using the synthetic dataset. (**a**) Refers to the epoch-wised loss curves in the training and validation sets. (**b**) Refers to the epoch-wised accuracy curves in the training and validation sets.

**Figure A8.** The loss and accuracy curves of Furlan2019 [43] using the synthetic dataset. (**a**) Refers to the epoch-wised loss curves in the training and validation sets. (**b**) Refers to the epoch-wised accuracy curves in the training and validation sets.

Table A3 refers to the results of training NI-U-Net++ using different numbers of synthetic images. This research chooses about 50% (7000 images) and 10% (1000 images) as two experiment settings to evaluate the impact when the number of synthetic images decreases. All results decrease when the images number decreases. The synthetic algorithm aims to generate a large amount of valid data, so applying all available data is more fitted to the target of this research.

**Table A3.** The quantitative results of NI-U-Net++ tested using a different number of synthetic images.


<sup>1</sup> "Number" refers to the number of synthetic images used in corresponding experiment.

*Appendix A.6 Additional Results of the Transfer-Training Process*

**Figure A9.** The inference time record. The max, min, and mean inference time is 0.0364, 0.0307, and 0.0294 s.

**Figure A10.** Some examples of the real-life rover vision and corresponding predictions.
