Advantages of Using Transfer Learning Technology with a Quantative Measurement

Hattula, Emilia; Zhu, Lingli; Raninen, Jere; Oksanen, Juha; Hyyppä, Juha

doi:10.3390/rs15174278

Open AccessArticle

Advantages of Using Transfer Learning Technology with a Quantative Measurement

by

Emilia Hattula

^1,*,

Lingli Zhu

¹,

Jere Raninen

¹,

Juha Oksanen

^1,2

and

Juha Hyyppä

^1,2

¹

National Land Survey of Finland (NLS), 00521 Helsinki, Finland

²

Finnish Geospatial Research Institute (FGI), 02150 Espoo, Finland

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(17), 4278; https://doi.org/10.3390/rs15174278

Submission received: 7 July 2023 / Revised: 24 August 2023 / Accepted: 29 August 2023 / Published: 31 August 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

The number of people living in cities is continuously growing, and the buildings in topographic maps are in need of frequent updates, which are costly to perform manually. This makes automatic building extraction a significant research subject. Transfer learning, on the other hand, offers solutions in situations where the data of a target area are scarce, making it a profitable research subject. Moreover, in previous studies, there was a lack of metrics in quantifying the accuracy improvement with transfer learning techniques. This paper investigated various transfer learning techniques and their combinations with U-Net for the semantic segmentation of buildings from true orthophotos. The results were analyzed using quantitative methods. Open-source remote sensing data from Poland were used for pretraining a model for building segmentation. The fine-tuning techniques including a fine-tuning contracting path, a fine-tuning expanding path, a retraining contracting path, and a retraining expanding path were studied. These fine-tuning techniques and their combinations were tested with three local datasets from the diverse environment in Finland: urban, suburban, and rural areas. Knowledge from the pretrained model was transferred to the local datasets from Helsinki (urban), Kajaani (suburban), and selected areas in Finland (rural area). Three models with no transfer learning were trained from scratch with three sets of local data to compare the fine-tuning results. Our experiment focused on how various transfer learning techniques perform on datasets from different environments (urban, suburban, and rural areas) and multiple locations (southern, northern, and across Finland). A quantitative assessment of performance improvement by using transfer learning techniques was conducted. Despite the differences in datasets, the results showed that using transfer learning techniques could achieve at least 5% better accuracy than a model trained from scratch with several different transfer learning techniques. In addition, the effect of the sizes of training datasets was also studied.

Keywords:

quantitative measurement; deep learning; transfer learning; U-Net; building extraction

1. Introduction

Map updating, object detection, and change detection accomplished manually from remote sensing data remain laborious [1]. The population living in cities has been in constant growth in recent years, causing continuous changes in urban environments and highlighting the need for updating buildings in topographic maps [2]. Automatizing the extraction of buildings will reduce the time consumed by map change detection. At the same time, it is a step toward the automation of the map updating process [3]. Deep learning methods, and especially convolutional neural networks (CNN), have proven successful in building extraction and automation tasks. CNN-based image classification methods have evolved in recent years and have become an important part of semantic segmentation tasks in building extraction [4]. Classical CNN architectures for image processing include VGG [5], GooLeNet [6], and ResNet [7]. Fully convolutional network (FCN) [8] is a landmark pixel-based segmentation method proposed to provide new inspiration for applying CNNs to advance building extraction research [4]. The idea is to use existing CNNs as encoders to generate hierarchical features and use upsampling methods as decoders to reconstruct images and generate the semantic segmentations, eliminating the fully connected layer exclusively. FCN, SegNet [9], and U-Net [10] all employ an encoder–decoder architecture and have been widely used for building extraction tasks, as Luo et al. (2021) tell us in their review on deep learning-based building extraction [4]. U-Net especially, introduced by Ronneberger et al. in 2015, has been a popular choice for building segmentation because of its simple architecture and great performance [10,11,12,13]. In addition, unsupervised domain adaptation (UDA) algorithms are rapidly evolving, and many methods are suitable to be utilized for remote sensing data [14,15,16], and also Segment Anything Model (SAM) has achieved promising results in segmentation tasks [17]. However, accurate building extraction remains a challenging topic even when high-quality remote sensing data and advanced CNNs are available due to vegetation, nearby objects, shadows, and different rooftop materials [2,18].

The quantity of suitable high-quality remote sensing data available can often also be scarce, as data can be difficult, slow, and expensive to collect [19]. At the same time, traditional machine learning methods assume that training data and testing data are from the same domain, while the assumption does not hold in many real-world scenarios. Transfer learning solves this problem by allowing the transfer of information between different domains, and studying transfer learning methods and identifying the best ways to fine-tune the neural network being used have therefore proven important. Transfer learning was first introduced in 1976 by Bozinovski and Fulgosi [20] in their paper, in which they addressed using the method in the training of neural networks. As transfer learning is used to transfer information between different domains, it makes it possible to train a deep learning model in an environment with a notable amount of data available for a semantic segmentation task, and then adapt it to another environment with only a small amount of data.

Some common transfer learning techniques include freezing all other layers of a neural network with their weights, removing the output layer and training it again, freezing some layers of the network and retraining some layers, and continuing to train the neural network with the new data using a lower learning rate. One typical approach for fine-tuning a network is to freeze the shallow layers of the network while modifying deeper layers according to a new dataset. However, this approach may not always work for different types of data [12]. Transfer learning techniques with U-Net have mostly been studied for medical image segmentation [12,21,22,23] but also for seismic data [24]. In 2020, Amiri et al. [12] observed that with some datatypes—for example, with ultrasound data—it might be more appropriate to fine-tune the shallow layers of U-Net, while with other datatypes like X-ray data, fine-tuning of deeper layers was more likely to yield better segmentation results.

However, in recent years, some studies have inspected other types of transfer learning techniques with U-Net: one paper in 2019 [25] studied transfer learning and U-Net for segmenting buildings from an INRIA dataset with a method called discriminative fine-tuning, which allowed the use of different learning rates for different layers. In [26], U-Net-based methods achieved the best segmentation accuracy compared with older machine learning methods like two-scale FCN and multilayer perceptron (MLP), and radiometric augmentation was studied for direct transfer learning from an aerial dataset to a satellite dataset. In 2019, Liu et al. [13] attempted to chain two U-Net models into a chain network with transfer learning in their study for building extraction. At best, the model achieved a 6.61% higher intersection over union (IoU) score than a normal U-Net [13].

In addition, some studies have inspected transfer learning with neural networks other than U-Net in remote sensing image classification and segmentation, with a couple of studies focusing on buildings: [27] proposed a scale-robust FCN for building extraction and introduced a data augmentation strategy for transfer learning from an aerial dataset to a satellite dataset, while noting that larger datasets and better data augmentation strategies might be needed for models’ increased generalization ability. In 2022, Lin et al. [28] used transfer learning to assess seismic building damage by proposing a data transfer algorithm for identifying potential beneficial samples from the historical data of earthquake-affected areas. Nonetheless, studies focusing on the fine-tuning methods for building extraction with U-Net and other neural network architectures remain limited, and misclassifying other objects like roads as buildings remains a problem [13].

In recent publication, Pinto et al. (2022) [29] reviewed the applications of transfer learning for smart buildings. By analyzing 77 papers, the authors suggested three future research opportunities regarding transfer learning. One of their suggestions was that comparing different transfer learning approaches is needed. The main contributions of the paper include: 1. Metrics in quantifying the accuracy improvement for transfer learning technologies were conducted; 2. Various transfer learning techniques and their combinations were investigated; 3. Multi-environment including urban, suburban and rural datasets were tested; 4. The effect of dataset size on different fine-tuning techniques was studied.

2. Material

The open-source LandCover.ai data from Poland [30] were exploited for pretraining a model. The LandCover.ai data include 33 orthophotos with 25 cm per pixel resolution and 8 orthophotos with 50 cm per pixel resolution. The images had three spectral bands (RGB) in the format of GeoTiffs. It covers a total area of 216.27 km

^{2}

. Open-source datasets suitable for building extraction were sought for the purposes of transfer learning and training a pretrained model. Two main requirements for an open-source dataset were that it should include aerial images, like the Finnish data, and the pixel resolution of the images should be close to the ones of Finnish data, making it easier for the deep learning models to adjust to it. LandCover.ai data fulfilled the conditions and included a large amount of data easily available with an environment quite close to the Finnish environment, but still having some differences in the style of buildings, which lead it to be chosen for use.

Local datasets included three datasets. They were from different environments in Finland: urban, suburban, and rural areas. There was three sets of true orthophotos. According to previous studies [31,32,33], adding 3D information from DSMs and DEMs to be used with aerial images can be useful and lead to better segmentation results. For further research, DEMs and DSMs were used together with true orthophotos. Both orthophotos and true orthophotos had the coordinate system of ETRS-TM35FIN.

Table 1 presents the datasets that were used. A training sample refers to one image of a dataset: a 2000 × 2000 pixel image, a 1000 × 1000 pixel image, or a 512 × 512 pixel image, depending on the dataset. On the other hand, one datapoint refers to a randomly cropped image piece when data augmentation methods are being used with training sets: either a 256 × 256 pixel image crop (D1, D2, D3, D4, D5, D7) or a 512 × 512 pixel image crop (D6). The data augmentation performed for all the training data included random cropping, as well as horizontal and vertical flipping, both carried out with a 50% chance.

2.1. Datasets from Urban Environment

True orthophotos from the Helsinki urban area had a pixel size of 0.3 m, and they were 2000 × 2000 pixels in image size. Thus, one 2000 × 2000 pixel true orthophoto, covered an area of 600 m × 600 m. The true orthophotos were produced from aerial images acquired in 2020 and DSMs using SURE for ArcGIS software from Esri. The SURE for ArcGIS software was installed in a supercomputer environment of CSC that offered ICT solutions [35]. The data contained the information x, y, R, G, B, and Near Infrared (NIR). Building labels for the true orthophoto data were obtained from manual collection based on the vector data of the topographic database. The building vector data from the topographic database were measured with aerial stereo-images and focused on the bases of the buildings, while building labels based on true orthophotos were collected from the top of the buildings containing detailed edge information. The correction was performed manually in order to ensure the building label quality. Figure 1 shows the data from different locations and Figure 2 shows the high-quality building labels with detailed information.

The vector file/building label was cropped into 2000 × 2000 pixel pieces, each corresponding to one true orthophoto piece using QGIS and Python and converted into 0–1 array labels. The true orthophotos were cropped into 1000 × 1000 pixel pieces. Three different-sized datasets were constructed for fine-tuning and model training without transfer learning; D1, D2 and D3. The training datasets had a sample size of 500, 300 and 100, 1000 × 1000 pixel crops of the Helsinki urban area, and the areas covered in the datasets were 45 km

^{2}

, 27 km

^{2}

and 9 km

^{2}

, respectively. For these datasets, a true orthophoto stripe from the Helsinki urban area was used for validation, early stopping and testing purposes. This set consisted of 96, 1000 × 1000 pixel samples with a 0.3 m pixel resolution, covering an area of 8.64 km

^{2}

. There were three different sizes of datasets created. They will be used to test the performance of different transfer learning techniques with regard to the local training data sizes for fine tuning the model.

2.2. Datasets from Suburban Environment

The model from the Helsinki urban area was provided from the ATMU project team for fine-tuning with the Kajaani suburban area data. This U-Net model was trained with 86 km

^{2}

of true orthophotos, DSMs, DEMs, and vector data from the topographic database. Overall, 20 km

^{2}

of the data were from the Pieksämäki suburban area of Finland, and the remaining 66 km

^{2}

from the Helsinki urban area.

To study transfer learning in the suburban area with this model, true orthophotos, DSMs, DEMs, and vector data were used. The true orthophotos had a pixel resolution of 0.3 m and were 2000 × 2000 pixels in size, with one true orthophoto covering an area of 600 m × 600 m. They contained the x, y, R, G, B, and NIR information. DEMs for the corresponding area were produced from a 2 m elevation model, with a point density of at least 0.5 points for one square meter. The pixel resolution was therefore 2 m. Labels were obtained and corrected from the topographic database’s building vector data.

The training data, containing true orthophotos, DEMs, DSMs, and labels, were processed into corresponding 1000 × 1000 pixel crops. The information about the true orthophotos, DEMs, and DSMs was processed into 5 channel images, the first three containing the true orthophotos’ R, G, and B information, and the last two channels the height information from the DSMs and DEMs. The training set, D4, covered an area of 25.38 km

^{2}

. The validation set included 12, 2000 × 2000 true orthophotos and their labels, as well as 52, 1000 × 1000 true orthophotos and their labels. It covered 9 km

^{2}

. The areas covered by buildings remained small, because the suburban area has a lot of vegetation and forests and is sparsely populated in comparison to the Helsinki urban area.

2.3. Dataset from Rural Areas

The multi-location orthophotos were collected by the NLS. The pixel resolution was 0.5 m, and the orthophotos contained the information x, y, R, G, and B. Buildings were annotated manually with a VGG Image Annotator (VIA) [34]. After the labeling process, the orthophotos were 1000 × 1000 pixels in size. Thus, one orthophoto covered an area of 500 m × 500 m. The building labels were processed into 0–1 arrays with the aid of a dhSegment framework [36]. The multi-location orthophotos formed dataset D5, which covered 198 km

^{2}

(Table 1).

2.4. Datasets for Pretraining a Model

The LandCover.ai (Land Cover from Aerial Imagery) dataset includes mapped buildings, woodlands, water, and roads from Poland [30]. The dataset has eight large orthophotos of about 4200 × 4700 pixels with a 0.5 m pixel resolution and thirty-three orthophotos with 0.25 m pixel resolution with sizes of about 9000 × 9500 pixels. The total area the data covers was 216.27 km

^{2}

. The data contained the information x, y, R, G, and B. Labels for all the data were modified to only include the building labels. The labels were converted into 0–1 arrays.

The LandCover.ai orthophotos with 0.5 m pixel resolution were cropped into 1000 × 1000 pixel size pieces, forming one dataset for pretrained model training with the dataset D5, and dataset D7 (Table 1). D7 was a combination of 160 1000 × 1000 pixel 0.5 m pixel resolution LandCover.ai orthophotos, covering an area of 40 km

^{2}

and the multi-location orthophoto dataset. The dataset therefore had 952 1000 × 1000 pixel images and covered a total area of 238 km

^{2}

. The whole LandCover.ai dataset was cropped into 512 × 512 pixel pieces with a script from the LandCover.ai project and divided into train, validation, and test sets according to the project’s ready division [30]. The dataset D6 included the whole LandCover.ai data, which included 10,674 512 × 512 pixel samples after cropping.

3. Methods

3.1. U-Net Architecture

The building extraction was performed by using a CNN. For the CNN, U-Net was chosen, because it has shown potential in many different segmentation tasks like building, road, and medical image segmentation [10,12,13,37]. The architecture used resembled that introduced by Ronneberger et al. [10]. The input layer consisted of two convolutional layers, after which both batch normalization and ReLU were used. The next four layers were similar, including the bottom layer, with the only difference being that they had a max-pooling layer before the application of double convolution. This part of U-Net is also called the contracting path. The contracting path of the neural network focuses on learning to recognize semantically lower features like edges and corners. U-Net can perform localization by combining high resolution features from the contracting path with an upsampled output. A convolutional layer can learn to assemble an even more precise output based on this combination [10].

After the contracting path, the rest of U-Net is called the expanding path. The bottom layer can be counted to be part of the both paths. The expanding path consists of four up-convolutional layers after the bottom layer, and an output layer, which is a convolutional layer. Each of the up-convolutional layers consists first of one transposed convolutional layer and then a similar double convolutional structure as in the contracting path, but without max-pooling. The expanding path learns to recognize higher features than the contracting path. Such features can be caused by the lower features. The expanding path is more or less symmetrical to the contracting path, and U-Net thus receives its U-shaped architecture, for which the neural network is also named. The network has no fully connected layers, and its convolutions only make use of the segmentation maps containing pixels, for which the full context is available in the input image. Prediction for border pixels for which the context is unavailable is accomplished by mirroring the input image [10].

The U-Net architecture includes skip-connections between corresponding encoder–decoder layers. The skip-connections allow a progressive reintroduction of high-frequency information from the encoder layers and a reduction of spatial ambiguities in the upsampling stage in the expanding path of U-Net [38]. The architecture included four skip-connections. To enable seamless tiling of the U-Net output segmentation maps, the input tile size should be selected so that all 2 × 2 max-pooling operations are applied to a layer with an even x-size and y-size [10]. U-Net with an almost identical structure was used to train a local model, which was then adapted with D5 and D6. The differences in the implementation of this model lay in the use of dropout with a rate of 0.25, always after a sequence of a convolutional layer, batch normalization, and ReLU. Upsampling was also performed a little differently: It consisted of a sequence of PyTorch’s classes UpsamplingNearest2d, ConstantPad2d, and a 2D convolution. This model had only three layers in the expanding and contracting paths in addition to the bottom layer, while the model fine-tuned with datasets D1–D3 had four layers in both paths (Figure 3). Thus, this model’s architecture also had only three skip-connections.

3.2. Training and Transfer Learning Techniques

To study transfer learning in the metropolitan area, pretrained models were trained with NLS orthophoto data and open-source data. Loss functions are used to minimize the deep learning model’s error and optimize its parameters [39]. Training loss indicates how well the model has fit the training data and is typically lower than the validation loss, revealing how well the model performs with the validation set. As the validation set is used in the selection of model parameters, it can offer a biased estimate of model performance, and a separate test set offers a less-biased estimate. Binary cross-entropy was selected as a suitable loss function. A model trained with the dataset D6 was selected for fine-tuning and studying different transfer learning techniques for the U-Net architecture. This pretrained model was adapted for the metropolitan area with local true orthophoto data. The domains of transfer learning in this paper were the difference in pixel resolution of data and the difference in the appearance of buildings and the environment. To optimize the model performance, early stopping was used, based on validation loss.

Multiple different fine-tuning techniques and their combinations were attempted to fine-tune the pretrained model with three local true orthophoto datasets of different sizes. Whether fine-tuning the contracting or expanding path of U-Net yielded better segmentation results, if multiple techniques would yield good results, as well as whether the local dataset size used for fine-tuning affected the result, were investigated. After separately retraining both U-Net paths, whether fine-tuning the remaining path with a smaller learning rate, in addition to retraining a path, would increase the segmentation results further was investigated, as well as what combination would achieve the best segmentation results. Fine-tuning with a lower learning rate was also investigated, without retraining any layers of the pretrained model, as well as whether the best results would be obtained by fine-tuning the contracting path, expanding path, or the whole network.

To evaluate the usefulness of retraining some layers and fine-tuning some layers of the pretrained model with a lower learning rate, model training with true orthophoto data was also performed without transfer learning. How the available sample quantity affected the viability of transfer learning and fine-tuning compared to training a whole new model without transfer learning was investigated. The results achieved with and without transfer learning were compared. The Helsinki area model’s performance in the suburban area was tested and then fine-tuned with the suburban area’s true orthophoto data. Early stopping based on validation loss was used in a similar fashion to the metropolitan area fine-tuning datasets. The performance of the initial Helsinki area model and fine-tuned models was compared.

3.3. Evaluation

The main evaluation metric for assessing building segmentation performance was the F1-score, which emphasizes the effect of correctly labeled building pixels. The F1-score is defined through precision and recall, which are also known as correctness and completeness:

P r e c i s i o n = \frac{T P}{T P + F P},

(1)

where TP stands for true positives, and FP for false positives. Recall is defined as

R e c a l l = \frac{T P}{T P + F N},

(2)

where FN stands for false negatives. F1-score can then be defined as

F 1 = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

In addition, pixel error was used to investigate the wrongly labeled pixel amount. When the desired objects to be segmented cover only a very small area of the whole image, the pixel error metric can be misleading, because it can be very low if the desired objects are wrongly categorized. This can be common in the countryside, for example, where populations are small, and the area is mostly covered by forests and cultivated fields. In other words, the building class representation was small within the images in this case. This was the main reason for selecting the F1-score as the main evaluation metric.

4. Results

4.1. Pretrained Models

The model trained with the D7 and the model trained with the D6 achieved a performance that was about as good as with the validation/test set (Table 2). The pixel error of both models was 0.055, while the model trained with D6 achieved a smaller test stripe loss but also a smaller F1-score. This indicates that the model trained with the D7 was better at labeling parts with buildings correctly, as the F1-score emphasized the effect of true positive values. However, both models labeled as many pixels inaccurately, and with the smaller F1-score, this indicates that the D6 model made more mistakes with true positive building pixels than the combination model. It can be observed that the model trained with the D6 achieved a notably lower training and validation loss. One of the reasons for this was the smaller building class representation in the dataset D6 compared to the D7, which contained a larger area of buildings in relation to the dataset size. In addition, there may also have been label inconsistencies due to different label requirements from different organizations.

Another pretrained model, trained with the Helsinki urban and Pieksämäki suburban areas true orthophotos, DSMs, DEMs, and vector data, was obtained to be adapted for the Kajaani suburban area of Finland. The image properties of the data used for training the pretrained model and fine-tuning it were similar. In both cases, the true orthophotos had a 0.3 m pixel resolution. The greatest difference was in the environment and infrastructure: In the Helsinki urban area, there are more buildings and especially multistory buildings, and the environment is more urban. On the other hand, in the Kajaani suburban area, there are many fields, forests, and smaller houses, which are farther away from each other. The initial performance of the Helsinki area pretrained model with the validation set was an F1-score of 0.322, a pixel error of 0.089, and a loss of 0.234 when thresholding with 0.5 was used for the outputs. The performance was therefore not very good before adaptation to the new data.

4.2. Performance without Transfer Learning

Three models were trained without transfer learning with the three different size datasets of true orthophotos. With these comparison models, the usefulness of transfer learning was assessed. The model trained with the largest dataset achieved the best F1-score on the true orthophoto stripe from the Helsinki urban area out of the three, while the model trained with the smallest local dataset achieved the smallest F1-score, indicating linear improvement of the semantic segmentation results when the number of local samples in the datasets increased (Table 3).

The model trained with D2 achieved a 4.2% higher F1-score (0.861) and 25% lower pixel error (0.018) than the model with D3, while the model trained with D1 achieved a 7.9% higher F1-score (0.891) than the model trained with D3 and a 3.5% higher F1-score than the model trained with D2. A notable decrease in loss as the dataset size increased was also observed. For the Helsinki urban area model obtained for fine-tuning with D4, the initial performance with the validation set was an F1-score of 0.322, a pixel error of 0.089, and loss of 0.234.

4.3. Fine-Tuning U-Net

Different fine-tuning techniques and their combinations were tried for fine-tuning the pretrained D6 model with true orthophoto datasets D1–D3. The techniques used included fine-tuning the whole U-Net with a smaller learning rate and fine-tuning only a contracting or expanding path. In addition, retraining either the contracting or the expanding layers with the fine-tuning datasets was attempted, and afterwards it was investigated whether fine-tuning the remaining path with a smaller learning rate would lead to an even better performance. The same techniques were tested in the suburban area of Finland.

4.3.1. Fine-Tuning with D3

The best performance was achieved by retraining the expanding path of U-Net with the dataset D3 and afterwards fine-tuning the contracting path with a lower learning rate (Table 4). The performance of the fine-tuned model with only a retrained expanding path was almost as good on the validation set. Fine-tuning the whole network with a lower learning rate was the next best choice based on the results, and it achieved a performance that was about as good as the comparison model with no transfer learning. Retraining or fine-tuning only the contracting path did not prove profitable with the dataset. When the performance of the best-performing fine-tuned model was compared with the model trained from scratch with the same fine-tuning dataset, it was observed that the fine-tuned model achieved a 4.7% better F1-score and a 16.6% lower pixel error.

4.3.2. Fine-Tuning with D2

The best performance with the dataset D2 was achieved by retraining the U-Net contracting layers and then fine-tuning the expanding layers with a lower learning rate (Table 5). The F1-score of this model on the validation set was 0.892, and the pixel error was 0.017. It was observed that retraining the expanding path of the network and fine-tuning the contracting path with a lower learning rate achieved performance that was about as good, and the performance of the model with only a retrained contracting path was almost as good. Fine-tuning parts of the U-Net or the whole network with a lower learning rate did not achieve performance that was as good as the comparison model with no transfer learning.

The comparison model trained from scratch achieved an F1-score of 0.861 (Table 3), while the best fine-tuned model achieved a 3.6% higher F1-score, 0.892. The pixel error decreased by 5.6% when the fine-tuned model was compared with the model not using transfer learning. This best model fine-tuned with the dataset D2 achieved performance that was about as good as the model trained from scratch with the dataset D1.

4.3.3. Fine-Tuning with D1

It was observed that the best performance on the dataset D1 was achieved by first retraining the contracting layers of U-Net and fine-tuning the expanding layers with a lower learning rate. This model achieved an F1-score of 0.905 and pixel error of 0.015 (Table 6). The model that had retrained the expanding layers and fine-tuned the contracting layers with a lower learning rate achieved performance metrics that were almost as good, with an F1-score of 0.896 and pixel error of 0.016. The performance difference between the fine-tuned model with only retrained contracting layers and the model, in addition to having fine-tuned expanding layers with a lower learning rate was minimal; the model with only retrained contracting layers achieved an F1-score of 0.898 and pixel error of 0.015. When the pretrained model was fine-tuned with only a lower learning rate, the achieved performance reached the same level as the comparison model with no transfer learning that was trained with the dataset D2. The comparison model (Figure 4) achieved an F1-score of 0.891 on the validation set, while the increase in the F1-score compared with the best fine-tuned model was only 1.6% higher. Both models had pixel errors that were as large, 0.015.

4.3.4. Kajaani Suburban Area and Transfer Learning

The best fine-tuning results with the Helsinki urban area pretrained model were obtained by fine-tuning the whole network and retraining the expanding path and fine-tuning either the whole network or the contracting path. However, when considering fine-tuning or retraining either the expanding or contracting paths, choosing an expanding path led to the best results in every case with the dataset D5 and the Helsinki area pretrained model.

The Helsinki urban area pretrained model was trained using aerial images, DSMs, and DEMs and Table 7 presents the performance after different fine-tuning methods. It was observed that the best performance this time was obtained by fine-tuning all the layers with a lower learning rate. Due to the Helsinki urban area data having a large building area and being very similar in its properties to the suburban area’s true orthophotos, the pretrained model probably learned simple features of buildings like corners and edges well, and was able to adapt to the new data more easily. Adding information from DSMs and DEMs also probably affected the result. F1-scores remained lower than the fine-tuning results from the urban area, but this was due to some true orthophotos obtaining a very low F1-score or even an F1-score of 0, because the model failed to detect small cottages under trees that even human sight could not identify in the images. It was also observed that if only the other part of U-Net was either retrained or fine-tuned, selecting the expanding path always resulted in better performance with the Helsinki area pretrained model and datasets D4 and the validation set.

5. Discussion

In one approach, when a network trained on a large dataset is fine-tuned, the shallow layers focusing on the lower features of data are kept unchanged, and the deeper layers are modified according to the new data to which the network will be adapted. However, it should be noted that the data and their features can affect whether the best performance is achieved by fine-tuning the deeper layers of a network, or whether the adjustment of the shallow layers’ parameters can yield better results. For example, a study by Amiri et al. published in 2020 concluded that when U-Net was being fine-tuned with X-ray data, it might be more appropriate to fine-tune the shallow layers, while with ultrasound data, the fine-tuning of deeper layers was more likely to yield better segmentation results, because these two types of data have different salient features [12].

It was observed in the fine-tuning result shown in Table 4, Table 5 and Table 6 that when aerial images were used, and the pretrained network was fine-tuned with a lower learning rate, the fine-tuning of the contracting path of U-Net focusing on the lower features of the data always achieved a better performance than the fine-tuning of an expanding path focusing on higher features with the LandCover.ai pretrained model and urban area true orthophoto datasets. However, if the retraining of either path was considered, it was observed that retraining the expanding path of U-Net led to considerably better performance than retraining the contracting path. Table 5 shows that when the dataset size increased, retraining either path led to a performance that was about as good. Table 6 shows that retraining the contracting path was a better choice.

Both training and validation loss curves were very smooth when the expanding path was retrained with the different size true orthophoto datasets, and with the largest dataset, the gap between the training and validation loss curve became very small. When the contracting path of the pretrained model was retrained, both curves showed a lot of spiking, which decreased to some extent when the dataset size became larger (Figure 5). The model performance was more stable when the expanding path was retrained. This could indicate that the shallower features like the edges of buildings were harder for the model to learn.

In the suburban area with the dataset D4, the best results were obtained by fine-tuning the whole network and retraining the expanding path, and fine-tuning either the whole network or the contracting path. When fine-tuning or retraining either of the expanding and contracting paths was considered, choosing an expanding path led to the best results in every case with the dataset and the local pretrained model. This indicates that both the dataset size and the pretrained model and its properties should be considered when choosing the most suitable fine-tuning method, because they both have an impact.

Finally, it is important to consider that the data types used can affect the transfer learning results. In the urban area, the transfer happened from orthophotos to true orthophotos, and in the suburban area, from true orthophotos, DEMs, and DSMs to true orthophotos, DEMs, and DSMs. We tested multi-environment datasets and performance seemed to be similar with the different datasets and transfer learning techniques as multiple different transfer learning techniques achieved higher performance with the different datasets. It might be due to the diversity of the pretraining dataset, in which data from various environments have been trained. The building extraction work is planned to be continued alongside studying change detection methods.

5.1. Effect of Training Sample Quantity

The effect of the available training sample quantity on model performance was observed in the pretrained models, the models trained without transfer learning, and the fine-tuned models. The pretrained model trained with the dataset D5 achieved the lowest F1-score on the validation/test set compared to the models that were trained with the D7 dataset and D6 dataset. The D6 dataset had the largest sample size and achieved the most stable results.

The effect of sample size was also observed in the models trained without transfer learning. The model trained with the D3 dataset achieved the lowest F1-score, while the model trained with the D1 achieved the highest score on the validation/test set. It was seen that the model with D1 also achieved a smaller loss and pixel error than the models trained with D2 and D3. In addition, it was observed that the F1-score’s growth is a little smaller between the datasets D2 and D1 compared to the datasets D3 and D2, because at some point a model might reach a point where more data would not further improve its performance. It is also possible that an increasing dataset size will cause model performance to decrease and has detrimental effects if the data added are of poor quality [40], but adding data of the same quality can also increase the performance.

The fine-tuning results showed that when fewer data were available, fine-tuning the whole network with a smaller learning rate or retraining the expanding path resulted in better model performance and more accurate segmentation of buildings. When the dataset size used increased, it seemed that retraining only the contracting path of U-Net became more profitable than fine-tuning the whole network or retraining the expanding path. The results of fine-tuning also improved with an increasing dataset size: Fine-tuning with D3 achieved a lower F1-score than fine-tuning performed with datasets D2 and D1. The D4 results are close to the performance achieved with the almost similar sized dataset D2, although the DSMs and DEMs help the suburban area model achieve accurate performance.

5.2. Limitations

Several factors limited the study. First, all the models, including the pretrained models, models with no transfer learning, and fine-tuned models, were trained only once. By training all the models multiple times, some variation could be removed from the results. In addition, computational cost was not measured during model training. Some spiking was seen in the validation curves of comparison models with no transfer learning and fine-tuned models (Figure 5), which may indicate that the Helsinki urban area true orthophoto stripe size used for validation and early stopping was rather small, but it is noteworthy that the spiking may also have been caused by the Adam optimizer used and inaccuracies in the labels. The final performance metrics of fine-tuned models were given based on the validation set performance, for which the parameters were optimized. If a larger test set was used, it was expected that performance would be lower. However, the difference in performance between using and not using transfer learning is observed on the validation true orthophoto stripe, but more extensive performance testing should be conducted if it is desired to use a model in production.

In general, the label quality and the possible inaccuracies in labeling affected the segmentation results and may have limited the accuracy achieved in the experiments. The labels of the Helsinki urban area true orthophoto data were manually corrected to be as accurate as possible, but it is likely that some inaccuracies remain, because the labels were not closely inspected before being used for training after the correction. Especially in the Kajaani suburban area, only some of the used label data were corrected as carefully as in the Helsinki urban area, and this probably affected the results decreasingly. The inaccuracies probably limit the results achieved, because the models can learn to recognize undesired features in the data. None of the orthophotos used was processed any further before being fed to U-Net, which may also have affected the models’ building extraction capability. Some of the true orthophotos were rather dark, and the buildings were overshadowed, which made it difficult to detect the buildings.

As the fine-tuned models were adjusted with local data from the Helsinki urban area, and the comparison models with no transfer learning were trained with the same data, they performed well in this region and regions similar to the Helsinki urban area environmentally and in infrastructure. This was observed in the further studies, because the local model obtained for fine-tuning did not perform very well in the Kajaani suburban area before adjustment for this new area. In addition, it should be noted that only models without pretraining and models pretrained on remote sensing data were compared in this study. Existing general methods utilize pretrained weights on, for example, ImageNet, such as the VGG [5] and ResNet [7]. These methods should also be considered for transfer learning purposes, as they can also achieve great results [41].

5.3. SAM

Minor testing with the promising SAM was conducted to assess its potential. The model was tested as is with the keyprompt “Buildings” and by fine-tuning it with D3. Some examples of the results from the test can be found in Figure 6. The results with the keyprompt were in general decent, with some buildings missing, and some trees and parking lots being falsely detected. For the fine-tuning, bounding boxes were utilized in addition to the building labels. Fine-tuning with D3 obtained slightly better results but with similar problems. Based on the test, SAM showed potential for fine-tuning with remote sensing data. Fine-tuning the model further could achieve accurate segmentation results and the topic should be investigated further in the future.

6. Conclusions

In this paper, various fine-tuning techniques and their combinations with U-Net were investigated. The open-source LandCover.ai data from Poland were employed to pretrain the U-Net model. Diverse datasets (true orthophotos) from multiple environments in Finland were tested in various fine-tuned models. Models with no transfer learning were trained from scratch with the local datasets for comparison with the fine-tuning results. Besides, training data with different sizes were also studied. From the experiments, it can be seen that with a small dataset for the target model, retraining the expanding path of the pretrained U-Net model and then fine-tuning the rest of the layers with a lower learning rate resulted in the best. With a larger dataset, retraining the contracting path of the pretrained U-Net model and then fine-tuning the layers with a lower learning rate achieved the best segmentation performance. The other findings include that the pretrained model was trained with high-quality data from the same country, leading to the best results by fine-tuning all the layers instead of layer retraining. Additionally, in retraining some layers, the expanding path to be retrained instead of the contracting path proved the best choice. For example, the best suburban area model achieved an F1-score of 0.877 when a training dataset covering an area of 25.38 km

^{2}

was used.

If only a small dataset for the target area for building extraction is available, open-source data, simple data augmentation methods, and transfer learning techniques can achieve performance that is at least about 5% more accurate than if a model is trained from scratch with the target area data. Finding the most suitable fine-tuning method for a dataset is important to obtain the best possible segmentation performance and take steps toward automatized building extraction and map updating.

In future research, the possibility of solving the limiting factors should be investigated. In addition, we want to suggest studying UDA algorithms for transfer learning in the field of remote sensing, as they show great potential. Further investigation and comparison with the recent SAM model [17] should be conducted, as it can be applied in a zero-shot way.

Author Contributions

Conceptualization, E.H. and L.Z.; formal analysis, E.H.; funding acquisition, L.Z., J.H. and J.O.; investigation, E.H.; methodology, E.H.; project administration, L.Z.; software, E.H. and J.R.; supervision, L.Z., J.H. and J.O.; validation, E.H.; visualization, E.H. and J.R.; writing—original draft, E.H.; writing—review and editing, E.H., L.Z., J.R., J.O. and J.H. All authors have read and agreed to the published version of the manuscript.

Funding

The work was supported by the National Land Survey’s ATMU project (Advanced Technology for topographic Map Updating), which was funded by the Ministry of Finance in Finland (Valtiovarainministeriö) for the period of 1 January 2021–31 December 2022, as well as the Finnish Geospatial Research Institute’s (FGI) projects, Forest-Human-Machine Interplay Building Resilience, Redefining Value Networks and Enabling Meaningful Experiences, and Future Forest Information System at Individual Tree Level—Tulevaisuuden Yksittäisen Puun Tietoihin Pohjautuva Metsätietojärjestelmä. The FGI’s projects were funded by the Academy of Finland. We made use of geocomputing tools provided by the Open Geospatial Information Infrastructure for Research (Geoportti, urn:nbn:fi:research-infras-2016072513) funded by the Academy of Finland, CSC—IT Center for Science, and other Geoportti consortium members.

Data Availability Statement

Some data used for this study are not publicly available. For datasets obtained from online, see References.

Acknowledgments

This study began as a thesis study, “Transfer learning technology for building extraction from orthophotos and open-source data” [42]. Many thanks to the CSC [35] support on the high-performance computation for the research purpose of the project for free use.

Conflicts of Interest

The authors declare no conflict of interest.

References

Schwert, B.; Rogan, J.; Giner, N.M.; Ogneva-Himmelberger, Y.; Blanchard, S.D.; Woodcock, C. A comparison of support vector machines and manual change detection for land-cover map updating in Massachusetts, USA. Remote Sens. Lett. 2013, 4, 882–890. [Google Scholar] [CrossRef]
Schlosser, A.D.; Szabó, G.; Bertalan, L.; Varga, Z.; Enyedi, P.; Szabó, S. Building Extraction Using Orthophotos and Dense Point Cloud Derived from Visual Band Aerial Imagery Based on Machine Learning and Segmentation. Remote Sens. 2020, 12, 2397. [Google Scholar] [CrossRef]
Vosselman, G.; Kessels, P.; Gorte, B. The utilisation of airborne laser scanning for mapping. Data Quality in Earth Observation Techniques. Int. J. Appl. Earth Obs. Geoinf. 2005, 6, 177–186. [Google Scholar] [CrossRef]
Luo, L.; Li, P.; Yan, X. Deep Learning-Based Building Extraction from Remote Sensing Images: A Comprehensive Review. Energies 2021, 14, 7982. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv 2016, arXiv:1511.00561. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Livne, M.; Rieger, J.; Aydin, O.U.; Taha, A.A.; Akay, E.M.; Kossen, T.; Sobesky, J.; Kelleher, J.D.; Hildebrand, K.; Frey, D.; et al. A U-Net Deep Learning Framework for High Performance Vessel Segmentation in Patients with Cerebrovascular Disease. Front. Neurosci. 2019, 13, 97. [Google Scholar] [CrossRef]
Amiri, M.; Brooks, R.; Rivaz, H. Fine tuning U-Net for ultrasound image segmentation: Which layers? arXiv 2020, arXiv:2002.08438. [Google Scholar]
Liu, W.; Yang, M.; Xie, M.; Guo, Z.; Li, E.; Zhang, L.; Pei, T.; Wang, D. Accurate Building Extraction from Fused DSM and UAV Images Using a Chain Fully Convolutional Neural Network. Remote Sens. 2019, 11, 2912. [Google Scholar] [CrossRef]
Yan, L.; Fan, B.; Liu, H.; Huo, C.; Xiang, S.; Pan, C. Triplet Adversarial Domain Adaptation for Pixel-Level Classification of VHR Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3558–3573. [Google Scholar] [CrossRef]
Zhang, L.; Lan, M.; Zhang, J.; Tao, D. Stagewise Unsupervised Domain Adaptation with Adversarial Self-Training for Road Segmentation of Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5609413. [Google Scholar] [CrossRef]
Wang, J.; Ma, A.; Zhong, Y.; Zheng, Z.; Zhang, L. Cross-sensor domain adaptation for high spatial resolution urban land-cover mapping: From airborne to spaceborne imagery. Remote Sens. Environ. 2022, 277, 113058. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Whitehead, S.; Berg, A.C.; Lo, W.Y.; et al. Segment Anything. arXiv 2023, arXiv:2304.02643. [Google Scholar]
Lazaro, A.; Nahhas, F.H.; Shafri, H.Z.M.; Sameen, M.I.; Pradhan, B.; Mansor, S. Deep Learning Approach for Building Detection Using LiDAR—Orthophoto Fusion. J. Sensors 2018, 2018, 7212307. [Google Scholar] [CrossRef]
Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 9. [Google Scholar] [CrossRef]
Bozinovski, S. Reminder of the First Paper on Transfer Learning in Neural Networks, 1976. Informatica 2020, 44. [Google Scholar] [CrossRef]
Nazi, Z.A.; Abir, T.A. Automatic Skin Lesion Segmentation and Melanoma Detection: Transfer Learning Approach with U-Net and DCNN-SVM. In Proceedings of the International Joint Conference on Computational Intelligence, Budapest, Hungary, 2–4 November 2020; Uddin, M.S., Bansal, J.C., Eds.; Springer: Singapore, 2020; pp. 371–381. [Google Scholar]
Zhao, X.; Wang, S.; Zhao, J.; Wei, H.; Xiao, M.; Ta, N. Application of an attention U-Net incorporating transfer learning for optic disc and cup segmentation. Signal Image Video Process. 2021, 15, 913–921. [Google Scholar] [CrossRef]
Raj, R.; Londhe, N.D.; Sonawane, R. Automated psoriasis lesion segmentation from unconstrained environment using residual U-Net with transfer learning. Comput. Methods Programs Biomed. 2021, 206, 106123. [Google Scholar] [CrossRef]
Wang, B.; Li, J.; Luo, J.; Wang, Y.; Geng, J. Intelligent Deblending of Seismic Data Based on U-Net and Transfer Learning. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8885–8894. [Google Scholar] [CrossRef]
Adiba, A.; Hajji, H.; Maatouk, M. Transfer Learning and U-Net for Buildings Segmentation. In Proceedings of the New Challenges in Data Sciences: Acts of the Second Conference of the Moroccan Classification Society, Kenitra, Morocco, 28–29 March 2019. SMC ‘19. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. Fully Convolutional Networks for Multisource Building Extraction From an Open Aerial and Satellite Imagery Data Set. IEEE Trans. Geosci. Remote Sens. 2019, 57, 574–586. [Google Scholar] [CrossRef]
Ji, S.; Wei, S.; Lu, M. A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery. Int. J. Remote Sens. 2019, 40, 3308–3322. [Google Scholar] [CrossRef]
Lin, Q.; Ci, T.; Wang, L.; Mondal, S.K.; Yin, H.; Wang, Y. Transfer Learning for Improving Seismic Building Damage Assessment. Remote Sens. 2022, 14, 201. [Google Scholar] [CrossRef]
Pinto, G.; Wang, Z.; Roy, A.; Hong, T.; Capozzoli, A. Transfer learning for smart buildings: A critical review of algorithms, applications, and future perspectives. Adv. Appl. Energy 2022, 5, 100084. [Google Scholar] [CrossRef]
Boguszewski, A.; Batorski, D.; Ziemba-Jankowska, N.; Dziedzic, T.; Zambrzycka, A. LandCover.ai: Dataset for Automatic Mapping of Buildings, Woodlands, Water and Roads from Aerial Imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, TN, USA, 19–25 June 2021; pp. 1102–1110. [Google Scholar]
Guan, H.; Li, J.; Chapman, M.; Deng, F.; Ji, Z.; Yang, X. Integration of orthoimagery and lidar data for object-based urban thematic mapping using random forests. Int. J. Remote Sens. 2013, 34, 5166–5186. [Google Scholar] [CrossRef]
Maltezos, E.; Doulamis, N.; Doulamis, A.; Ioannidis, C. Deep convolutional neural networks for building extraction from orthoimages and dense image matching point clouds. J. Appl. Remote Sens. 2017, 11, 42620. [Google Scholar] [CrossRef]
Gilani, S.A.N.; Awrangjeb, M.; Lu, G. An Automatic Building Extraction and Regularisation Technique Using LiDAR Point Cloud Data and Orthoimage. Remote Sens. 2016, 8, 258. [Google Scholar] [CrossRef]
Dutta, A.; Zisserman, A. The VIA Annotation Software for Images, Audio and Video. In Proceedings of the 27th ACM International Conference on Multimedia, New York, NY, USA, 21–25 October 2019. MM ’19. [Google Scholar] [CrossRef]
ICT Solutions for Brilliant Minds—CSC. Available online: https://www.csc.fi/ (accessed on 29 September 2021).
Ares Oliveira, S.; Seguin, B.; Kaplan, F. dhSegment: A generic deep-learning approach for document segmentation. In Proceedings of the Frontiers in Handwriting Recognition (ICFHR), 2018 16th International Conference on IEEE, Niagara Falls, NY, USA, 5–8 August 2018; pp. 7–12. [Google Scholar]
Abderrahim, N.Y.Q.; Abderrahim, S.; Rida, A. Road Segmentation using U-Net architecture. In Proceedings of the 2020 IEEE International Conference of Moroccan Geomatics (Morgeo), Casablanca, Morocco, 11–13 May 2020; pp. 1–4. [Google Scholar] [CrossRef]
Liu, Y.; Nguyen, D.; Deligiannis, N.; Ding, W.; Munteanu, A. Hourglass-ShapeNetwork Based Semantic Segmentation for High Resolution Aerial Imagery. Remote Sens. 2017, 9, 522. [Google Scholar] [CrossRef]
Alzubaidi, L.; Zhang, J.; Humaidi, A.J.; Al-Dujaili, A.; Duan, Y.; Al-Shamma, O.; Santamaría, J.; Fadhel, M.A.; Al-Amidie, M.; Farhan, L. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. J. Big Data 2021, 8, 53. [Google Scholar] [CrossRef]
Barragán-Montero, A.M.; Thomas, M.; Defraene, G.; Michiels, S.; Haustermans, K.; Lee, J.A.; Sterpin, E. Deep learning dose prediction for IMRT of esophageal cancer: The effect of data quality and quantity on model performance. Phys. Medica 2021, 83, 52–63. [Google Scholar] [CrossRef]
Abou Baker, N.; Zengeler, N.; Handmann, U. A Transfer Learning Evaluation of Deep Neural Networks for Image Classification. Mach. Learn. Knowl. Extr. 2022, 4, 22–41. [Google Scholar] [CrossRef]
Hattula, E. Transfer Learning Technology for Building Extraction from Orthophotos and Open-Source Data. Master’s Thesis, Aalto University, Helsinki, Finland, 2022. [Google Scholar]

Figure 1. Examples of data used. (a) Example of multi-location orthophoto from Finland. (b) Example of true orthophoto from Finland. (c) Example of LandCover.ai orthophoto from Poland. (d) Example of multi-location orthophoto label. (e) Example of true orthophoto label. (f) Example of LandCover.ai orthophoto label.

Figure 2. Accurate building labels were collected from true orthophotos.

Figure 3. Implemented U-Net structures. (a) U-Net architecture fine-tuned with datasets D1–D4. (b) U-Net architecture fine-tuned with datasets D5–D6.

Figure 4. Comparison of segmentation result of urban area models without transfer learning and fine-tuned LandCover.ai models. (a) Input image. (b) Segmentation result with D3 of model trained with no transfer learning. (c) Segmentation result with D2 of model trained with no transfer learning. (d) Segmentation result with D1 of model trained with no transfer learning. (e) Ground-truth label. (f) Segmentation result with fine-tuning and D3. (g) Segmentation result with fine-tuning and D2. (h) Segmentation result with fine-tuning and D1.

Figure 5. Training and validation loss curves from retraining expanding path with different sized true orthophoto datasets. (a) Retraining a pretrained model expanding path with D3. (b) Retraining a pretrained model contracting path with D3. (c) Retraining a pretrained model expanding path with D2. (d) Retraining a pretrained model contracting path with D2. (e) Retraining a pretrained model expanding path with D1. (f) Retraining a pretrained model contracting path with D1.

Figure 6. Building detection with SAM. The left side images show building detection results with fine-tuning, and the right side images without fine-tuning.

Table 1. The remote sensing data used. True orthophotos obtained building vector labels from the topographic database, and they were corrected to be as accurate as possible. Orthophoto data from Finland were manually labeled with VIA [34]; the open-source orthophotos from Poland had ready labels [30]. For datasets other than LandCover.ai data, the building area was calculated based on the labels’ building pixels. Different datasets have been named D1–D7, D meaning dataset.

Dataset	Dataset Location(s)	Dataset Area(s)	Pixel Resolution(s)	Original Image Size(s)	Used Image Size(s)
D1	Finland, Helsinki	45 km $^{2}$	0.3 m	2000 × 2000 pixels	1000 × 1000 pixels
D2	Finland, Helsinki	27 km $^{2}$	0.3 m	2000 × 2000 pixels	1000 × 1000 pixels
D3	Finland, Helsinki	9 km $^{2}$	0.3 m	2000 × 2000 pixels	1000 × 1000 pixels
D4	Finland, Kajaani	25.38 km $^{2}$	0.3 m	2000 × 2000 pixels	1000 × 1000 pixels
D5	Multi-locations in Finland	198 km $^{2}$	0.5 m	12,000 × 12,000 pixels	1000 × 1000 pixels
D6	Poland	216.27 km $^{2}$	0.25 m, 0.5 m	About 4200 × 4700 pixels, about 9000 × 9500 pixels	512 × 512 pixels
D7	Poland and Finland	238 km $^{2}$	0.5 m	About 4200 × 4700 pixels, 12,000 × 12,000 pixels	1000 × 1000 pixels

Table 2. The performance of the three different pretrained models trained. The smallest training and validation loss achieved by each model can be observed, as well as how well each model performed in the segmentation of buildings of the test set D4, including loss, F1-score, and pixel error.

Dataset Used for Model Training	D7	D9	D8
Training loss	0.029	0.053	0.006
Validation loss	0.043	0.027	0.008
True orthophoto test stripe loss	0.244	0.229	0.182
True orthophoto test stripe F1-score	0.456	0.532	0.478
True orthophoto test stripe pixel error	0.073	0.055	0.055

Table 3. The performance on the validation set of models trained from scratch with different size true orthophoto datasets, D1–D3.

Segmentation Performance without Transfer Learning
Dataset	Loss	F1-Score	Pixel Error
D3	0.074	0.826	0.024
D2	0.048	0.861	0.018
D1	0.041	0.891	0.015

Table 4. Fine-tuning results of pretrained U-Net on the validation set, fine-tuned with a true orthophoto dataset D3. The left-hand column shows the fine-tuning technique used, and the next columns from left to right report the achieved loss, F1-score, and pixel error with the technique.

Final Segmentation Performance, D3
	Loss	F1-Score	Pixel Error
Fine-tuning whole network	0.060	0.829	0.023
Fine-tuning contracting path	0.074	0.813	0.025
Fine-tuning expanding path	0.077	0.765	0.029
Retraining contracting path	0.083	0.771	0.031
Retraining contracting path and fine-tuning expanding path	0.071	0.808	0.026
Retraining expanding path	0.058	0.863	0.021
Retraining expanding path and fine-tuning contracting path	0.054	0.865	0.020

Table 5. Fine-tuning results of pretrained U-Net on the validation set, fine-tuned with the dataset D3. The left-hand column shows the fine-tuning technique used, and the next columns from left to right report the achieved loss, F1-score, and pixel error with the technique.

Final Segmentation Performance, D2
	Loss	F1-Score	Pixel Error
Fine-tuning whole network	0.053	0.853	0.021
Fine-tuning contracting path	0.056	0.850	0.020
Fine-tuning expanding path	0.068	0.817	0.025
Retraining contracting path	0.047	0.880	0.018
Retraining contracting path and fine-tuning expanding path	0.046	0.892	0.017
Retraining expanding path	0.049	0.881	0.018
Retraining expanding path and fine-tuning contracting path	0.045	0.891	0.017

Table 6. Fine-tuning results of pretrained U-Net on the validation set, fine-tuned with a true orthophoto dataset D1. The left-hand column shows the fine-tuning technique used, and the next columns from left to right report the achieved loss, F1-score, and pixel error with the technique.

Final Segmentation Performance, D1
	Loss	F1-Score	Pixel Error
Fine-tuning whole network	0.050	0.865	0.019
Fine-tuning contracting path	0.051	0.868	0.019
Fine-tuning expanding path	0.061	0.845	0.023
Retraining contracting path	0.041	0.898	0.015
Retraining contracting path and fine-tuning expanding path	0.040	0.905	0.015
Retraining expanding path	0.046	0.894	0.017
Retraining expanding path and fine-tuning contracting path	0.042	0.896	0.016

Table 7. Fine-tuning results of pretrained U-Net from Helsinki urban area with Kajaani suburban area dataset D4, including DSM and DEM information. The performance of fine-tuned models on the 9 km

^{2}

Kajaani true orthophoto validation set.

Table 7. Fine-tuning results of pretrained U-Net from Helsinki urban area with Kajaani suburban area dataset D4, including DSM and DEM information. The performance of fine-tuned models on the 9 km

^{2}

Kajaani true orthophoto validation set.

Segmentation Performance
	Loss	F1-Score	Pixel Error
Fine-tuning whole network	0.028	0.877	0.009
Fine-tuning contracting path	0.041	0.624	0.016
Fine-tuning expanding path	0.035	0.833	0.010
Retraining contracting path	0.047	0.652	0.015
Retraining contracting path and fine-tuning expanding path	0.050	0.668	0.015
Retraining contracting path and fine-tuning whole network	0.045	0.751	0.012
Retraining expanding path	0.037	0.813	0.011
Retraining expanding path and fine-tuning contracting path	0.043	0.837	0.011
Retraining expanding path and fine-tuning whole network	0.040	0.852	0.010

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hattula, E.; Zhu, L.; Raninen, J.; Oksanen, J.; Hyyppä, J. Advantages of Using Transfer Learning Technology with a Quantative Measurement. Remote Sens. 2023, 15, 4278. https://doi.org/10.3390/rs15174278

AMA Style

Hattula E, Zhu L, Raninen J, Oksanen J, Hyyppä J. Advantages of Using Transfer Learning Technology with a Quantative Measurement. Remote Sensing. 2023; 15(17):4278. https://doi.org/10.3390/rs15174278

Chicago/Turabian Style

Hattula, Emilia, Lingli Zhu, Jere Raninen, Juha Oksanen, and Juha Hyyppä. 2023. "Advantages of Using Transfer Learning Technology with a Quantative Measurement" Remote Sensing 15, no. 17: 4278. https://doi.org/10.3390/rs15174278

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Advantages of Using Transfer Learning Technology with a Quantative Measurement

Abstract

1. Introduction

2. Material

2.1. Datasets from Urban Environment

2.2. Datasets from Suburban Environment

2.3. Dataset from Rural Areas

2.4. Datasets for Pretraining a Model

3. Methods

3.1. U-Net Architecture

3.2. Training and Transfer Learning Techniques

3.3. Evaluation

4. Results

4.1. Pretrained Models

4.2. Performance without Transfer Learning

4.3. Fine-Tuning U-Net

4.3.1. Fine-Tuning with D3

4.3.2. Fine-Tuning with D2

4.3.3. Fine-Tuning with D1

4.3.4. Kajaani Suburban Area and Transfer Learning

5. Discussion

5.1. Effect of Training Sample Quantity

5.2. Limitations

5.3. SAM

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI