1. Introduction
A land-cover map is an image where each pixel value corresponds to a land-cover class label. Land-cover maps are essential in various projects for Earth monitoring, such as forest evolution [
1], urban expansion [
2], or climate modelling [
3]. An efficient way to produce land-cover maps is to use satellite images, especially from the optical domain [
4,
5]. Recent decades have seen the launch of numerous satellites dedicated to Earth monitoring which provide high-quality data for land-cover mapping, such as the SPOT and Landsat families and more recently the Sentinel-2 constellation. The use of satellite image time series improves the classification accuracy in comparison to single date classification approaches [
6]. Land-cover maps can be produced with two main procedures, human photo-interpretation and automatic mapping using machine learning classifiers. In practice, image interpretation is a time-consuming hand task in contrast to automated processing. For an operational land-cover mapping over large areas, automated processing is preferred. Classification algorithms based on machine learning methodologies are often applied, which consist of a set of decision rules assigning a label to each pixel. Two categories of classification algorithms are considered for land-cover mapping: supervised and unsupervised ones. State of the art has shown that supervised classifiers such as Support Vector Machine (SVM) [
7] or Random Forests (RF) [
8] obtain better performance than unsupervised approaches [
9]. Maximum Likelihood classifiers have also been used for land-cover mapping [
10], but they do not behave well in high dimensional feature spaces as those encountered with multispectral image time series. SVM and RF can handle data sets with large numbers of features, such as the various spectral bands for the different dates of satellite image time series, and with many training samples covering a wide geographical extent. In addition to image time series, high-quality labelled reference data are an essential element for an efficient classification [
11].
Reference data can come from various sources such as dedicated field surveys, or existing databases. Field surveys are expensive and tedious to perform since it is necessary to visit the fields several times a year to handle the seasonal land-cover classes such as crops. Therefore, reference data are generally available after the corresponding mapping period, whereas the remote-sensing images are available a few days after their acquisition. In essence, the delays in obtaining reference data are the main obstacles to the timely frequent production of land-cover maps. These delays are even longer when using publicly available databases since those have their own production and updating cycles. Naturally, the delays are proportional to the area covered and the kind of classes considered. Consequently, the large scale (i.e., over large areas) land-cover mapping [
12] at the annual frequency cannot rely on field surveys nor existing data bases.
In this study, we propose and evaluate methods to overcome the absence of reference data for the period to be mapped. The purpose is to classify an image time series by using prior information as past image time series and past reference data.
However, the use of previous data is not straightforward due to changes between years. These changes can occur at two levels. At the image level, the behavior of the same vegetation class may differ between years due to climate differences, and the number and dates of the acquisitions may vary. At the class level, the same point of the map can belong to different classes according to the year, which requires the production of a new land-cover map each year. The land-cover changes or transitions can be natural or artificial. A natural transition occurs, for instance, when an edge river pixel becomes soil instead of water due to the water level. An artificial transition is often due to urban expansion when a new building was constructed on farmland. In our application case, the main artificial transitions correspond to crop classes, due to crop rotation, which consists mainly of alternating winter and summer crops for two consecutive years. For all these reasons, reference data have only a limited validity period and must be used only with the contemporaneous image time series.
In a previous work, we proposed simple strategies that use the history of the mapped area and fusion schemes [
13]. First, a
naive baseline was defined, consisting of training a classifier using past (image and reference) data, and classifying the image time series of the current period. The
naive classification only considers one previous period at the same time. This approach yields to maps with an OA between 50% and 70%.
A second method, the Single Classifier, was trained using all the available previous data. In the case in which only one year is used, this is equivalent to the naive classifier approach. This approach provides an improvement of the OA of naive case between 5% and 10% depending on the number of previous data used. The Single Classifier has a major limitation as the method requires a large amount of training samples to be efficient. Then, it is necessary to keep all the data which may cause issues for an operational processing.
A third approach using voting methods was proposed. Majority voting using the naive case maps of several previous periods and a variant using the confidence of the naive classifiers yielded interesting results approaching an OA of 82%. The main drawback of these approaches was the need for several past periods.
In the literature, the lack of reference data is handled with Domain Adaptation (DA) algorithms.
DA considers two domains, the source domain which has reference data and the target one for which few or no reference data are available.
Generally, for land-cover mapping two cases are considered:
The reference data of the source domain is not representative of the area covered by the image [
14].
The source and target domains are two different images [
15]. For this use of DA methods, both domains must be related in terms of land-cover classes and climate.
Tuia et al. [
16] recently did a survey of DA methods used for remote-sensing data classification. The authors listed four categories of algorithms:
Adapting classifiers with unsupervised or semi-supervised approaches.
Adapting classifiers by Active Learning.
Selecting invariant features for classification.
Adapting data distributions.
The two first categories consist of adapting the decision rules of the classifier using the target domain data whereas the two last ones aim at adapting the data. The first step consists of finding common information in both domains. Then, the source domain is adapted to take into account shifts. Once this is done, the target domain can finally be classified.
Adapting the classifier is one of the most complex tasks. The main idea is to train a classifier on the source domain and then identify useful elements in the target domain and use them to modify the decision rules of the classifier. The first category of algorithms is automated using unsupervised or semi-supervised approaches. The work of Bruzzone et al. [
14], considers a Maximum Likelihood classifier to propose an unsupervised update of the decision rules. This classifier, based on Bayes decision theory, uses the
a-priori probability and the conditional probability density distribution of each class to determine a set of decision functions. Then, they use different methods to compute the same probabilities for the target domain and iteratively adapt the decision rules. In a similar context, Rajan et al. [
17] use a knowledge transfer approach to compute weights based on target domain samples in an ensemble of Binary Hierarchical Classifiers (BHC) [
18].
In our context, the land-cover maps are produced using RF. The decision trees in the RF are more elaborate than the ones constituting a BHC ensemble. Indeed, the BHC transforms a C class classification problem into a binary classification problem, instead of considering all the classes, as in the RF.
The second category of classifier adaptation algorithms is a set of methods based on Active Learning. Active Learning approaches achieve satisfying results as the user interacts several times during the process. Indeed, the term ’Active’ refers to the fact that the user must select domain samples and label them by visual interpretation for instance. This type of method is prohibitive in our use because it is unusable in operational production at the national scale.
The feature extraction approach involves finding a set of features usable in both domains. The two domains are considered at the same time to determine a set of invariant features, i.e., the ones that are the least affected by shifts. Using the invariant features also allows the shifts to be estimated. A projection matrix is computed and used to transform both domains into a new one. Then, both the source and target data points can be processed indiscriminately in this single domain. In the literature, the authors mainly propose methods for the invariant feature detection and selection [
15,
19]. The authors propose different distance measures to determine the shifts between domains. These types of methods are widely used to correct two kinds of shifts, the illumination variations or effects due to the sensor viewing angle.
In our case, the used data are acquired with a constant viewing angle and the radiometric variations due to illumination changes are already corrected and expressed in surface reflectance, meaning the images are invariant to these shifts. The second case of use of feature extraction is to smooth the differences between two different parts of an image. The problem of classifying a large image with a small area covered by reference data is not addressed in this work, as in a current use the images are split according to eco-climatic areas before mapping [
12] and the reference data are an excellent representation of the mapped area. In this work, the Formosat-2 images are included in a unique eco-climatic area. A split into eco-climatic areas seems to be more efficient than this use of DA in our case, as a complete satellite image can cover very different landscapes, including seaside mountains and plains for example, and an essential predicate for the DA is that the two domains must be similar. Moreover, removing features is not considered at the moment as the RF chooses relevant features during the training phase. Coupled with a scattered reference data on the image, the RF has a great generalization capability [
20].
The last category of DA methods aims to adapt the data. These approaches seem to be the most interesting in our case. The main idea is to find a transformation to ensure that the data from the source domain matches the distribution of the data of the target domain. Once the source data are projected in the target domain, the supervised classifier can be used. The main difference with the feature extraction methods is that all the features are kept during the projection. In the literature, two major approaches are presented: a projection matrix estimation or an alignment of the data distributions. The proposed methods to align the distributions aim to estimate the distance between two different time series: for instance, the Dynamic Time Warping (DTW) [
21]. In another example, the alignment is done by using histogram matching methods [
22]. The most commonly used methods to estimate the projection matrix, are based on a kernel matrix. Mainly existing works propose different ways to estimate this matrix [
23,
24]. This category is widely referenced in the literature, with many algorithms, the interested reader can refer to the surveys [
16,
25].
Recently Optimal Transport (OT) was used for DA [
26]. The OT problem is a well-known problem in statistics and mathematics. This problem has seen significant progress during the Second World War. Kantorovich [
27] describes the transportation problem as follows: “Either
stations attached to a railway network deliver goods at a rate of
wagons per day, respectively. On this same railway network, there are
stations which consume these goods at a rate of
wagons per day, respectively. At the end of the day, the following equality must be respected:
that means all wagons for the
S stations have been received by the
T stations. The cost
representing the energy required by one wagon to be carried from the station
to the station
is known. The problem is, therefore, to assign to each consumption station, a path taking into account the production stations so that the total costs due to transport are minimal”.
By considering the quantities of goods transported by a wagon as probability densities, OT has become a DA method, used to find an optimal transformation from Source to Target. Once an optimal coupling between the source and target domains is found, the coupling matrix is used to project the source data on the target domain. OT is presented in detail in
Section 2.2.
Most of the methods listed above have limited use cases, making a transition to an operational use complex. Indeed, most of the time, the set of classes is restricted to classes with very different characteristics, such as water, building, forest, pasture, vineyards but rarely crops. The study area is generally reduced to an area of about pixels. Domains are generally composed of a single satellite image, which raises the question of dimensions when using a complete time series.
In this paper, we propose to use the OT to produce the current land-cover map without using the corresponding reference data. The OT algorithms are compared to the naive classifier which can be seen as a very simple DA approach. The use of OT in fusion schemes is also shown.
The remainder of the paper is organized as follows.
Section 2 introduces the data and the methods used.
Section 3 presents the results obtained.
Section 4 gives discussions about the different methods. Then conclusion and perspectives for future works are drawn in
Section 5.
3. Results
In this section, we present the evaluation of the performance of the proposed algorithms. First, a general overview is given by presenting results averaged for each year, followed by the results of all pairs of years. Secondly, the EMD-PerClass approach is analyzed in detail giving the performance for each class. Finally, the contribution of the majority voting using the maps produced with the EMD-PerClass approach is compared to majority voting using the naive maps.
3.1. Global Performance
To give an overall assessment of the proposed approaches and compare them to the supervised and naive baselines, the global averaged OA is presented. For the supervised case, the average is computed over ten random draws of the training data. For all other cases, the average includes the OA for all the combinations of a source domain year and all other years as target domain. For example, considering 2007 as source domain, the different pairs of years are , then the OA obtained on the 10 draws for each pair of years are averaged and the 90% OA confidence interval is computed.
The OA obtained for each method are presented in
Figure 4. Two types of plots are presented. The dotted curves represent methods that cannot be used in a real case because they use the labels in the target domain, which are not available in operational settings. The full curves indicate methods usable in an operational use case. In this figure, 12 curves are drawn, showing different source domain years on the x-axis and the OA on the y-axis. The
supervised and
naive cases are represented by the blue and dark curves, with OA around 90% and 70% respectively. These two baselines being upper and lower bounds of this work, only methods with an OA between them are interesting for land-cover map production. The
PAS approach is represented with the yellow curve, but unexpectedly, the performances are similar or lower than the
naive case. The Sinkhorn-based approaches,
SUP-Sinkhorn (dashed green curve) and
SP-Sinkhorn (cyan curve) are inefficient for all considered source domains, with OA between 25 and 45%. With more than 80% OA the
SUP-EMD-WL (orange dashed curve) is the best method, although it is not a method usable in a real case. The five other methods are close to the
naive OA. The
SK-Cluster (grey curve) and
SP-EMD (brown curve) have lower OA than the
naive case, but with overlapping confidence intervals. The two baseline curves,
SUP-EMD (light blue dashed curve) and
SUP-Sinkhorn-WL (pink dashed curve) have slightly higher OA, but again, within the confidence intervals. These two methods cannot be considered in an operational production as they use the reference data for selected samples. The
EMD-PerClass performances are represented with the red curve, showing a slight improvement with respect to the
naive case. Finally, the last method is the
PAS-EMD (purple curve). This approach produces maps with a statistically significative increase in OA compared to the
naive case. However, as with all the proposed methods, the OA obtained remains very close, within the confidence interval of the
naive case.
Despite this fact, three interesting results can be highlighted. First, considering the Sinkhorn algorithm, the OA obtained by the SUP-Sinkhorn and the SUP-Sinkhorn-WL show that the problem is too complex to be correctly solved with regularization. With an OA difference around 30%, this result is confirmed for all the source domains.
Second, the confidence intervals are narrow for all pairs of years, showing the stability of the methods for a given source domain. As five methods give similar results, the confidence intervals improve the methods comparison. The
SP-EMD approach is the one with the widest confidence intervals, showing the importance of balancing the classes before the transport estimation. This point is shown by the OA obtained for 2011 as the source domain. For this year, the reference data cover almost entirely the image as it was produced by mixing two reference data sets. As the whole image is covered, the hypothesis for the SP approach that the crop rotations preserve the class distribution between two years (
Section 2.3.2), is more likely. The confidence intervals confirm the inefficiency of the
SUP-Sinkhorn and
SP-Sinkhorn, as the difference between the maximum OA for Sinkhorn and the minimum for other method is more than 10%. At the opposite, the
SUP-EMD-WL and
SUP-Sinkhorn-WL have the narrowest confidence intervals, showing the importance of class balance during OT estimation again.
Thirdly, the link between the naive case and the PAS-EMD or EMD-PerClass OA is clearly shown, showing that improving the first should impact the second. This result opens the way for an iterative approach, where a new PAS-EMD can be estimated using the resulting map of the previous iteration.
Impact of Domain Samples Selection
Figure 5 represents the matrices of differences between the OA of the
naive case and all the other methods (
). Each matrix maps the source and target periods and, therefore, the diagonal represents the supervised case (set to zero here). Except for the first matrix, the values represent the OA obtained by producing the map of the year on the y-axis from the data of the year on the x-axis according to the method indicated under the x-axis. The first matrix represents the difference between supervised and
naive cases. For this particular case, the OA used for the supervised case is the one obtained for the target domain indicated on the y-axis. For the
naive case, the value used is the one corresponding to the years on x-axis and y-axis. The color scale uses blue hues to show an OA increase in comparison to the supervised case and red hues for a decrease.
The color bar is not symmetrical, showing that the OA degradation is more significant than the improvement, but the color bar is centered on zero so the lighter the red, the less significant the difference is and the better the produced map is. The best cases are shown in dark blue.
In the majority of cases, the proposed methods improve the
naive case. The worst results are obtained by the Sinkhorn methods
SUP-Sinkhorn and
SP-Sinkhorn using labels on the source domain only, with an OA more than 40% lower than the
naive case. For these two methods, the OA difference stays between −20% and −50% which is consistent with the overall view given by
Figure 4. The two methods show relevant differences for some pairs of years as (2008, 2013) and (2010, 2012), which means that the selected samples have an impact on the regularization term. The impact of the sample selection is also shown with the
SUP-EMD and
SP-EMD, which provide different results.
The use of labels in both domains has a high impact, shown by the SUP-Sinkhorn-WL and SUP-EMD-WL graphs. For the Sinkhorn algorithm, the use of labels improves the OA in most of the cases because it allows being better than the SUP-Sinkhorn and SP-Sinkhorn. This result proves that simplifying the problem is required to use the Sinkhorn algorithm. The highest improvement is obtained with SUP-EMD-WL with an increase of OA near 10% for all cases. Again, the use of labels in the transport estimation allows for improving the performance.
The PAS approach provides similar results to the naive case, with very slight OA differences. Therefore, the PAS-EMD provides a significant improvement in comparison to the PAS. In most cases, it is better than the use of EMD-PerClass only. Also, the processing time is shorter for PAS-EMD than EMD-PerClass as the number of classes to be transported is lower.
The baseline cases provide three significant results:
The results are consistent since variations caused by the choices of the pairs of years used are limited to about 5 to 10%.
The used samples have an impact on the efficiency of the estimated transport.
Estimating the transport independently for each class seems more efficient because it is less complex.
The gain of per-class approach is shown by the SK-Cluster, the EMD-PerClass and the PAS-EMD methods. EMD seems to work better than the SK-Cluster. Introducing the class diversity in the regularization term is not as efficient as the SUP-Sinkhorn-WL, the main reason being the quality of naive classification which impacts the quality of used labels. The noise introduced by errors in the naive classification will impact the efficiency of the regularization term. In addition to the OA values obtained, the EMD per-class method is faster than the Sinkhorn processing, due in large part to the lack of the regularization step.
An interesting result is the absence of an optimal year pair for all methods. One can also observe from the structure of the matrices that each source domain does not have the same impact on a single target domain. This aspect opens the way to a fusion step of the maps produced after transport estimation if several source domains are available.
3.2. ParticulaR Case Analysis: PAS-EMD
We choose to focus the analysis on the one providing the best results, and usable in an operational context. In this section, a comparison between naive and PAS-EMD is provided, both for OA and per-class FScore.
Before considering the performance obtained by
PAS-EMD, the FScore obtained by the
naive classification is shown as it represents the lower baseline in this work.
Figure 6 represents heat maps, where the color scale gives the FScore value. The source domain (the year of the trained model) is given on the x-axis, the target domain (the year of the produced land-cover map) is given on the y-axis. This figure shows that three categories of FScore are obtained, the first showing very high FScore values for all pairs of years (blue values), the second, at the opposite, very low FScore (red values) and the last, with average FScore depending on the considered years. Broad-Leaved tree and maize are majority classes and geographically stable in our data, which explains the excellent FScore obtained for all pairs of years. The worst FScore values are obtained with minority classes such as hemp or soybean. Several classes are known to be highly confused with majority classes such as barley being confused with wheat or sorghum with maize, which could explain the low FScores observed on these classes The heat maps are not symmetric. There is not a unique optimal pair of years for all classes shared by all the methods. These results are consistent with global analysis.
Figure 7 shows heat maps where the color represents the difference in FScore between
PAS-EMD and
naive. The classes are ordered as in
Table 2, from left to right. The x-axis of each heat map represents the source domain year and the y-axis the target domain year. The values shown are computed as
. The values are represented along a color bar where a blue hue means a higher difference in FScore and a red hue a lower difference. A white value equals zero, meaning equality between the two methods. The color bar extends from -0.2 to 0.8, but these extrema are not often reached. In most cases, the dominant color is blue, meaning that there is a global improvement. Each class has a different margin of improvement depending on the FScore value obtained by the
naive case. The FScore has values between 0 and 1, so for classes where the FScore is higher than 0.8 an improvement of 0.1 is already satisfactory.
Therefore,
Figure 6 and
Figure 7 should be compared to understand the improvement brought by
PAS-EMD use for each class. A significant improvement is shown for the river class, where a difference of about 0.8 is achieved. In general, perennial classes benefit from the
PAS-EMD approach with an overall improvement of the FScore. The only exception is the artificial surface class for which degradation is observed when 2007 is the target domain. The FScore degradation cases are mainly grouped in the lower diagonal part of the matrix showing a sensitivity to the source domain used.
For the annual classes, the results are quite robust with an overall improvement between 0.1 and 0.3. Maize is the annual class that benefits the least from the contribution of OT, but the FScore degradation is limited to 0.05 which remains negligible.
3.3. Majority Voting after OT
The previous results are obtained using only one source domain at the same time. We investigate here the use of majority voting (MV) to use several source domains to predict one target domain.
Figure 8 represents the OA obtained (on the y-axis) depending on the number of source domains used for voting (x-axis). Two kinds of curves are presented, the horizontal lines representing methods using only one domain source as input, and the full curves representing the MV performance. The horizontal lines correspond to the mean values over all source domains from
Figure 4. The reference cases are the
supervised (blue dashed line), the
naive (black dashed line) and the
PAS (yellow dashed line) cases which are mostly confounded. The two methods providing the best results are considered to be input for voting:
EMD-PerClass (light blue dashed line) and
PAS-EMD (orange dashed line) and are also shown for reference. Then, the MV is applied using the maps produced with these methods. The
naive case maps are used to produce the
MV-Naive maps, and this is also the case for the other methods. Then the OA shown are the
MV-Naive (green curve), the
MV-PAS (pink curve), the
MV-EMD-PerClass (brown curve) and finally the
MV-PAS-EMD (purple curve).
By averaging the values over all source domains, the difference between the naive and the PAS is negligible. The use of transport alone improves performance by 2.5% and the combined use of transport and PAS increases performance by 5% compared to the naive case. This gain is found when using the MV with an average gap between 2 and 5% between the MV-naive and the MV-PAS-EMD voting. The use of MV with the PAS-EMD increases the OA from 0.75 to 0.85, and is better than the other voting approaches. The confidence intervals are very narrow, which shows the robustness of the voting methods concerning the isolated classification error. All voting methods show the same behavior, a very high OA when two years are used then the OA decrease between 2 and 5% when 3 or more years are used.
As introduced earlier, standard measures are insufficient for the evaluation of voting methods. Therefore, the OA obtained must be tempered by looking at the undecided pixels ratio.
Figure 9 is a ratio graph, in which the x-axis represents the number of source domains used for voting, and the y-axis represents the pixel ratio between correctly classified (top plot), incorrectly classified (bottom plot), and undecided pixels (middle plot). The three plots in this figure do not have the same range, as the top one varies between 0.6 and 0.9%. The two bottom plots vary between 0 and 0.33%. The presented methods are the same as in
Figure 8, with the same color code and legend. Each method has three curves associated, one for each category in a different subplot, except for the reference methods for which undecided pixels do not exist.
An interesting result highlighted by this figure is the percentage of incorrectly classified pixels by each method. Voting methods reduce the ratio of incorrectly classified pixels by about 10 to 20% compared to reference methods. When only two years are used, the MV allows a map to be produced with fewer errors than the supervised case, but the ratio of undecided pixels is very high. This explains the behavior of voting curves, as the ratio of undecided pixels is around 35% for two years and falls under 10% when more than three source domains are used.
This figure shows that the results of the MV-PAS-EMD are better than the others because it has the highest number of correctly classified pixels, the fewest undecided, and especially the fewest incorrectly classified pixels.
Figure 10 presents the FScore value for each class (y-axis) as a function of the number of input source domains (x-axis). The color bar starts with red hues for values near 0 and goes to blue hues for high FScore values near 1. There are four heat maps in this figure, one for each voting method. As expected, the FScore is always higher in the first column, due to the high number of undecided pixels obtained when two inputs are used for voting. This phenomenon is also observed for an even number of inputs because it is a favorable situation for cases of indecision. This figure highlights the improvement of FScore compared to voting based on the
naive case. The
MV-PAS-EMD is better than the other methods with respect to OA, pixel ratio, and FScore.
4. Discussions
Figure 4 and
Figure 5 allow a quick comparison of the OA obtained using the different methods. Two types of analyses are carried out: first, baselines using reference data of both domains and, second, using only the reference data of the source domain. Baseline methods where samples are controlled allow defining the trends that can be expected from OT in the best cases. The
SUP-Sinkhorn case thus highlights the inefficiency of the algorithm, which is explained by the complexity of the used data: there are too many classes for optimal minimization.
The management of samples is double-edged. On one hand, the number of samples must be high enough to represent each class correctly. On the other hand, this number must be small so that the cost matrices can be computed. Otherwise, the algorithm does not find the optimal solution because of the high number and the diversity of classes considered.
This interpretation is confirmed by the results obtained by SUP-Sinkhorn-WL, which shows that with sufficient simplifications the algorithm can provide robust transport with an improvement of about 5% of OA compared to the naive case, but more than 35% compared to the SUP-Sinkhorn. All these results confirm that our problem is too complex to use the Sinkhorn algorithm directly.
The EMD algorithm achieves satisfying results, with shorter processing time but it requires some help to be efficient, as shown using labels in both domains. The main impact of the use of labels is on the cost matrix, where for two different classes, the cost between samples is set to a very high value. As the EMD is a linear operation, the problem can be easily separated in C transport estimations, where C is the number of classes. This approach gives significant results and some cases of improvements, whereas the OA obtained by standard transport estimation, considering all classes at the same time, oscillates and remains close to the naive case. The main drawback of this approach is that a first classification is needed, and its quality significantly impacts transport efficiency.
If the time series of both domains are very different, the use of OT provides a correct gain regardless of the naive classification. For example, the results obtained for 2008 for which many dates are missing in winter, or for 2011 for which there is no image for August.
The best case is achieved when the initial classification is done in a supervised way, but this case does not need transport and is just studied here for comparison purposes.
The maximum OA of each method corresponds to a different year pair. That indicates that the essential element of transport is the selection of relevant samples.
The major drawback of the
naive case is the loss of the minority classes which are drowned out by the majority classes in the training data sets. This is slightly corrected thanks to transport. However, if two classes are too similar (for instance wheat and barley or water classes), the main difficulty remains to separate them. Indeed, the analysis of the confusion matrices shows that most of the errors made by the classifiers remain consistent. For instance, vegetation classes are confused with grass, or several pixels in lake boundaries are classified as gravel pit. The FScore analysis gives good insight of the advantage of using transport. This result is confirmed by the analysis of the impact of the per-class EMD on the MV FScore (
Figure 10), showing that despite the slight increase in FScore per class for minority classes, the voting predicts some samples as members of these classes. That means that using OT allows keeping the minority and hard to classify classes during the processing and paves the way for an iterative method, for instance, to improve the prediction of these classes.
These conclusions are confirmed by the combined use of the PAS and the EMD-PerClass approaches. The OA performances are improved in most of the cases, and the FScore values are also improved in comparison to the naive case. The improvement of FScore is relative to the initial value, but in most cases, this approach provides excellent results.
A limitation of the PAS approach is shown by the FScore analysis as the lower diagonal of the artificial surface class FScore matrix presents a degradation of FScore values. This result is specific to this class and can be explained by its nature.
Indeed, unlike other classes, artificial areas are more likely to appear at a point in time or remain stable, rather than transition to another class. In particular, this means that using a position indicating a building from a previous year will nearly always be valid for a future year, but the opposite is not systematically exact. Not all samples used for transport and training phases are valid, which explains the observed performance.
The results obtained by the OT only are not very satisfactory and provide, in the best cases, a relative gain of 5% of OA. The use of voting methods allows several source domains to be jointly used and therefore benefits from the gain due to transport to improve the performance of voting methods.
Figure 8 and
Figure 9 show that this gain remains minimal given the workload involved in the effective use of OT. Again, the performances are boosted by the combined use of
PAS and
EMD-PerClass, allowing an improvement in OA near to 10% in comparison to the
naive-based voting. This approach provides a better OA with mostly correctly classified pixels and fewer undecided and incorrectly classified pixels. The results of this study provide a negative overview of the use of OT, which is mainly due to the quality of the dataset, and its strong heterogeneity. The 2008 case is significant because the absence of images for a long duration (see
Table 1) makes the use of this year as a source domain difficult. It is, therefore, evident that in similar cases, the use of OT makes the improvement of the classification accuracy possible. Because of all the difficulties encountered for using OT for land-cover mapping, the relative improvement of 5% OA is a therefore positive result.
In the particular case where only one source domain is available, the use of transport is recommended because it allows a more accurate classification to be produced. On the other hand, if two or more years are available, producing naive classifications and fusing them with a simple MV is faster and produces a map of similar quality and therefore, the trade-off between complexity and accuracy must be considered.