3.1. LUCAS and Flickr Images Classification
The automatic classification model of natural scene images is the key under this framework, which will directly determine the reliability of the land cover data verification results. The NSIC-Inception model was used for the photo classification accuracy of the test set, and the results showed that the top1 accuracy of the test set was 95.48%, and the top-3 accuracy was 99.47%. The precision and recall rates of each category are shown in
Table 4.
The accuracy of A1, B15, and B2 were all above 90%, with A1 having the highest accuracy rate of 99.80%. However, B16’s precision was lower, only 60.32%. The recall rate of all four categories was above 90%, the recall rate of B16 was lower compared to the other three, and the recall rate of A1, B2, and B15 increased in order.
The NSIC-Inception model, resnet50 model and the VGG16 model were used for the same transfer learning, and the model accuracy and land cover verification accuracy of both (
Table 5) were compared. There is a clear difference in the structure of the three models. The NSIC-Inception model sets multiple convolutional kernels of different sizes in parallel to extract image features, while VGG16 is a typical deep network structure consisting of five groups of convolutions, two fully connected, and one classification layer. All convolutional layers use 3 × 3 convolutional kernels and take a vertical expansion and deepening approach to build the network. The Resnet50 model is designed to overcome the problem of inefficient learning and ineffective accuracy due to the deepening of the network by using a shortcut connection to introduce the data output from one of the previous layers directly into the input part of the later layers by skipping several layers.
Both the accuracy of the model and the agreement of the image classification results with CCI-LC with NSIC-Inception pre-training were slightly higher than those of the VGG16 pre-trained model and Resnet50 model. Moreover, the agreement of the LUCAS images was much higher than those of the Flickr images. This showed that although the natural scene images classification model has an impact on the verification results of the land cover data, the impact of different images sources on the verification results is more pronounced than that of the image classification model.
In summary, the NSIC-Inception model proposed in this study had better performance and could classify the natural scene images of categories A1, B15, and B2 with high accuracy to satisfy the verification of the land cover map. Therefore, the 12,879 natural scene images used for LC-CCI verification were classified by NSIC-Inception model, and only those with classification confidence greater than 0.9 were retained. The results are shown in
Table 6.
As can be seen from
Table 6, a total of 9997 geo-tagged natural scene images were classified with a confidence level of more than 0.9. Category A1 accounts for about 92.84% of 2015 LUCAS image datasets, 7719 images. While the quantity of B2 images in the set was only 108, accounting for only 1.30%. Since the LUCAS survey was a spatially homogeneous sampling, both showed the uneven feature that the quantity of images in category A1 was much higher than the quantity of images in other categories. In contrast, the quantity of images distributed across categories on Flickr in 2017 was more even.
3.2. LC CCI Verification and Analysis in the UK
The verification of the land cover map requires the spatial locations of samples first. Then, it is possible to verify the land cover types with “ground truth” of the sampled points, and to calculate the quantity and proportion of samples with their classification. The spatial pre-sampling method based on the image spatial distribution was used, and the samples were laid out with the same spatial distribution as the geo-tagged images. The geo-tagged 2017 Flickr images were used to sample the LC CCI in 2017, and the geo-tagged 2015 LUCAS images were used to sample the LC CCI in 2015. The distribution of the quantity of land cover categories at the sampling points is shown in
Table 7.
The very small percentage (0.24%) of the area covered by class B16 in the UK resulted in even fewer samples for B16 in the pre-sampling. It has been shown that the small sample size has an impact on the reliability of the verification results [
30,
31]. Consequently, type B16 was removed from the image datasets verification and A1, B15, and B2 were reserved for the agreement analysis in this study. As shown in
Table 6, there were a total of 1683 LC CCI samples in 2017 and 8314 LC CCI samples in 2015.
A confusion matrix was built using pivot tables (
Table 6 and
Table 7), shown in
Figure 6. Natural scene images were used as a ground reference to judge whether the land cover data of the sample are correct.
Overall accuracy, product accuracy, user accuracy, and Kappa coefficient of map verification for LC CCI are shown in
Table 8, respectively. The classifications of geo-tagged images were both based on the NSIC-Inception model. However, the verification results varied widely between LUCAS and Flickr.
The kappa coefficients, with the verification result of CCI LC with LUCAS and Flickr being 0.57, 0.46, respectively, showed medium consistency, but the Kappa coefficient of LUCAS was 0.11 higher than that of Flickr.
In terms of the overall accuracy of the land cover maps, the overall accuracy of the verification results for the LUCAS images was 30.83% higher than that for the Flickr images. However, the product accuracy difference of verification for each category in the LUCAS and Flickr image sets was at least 20.83% narrower than the difference in overall accuracy, and the consistency verification for each category behaved differently in different datasets, which may be related to factors, such as the size of images sample, spatial distribution, and spatial heterogeneity of land cover.
3.3. Comparison and Analysis of LUCAS vs. Flickr
The differences between LUCAS and Flickr were compared and analyzed regarding the quantity of images, spatial distribution, representativeness of images, and the camera angle.
In order to quantitatively measure the four impacts, the verification area was divided by grids of 3000 × 3000 m. The quantities of LUCAS and Flickr image and the number of land cover types were counted within the grid. The number of land cover types was between 1 and 4, which meant 1–4 land cover types (A1, B15, B16, B2) within each grid, which was used to express the spatial heterogeneity of land cover. The average accuracy of verification in each grid and in each spatial heterogeneity of land cover were calculated and expressed as the ratio of the number of verified consensuses to the total number of images.
- (1)
The quantity of the images
Flickr had a much smaller total number of images than LUCAS. For category A1 images in particular, Flickr had 7180 fewer images than LUCAS, as shown in
Table 5. The PA of LUCAS was higher than Flickr, at 9.06%. This huge volume gap made the PA and OA of Flickr lower than that of LUCAS. Similar to category A1, Flickr had 224 more images than LUCAS in category B2, and the PA of Flickr was higher that of LUCAS than 10.0%. Thus, quantity of images plays an important role in the verification results.
- (2)
The spatial distribution of images
The spatial distribution of sample is completely determined by the distribution of images using images for verification. The spatial distribution of LUCAS and Flickr images collections was compared. LUCAS takes photos according to certain sampling rules, while Flickr photo spatial distribution is random. Hence, this could have a certain influence on the verification of the land cover maps [
32].
As shown in
Figure 7, LUCAS mainly distributed areas within one or two land cover types, accounting for 57% and 37% respectively. Flickr’s photos accounted for 25% and 45% of the one and two land cover types. There were more Flickr images than LUCAS images in areas with higher spatial heterogeneity of land cover, which may result in verification inconsistency more often.
In addition, the average verification accuracy in the low spatial heterogeneity grid is higher than that in the high spatial heterogeneity grid, for both LUCAS and Flickr. As the spatial heterogeneity increases from 1 to 4, the LUCAS’s average verification accuracy decreases from 0.98 to 0.78, and Flickr’s average verification accuracy decreases from 0.70 to 0.46.
It could be assumed that the consistency of the verification results was related to the diversity of the land cover at the location of the samples, and the lower spatially heterogeneous the images are distributed over the area of land cover, the more reliable the validation of this method is. Maybe in areas with high spatial heterogeneity of land cover, the verification of land cover requires higher representativeness of images.
- (3)
The representativeness of images
Theoretically, the layout of sampling points should be combined with the spatial heterogeneity of land cover. The locations with high spatial heterogeneity need more samples, while the locations with low spatial heterogeneity need fewer samples. In other words, the sample points should be as representative as possible. There was a gap in terms of representativeness between Flickr images and LUCAS images, which can be confirmed from
Figure 8 and
Figure 9.
The quantity of LUCAS and Flickr image in the grid of 3000 × 3000 were counted and then plotted based on the UK map, as shown in
Figure 8. A scatter plot of the quantity of images and the average accuracy in different grids was plotted, and a linear fit was analyzed for it within the 0.95 confidence intervals, as shown in
Figure 9.
From
Figure 8a, LUCAS images were distributed in almost the whole UK, and there were only 1–4 images in each grid. However, the average consistency within the grid is as high as 93.78%, as shown
Figure 9a. This indicates that the LUCAS image, as a record of land cover type, is highly representative.
From
Figure 9b, Flickr images had a limited distribution range, mostly gathered in London, Manchester, York, and other large cities in England. The quantity of images in the grid ranged from 0 to 84, mostly concentrated in 0–20 images, but the corresponding average consistency also fluctuated in a wide range, and the overall starting value was only 57.99%, 35.79% lower than LUCAS. This suggests that as a record of land cover type, Flickr images are still not spatially representative enough. The coefficient of the fitted line is positive, and the degree of consistency increases with the increase of the quantity of images. That is to say, although Flickr images do not have very typical spatial representation of land cover type, the increase in the quantity of images in a unit grid could increase the verification result and thus be more reliable.
The images of LUCAS and Flickr demonstrated different performances in the same land cover verification case due to different shooting purposes, different photographers and different sharing methods. In addition to the above reasons obtained from analysis, there were also some subjective behaviors.
For example, LUCAS shooting group had a clear requirement to find a fixed shooting angle before shooting. In this study, images from a central perspective were used. Such a shooting angle represented the main content of the pictures mostly because given the surface or close natural scenes, it can reflect the “ground truth”. However, the images shared by the public on Flickr may be tourist records or special scene commemorations, such as buildings clearly standing in the grass and beautiful flower pools in the city. The NSIC-Inception model can accurately identify buildings as B15 land and flower pools as A1 land cover. However, it turns out that grassland and artificial land cover are the types of land cover in the 300-m resolution map. There are probably a lot of these images in Flickr. Such a minimum mapping unit for photo content records and a minimum unit for land cover records may be more common in Flickr collections than in LUCAS.
Therefore, the differences caused by these subjective factors also affect the verification results with land cover dataset.
3.4. Verification CCI LC and GLC-FCS
The verification of GLC-FCS was compared with that of CCI LC using LUCAS which had a more reliable result than Flickr. The spatial resolution of GLC-FCS dataset is different from CCI LC and can be explored with respect to the impact of spatial heterogeneity on the method proposed.
The results of verification CCI LC and GLC-FCS using LUCAS images were compared. The confusion matrix of GLC-FCS and the difference between the confusion matrix of GLC FCS and that of CCI LC were drawn, as shown in
Figure 10. When the spatial resolution of land cover was improved, the land cover sample points of B15 category were increased, and compared with CCI LC, there were 71 new verification samples whose photo category was consistent with the land cover category. However, the sample points of B2 and A1 were reduced, and the verification sample points of photo category consistent with land cover category were also reduced by 23 and 170, respectively.
As shown in
Table 9, The overall accuracy of GLC-FCS was slightly lower than that of CCI LC in the verification. It should be noticed that the PA of B2 was higher in GLC-FCS than that of CCI LC. Both PA and UA of A1 are slightly increased or decreased, which may be caused by the error of the product itself. The sample points of B2 were too few, so we will not further explain it here.
We presumed that the change in the verification results of B15 was due to the higher spatial heterogeneity of artificial surfaces land cover types, such as cities, parks, neighborhoods, etc. Land cover maps with higher resolution could make the images close to the mapping units of land cover. Thus, PA verification results were improved with regard to land cover with 300 m spatial resolution, as shown in
Figure 11.
In CCI LC land cover, A1 land cover was found near the sampling points, while A1 and B15 land cover were found near the sampling points in GLC-FCS land cover. Comparing the land cover at high and low resolution, B15 shows a stronger spatial heterogeneity than A1, which is expected. The minimum mapping unit of land cover and that of images do not match. When the spatial resolution is increased from 300 × 300 to 30 × 30, many verification results of the sample points will be positive, but not all. The key to this method was the matching of the smallest unit of mapping and that of the image.