The area reflected in the proposed dataset covers regions from five EU countries (Austria, Belgium, Spain, Denmark, and Netherlands). AgriSen-COG comprises 6,972,485 parcel observations, grouped in 41,100 patches of size pixels. Each observation is described by the parcel’s spatial and temporal characteristics or as a univariate time series, with an aggregated version for each polygon.
5.3. Dataset Creation Methodology
Crop type mapping datasets are built upon publicly available data. Several countries have already published their LPIS information for the EU region, making it possible to create large-scale crop type mapping datasets for ML/DL. In this paper, we analyze the available parcel information from all open-source EU crops’ data, resulting in the collection, processing, and analysis of five different areas (Austria, Belgium, Catalonia, Denmark, and Netherlands). Since there is no standard regarding LPIS data, creating a dataset raises several challenges due to the nonuniformity of the data. We also propose a detailed and reproducible methodology based on the LPIS input’s heterogeneity, which may be applied further to extend the current dataset with additional areas. Multiple regions included in an ML/DL dataset for crop type mapping can increase a model’s generalization capability or help in the domain adaptation method. In this context, a similar processing method is crucial for providing consistency among the data. Our dataset creation process is divided into three stages: (1) LPIS processing, to obtain a standard crop description for each polygon; (2) preparing the rasterized data, to obtain pairs of input data; and GT (3) improving dataset quality using anomaly detection. In
Figure 2 and
Figure 3, we present the workflow of our methodology and the challenges they handle. The final dataset and several intermediate preprocessing steps are available for download from our repository. The entire processing code may be accessed as Python scripts on the project’s GitHub, and serves as a quickstart in modifying or extending the proposed dataset.
- (1)
LPIS processing
Step 1.1 is represented by LPIS data collection. In the absence of a shared database, retrieving each piece of LPIS information is time-consuming and performed manually. Most of the LPIS datasets are available on the Ministry of Agriculture of each country’s webpage. However, navigation is hampered, as most websites use specific acronyms in the original language to denote the files. We collected and published the original files involved in the proposed dataset for easier access (Output 1.1). The original files helped to test the proposed workflow or conduct different analyses/processing.
Step 1.2 consists of converting the LPIS data to a unified format, which helps with further processing. The lack of a standard is visible first in the output format used to deliver the LPIS data. We encountered Shapefiles [
65], Geopackages [
66], Geodatabases [
67], and GeoJSON [
68] files. The output format consistency does not apply even for the same area, as there is a distinct format for different years. Therefore, we unified the file types, providing the data in two formats (Output 1.2): Geopackage and Parquet. Geopackage was chosen due to its popularity among the geocommunity and the integration with geotools (e.g., QGIS [
69]). We also provided the data in a partitioned Parquet [
70] format that allows for further distributed processing, which is needed when handling a large number of polygons. For our end goal, we require a list of geometrical shapes mapped with a label for each polygon. Step 1.2 is crucial in standardizing the LPIS data to apply uniform processing algorithms later.
Step 1.3 continues our standardization by selecting a set of columns of interest, renaming them based on a chosen convention way, and translating the corresponding values to English. For each LPIS, we considered the crop type, crop group, area, and geometries-related columns. We believe crop type and geometries are mandatory fields, as they include the preliminary information we need for the proposed dataset. Crop group and area information are supplementary materials useful for statistical analysis. If the area for each polygon was not provided, we automatically computed it. Next, we proceeded with the English translation of the unique labels from the original LPIS. For a proper translation, we identified the correct encoding for each original LPIS file. The translations were manually corrected for each country to remove errors. This was a time-consuming manual process, but it was necessary to align the labels among various LPIS systems. Output 1.3 includes the English version of the LPIS data, with the same column naming for each country, together with a list of the corresponding encodings and translation of each label. Both types of information serve as a starting point for further extension of the current dataset. Sen4AgriNet also offers translation data for France and Catalonia. However, there are differences between our proposed translation and theirs, probably due to our translation revision process, which improves the quality of the final output.
Step 1.4 addresses another issue created by the absence of a standard: distinct names for the same crop label class. This problem exists even for the same region when analyzing data from different years. We used the FAO Indicative Crop Classification (ICC) [
15] categories to solve this and map each crop type to a new label. We followed the same naming convention as Sen4AgriNet, as we created AgriSen-COG to integrate with existing datasets. Furthermore, we believe that using a clear standard for crop labeling is helpful in further dataset extensions. FAO ICC uses a taxonomy that divides the crop types based on group > class > subclass > order. For AgriSen-COG, we chose the innermost ICC label and attached a number code to each crop type. There are 168 classes and subclasses in all, making up the custom FAO/CLC classification scheme. We incorporated two supplementary classes, Fallow land and background, in our final GT. In addition, each region’s resulting file (Output 1.4) contains all the upper levels from FAO ICC, including group, class, subclass, and order.
- (2)
Preparing rasterized data
Step 2.1 starts our rasterization process for converting the LPIS data into actual raster data. It consists of finding the exact boundaries of each area of interest (AOI). The limits are needed in the next step to finding the intersecting Sentinel-2 tiles. One may choose between a region/country’s actual border coordinates to retrieve the boundaries or extract them from the LPIS file. We tested both approaches and decided to follow the latter. Even though the first version is faster, as the border files are publicly available, we are only interested in the region with agricultural representation. Therefore, we computed the boundaries from the previously generated LPIS. In this way, we eliminated from the start all of the Sentinel-2 tiles that do not intersect with any crop polygons.
Step 2.2 continues with the discovery of the Sentinel-2 tiles that will serve as input data for the proposed dataset. To ease the searching process, we used the S2 Amazon STAC catalogue (AWS S2 COGs STAC:
https://registry.opendata.aws/sentinel-2-l2a-cogs/, accessed on 4 June 2023). We conducted STAC searching queries based on each AOI’s boundary, cloud percentage, and our dates of interest.
Step 2.3 starts the actual rasterization of our LPIS information. The advantage of using S2 COGs is that we can study a Sentinel-2 tile without needing to download it. We took the unique tile regions identified in the previous step and used their bounding boxes to map the LPIS polygon on a new raster for each tile. The result (Output 2) was a georeferenced array for each tile. These constitute the ground truth data of the proposed AgriSen-COG. We used the previously mapped FAO ICC code to generate the pixel values for each geometry. The raster images were generated by matching each Sentinel-2 tile’s coordinate reference system (CRS). The GT raster was released under the following formats: Geotiff [
71], as it is a popular geo-format; and Zarr [
72], to enable distributed processing; and COGs (cloud-optimized Geotiff:
https://www.cogeo.org, accessed on 4 June 2023), to allow image access without downloading the data.
- (3)
Improving dataset quality with Anomaly Detection
We integrated an identification of mislabeled GT as a preprocessing step for our dataset creation. To our knowledge, AgriSen-COG is the first crop type dataset to incorporate an anomaly detection step to curate the data. Our goal is to identify the mislabeled crop parcels. Therefore, we need to prepare aggregated information at the polygon level. The workflow for our data preparation and anomaly detection process is described in
Figure 4 and
Figure 5.
Step 3.1 starts our data preparation for the anomaly detection task. First, we computed the NDVI index () to capture the characteristics of our crop vegetation while reducing the multichannel structure to a one-channel image. The NDVI was computed for each tile at a pixel level to preserve the temporal and spatial dimensions.
Step 3.2 continues with cloud masking the NDVI image. As the NDVI was later used for anomaly detection, clouds would alter the process and include bias in a polygon’s time series. Therefore, we applied a cloud mask on each image. We decided to use the SCL mask, already delivered with the Sentinel-2 product, eliminating the need for another cloud processing algorithm to be added to our workflow. The SCL mask offers comparable results to top cloud masking methods [
73]. From the 12 labels present in the SCL mask, we implemented the cloud- and snow-related pixel classes, namely saturated or defective, cast shadows, cloud shadows, cloud medium probability, cloud high probability, thin cirrus, and snow or ice.
Step 3.3 assembles our NDVI time series. Each pixel from our dataset has the following properties: (1) a sequence of NDVI values, representing the vegetation characteristics captured at different moments; (2) a crop label, describing the corresponding crop class (from LPIS); (3) a polygon identifier, assimilated to a number given to each polygon from the LPIS data. Initially, the information is stored as multiple matrixes of pixels, which are transformed into sequences having the values mentioned earlier.
Step 3.4 is the final data processing step before applying the LSTMAutoencoder for anomaly detection. As we are only interested in detecting anomalies at the polygon level, we aggregated the time series corresponding to pixels from the same polygon. Possible aggregations include polygon median, mean, or computing the barycenter. In our time series, we might have missing data for the same polygon due to cloud masking or just missing data from the original Sentinel-2 image. The median and mean are more sensitive to the missing data situations, as mentioned earlier. Therefore, we computed the barycenter for each polygon, capturing the time-related variability of each polygon. The barycenter (Equation (
1)) was computed using the DTW barycenter averaging (DBA) [
54] algorithm. The barycenter is a sequence for each polygon that reflects a crop’s growing cycle from the respective parcel.
Step 3.5 starts our anomaly detection process by grouping the barycenter time series for each crop type. As in [
55,
62], we expected most crop labels from the same class to be correct, and aimed to identify the outliers only. As we had a large variability regarding the number of time-series for each category (from a few hundred to ten thousand), we chose an autoencoder network instead of a KNN with DTW. We eliminated the dates without input for each time series and used interpolation to fill in missing values.
Step 3.6 corresponds to the actual model training, as we trained a LSTM Autoencoder for each category. The architecture of our models is described in
Table 2 and is based on the LSTM autoencoders from here (LSTM autoencoders:
https://github.com/shobrook/sequitur, accessed on 4 June 2023). The proposed network follows a classic autoencoder structure composed of encoder and decoder parts. In our case, the encoder and decoder use two LSTM layers, followed by a fully connected layer at the end of the decoder.
Step 3.7 consists of passing again through the trained autoencoder model in prediction mode to record the prediction error. The autoencoder tries to reconstruct the input by minimizing the reconstruction loss. We chose the mean squared error (MSE—Equation (
2)) loss for our model. We saved the value of the MSE for each sequence and used a threshold to identify the anomalies based on it.
Step 3.8 identifies the outliers based on the MSE loss values determined in the previous step. We have an array of prediction loss for each crop type label, on which we apply a threshold to separate the regular class from the possible abnormalities. Even though the threshold might be chosen by a visual analysis of the distribution, in our case, we have more than 50 label types for each country. Therefore, we proceeded with a dynamic threshold, as in [
62], the Otsu thresholding. This technology, which was initially developed to convert gray-level photos into black-and-white images, enables the separation of a histogram with two spikes. It looks for a binary threshold that yields the least intraclass variance when the two groups are averaged. We computed the Otsu thresholding for each category and eliminated the crop parcel with higher values than the corresponding threshold for each class. The Otsu thresholding is defined in Equation (
3), where
and
is the empirical probability that the loss is equal or below
t, respectively above. The variance of normal/abnormal values is reflected in
and
.
5.4. Dataset Description
The resulting AgriSen-COG is a multiyear, multicountry dataset for crop type mapping. It includes 2019 and 2020 data covering the following five areas: Austria, Belgium, Catalonia, Denmark, and the Netherlands. We used the corresponding LPIS information for each region, distributed under the Open Data Commons Attributions Licence. We selected the years 2019 and 2020, summing up to 10.2 M parcels (the detailed distribution of polygons is presented in
Table 3). Each original AOI includes a large and varied number of unique labels (
Table 3), mapped to FAO ICC standard, resulting in, at most, 102 common crop classes, including the additional
Fallow category. The noncrop pixels are marked with the
background label. In
Figure 6, we present a sample for the proposed dataset from all six regions, including both years for the same area. As depicted, we highlight the spatial variability and temporal changes included in the AgriSen-COG dataset. Our GitHub repository provides a more thorough explanation, examples of data loading functions, and graphic demonstrations. Additionally, code samples are offered to help people write the logic presented in the creation methodology, regarding both LPIS processing and data rasterization.
The proposed dataset contains two subsets created to match the two approaches in crop type mapping: pixel-level patch subset (for temporal semantic segmentation) and parcel-level aggregated subset (for time-series classification). Both subsets contain data from all five regions, covering two years (2019, 2020). Therefore, we enable further research focused on a single area or studying how models handle different geographical characteristics.
We followed the methodology mentioned earlier in creating AgriSen-COG. It relies on a total of 62 Sentinel-2 tiles. We selected the tiles that intersected with the LPIS bounds of each region and retrieved the Sentinel-2 Level-2A tiles with less than 30% cloud percentage. Next, we rasterized the LPIS polygons, following the bounding boxes of each tile, but we discarded the parcels with less than 0.1 ha area. After rasterization, we applied our anomaly detection algorithm and identified possible anomalous fields. Therefore, we eliminated the corresponding polygons by labeling them as the background class. Ultimately, we divided each tile into patches so that our data fit the hardware restrictions.
The proposed dataset includes the 10 m resolution bands only (red, green, blue, and near-infrared), as they provide most of the vegetation-related information and they do not require further upsampling to use. Therefore, we chose size 366 × 366 for each patch, an integer division with the initial tile size for the 10 m resolution bands. The patch size, as mentioned earlier, also makes AgriSen-COG compatible with other datasets, such as Sen4AgriNet. From each tile, there is a total of 900 resulting patches. However, we discarded the patches that did not include any crop-related polygon, summing up to 41,100 patches in AgriSen-COG.
Table 4 presents a detailed description of the eliminated polygons and patches during each stage.
The patches are saved as Zarr arrays in a format offering a self-describing design compatible with Xarray. The selected format is also compatible with a distributed processor (like Dask) and is the preferred format for cloud-stored data. The ground truth (LPIS masks) data files are stored using the COG format and are available on a public S3 bucket. This way, we enrich the existing COGs databases and make the proposed data easily accessible (no download needed) and findable (STAC catalogues indexes). The aggregated data for each field (the barycenters) constitute the proposed time-series dataset, and are distributed in Parquet format to ensure a smaller size and distributed processing if needed.
The five AOIs comprise around 62 Sentinel-2 tiles, with the patches dataset summing up to 6,972,485 fields, for 2019–2020, with 41,100 patches.