*4.1. Standardization and Unification*

The use of food composition data from different countries needs a high level of harmonization of both food values and the nutrients that are included [48]. Data processing requires precise nomenclature and standardized methods, such as the use of ontologies or tags that allow correct classification and description [86]. Nutrients from the TEDDY study were compared between four countries. According to Uusitalo et al. [67], harmonizing datasets before calculations generally made the results comparable, as systematic and random errors were minimized. This approach was previously used for ten European countries in the EPIC study, producing similar results [51,66].

Due to the large amount of available food items, the implementation of artificial intelligence and computational approaches is recommended [87]. Currently there are many automatic and semi-automatic tools that are extensively used to classify FCDBs [9,41,87,88]. A clear example is the ASA24 system that uses automated methods for several databases [16,87]. Another example is StandFood, a semi-automated system that obtained an overall result accuracy of 79% [41]. New techniques of natural language processing [88], machine learning, and statistical models, such as Monte Carlo simulations [12] or extraction of 'big data' [20], make the process faster than manual work [16]. However, due to the complex work, a manual *post hoc* review is always required [82,87]. After a first approach using different methodologies, manual and semi-automated harmonization and standardization work was decided to be performed in the S4H FDCB. Although human errors are still possible, this work guarantees a higher accuracy when comparing the same foods from different FCDB than automated predictions [41].

The first step was to achieve harmonization to classify foods. Durazzo et al. [37] classified foods based on different criteria. One of the classifications used is the FoodEx2 classification implemented by EFSA [40]. We selected these classification criteria due to its hierarchical nature and its widespread use. All foods were harmonized and linked between the different FCDBs. This classification provides the possibility to match foods, although full comparability is not guaranteed [2,59]. Secondly, since all nutrients and compounds have to be made comparable, they were defined in the same way, according to measurement units [51,67]. The tagnames proposed by INFOODS [43], indicating the name of the component, units and analytical method [60], have been implemented in different FCDBs around the world [45]. INFOODS tagnames allowed us to normalize variables from all databases to reference units (such as μg or mg) with faster results. Also, when unifying two nutrients, it allowed us to ensure that they were expressed in the same way and could be comparable. We modified 8.4% of the units of the tagnames to obtain a more functional FCDB. Most of the individual phenolic compounds did not have a tagname, and a new tagname was created to facilitate their integration into S4H FCDB. After standardizing both foods and nutrients, we had the opportunity to unify those foods that were categorized as identical. This would allow the inputting of those missing data and foods in the national FCDB.

Several studies claim that for research purposes in nutritional epidemiology, it is better to approximate nutrient values than to leave them as missing [51]. Not imputing data could lead to systematic underestimations of nutrient intake [18]. Although authors and institutions recognize this as a reliable method [33], others are critical, arguing that food composition changes considerably from one country to another [2]. S4H FCDB inputted the values of a weighted estimate of several FCDBs, making the values of high quality and taking into account the biodiversity of foods, thereby improving the estimations [14]. The inputting of missing values are frequent mechanisms that are performed when using FCDBs with recognized data quality [88]; typically, the data come from FCDBs from the United States, Europe, or other countries in the same region [5,12]. An example is the FCDBs from countries in sub-Saharan Africa, which import up to 88% of data about animalsource food [22]. Another example is the Middle East FCDBs, which inputted food from the United Kingdom FCDB [81]. The S4H FCDB uses an ad hoc approach to standardize the FCDB, as was done in the EPIC study [66]. This approach will make it possible to add foods or replace the value of a missing compound from other FCDBs with comparable estimated and weighted values [51,66].

During the first unification tests between foods, large standard deviations were identified in some macronutrient or micronutrient, largely coming from beverages and spices. The reason was that most of the 0 values for a compound or nutrient were not of the 'logical zero' type. Authors such as Pérez Grana or Westenbrink et al. [1,25] recommend that missing values should never be replaced by 0 and even modify the 'logical zero' values so as to avoid affecting the estimations. Before unifying the values, all the 0 values were removed. Then, by unifying the values, most of the data were homogenized, thus improving the results. The loss of the 'logical zero' values would not affect the calculations since they should remain at 0 and can be incorporated later.

On the other hand, although the mean and median were calculated, median values were chosen as the reference value after unification. Although some authors choose the mean [9,25], the median value is, in some cases, a better measure of central tendency [88], especially for extreme values from national FCDB. This ensures homogenization of the data and prevents wrong estimations. The unification allowed the inclusion of many foods and compounds. Figure 3 shows how the unification guarantees the homogenization of values. Once the values were unified and cleaned, as recommended by FAO or EuroFIR [2,33], estimated energy was recalculated using the Atwater coefficients [62].

Organizations such as FAO work with spreadsheets due to their simplicity, wide availability and familiarity to users [44,89]. Our work started out using spreadsheets, although the amount of data quickly became rather difficult to handle [89]. Therefore, the software MySQL was used, which allowed us to send and retrieve data through its interrelated tables [45,90]. This ensured traceability and quality controls, and also facilitated the relationship of S4H FCDB with the recipe tables for subsequent calculations.
