*3.2. Methodology*

The methodology followed in this paper consisted of 6 phases (Figure 5). The preparatory stage (stage zero) in the previous subsection was concluded with the creation of the database. Then, the remaining five phases included modelling and testing. The first step for data pre-processing was to identify input variable's importance for better understanding their behaviour and obtaining additional information regarding their usefulness in the final model. This was completed using Multivariate Adaptive Regression Splines (MARS, Step 1). Then, the next phase was to define the first-year corrosion loss of galvanised steel. Self-Organising Maps (SOM) were used, including various layers (supersom) of both supervised and unsupervised learning. The next two steps used the result of the various layers of this algorithm. The first layer has been the result of using unsupervised SOM, according to the relationships between the 7 main variables. Zinc corrosion loss during first year of exposure (Corr\_Zn, in μm) was the output variable to be predicted (Step 2).

**Figure 5.** Flow chart showing the methodology followed in this paper. The six phases proposed are exposed as shown.

The advantage of SOM maps is that in addition to assigning an individual value, an uncertainty range is also given, obtained by adding the minimum and maximum value within each neuron. Besides, it is intended that in addition to self-organising according to the input variables, supersom networks group the data according to the various corrosivity categories. Then, the second one of the two output layers would be the result of organising corrosion in a supervised output layer that will assign the corresponding 'corrosivity category' value set to each node by the standard (Step 3). Furthermore, the corrosivity is not constant with respect to exposure time. In most cases, it decreases with increasing exposure due to accumulation of corrosion products on the surface. Step 4 includes optimising the formula that allows the extrapolation of these results to long term results. With Newton's method, a nonlinear regression of the formula used by ISO 9224 (Equation (1)) was performed to optimise the value of variable *b*.

Finally, to test the quality of the predictions, a model based on Euclidean distances was used (Step 5). This model analyses the model input variables, trying to find the most similar cases in the database to show their corrosion value and its similarity degree (quality). Then, in this fifth phase, the results obtained were compared with existing real cases to measure the quality of predictions using a Euclidean distance model. Although both supersom and distance models start from the same database and have the same inputs, their purposes are different. While supersom model gives a corrosion prediction, and a corrosivity category, the distance model sets the quality of that prediction.

Techniques

• Multivariate Adaptive Regression Splines (MARS)

One of the most widely used algorithms for solving adaptive computing problems is MARS [56]. This method consists of approximating an unknown function by the linear combination of a set of basic functions (products of the model variables) [57]. Among the key points of the algorithm, it stands out that it autonomously selects the relevant variables and interactions between them for each subregion. Thus, the dimensionality reduction of the problem is performed directly by the model, with the advantage of being locally carried out. Precisely, this benefit can be used to analyse the relevance of the variables likely to subsequently participate in the model.
