*4.2. Performance Metrics*

We quantitatively evaluate the classification performance with label-based and instance-based metrics. Label-based metrics work by evaluating each label separately and returning the average (micro or macro) value across all appliances. In contrast, the instance-based metrics evaluate bi-partition over all instances. To this end, two metrics, namely example-based *F*1 (*F*1-eb) and macro-averaged *F*1 (*F*1-macro) measures are used. Example-based *F*1 (*F*1-eb) is an instance-based metric that measures the ratio of correctly predicted labels to the sum of the total true and predicted labels such that:

$$F\_1 \text{-eb} = \frac{\sum\_{i=1}^{M} 2 \cdot t\_p}{\sum\_{i=1}^{M} y\_i + \sum\_{i=1}^{M} \hat{y}\_i} \tag{9}$$

The *F*1-macro is derived from *F*1 score and measures the label-based *F*1 score averaged over all labels and is defined as:

$$F\_1\text{-macro} = \frac{1}{M} \sum\_{i=1}^{M} \frac{2 \cdot t\_{pi}}{2 \cdot t\_{pi} + f\_{pi} + f\_{pi}} \tag{10}$$

where *tp* is true positive, *fp* is false positive and *fn* is false negative. High *F*1-ma usually indicate high performance on less frequent labels [45].
