*Proceeding Paper* **Quality of Labeled Data in Machine Learning: Common Sense and the Controversial Effect for User Behavior Models †**

**Maxim Bakaev \* and Vladimir Khvorostov**

Faculty of Automation and Computer Engineering, Novosibirsk State Technical University, Pr. K. Marksa 20, 630073 Novosibirsk, Russia; xvorostov@corp.nstu.ru

**\*** Correspondence: bakaev@corp.nstu.ru

† Presented at the 15th International Conference "Intelligent Systems" (INTELS'22), Moscow, Russia, 14–16 December 2022.

**Abstract:** Intelligent systems today are increasingly required to predict or imitate human perception and behavior. In this, feature-based Machine Learning (ML) models are still common, since collecting appropriate training data from human subjects for the data-hungry Deep Learning models is costly. Considerable effort is put into ensuring data quality, particularly in crowd-annotation platforms (e.g., Amazon MTurk), where fees of top workers can be several times higher than the median. The common knowledge is that quality of input data is beneficial for the end quality of ML models, though quantitative estimations of the effect are rare. In our study, we investigate how labeled data quality affects the accuracy of models that predict users' subjective impressions—per the scales of Complexity, Aesthetics and Orderliness assessed by 70 subjects. The material, about 500 web page screenshots, was also labeled by 11 workers of varying diligence, whose work quality was validated by another 20 verifiers. Unexpectedly, we found significant *negative* correlations between the workers' precision and *<sup>R</sup>*2s of the models, for two out of the three scales (*r*<sup>11</sup> <sup>=</sup> <sup>−</sup>0.768 for Aesthetics, *<sup>r</sup>*<sup>11</sup> <sup>=</sup> <sup>−</sup>0.644 for Orderliness). We speculate that the controversial effect might be explained by a bias in the indiligent labelers' output that corresponds to subjectivity in human perception of visual objects.

**Keywords:** web interfaces; intelligent systems; machine learning; image recognition

### **1. Introduction**

One of the implicit assumptions in Machine Learning (ML) is that the data that get through the preliminary screenings and tweaks to the model training stage are appropriate. As for ML models that seek to predict or simulate human behavior, such as user behavior models (UBMs) in the field of Human–Computer Interaction (HCI), the situation is rather more sophisticated. The actual interaction-related data, which are generally the input of the predictive UBMs [1], arguably cannot be "bad", as long as they reflect the human "imperfection". However, there are also increasingly important subjective dimensions, from perceptional "how pleasant is our website design" in HCI to "how likely is it that you would recommend our service to a friend" in marketing. By definition, the subjective impressions are usually directly provisioned by human subjects—although indirect methods do exist, e.g., facial emotion recognition. Correspondingly, Deep Learning is slow to take off in this field, and an ample share of the models are feature based and rely on labeled data and the subjective assessments.

There is a general consensus that inaccurately annotated data are a hindrance and that the labeled data quality does not come for free. In micro-task platforms, such as Amazon Mechanical Turk (MTurk), filtering of crowdworkers can be carried out by a reputation that is principally based on the Approval Rate supplied by task requesters [2]. The fees charged by higher-paid workers are about four times above the *median* ones in MTurk [3], even though it has been shown that even top workers can be indiligent [4].

**Citation:** Bakaev, M.; Khvorostov, V. Quality of Labeled Data in Machine Learning: Common Sense and the Controversial Effect for User Behavior Models. *Eng. Proc.* **2023**, *33*, 3. https://doi.org/10.3390/ engproc2023033003

Academic Editors: Askhat Diveev, Ivan Zelinka, Arutun Avetisyan and Alexander Ilin

Published: 9 May 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Reputation might have seemed an easy solution to crowd-labeled data quality a decade ago [2], but the arsenal of methods and tools has been rapidly expanding since then [5], as we subsequently outline in Section 2.1. The currently mainstream data quality control methods are *majority/group consensus* and *ground truth*, which necessarily imply redundancy (several workers performing the same task), wasting up to 33% of the output.

Even if data labeling work is carried out by volunteers and is technically free, their limited effort should be used efficiently too. Although volunteers generally have higher motivation than crowdworkers, redundancy might still be necessary to reach the certainty thresholds [6]. Setting the latter is actually a major problem for a requester, which we believe is not adequately covered in existing studies. Similarly to software debugging, more is always better, and there is no hard threshold to improving the quality of the data, only the one advised by practicability. Many developments to improve input data for UBMs, e.g., the enhanced version of the robust *Aalto Interface Metrics* (https://github. com/aalto-ui/aim, accessed on 1 June 2022) [7], are underway with the best intentions. Unfortunately, estimating the concrete "return on investment" in data quality remains problematic, as quantitative studies of its end effect in ML are scarce.

In our paper, we explore the relation between the completeness and precision of the input data produced by 11 human labelers and the quality of the ensuing 33 user behavior models built for 487 web page screenshots assessed by another 70 participants. Rather unexpectedly, we find that the significant correlation between the labelers' precision and the quality of the models constructed for the subjective scales of aesthetics and orderliness is **negative**. We attribute this preposterous result to the bias in indiligent labelers that brings their output closer to some subjective dimensions of human visual perception. We did not find any significant correlations for the labeling of completeness—even for complexity, which is known to be affected by the number of visual elements. Our results question the traditional data quality measures' applicability for human-related data, although further research is necessary.

The outcome has been preliminary reported and discussed at the *2021 Fall Conference of Ergonomic Society of Korea (ESK)*. In the current paper, we present the extended version of our results, referencing some of our previous related publications, such as [8,9]. In Section 2, we briefly review the research relevant to human behavior data quality in ML and describe our experiment. In Section 3, we construct the models and analyze the effects of the input data on their quality. In the final section, we discuss the findings and their possible causes, and outline directions for further research.

### **2. Method and Related Work**

### *2.1. Data Quality Control in ML*

As noted by philosophers long ago, the concept of *good* is very subjective. In relation to ML, it was recently demonstrated that the understanding of "good data" varies considerably for different stakeholders [10]. The concept of *quality*, though more objective and operational, is domain specific [11] and multi-dimensional [12]. With respect to data, it commonly involves the aspects of completeness, consistency, lack of duplicates, accuracy, timeliness for the purpose, and so on—some researchers identify as many as 20 dimensions. Since ML is predominantly concerned with *precision* and *recall* of the models, it associates data quality for the most part with *completeness* and *accuracy*.

The importance of these two dimensions in data quality was well recognized even before the ML era, and the related methodologies were classified as the ones helping "selection, customization, and application of data quality assessment and improvement techniques" [13]. Currently, the data quality control incorporates techniques for data collection planning, cleaning, profiling, evaluation, monitoring, etc. About a decade ago, strong focus in the field was established concerning crowd data, due to rapid advancement of crowdworking platforms such as Amazon MTurk (2005), microworkers.com (2009), Yandex.Toloka (2014), etc. The whole family of related meta-tools dedicated to data quality control emerged, such as CDAS (2012), Crowd Truth (2014), iCrowd (2015), DOCS for

AMT (2015), and others [9]. A comprehensive review of quality control in crowdsourcing can be found in [5], where the methods are organized into three major groups: individual, group and computation based. The former two generally imply involvement of humans in the assessment of the annotators or of the tasks' output.

It should be noted though that there has been a certain decrease in research enthusiasm towards crowd data since then, as the involved disadvantages had been acknowledged [4]. ML and Intelligent Systems came to rely more on unstructured and uncontrolled data sources [14], see Big Data [11,15] and data scrapped from the web [16]. A recent related publication carefully catalogs the software tools for data quality measurement and monitoring, listing a whopping 667 of them [17]. All in all, the quantitative engineering of data quality is better developed in the fields where data generation is easier to control. A recent example of such a field is IoT (see review in [18]), while the most established one is industrial data, where datasets are well structured and plentiful. Researchers in the field of industrial data quality already formulate it as a *dataset selection problem* and propose, e.g., the criteria of *estimated relative return improvement* and *estimated action stochasticity* [19]. However, those working with human-related data more often than not have no luxury of choosing between several datasets relevant to their specific problem.

#### *2.2. Human Factor in Data Quality*

The comparative rarity of reusing human behavior-related data outside of reproducibility and meta-analysis studies (e.g., [20] using the dataset from [8]) is partially due to its high value. The latter mainly comes from costly human time needed to generate or label the data, but its potential economic value may be involved too—think of social networks users' behavior data. Another reason that decreases the chance for finding appropriate data for a specific problem is that human data are a task too and context-dependent, and there is never a perfect match of factors and conditions. Moreover, their quality is arguably less formalizable on the scale from "good" to "bad", and the emerging concept of "fitness-for-use" [21] might prove to be more appropriate than "quality".

In the dawn of the AI/ML era, the human factor in data was rather considered a nuisance (cf. user needs in the era of mainframes). For instance, *10 reasons for bad data quality* comprehensively listed by Lee et al. in 2006, include "subjective judgments during data generation" [22]. Lately though, there is more recognition that human-related data are special, and specific quality dimensions are introduced, such as ethical ones [23]. The latter are arguably a response to the recently highlighted "inappropriate" behavior of trained AIs, who started to demonstrate "racist", "sexist", or "offensive" behavior [24]—just in accordance with the patterns they found in human-generated training data.

Still, the urgency of ML methods to describe human behavior is widely recognized, and UBMs that both incorporate domain knowledge and are trained on practical data is a popular implementation. The models' output are certain key performance characteristics—the examples in HCI field are success rate, time to complete a task, dimensions of subjective satisfaction, etc. The corresponding input data generally would need to specify the characteristics of the target users and the parameters of a candidate UI [1]. The techniques for parameterizing the UI can rely on manual labeling, on automated design mining algorithms (see in [25]), or on their combination [9]. Indeed, the algorithms for calculating the features are already numerous, and the developers put in a considerable effort in improving them [7]). However, the data quality studies in the field rather focus on adherence to "best practices" [26] and the reasons leading to "bad" models [27]. Quantitative studies of the data effect are rare, if any.

So, we undertook the following experimental study to relate the measured dimensions of the input data quality and the quality parameters for some simple UBMs.

### *2.3. The Experiment Description*

#### 2.3.1. Material

The material in our experiment was screenshots of website homepages belonging to universities and colleges from all over the world (but only their English versions). First, we automatically collected 10,639 screenshots in PNG format using a dedicated Python script crawling through various catalogs, DBPedia, etc. Then, we manually selected 497 of them for the experiment (see [8] for more detail)—hereafter, references as the UIs. To ensure better diversity of UI elements, the screenshots were made for full web pages, not just of the part above the fold or of a fixed size.

### 2.3.2. Procedure

### UI Assessment

In a dedicated online survey (see details in [8]), the participants provided the subjective assessments of their impressions for each UI, per the three visual perception scales that we employed:


Complexity and aesthetics were elected as arguably the most popular dimensions in studies of subjective visual perception [20]. Orderliness was added mostly for the purpose of validity control of the assessments, as most studies in HCI are uniform about the positive correlation of UI regularity with aesthetics and the negative one with complexity. For each of three the scales, Likert ratings were used (1—lowest, 7—highest). The participants were instructed to provide their honest subjective assessments and were told that there are no right or wrong answers. The screenshots were randomly assigned to each participant successively, and the completeness of the assessment for all the 3 scales per UI was mandatory and controlled by the survey software.

#### UI Labeling

The labelers used LabelImg (Version 1.8.1, from https://github.com/tzutalin/labelImg, accessed on 1 June 2022), a third-party dedicated software tool that saves the output as XML files in PASCAL VOC format. They were asked to draw bounding rectangles around UI elements in the screenshots, as precisely as possible, and to choose one of the 20 pre-defined classes for the element: *image, background image, text, textinput, link, button, etc.* (see the complete list in [9]). The participants were provided with the written instruction on UI labeling and on technical usage of LabelImg and asked to process as many UI elements in each UI as possible. The screenshots were distributed among them near evenly, but no random assignment was performed.

### The Labeling Verification

For each UI element in each screenshot, the verifiers could specify the labeling as *correct* or *incorrect*. In addition, for each UI the they were asked to subjectively assess completeness, i.e., if all the visible UI elements had been labeled, on the scale from 1 (very few elements) to 100 (all elements). The verifiers had the written instruction with recommendations for making the *correct/incorrect* decision, based on the UI elements' bounding box precision and the correct specification of its class. To support the verification procedure, we have developed a custom web-based software. The previously labeled UIs were distributed among the verifiers near evenly, but without a random assignment.

#### 2.3.3. Subjects

There were 3 groups of human participants, mostly students of Novosibirsk State Technical University (NSTU), who performed the aforementioned activities:


All the participants took part in the study voluntarily, and informed consent was obtained. They had normal or corrected to normal vision and reasonably high experience in the general usage of IT.

#### 2.3.4. Design

The mean UI assessment ratings per the screenshots on the three scales of complexity, aesthetics, and orderliness (ScaleC, ScaleA, and ScaleO, respectively) were used as the *output variables* for the 3 × 11 = 33 user behavior models that we would construct for each scale and each labeler.

The *input data* for the models were 8 factors, whose values we automatically calculated for each UI from the labeling data, using our dedicated Python script:


From the 20 labeled classes, we deliberately chose the most visually prominent ones: *text*, *image* and *background image*, since our experiment implied visual perception of the material, but no interaction with the UIs—hence, no *link*, *radiobutton*, *selectbox*, *textinput*, and so on.

So, our experiment had between-subject design. The main independent variables were *subjective completeness (SC)* and *Precision*, averaged for each of the 11 labelers:

$$Precision = \frac{correct}{correct + incorrect}.\tag{1}$$

The (derived) dependent variables were the quality parameters (*R*2s) of the user behavior models. We also controlled for another derived variable, the number of screenshots processed by each labeler (UI).

Our **hypothesis** was that higher SC and precision, corresponding to better quality of the labeling data, should result in the better quality of the models (*R*2s).

#### **3. Results**

#### *3.1. Descriptive Statistics*

In total, we collected 12705 assessments for the 497 UIs. Further, the 11 labelers specified 42,716 elements in 495 UIs (see [Table 1] in [9]), and the quality of their work was evaluated by 20 verifiers. Some UIs had technical problems or incomplete evaluations, so, we remained with 487 valid UIs (98.0%), for which the descriptive statistics are presented in Table 1. The first and second names of the labelers are abbreviated in the IDs.


**Table 1.** The descriptive statistics per the labelers (M ± SD).

To check for the homogeneity of the UI assessments per the 11 labelers, we ran ANOVA tests for all three scales. We found a barely significant effect of ID only on ScaleO (*F*10,476 = 1.87, *p* = 0.047), but not on ScaleC (*F*10,476 = 1.21, *p* = 0.284) or ScaleA (*F*10,476 = 1.63, *p* = 0.096). The post-hoc test for ScaleO (Tukey HSD, since there were many levels of the independent variables) found significant difference (at *α* = 0.05) only between labelers PV and PE (*p* = 0.012). The variances were not different (*p* = 0.372), so the ANOVA assumptions were met. Pearson correlations for the assessments per UIs were highly significant between ScaleA and ScaleO (*r*<sup>487</sup> = 0.771, *p* < 0.001), as well as between ScaleC and ScaleO (*r*<sup>487</sup> = −0.145, *p* = 0.001), but not between ScaleC and ScaleA.

In the verification, 37,053 labeled elements were specified as correct and 4967 as incorrect, and the mean Precision per labelers was 88.7%, which indicates a reasonably good work quality. The Pearson correlation between Precision and SC per labelers was not significant (*p* = 0.727), which suggests that these two aspects of UI labeling quality are distinct. The correlation between SC and the average number of correct objects was significant (*r*<sup>11</sup> = 0.622, *p* = 0.041), unlike for the number of all labeled objects (*r*<sup>11</sup> = 0.170, *p* = 0.618), which reinforces the meaningfulness of the verification.

#### *3.2. The Effect of the Input Data Quality in the Models*

To construct the UBMs, we relied on simple linear regression, since we only had a limited number of data samples (41–54) for each labeler. So, we built 33 models, each having the same 8 factors calculated from each labeler's output. The *R*2s obtained for the models are presented in Table 2, together with the mean labelers' quality parameters obtained from the UI's verifications.

Since the number of screenshots processed by each labeler (UI) was not exactly the same (see in Table 1), we checked its correlations with *R*2s for each of the three scales. We found that neither of the Pearson correlations was significant at *α* = 0.05, so treating all labelers' models universally is justified.

The subsequent Pearson correlations analysis revealed that the SC did not have a significant correlation (at *α* = 0.05) with the models' quality parameter (*R*2) for either of the scales. Even for ScaleC, the correlation was *r*<sup>11</sup> = −0.062 (*p* = 0.856), whereas the visual complexity of a user interfaces is known to be influenced by the number of elements [25]. For the sake of checking the conceptual validity of our SC variable, we also checked the association between the factual *average number of elements per UI* for each labeler and the *R*2s. Again, neither of the Pearson correlations were significant (at *α* = 0.05), the correlation for ScaleC being *r*<sup>11</sup> = 0.274 (*p* = 0.415).


**Table 2.** The labelers' and the models' quality.

For precision, we found significant **negative** correlations with the *R*2s for ScaleA (*r*<sup>11</sup> = −0.768, *p* = 0.006) and ScaleO (*r*<sup>11</sup> = −0.644, *p* = 0.032), but not for ScaleC (*r*<sup>11</sup> = −0.051, *p* = 0.883). Recognizing the possible inaccuracy of our quality measures, we tried treating *R*<sup>2</sup> and the precision as ordinal variables—this is rather practical, since task requesters are often interested in only accepting the output from the best labelers. However, the results did not change very much for Kendall's tau-b correlation measure: *τ*<sup>11</sup> = −0.491, *p* = 0.036 for ScaleA and *τ*<sup>11</sup> = −0.418, *p* = 0.073 for ScaleO.

#### **4. Discussion and Conclusions**

Seeking to explore the effect of input data quality, we undertook an experimental study with 101 human participants and 497 web UIs. Our assumption was that better quality of the UI labeling should result in better quality of UBMs.

Contrary to our expectations, we found significant *negative* correlations between the labeling quality parameters and the resulting models' quality (see Table 2) for the subjective impression dimensions of aesthetics (*r*<sup>11</sup> = −0.768) and orderliness (*r*<sup>11</sup> = −0.644). Before deciding to report the negative research results in the current paper, we revisited the possible biases. However, the following considerations re-enforce the validity of our findings:


The discovered negative correlations between the labelers' precision and the quality of the resulting models are not entirely clear to us, and we do not yet have a convincing explanation. We would like to note that the effect was found for the scale of aesthetics and the related scale of orderliness, but not for the less subjective scale of complexity. It is believed that aesthetics judgements for visual objects are rather high level, involving the factors of layout, visual hierarchy, colors, etc. Individual elements are grouped according to Gestalt principles, and imprecisions and omissions might even contribute to that—think of an Impressionist painting. Correspondingly, we might speculate that the indilligent workers would have a bias towards picking the UI elements and labeling them in a way matching the actual human perception. However, a much closer look at their output would be required before making any justified conclusions.

Among the limitations of our study, we see the relative minimalism of the linear regression UBMs. We only employed eight factors, and often they would not even be significant in the models. Correspondingly, the absolute quality levels of some models were rather modest, while the average *R*<sup>2</sup> per the 33 models turned out to be 0.281. The latter is arguably acceptable for our small-scale study that deliberately incorporated the potentially low-quality input data. For instance, in our another study with the same set of university websites screenshots, *R*2*s* ranged from 0.105 to 0.248 (similarly, aesthetics had the highest *R*<sup>2</sup> of the three scales) [8]. However, recognizably, there the number of factors was smaller, and the number of samples was higher. In any case, in the current study we were interested in the relative values and never intended to use the models in production.

Our further research prospects involve experimentation with more labelers and a more diverse set of the web UIs. Having collected more data, we plan to employ artificial neural network (ANN) models, instead of the simple linear regression ones. ANNs are known as universal approximators and can naturally handle systematic bias in data.

**Author Contributions:** Conceptualization, M.B.; methodology, M.B.; software, V.K.; validation, M.B. and V.K.; formal analysis, M.B.; investigation, M.B.; resources, M.B. and V.K.; data curation, M.B. and V.K.; writing—original draft preparation, M.B.; writing—review and editing, M.B.; visualization, M.B.; supervision, M.B.; project administration, M.B.; funding acquisition, M.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by RFBR, grant number 19-29-01017.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and approved by the Ethics Committee of Faculty of Humanities of Novosibirsk State Technical University (protocol code 7\_02\_2019).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study are available on request from M.B. (the first author). The data are not publicly available due to privacy reasons.

**Acknowledgments:** We would like to thank those who contributed to the project. Sebastian Heil, Martin Gaedke, Anna Stepanova and Galina Hamgushkeeva; as well as Alyona Bakaeva, who provided inspiration for this work.

**Conflicts of Interest:** The authors declare no conflicts of interest.

### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

**Marina Barulina 1,2,3,\*,‡, Yuliya Gergenreter 4,‡, Natalia Zakharova 4,‡, Vladimir Maslyakov 3,4,‡, Vladimir Fedorov 4,‡ and Ivan Ulitin 1,2,‡**


**Abstract:** A predictive model for the early diagnosis of breast cancer based on the concentration of some cytokines in the tumor microenvironment in the blood was built in this paper. In the work, the influence of the following cytokines was studied: monocytic chemoattractant protein-1, vascular endothelial growth factor, tumor necrosis factor-alpha, interferon gamma, transforming growth factor-beta1, granulocyte colony stimulating factor, and granulocyte-macrophage colony stimulating factor. As a result of preliminary statistical analysis, some combinations of these cytokines that allowed for almost reliable detection of the presence or absence of breast cancer were identified. Based on the identified combinations, new features were constructed. A machine learning model was trained using gradient boosting for its classification method. The built model has an accuracy equal to 1.0 at this stage, so the authors find it reasonable to carry out additional tests of the model for more patients. However, even at this stage, it can be concluded that the concentration of cytokines in the blood serum is applicable for the early diagnosis of breast cancer.

**Keywords:** tumor microenvironment; cytokines; machine leaning; breast cancer; blood analysis; predictive analysis
