of hours 352 2442

We used the VoxCeleb1 evaluation dataset, which includes 40 speakers and 37,220 pairs of official test protocols [27], as shown in Figure 5. The test protocols comprises eight pairs per utterance of the VoxCeleb1 evaluation set (four pairs of the same speaker and four pairs of different speakers). Among all possible 38,992 (4,874 × 8) utterances, 37,720 pairs were determined. The pair decision is made in consideration of balance such as gender, utterance length, and the number of pairs per speaker. In addition, it is an open-set test that evaluates all speaker pairs that are unavailable for the

Average # of utterances per POI 116 185 Average length of utterances (s) 8.2 7.8

**Dataset VoxCeleb1 VoxCeleb2**

speaker verification datasets, comprising more than 100 thousand and 1 million utterances with 1251 and 6112 speakers, respectively.


**Table 5.** Dataset statistics for both VoxCeleb1 and VoxCeleb2. There are no duplicate utterances between VoxCeleb1 and VoxCeleb2 (POI = person of interest).

We used the VoxCeleb1 evaluation dataset, which includes 40 speakers and 37,220 pairs of official test protocols [27], as shown in Figure 5. The test protocols comprises eight pairs per utterance of the VoxCeleb1 evaluation set (four pairs of the same speaker and four pairs of different speakers). Among all possible 38,992 (4874 × 8) utterances, 37,720 pairs were determined. The pair decision is made in consideration of balance such as gender, utterance length, and the number of pairs per speaker. In addition, it is an open-set test that evaluates all speaker pairs that are unavailable for the training dataset. *Electronics* **2020**, *9*, x FOR PEER REVIEW 9 of 15


**Figure 5.** Example of official test protocol from VoxCeleb1 evaluation dataset (In the first column, 1 refers to the same speaker and 0 refers to different speakers. The second and third columns refer to the speakers to be compared). **Figure 5.** Example of official test protocol from VoxCeleb1 evaluation dataset (In the first column, 1 refers to the same speaker and 0 refers to different speakers. The second and third columns refer to the speakers to be compared).

#### *4.2. Experimental Setup 4.2. Experimental Setup*

described in Section 3.1, was trained.

**Epochs**

**Loss**

During data preprocessing, we used 64-dimensional log Mel-filter-bank energies with a 25 ms frame length and 10 ms frame shift, which are the mean variance normalized over a sliding window of 3 s. For each training step, a 12 s interval was extracted from each utterance through cropping or padding. In addition, a preprocessing method was used to conduct time and frequency masking on the input features [28]. During data preprocessing, we used 64-dimensional log Mel-filter-bank energies with a 25 ms frame length and 10 ms frame shift, which are the mean variance normalized over a sliding window of 3 s. For each training step, a 12 s interval was extracted from each utterance through cropping or padding. In addition, a preprocessing method was used to conduct time and frequency masking on the input features [28].

The model training specifications are as follows: we used a standard cross-entropy loss function, with a standard stochastic gradient descent optimizer, with a momentum of 0.9, a weight decay of 0.0001, and an initial learning rate of 0.1, reduced by a 0.1 scale factor on the plateau [29]. All experiments were trained for 200 epochs with a 96 mini-batch size. The scaling constant was set to 10, and the reduction ratio was set to 8 [11,22]. As shown in Figure 6, we confirmed that the training The model training specifications are as follows: we used a standard cross-entropy loss function, with a standard stochastic gradient descent optimizer, with a momentum of 0.9, a weight decay of 0.0001, and an initial learning rate of 0.1, reduced by a 0.1 scale factor on the plateau [29]. All experiments were trained for 200 epochs with a 96 mini-batch size. The scaling constant α was set to 10, and the reduction ratio *r* was set to 8 [11,22]. As shown in Figure 6, we confirmed that the training loss converges for

**Loss**

(**a**) (**b**) **Figure 6.** Training loss curve of (**a**) the baseline model and (**b**) the proposed model.

From the trained model, we generated a 512-dimensional speaker embedding for each utterance, as shown in Figure 7. The standard cosine similarity is computed for the speaker pair, and the equal error rate (EER, %) is calculated. The EER value is the crossing point of the two curves, the false rejection rate and the false acceptance rate, according to the decision threshold. This can also be expressed on the receiver operating characteristic (ROC) curve using the true-positive rate and false -positive rate. All of our proposed methods were implemented using the PyTorch toolkit [30].

**Epochs**

loss converges for the baseline model, as described in Section 2.2, and the proposed model, as

*4.2. Experimental Setup*

the input features [28].

the speakers to be compared).

10, and the reduction ratio was set to 8 [11,22]. As shown in Figure 6, we confirmed that the training

*Electronics* **2020**, *9*, x FOR PEER REVIEW 9 of 15

1 id10270/x6uYqmx31kE/00001.wav id10270/8jEAjG6SegY/00008.wav 0 id10270/x6uYqmx31kE/00001.wav id10300/ize\_eiCFEg0/00003.wav 1 id10270/x6uYqmx31kE/00001.wav id10270/GWXujl-xAVM/00017.wav 0 id10270/x6uYqmx31kE/00001.wav id10273/0OCW1HUxZyg/00001.wav 1 id10270/x6uYqmx31kE/00001.wav id10270/8jEAjG6SegY/00022.wav 0 id10270/x6uYqmx31kE/00001.wav id10284/Uzxv7Axh3Z8/00001.wav 1 id10270/x6uYqmx31kE/00001.wav id10270/GWXujl-xAVM/00033.wav 0 id10270/x6uYqmx31kE/00001.wav id10284/7yx9A0yzLYk/00029.wav 1 id10270/x6uYqmx31kE/00002.wav id10270/5r0dWxy17C8/00026.wav 0 id10270/x6uYqmx31kE/00002.wav id10285/m-uILToQ9ss/00009.wav 1 id10270/x6uYqmx31kE/00002.wav id10270/GWXujl-xAVM/00035.wav 0 id10270/x6uYqmx31kE/00002.wav id10306/uzt36PBzT2w/00001.wav 1 id10270/x6uYqmx31kE/00002.wav id10270/GWXujl-xAVM/00038.wav 0 id10270/x6uYqmx31kE/00002.wav id10307/kp\_GCjLq4qA/00004.wav 1 id10270/x6uYqmx31kE/00002.wav id10270/GWXujl-xAVM/00033.wav 0 id10270/x6uYqmx31kE/00002.wav id10275/Mdk1SXywHck/00024.wav **. . .**

**Figure 5.** Example of official test protocol from VoxCeleb1 evaluation dataset (In the first column, 1 refers to the same speaker and 0 refers to different speakers. The second and third columns refer to

During data preprocessing, we used 64-dimensional log Mel-filter-bank energies with a 25 ms frame length and 10 ms frame shift, which are the mean variance normalized over a sliding window of 3 s. For each training step, a 12 s interval was extracted from each utterance through cropping or padding. In addition, a preprocessing method was used to conduct time and frequency masking on

The model training specifications are as follows: we used a standard cross-entropy loss function, with a standard stochastic gradient descent optimizer, with a momentum of 0.9, a weight decay of

**Figure 6.** Training loss curve of (**a**) the baseline model and (**b**) the proposed model. **Figure 6.**Training loss curve of (**a**) the baseline model and (**b**) the proposed model.

From the trained model, we generated a 512-dimensional speaker embedding for each utterance, as shown in Figure 7. The standard cosine similarity is computed for the speaker pair, and the equal error rate (EER, %) is calculated. The EER value is the crossing point of the two curves, the false rejection rate and the false acceptance rate, according to the decision threshold. This can also be expressed on the receiver operating characteristic (ROC) curve using the true-positive rate and false -positive rate. All of our proposed methods were implemented using the PyTorch toolkit [30]. From the trained model, we generated a 512-dimensional speaker embedding for each utterance, as shown in Figure 7. The standard cosine similarity is computed for the speaker pair, and the equal error rate (EER, %) is calculated. The EER value is the crossing point of the two curves, the false rejection rate and the false acceptance rate, according to the decision threshold. This can also be expressed on the receiver operating characteristic (ROC) curve using the true-positive rate and false -positive rate. All of our proposed methods were implemented using the PyTorch toolkit [30]. *Electronics* **2020**, *9*, x FOR PEER REVIEW 10 of 15

**Figure 7.** Examples of the 512-dimensional speaker embedding in one utterance of (**a**) baseline model and (**b**) proposed model (we converted the 512-dimension to 32 × 16). **Figure 7.** Examples of the 512-dimensional speaker embedding in one utterance of (**a**) baseline model and (**b**) proposed model (we converted the 512-dimension to 32 ×16).

#### *4.3. Experimental Results 4.3. Experimental Results*

by 1.9%.

To evaluate the proposed methods, we first tested the baseline using different encoding methods and other networks and then we compared our proposed method with state-of-the-art encoding methods. To evaluate the proposed methods, we first tested the baseline using different encoding methods and other networks and then we compared our proposed method with state-of-the-art encoding methods.

Table 6 presents the results of baseline modifications, as described in Section 2.2. It demonstrates the effectiveness of modifications to the encoding methods. We experimented with basic encoding layers, such as GAP and SAP. We then combined the proposed methods individually to the baseline, for example, self-attentive multi-layer aggregation, feature recalibration, and deep length normalization. Specifically, the scaled ResNet-34 with GAP and SAP achieved EER values of 6.85 % and 6.68%, respectively. Because multi-layer aggregation was not applied with these encoding Table 6 presents the results of baseline modifications, as described in Section 2.2. It demonstrates the effectiveness of modifications to the encoding methods. We experimented with basic encoding layers, such as GAP and SAP. We then combined the proposed methods individually to the baseline, for example, self-attentive multi-layer aggregation, feature recalibration, and deep length normalization. Specifically, the scaled ResNet-34 with GAP and SAP achieved EER values of 6.85 % and 6.68%, respectively. Because multi-layer aggregation was not applied with these encoding methods, the number of dimensions

was reduced from 6.85% to 5.83% and from 6.68% to 5.42%, respectively. Additional applications to self-attentive multi-layer aggregation using feature recalibration and deep length normalization also achieved EER values of 5.07% and 4.95%, respectively. In addition, the ROC curve of the proposed model showed the EER point, as shown in Figure 8. Consequently, the experimental results showed that when all of the proposed methods were applied, the model parameters increased by approximately 0.5 M compared to the scaled ResNet-34 with GAP, whereas the EER value improved

**Table 6.** Experimental results for modifying the baseline construction, using the VoxCeleb1 training and evaluation dataset (Dim = speaker embedding dimension; Params = model parameters; EER = equal error rate; GAP = global average pooling; SAP = self-attentive pooling; MLA = multi-layer

**Model Encoding Method Dim Params EER**

GAP 256 ≈5.6 M 6.85 SAP 256 ≈5.7 M 6.68 GAP-MLA 512 ≈5.9 M 5.83 SAP-MLA 512 ≈6.0 M 5.42 SAP-MLA-FR 512 ≈6.1 M 5.07 SAP-MLA-FR-DLN 512 ≈6.1 M 4.95

aggregation; FR = feature recalibration; DLN = deep length normalization).

Scaled ResNet-34

methods, the number of dimensions of the speaker embedding was 256. In addition, the gap in performance between GAP and SAP was not large. We then applied the multi-layer aggregation for of the speaker embedding was 256. In addition, the gap in performance between GAP and SAP was not large. We then applied the multi-layer aggregation for scaled ResNet-34 with GAP and SAP. In particular, the scaled ResNet-34 using multi-layer aggregation and GAP is our baseline system described in Section 2.2. Although speaker embedding dimensions and model parameters were larger in number than those of GAP and SAP, the EER value was reduced from 6.85% to 5.83% and from 6.68% to 5.42%, respectively. Additional applications to self-attentive multi-layer aggregation using feature recalibration and deep length normalization also achieved EER values of 5.07% and 4.95%, respectively. In addition, the ROC curve of the proposed model showed the EER point, as shown in Figure 8. Consequently, the experimental results showed that when all of the proposed methods were applied, the model parameters increased by approximately 0.5 M compared to the scaled ResNet-34 with GAP, whereas the EER value improved by 1.9%.

**Table 6.** Experimental results for modifying the baseline construction, using the VoxCeleb1 training and evaluation dataset (Dim = speaker embedding dimension; Params = model parameters; EER = equal error rate; GAP = global average pooling; SAP = self-attentive pooling; MLA = multi-layer aggregation; FR = feature recalibration; DLN = deep length normalization).


**Figure 8.** ROC curve of the proposed model (threshold value is 0.3362 and EER value is 4.95% using VoxCeleb1 training and evaluation dataset in Table 6). **Figure 8.** ROC curve of the proposed model (threshold value is 0.3362 and EER value is 4.95% using VoxCeleb1 training and evaluation dataset in Table 6).

Table 7 shows a comparison of our proposed methods with other networks. All experiments used the VoxCeleb1 training and evaluation datasets. First, the *i*-vector extractor was trained according to the implementation in [27]. After generating 400-dimensional *i*-vectors, PLDA was applied to reduce the number of dimensions of *i*-vectors to 200. The *i*-vector with the PLDA system achieved an EER value of 8.82%. In addition, an *x*-vector system was trained according to the implementation in [18]. The *x*-vector system is based on the use of time-delay neural networks (TDNN) using an SP method, which is commonly applied for text-independent speaker verification along with a ResNet-based system. The 1500-dimensional *x*-vector was extracted from the TDNN, which achieved an EER value of 8.19%. Our proposed methods based on the scaled ResNet-34 showed an improved performance, compared to the previous systems (i.e., EER value of 4.95%). Table 7 shows a comparison of our proposed methods with other networks. All experiments used the VoxCeleb1 training and evaluation datasets. First, the *i*-vector extractor was trained according to the implementation in [27]. After generating 400-dimensional *i*-vectors, PLDA was applied to reduce the number of dimensions of *i*-vectors to 200. The *i*-vector with the PLDA system achieved an EER value of 8.82%. In addition, an *x*-vector system was trained according to the implementation in [18]. The *x*-vector system is based on the use of time-delay neural networks (TDNN) using an SP method, which is commonly applied for text-independent speaker verification along with a ResNet-based system. The 1500-dimensional *x*-vector was extracted from the TDNN, which achieved an EER value of 8.19%. Our proposed methods based on the scaled ResNet-34 showed an improved performance, compared to the previous systems (i.e., EER value of 4.95%).

**Model Encoding Method Dim EER** *i*-vector + PLDA - 200 8.82 *x*-vector SP 1500 8.19 Scaled ResNet-34 SAP-MLA-FR-DLN 512 4.95

Table 8 and Table 9 show a comparison of our proposed methods with state-of-the-art encoding approaches. Here, we compared encoding methods using a ResNet-based model and the crossentropy loss function. Various encoding methods were compared, including TAP [10,16], learnable

In Table 8, all experiments used the VoxCeleb1 training and evaluation datasets. ResNet-34 with TAP, LDE, SAP, or GAP achieved EER values of 5.48%, 5.21%, 5.51%, and 5.39%, respectively [10,15]. The speaker embedding dimensions of these systems were 128 or 256, which were smaller than those of the proposed methods. However, our proposed encoding methods based on the scaled ResNet-34 achieved an EER value of 4.95%. The performance was an improvement to that of other systems.

In Table 9, all experiments used the VoxCeleb2 training datasets and VoxCeleb1 evaluation datasets. As presented in Table 5, the VoxCeleb2 training datasets are seven times more than the VoxCeleb1 training datasets. Table 9 shows that increasing the amount of training dataset was effective for performance improvement. ResNet-34 and ResNet-50 with TAP achieved EER values of 5.04% and 4.95%, respectively [16]. In addition, a thin-ResNet-34 with NetVLAD and GhostVLAD achieved EER values of 3.57% and 3.22%, respectively [7]. The number of speaker embedding

layer aggregation; FR = feature recalibration; DLN = deep length normalization).

dictionary encoding (LDE) [10], SAP [10], GAP [15], NetVLAD [7], and GhostVLAD [7].

**Table 7.** Experimental results comparing our proposed methods with other networks using the

**Table 7.** Experimental results comparing our proposed methods with other networks using the VoxCeleb1 training and evaluation dataset (Dim = speaker embedding dimension; EER = equal error rate; SP = statistical pooling; GAP = global average pooling; SAP = self-attentive pooling; MLA = multi-layer aggregation; FR = feature recalibration; DLN = deep length normalization).


Tables 8 and 9 show a comparison of our proposed methods with state-of-the-art encoding approaches. Here, we compared encoding methods using a ResNet-based model and the cross-entropy loss function. Various encoding methods were compared, including TAP [10,16], learnable dictionary encoding (LDE) [10], SAP [10], GAP [15], NetVLAD [7], and GhostVLAD [7].

**Table 8.** Experimental results comparing our proposed methods with state-of-the-art encoding methods using the VoxCeleb1 training and evaluation dataset (Dim =speaker embedding dimension; EER = equal error rate; TAP = temporal average pooling; LDE = learnable dictionary encoding; SAP = self-attentive pooling; GAP = global average pooling; MLA = multi-layer aggregation; FR = feature recalibration; DLN = deep length normalization).


**Table 9.** Experimental results comparing our proposed methods with state-of-the-art encoding methods using the VoxCeleb2 training datasets and the VoxCeleb1 evaluation datasets (Dim = speaker embedding dimension; EER = equal error rate; TAP = temporal average pooling; SAP = self-attentive pooling; MLA = multi-layer aggregation; FR = feature recalibration; DLN = deep length normalization).


In Table 8, all experiments used the VoxCeleb1 training and evaluation datasets. ResNet-34 with TAP, LDE, SAP, or GAP achieved EER values of 5.48%, 5.21%, 5.51%, and 5.39%, respectively [10,15]. The speaker embedding dimensions of these systems were 128 or 256, which were smaller than those of the proposed methods. However, our proposed encoding methods based on the scaled ResNet-34 achieved an EER value of 4.95%. The performance was an improvement to that of other systems.

In Table 9, all experiments used the VoxCeleb2 training datasets and VoxCeleb1 evaluation datasets. As presented in Table 5, the VoxCeleb2 training datasets are seven times more than the VoxCeleb1 training datasets. Table 9 shows that increasing the amount of training dataset was effective for performance improvement. ResNet-34 and ResNet-50 with TAP achieved EER values of 5.04% and 4.95%, respectively [16]. In addition, a thin-ResNet-34 with NetVLAD and GhostVLAD achieved EER values of 3.57% and 3.22%, respectively [7]. The number of speaker embedding dimensions of these systems was 512, which is the same as that of our proposed methods. Our proposed encoding methods based on the scaled ResNet-34 achieved an EER value of 2.86%. Consequently, the experimental results showed that our proposed methods were superior to other state-of-the-art methods.

Furthermore, in the case of on-device speaker verification, the lower the speaker embedding dimension, the faster the system. Our proposed methods have limitations as a high-dimensional speaker embedding method, compared to other state-of-the-art encoding methods. Therefore, future research is required to address this dimension problem. In a future study, on-device speaker verification using low-dimensional speaker embedding will be conducted.

#### **5. Conclusions**

In previous multi-layer aggregation methods for text-independent speaker verification, the number of model parameters was relatively large, and unspecified variations increased during training. Therefore, we proposed a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. First, we set the ResNet with the scaled channel width and layer depth as a baseline. Second, self-attentive multi-layer aggregation was applied when training the frame-level features of each residual layer in the scaled ResNet. Finally, the feature recalibration layer and deep length normalization were applied to train the utterance-level feature in the encoding layer. The experimental results using the VoxCeleb1 evaluation dataset showed that the proposed method achieved an EER value performance comparable to that of state-of-the-art models.

**Author Contributions:** Conceptualization, S.S.; methodology, S.S.; software, S.S.; validation, S.S.; formal analysis, S.S.; investigation, S.S.; resources, S.S.; data curation, S.S.; writing—original draft preparation, S.S.; writing—review and editing, S.S. and J.-H.K.; visualization, S.S.; supervision, J.-H.K.; project administration, J.-H.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Acknowledgments:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No.2020R1F1A1076562).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Domain Usability Evaluation**

**Michaela Baˇcíková \* , Jaroslav Porubän , Matúš Sulír , Sergej Chodarev , William Steingartner and Matej Madeja**

> Department of Computers and Informatics, Faculty of Electrical Engineering and Informatics, Technical University of Košice, Letná 9, 042 00 Košice, Slovakia; jaroslav.poruban@tuke.sk (J.P.); matus.sulir@tuke.sk (M.S.); sergej.chodarev@tuke.sk (S.C.); william.steingartner@tuke.sk (W.S.); info@madeja.sk (M.M.)

**\*** Correspondence: michaela.bacikova@tuke.sk

**Abstract:** Contemporary software systems focus on usability and accessibility from the point of view of effectiveness and ergonomics. However, the correct usage of the domain dictionary and the description of domain relations and properties via their user interfaces are often neglected. We use the term *domain usability (DU)* to describe the aspects of the user interface related to the terminology and domain. Our experience showed that poor domain usability reduces the memorability and effectiveness of user interfaces. To address this problem, we describe a method called *ADUE (Automatic Domain Usability Evaluation)* for the automated evaluation of selected DU properties on existing user interfaces. As a prerequisite to the method, metrics for formal evaluation of domain usability, a form stereotype recognition algorithm, and general application terms filtering algorithm have been proposed. We executed ADUE on several real-world Java applications and report our findings. We also provide proposals to modify existing manual usability evaluation techniques for the purpose of domain usability evaluation.

**Keywords:** human–computer interaction; user experience; usability evaluation methods; domain usability; domain-specific languages; graphical user interfaces

#### **1. Introduction**

User experience (UX) and usability is already ingrained in our everyday lives. Nielsen's concept of "usability engineering" [1] and Norman's [2] practical user interface (UI) design has become an inseparable part of design policies in many large companies, setting an example to the UX field throughout the world. Corporations such as Apple, Google, Amazon, and Facebook realized that when designing UIs, it is not only about how pretty the UI looks like, but from a long-time perspective, usability and UX bring economic benefits over competitors. Usability and UX are related to many aspects of the design, including consistency, efficiency, error rate, learnability, ease of use, utility, credibility, accessibility, desirability, and many more [1–5].

However, when analyzing common UIs of medium and small companies, we still find such UIs that are developed with respect to the practical usability and UX but not to the user's domain. From our experience, such cases are very common. The situation has slowly slowly become better with the introduction of UX courses into the curricula of universities and with the foundation of UX organizations spreading the word. The more specific the domain, the more evident is the problem of designs focused on usability that neglects the domain aspect. This fact has been identified by multiple researchers around the globe [6–9].

#### *1.1. Domain Usability*

We describe *Domain Usability (DU)* in terms of five UI aspects: domain content, consistency, world language, an adequate level of specificity, language barriers, and errors. For the purpose of clarity, we will present the full definition [10] of all five aspects here:

**Citation:** Baˇcíková, M.; Porubän, J.; Sulír, M.; Chodarev, S.; Steingartner, W.; Madeja, M. Domain Usability Evaluation. *Electronics* **2021**, *10*, 1963. https://doi.org/10.3390/ electronics10161963

Academic Editor: George A. Tsihrintzis

Received: 19 July 2021 Accepted: 10 August 2021 Published: 15 August 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).


Domain usability is not a separate aspect of each UI. On the contrary, it is a part of the general usability property. The *overall usability* is defined as a *combination* of ergonomic and domain usability. Successful completion of a task in a UI is affected by both ergonomic and domain factors:


As we described in our previous works, all aspects of the overall usability (as defined by Nielsen [1]) are affected by DU. For more details on the definition of DU, we encourage the reader to see our earlier work [11].

#### *1.2. Problem and Motivation*

To summarize our knowledge, we identified the main issues in this area as follows:


We have addressed issues (i) to (v) and also partially (vi) in our previous works:


The *main contribution* of this paper is concerning issue (vi), in which we focus on automated evaluation. We would like to summarize and put into context our existing findings in this area and to describe the final design, implementation, and validation of this method. The novel additions and improvements include but are not limited to the General Application Terms Ontology (Section 4.2), the Form Stereotype Recognition algorithm (Section 4.3), the computation and display of the Domain Usability Score in the ADUE tool (in Section 6), and detailed presentation of the evaluation results on real-world applications (in Section 7). As a secondary contribution, we will propose modifications of existing general usability evaluation techniques to make them suitable for domain usability evaluation (Section 8).

#### *1.3. Paper Structure*

In Section 2, we introduce our DU metrics that can be used for formal DU evaluation. The metrics were used to calculate the DU score in our automated evaluation approach.

In Sections 3–5, we explain the design of our automated approach to DU evaluation. First, we explain the concept (Section 3), then describe the prerequisites needed for the approach to work (Section 4), and then we describe the method itself (Section 5). To verify the approach and show its viability, we implemented its prototype (Section 6) and used it to analyze multiple open-source applications (Section 7).

We summarize both manual and automated techniques of usability evaluation in Section 8, and for some of them, we comment on their potential to evaluate DU. Section 9 represents related work focused on DU and its references in the literature.

#### **2. Domain Usability Metrics Design**

As we have mentioned, DU is defined by five main aspects. In our previous research [24], we tried to determine whether all DU aspects impact the usability equally. Several preliminary experiments we performed in the domain of gospel music suggested the invalidity of this hypothesis [10,23]; e.g., consistency issues had a stronger impact on usability than language errors.

We decided to conduct two surveys [24] to evaluate the effect of five DU aspects on DU. Using the results of the surveys, we designed a metric for formal evaluation of DU. The metrics can be used in manual or automatized evaluation to represent *formal measurement of target UI's DU*. Next, we will explain the design of the DU metrics.

To formally measure the target UI's DU, we first determine the number of all user interface components containing textual data or icons. Next, we analyze the components to find out which of them have DU issues. Since we have the number of all terms *n* and the number of DU issues, we can compute the percentage of the UI's correctness, where 100% represents the highest possible DU and 0% is the lowest one. Note that each component can have multiple issues at the same time (e.g., an incorrect term and a grammar error). If all UI components had multiple issues, the result would be lower than zero, so it is necessary to limit the minimum value. Given that each DU aspect has a different weight, we defined the formula to measure DU as follows:

$$du = \max\left(0, \ 100\left(1 - \frac{e}{n}\right)\right) \tag{1}$$

where *e* is calculated as:

$$e = w\_{d\mathcal{C}} n\_{d\mathcal{C}} + w\_{d\mathcal{S}} n\_{d\mathcal{S}} + w\_{\mathcal{C}} n\_{\mathcal{C}} + w\_{\mathcal{e}b} n\_{\mathcal{e}b} + w\_l n\_l \tag{2}$$

Coefficients *w<sup>x</sup>* (where *x* stands for *dc*, *ds*, *c*, *eb* or *l*) are the weights of particular DU aspects as follows:


• *nl*—the number of world language issues.

The weights *w<sup>x</sup>* were determined by performing two surveys, first a general one with 73 respondents aged between 17 and 44 years and then a domain-specific one with 26 gospel singers and guitar players aged between 15 and 44 years. The general group consisted of general computer users, and the domain-specific group was selected from the participants of previous DU experimentation with manual DU evaluation techniques [10,23], as they experienced DU issues first-hand.

The questionnaires consisted of two parts. The first part contained five DU aspects represented by visual examples—screenshots from domain-specific UIs. To ensure that participants understood the issues, supplementary textual explanations were provided. The task of the participants was to study the provided examples and rate the importance of a particular DU aspect using a number from the Likert scale [26] with a range from 1 to 5 (1 being the least important).

In the second part, the task was to order the five aspects of DU from the least to the most important. The questionnaires given to the general and domain-specific group can be found at: http://hornad.fei.tuke.sk/~bacikova/domain-usability/surveys (accessed on 9 August 2021). Details about the surveys can be found in [24].

We merged the results of the first (rating) and second (ordering) part of the domainspecific questionnaire and computed the weight of each aspect. Therefore, we can substitute the weights *w<sup>x</sup>* (where *x* ∈ {*dc*, *ds*, *c*,*eb*, *l*}) in Equation (2):

$$e = 2.9 \, n\_{d\varepsilon} + 2.6 \, n\_{d\varepsilon} + 2.6 \, n\_{\varepsilon} + 1.7 \, n\_{e\flat} + 1.54 \, n\_{l} \tag{3}$$

Equation (1) then represents *the metric of DU considering its aspects*, with the result as a percentage. To interpret the results, evaluators can follow Table 1. The interpretation corresponds to the scale on which the participants rated the particular aspects in the surveys.

**Table 1.** Interpretation of the rating computed via the proposed DU metric.


#### **3. Automatic Evaluation of Domain Usability Aspects**

In this section, we will analyze the boundaries of DU evaluation automation and the possibilities related to individual DU aspects. We explain the design of an automated approach to DU evaluation at a high level of abstraction.

#### *3.1. Domain Content and Specificity*

Domain content and specificity are the most difficult aspects to evaluate in a domainspecific application. Since a particular UI is usually designed for a domain expert, the domain content in the UI must be specific for the particular domain. Because of the ambiguity of natural language, the line determining whether a given word pertains to a particular domain or not may be very thin. We admit that evaluation performed by a domain expert should be considered the most appropriate in such cases. However, when no expert is available or when first UI prototypes are going to be evaluated, automated evaluation might be a helpful, fast, and cheap way to remove issues in the early stages. We will try to outline the situation in which such an automated evaluation would be utilized.

Imagine we have an existing user interface that has been used in some specific domain for ages. However, although this UI is usable, the used technology had become obsolete. The time has come to develop and deploy a new application version. The technologies will

change, but for the domain users to accept the new UI, at least the terminology should be consistent with the previous version. However, testing the whole UI for domain-related content manually is a time-consuming, attention-demanding, and tiresome task. It would be helpful to have an automated way to compare both UIs. Suppose there is a way of extracting the terminology of both UIs into a formal form (e.g., an ontology). Then it would be possible to compare the results using a comparator tool. The result of the comparison would show the following:


Illogical changes are the following: (i) from text input component (e.g., text boxes and text areas) to descriptional component (e.g., labels) and vice versa, (ii) from textual to functional (e.g., buttons and menu items) and vice versa, (iii) from functional to descriptional component and vice versa, and (iv) from grouping (containers, button groups, etc.) to other types of components and vice versa. For example, the term "Analyze results" which, in the old UI, was represented by a button, but in the new UI, it is a label—i.e., the representing component changed its type from functional to descriptional. When checking the mentioned type changes, we can confirm the term against its representing component in the old and new UI version.

The scenario described above is rather specific for situations in which there are two versions of the particular UI—whether it is an old UI and a new one, or two separate UIs from the same domain are developed by different vendors. However, when the UI is freshly designed specifically for the particular business area, there is usually only one UI available. In this case, some other source of ontological information is needed, which may be:


In these cases, the feasibility of analysis strongly depends on the reference resources. The disadvantage of the first option is the necessity of the reference ontology, which would have to be created manually by the domain expert. On the other hand, such a manually created ontology would be of higher quality than an automatically generated one, presumably having defined all necessary domain objects, properties, and relations. Thus, it would be easier to check the correctness of the target UI than by applying the approach as with two UIs, since it is usually not possible to extract 100% of data from both UIs.

As for ontological dictionaries or web search, again, the analysis strongly depends on the resources. Current ontological dictionaries are quite good, but their size is limited and their ontologies are not very usable in any specific domain. It would be best to have a domain-specific ontological dictionary, but because we assume that in the future, domainspecific ontologies [27] would grow in both size and quality, and the approach proposed here will be applicable with greater value.

Current technologies and resources allow us only to use general ontologies to check *hierarchies of terms for linguistic relations using natural language processing*. Let us take an example of a *Person* form. The form has a list of check-box buttons for selecting a favorite color with values *red*, *yellow*, *blue*, and *green*. The task is to check whether the parent–child relation between the *Favorite color* term and individual color values is correct (Listing 1).

```
favoriteColor {children}: [
    red
    yellow
    blue
    green
]
```
**Listing 1.** Hierarchy of terms for selecting favorite color in the domain dictionary of the Person form.

From the linguistic point of view, *Favorite color* is a hypernym of the individual color values (or conversely, the latter are hyponyms of the *Favorite color*). Similar relations are *holonymy* and *meronymy* which represent a "part" or "member" relationship.

Suppose that we know and can automatically determine the hierarchy of terms in the UI (we know that components labeled by the color names are hierarchically child components of the container labeled by the term *Favorite color*), we can check if these linguistic relations exist between the parent term and its child terms.

Existing available ontological dictionaries (such as WordNet) usually provide a wordattribute relation of words including linguistic relations, such as hyponymy and holonymy. In the domain analysis process, all children and parents should be checked from the linguistic point of view, but mainly enumerations and button groups or menu items because they are designed with the "grouping" relation in mind. The same can be achieved by using web search instead of ontological dictionaries (more on using web search in Section 5.2).

As the last process of checking UI domain content, we propose to check the presence of *tooltips*. A tooltip is a small description of a graphical component, which explains its functionality or purpose. Tooltips are displayed after a short time when the mouse cursor position is over the component. Many times, tooltips are not necessary for generalpurpose components, e.g., the OK, Cancel, Close, or Reset buttons. However, they can be extremely important for explaining the purpose of *domain-specific* functional components (components performing domain-specific operations) or when the description would take too much space when putting it on the component's label. Our experiment with opensource applications [25] showed that developers almost never use tooltips for functional components, even in cases when their label is not quite understandable even for domainspecific users. The common cases are acronyms and abbreviations used when the full name or description of the domain operation would take too much space on the display.

#### *3.2. Consistency*

All domain terminology should be checked for consistency and, thus, marked for checking. We can search for equal terms with case inconsistencies (Name-NAME-naMe) and/or similar terms (Cancel, Canceled) and their case inconsistencies.

*Note*: currently, it is not possible to automatically evaluate the so-called *feature consistency*, i.e., whether the same functionality is represented by the same term. The reason is the inability of current technologies to make this information available programmatically.

#### *3.3. Language Barriers and Errors*

Language errors and the completeness of language alternatives can be checked using standard spell-checking methods. For example, dictionaries (e.g., bundled with opensource text editors such as OpenOffice) may be leveraged to mark all incorrect and untranslated words similarly to spell checking in modern textual editors.

#### **4. Prerequisites**

In order to analyze the domain dictionary in any application, the means of extracting that dictionary into a formal form is necessary. For this extraction, we can use the DEAL (Domain Extraction ALgorithm) method described in [28,29].

In this section, we will describe the DEAL tool needed for extracting domain information from existing user interfaces. We also describe the design and implementation of supplementary algorithms that we implemented into DEAL to be able to focus on DU issues, namely:


#### *4.1. DEAL Method*

DEAL (Domain Extraction ALgorithm) (https://git.kpi.fei.tuke.sk/michaela.bacikova/ DEAL; accessed on 9 August 2021) is a method for extracting domain information from user interfaces of applications. Its implementation currently supports Java (Swing), HTML, and Windows applications (\*.exe). The Windows application analyzer utilizes the output of Ranorex Spy (https://www.ranorex.com/help/latest/ranorex-studio-advanced/ranorexspy/introduction/; accessed on 9 August 2021), which means it supports programs that are analyzable by Ranorex. The list of supported components is located at https://git.kpi.fei. tuke.sk/michaela.bacikova/DEAL/-/wikis/analyzing-windows-applications (accessed on 9 August 2021).

Except for the part of loading the input application, the whole process is fully automatized and takes place in two phases: *Extraction* and *Simplification*. The result of the *Extraction* phase is a domain model in the form of a graph. Nodes of the graph correspond to terms (concepts) of the analyzed user interface. Each such node contains information about:


The extraction is followed by the *Simplification* phase, where structural components without domain information (e.g., panels and containers) are filtered out unless they are necessary to maintain the term hierarchy.

Properties of the terms and their hierarchy are used to check for the missing domain information in order to identify incorrect or missing data types and lexical relations between terms such as hyponymy, hypernymy, holonymy, and meronymy.

For example, let us have a form for entering the person's data such as name, surname, date of birth, marital status, or favorite color. The *Person* dialog contains the fields for entering the data. The resulting domain model can be seen in Listing 2. It contains the term *Person* with the child nodes corresponding to fields of the of form. The *status* term has the *enumeration* type with mutually exclusive values because in the UI it contains multiple options as radio buttons. The *favorite color*, on the other hand, uses check-box components, so the corresponding term contains *child terms* with all offered values, and they are not mutually exclusive. *Person* term also contains children corresponding to functional components, e.g., menu items or buttons (such as *OK* or *Close*). A similar graph of terms is created for every window in the user interface.

DEAL is able to export this hierarchy into the standard OWL ontological format.

```
domain: 'Person' {children}: [
   'Name' {string}
   'Surname' {string}
   'Date of birth' {date}
   'Status' {mutually-exclusive}
       {enumeration}[
          'Single'
          'Married'
          'Divorced'
          'Widowed'
       ]
   'Favorite color' {mutually-not-exclusive}
       {children}: [
      'red'
      'yellow'
      'blue'
      'green'
   ]
   'OK'
   'Close'
   'Reset'
]
```
**Listing 2.** The domain model of the Person form.

#### *4.2. General Application Terms Ontology*

In the Person form example in Listing 2, we have three terms (represented by three buttons) not related to the domain of Persons. If we are to analyze *domain* objects, properties, and relations, we need to filter out any terms potentially unrelated to the domain. To do so, we will use a new reference ontology that will list domain-independent general-purpose terms commonly used in applications, their alternatives, and their forms.

We built this ontology manually by analyzing 30 open-source Java applications from SourceForge, 4 operating systems and their applications (system applications, file managers, etc.), and 5 software systems from the domain of integrated development environments (IDEs). The specific domain of IDEs was selected to observe and compare the occurrence of domain-specific versus general application terms. We listed and counted the occurrence of all terms in all analyzed UIs. Then, we selected only those that had an occurrence rate over 50%.

The list of the most common terms can be seen in Table 2 (the *General Application Terms Ontology* can be found at https://bit.ly/2R6bm6p; accessed on 9 August 2021). According to this ontology, we implemented an additional module into DEAL, which is able to automatically filter out such terms from the domain model immediately after the domain model Extraction and Simplification phase and prior to the DU evaluation process.

The analysis of application terms in a specific domain showed that the domain-specific terminology is more common in a specific domain than general application terms.

#### *4.3. Recognizing Form Stereotypes*

Another drawback of the DEAL method is its insufficient form analysis. In more than 50 open-source applications we have analyzed, the most common problem were the missing references between the actual form data components (i.e., text fields) and their textual labels readable in the UI. Such a missing reference causes a component to be extracted without any label or description and, therefore, has no term to be represented by. As a result, it is filtered out in the DEAL's domain model Simplification phase as a component with no domain-specific content and is therefore excluded from the consecutive analyses.


**Table 2.** List of the most frequently occurring terms in UIs (the vertical bar character '|' denotes alternatives).

However, such components are necessary to determine the data type of their input values, which is reflected in the domain model. For example, in Listing 2, *name* is of data type *string* and *dateOfBirth* is of data type *date*.

For the developers of component-based applications, it is usually possible to set a "labelFor" (Java) or "for" (HTML) attribute of the label component (from this point, we will refer to this attribute as to *labelFor*). However, since this attribute is not mandatory in most programming languages, the result is usually a large number of components with no label assigned.

To solve this issue, we designed a *Form Stereotype Recognition (FSR) algorithm* to recognize form stereotypes in target UIs and implemented it into the DEAL tool.

Prior to the implementation, we manually analyzed the source code of existing user interfaces for the most common *form stereotypes*. We selected web applications instead of desktop ones for better accessibility and higher occurrence of forms. Thirty web applications were analyzed, and we focused on registration forms, login forms, and their client applications. Based on the analyzed data we identified the five most common form stereotypes shown in Figure 1.


common in modern web applications, although it is marked as less usable. In this case, there is rarely any other label around the form component.


**Figure 1.** The most frequent form stereotypes.

The FSR algorithm analyzes these form stereotypes in the target UI, and based on the identified stereotype, it assigns a label to each form data component. In short, the main principle of the FSR algorithm is to find all form components around each particular label in a form container. Then for all labels (excluding the ones that have the *labelFor* attribute set), the FSR counts the number of components around them as displayed in the UI. The resulting form stereotype is the direction in which the largest number of form components is located relative to each label. If there is no explicit maximum (e.g., five components have labels on their left and five other components have labels on their right), then the form stereotype cannot be identified and is marked as MIXED.

The targets of the FSR algorithm are common form components, namely:


If the target container was identified as a form stereotype, FSR pairs the form components with their labels by defining their *labelFor* attribute. This step also enables us to mark all form components that have no automatically assignable label and represent them as *recommendations* for correction to the user. If there is any label that has no stereotype, then it is considered a *usability issue*, and a recommendation for assigning a label to the most probable component (closest according to one of the possible stereotypes) is displayed. An example of both issues can be seen in Figure 2 extracted from the OpenRocket (https://sourceforge.net/projects/openrocket/; accessed on 9 August 2021) user interface.

By using the FSR algorithm, we were able to successfully recognize the correct stereotypes of most of the tested form components.

**Figure 2.** DEAL—Example of a recommendation indicating the successful recognition of a form stereotype and an issue because of a missing *labelFor* attribute. The domain model shown in this figure was extracted from OpenRocket.

#### **5. ADUE Method**

The ADUE method uses the techniques mentioned in Sections 3 and 4. To sum up, we propose the following approaches to the automatized analysis of DU:


In the next subsections, we describe each of the methods in more detail (except the form analyzer that was already explained in Section 4.3). We use example applications to explain each approach and show the identification of usability issues and recommendations for fixing them.

#### *5.1. Ontological Analysis*

As mentioned in Section 3.1, the first option is to use two ontologies extracted from new and old application versions. In case there is only one ontology, only specificity (Section 5.2) and grammar evaluation (Section 5.3) are executed for this ontology. If there are two ontologies, both specificity and grammar evaluations are performed on the newer one along with ontological comparison. Now we will describe the ontological comparison approach.

The process is depicted in Figure 3. For technological reasons, DEAL is able to run only one application at a time; therefore the ontology extraction happens in two steps. First, we use the DEAL tool to extract domain information from the first application without any DU analysis, and export it into an ontological format (the top-left part of Figure 3). Then, we run DEAL again with the new application version (the top-right part), import the previously extracted ontology, and run the ADUE comparison and evaluation algorithm (the bottom part of Figure 3). The ontology evaluation results are then displayed to the user.

Each item in an extracted ontology represents one component of the UI, and it contains:


**Figure 3.** ADUE method—a high-level overview of the ontological evaluation process with two ontology versions. Processes are marked as ellipses, data as rectangles.

The algorithm compares the original ontology with the new one, searching for new, deleted, changed, and retained elements. We consider two elements equal if all their attributes (text, ID, class, parent, children) are equal. As a part of the evaluation, we consider the impact of the changes as follows:


The whole process is noted as the "Evaluation process" in Figure 3.

All results are stored in a list and then displayed to the evaluator in the UI. There, the user can see a list of all terms in the application. After selecting a specific term, details about the changes between the old and new ontology versions are shown, along with an error or a warning in case a potential issue was found.

After the comparison, *specificity evaluation* (Section 5.2) and *grammar evaluation* (Section 5.3) are performed on the new ontology version.

#### *5.2. Specificity Evaluation*

The goal of the *specificity evaluation* is to linguistically verify hierarchical relations found in the user interface. It uses ontological dictionaries and web search as a source of linguistic relations.

The algorithm traverses all grouping elements in the domain model graph. For each group, it selects the names of child terms and creates a *child word set*. From each child word set, we remove all forms of reflexive pronouns and auxiliary verbs (is, are, have, etc.) to get more precise results. The algorithm also uses natural language processing to recognize the word class of each word and keeps only nouns, verbs, and adjectives.

We use the *Ontological Dictionaries and Google Search evaluation algorithm (OD&GS)* to get a list of the *most probable parent terms* (hypernyms or holonyms) for each child word set. The algorithm combines three sources: WordNet, Urban Dictionary, and Google web search. To optimize the results, it defines the following order in which the sources are utilized:


After that, the *OD&GS* algorithm returns the list of possible parent terms. The number of the results is limited to 9. This number was determined empirically based on the number of correct results in our experiments with the terminology of multiple existing UIs.

For each child term set, it is checked if the parent of the set is found in possible parent terms generated by *OD&GS* algorithm. If it is not the case, a warning is shown, and terms obtained by the *OD&GS* are suggested as alternatives.

The results of the *OD&GS* algorithm strongly depend on the quality of the used ontological dictionaries. In the next sections, we explain how each of the data sources is used.

#### 5.2.1. WordNet

WordNet (https://wordnet.princeton.edu; accessed on 9 August 2021) is a dictionary and a lexical database. The dictionary provides direct and inherited hypernyms as a part of word definition for nouns, adjectives, and verbs. As a query result, WordNet returns so-called *synsets*, containing the information about words including the given word class. We filter out synsets with different word classes compared to the child word. To ensure higher accuracy of the results, we include only direct hypernyms. As a result, we construct a list of hypernyms for each child word set.

#### 5.2.2. Urban Dictionary

Urban Dictionary (http://www.urbandictionary.com; accessed on 9 August 2021) is a crowdsourced dictionary. For each queried word it returns seven most popular definitions based on the votes of the Urban Dictionary users. For each query, we collect all meaningful words from the definitions. The words are sorted by the frequency of their occurrence. The result is a list of the words with the highest frequency that can be considered possible parent terms.

#### 5.2.3. Google Web Search

While Google is not a linguistic tool, the current state of its multi-layered semantic network—*Knowledge Graph* [30,31]—enables gaining quite accurate results to confirm linguistic relations such as hyponymy, hypernymy, meronymy, and holonymy by using web search queries. The efficiency of data collection of Google's semantic network database enables it to grow its data into gigantic dimensions as opposed to any semantic network, including WordNet and UrbanDictionary, and for that reason, we see greater potential in web search than in current ontological dictionaries.

Based on our tests, Google search provides the most precise results compared to other sources we have used. On the other hand, it is not very suitable for automated requests. Because the Google web search approach provides results with high reliability, we present it in this paper despite the restrictions.

To search potential parent terms, we use two queries with the list of child words:


For example: "red, green, blue, brown are common values for" or "red, green, blue, brown are".

We parse the returned HTML documents and count the most common words. The probability of each word in the result is based on the frequency of its occurrence. Additionally, we ignore words of a different word class from the class of child words.

To verify the gained results we use the reverse queries for each child word: "is {a possible parent term} value/kind of {word}", for example, "is color kind of blue", "is color kind of yellow".

The number of occurrences of both words found in the resulting HTML page is used to determine the probability of the found words being the correct parent terms for the particular child word set. If there is low or no occurrence of a particular pair, this pair has the lowest probability in the result list.

#### *5.3. Grammar Evaluation*

There are two common grammatical issues occurring in user interfaces: an incorrectly written word (a typo), or a word that was not translated into the languages of the user interface. The second case is especially common in applications that are localized in multiple languages.

For this reason, usual spell checking is supplemented with the translation checking. If some word is not found in the dictionary for the current language, the algorithm checks the default language (usually English). If it is found, its translations are added to recommended replacements. Otherwise, the recommendations are based on similar words in the same way as it is done in modern text editors. In the end, a list of recommended corrections is provided to the evaluator.

#### *5.4. Tooltip Analysis*

The Tooltip analysis algorithm (TTA) selects all *functional* terms, i.e., terms extracted from functional components, from the domain model. Then for every such term, the presence of a tooltip is checked—either by inspecting the representing component or by checking the description property of the term node, where the component's tooltip text is usually stored. If no tooltip is found, this information is added to the list of warnings, and we advise the developer to add it.

Because general-purpose components (*OK, Open, Save, Exit, Cancel*, etc.) are common, frequently used, and generally understood, we presume that the importance of tooltips for such components is very small. Their purpose is clear from their description and/or icon. For this reason, we only analyze domain-specific components. General-purpose components are removed in the DEAL's *Extraction* phase using the general application terms ontology described in Section 4.2.

If no tooltip is found for some functional component, the result is displayed to the evaluator in one of two ways:


An example of the usability issue and its report to the user can be seen in the JSesh interface menu items (Figure 4) where there are two items with no visible textual information and/or tooltip.

**Figure 4.** Example of JSesh menu items both without a tooltip and label (**top**) and a usability issue reported to the user (**bottom**).

#### **6. Prototype**

All processes mentioned in Section 5 were implemented and integrated into the DEAL tool. The results of tooltip and form stereotype analysis are displayed as tooltips in the DEAL's domain model as seen in Figures 2 and 4.

The process of domain usability evaluation can be activated using a menu item in DEAL. Results of the analysis are displayed as errors (highlighted with red color) and recommendations (highlighted with orange) in the DEAL's component tree. Recommendations for corrections are displayed in tooltips. DEAL enables us to look up any component in the application by clicking on it in the component tree. As a result, the component is highlighted by the yellow color directly in the analyzed application. This way the analyst can locate the component needing the recommended modification.

Ontological evaluation, grammar evaluation, and specificity evaluation are implemented in a tool called ADUE (Figure 5), which can be started directly from DEAL or as a standalone process. In the case of starting from DEAL, the newest ontology is automatically extracted from the application currently analyzed by the DEAL tool. In the latter case, both ontologies (old and new) have to be imported manually.

When running the process with only one ontology, then only grammar and specificity evaluation is performed, and results are displayed only in the right column.

When loading two ontologies, the former processes are performed on the newer ontology as an additional process, and both ontologies are compared. Results are similar to one ontology analysis, but in the left column, we can see the components (terms) in the older application.


**Figure 5.** The ADUE evaluation tool displaying the results from comparing two sample applications.

Different types of errors are displayed using colors. Red is used for grammar errors. Orange means an incorrectly defined parent term (hypernym, holonym). Recommendations are displayed in a tooltip. The pink color is used for illogically changed components. The evaluator can also see all terms that were retained, added, deleted, or changed. In all cases, we display recommendations for change in the *Table of suggestions* (bottom right).

We used the metrics described in Section 2 to calculate the overall DU score of the evaluated user interface (the percentage in the bottom part of Figure 5). The errors are included in the DU as follows:


As explained in the paper, we were not able to analyze consistency issues, and world language issues are indistinguishable from grammar errors; therefore, the number of errors for these two aspects remains 0 and does not affect the DU score calculation.

#### *ADUE for Java Applications*

To be able to extract data from Java applications, DEAL uses Java *reflection* and *aspect-oriented programming* (AOP). AOP in load-time enables us to weave and also to analyze applications with custom class loaders, which would be problematic using a simple reflection. There are still limitations in some cases; e.g., AOP is not able to weave directly into Java packages such as *javax.swing*. Weaving directly into the JDK source code and thus creating our own version of Java to run the target application would solve the issue.

To extract, traverse, and compare ontologies, we used the *OWL API* library (https://github.com/owlcs/owlapi/wiki; accessed on 9 August 2021). As a dictionary in the grammar evaluation, we used the US English dictionary from the *OpenOffice* text editor (https://www.openoffice.org; accessed on 9 August 2021). We chose this dictionary because of the simple textual format with words separated by newline characters and because it can be freely edited and complemented by new words. In the same package, there are also multiple languages available, so they can be used for the evaluation of applications in other languages. To check the grammar, the *JAZZY* library (http://jazzy.sourceforge.net; accessed on 9 August 2021) was used. After identifying a typo in a text, *JAZZY* returns multiple replacement recommendations of the incorrect word. For natural language processing needed in the specificity evaluation, we used the *Apache OpenNLP* library (https://opennlp.apache.org; accessed on 9 August 2021), which can identify the word classes such as verbs, nouns, or adjectives. To query the WordNet dictionary, the *JAWS* library (https://github.com/jaytaylor/jaws; accessed on 9 August 2021) was used. Urban Dictionary does not provide a special API for machine usage. Therefore, we used standard HTTP GET requests to query the dictionary and then analyzed the source code of the response pages statically. To query the Google search engine, we used the publicly available API (https://developers.google.com/custom-search/v1/overview; accessed on 9 August 2021).

Ontologies were used because of good support for export and a comparison engine. However, in our approach, the main *limitation* of ontologies is considered the inability to use special characters and spaces in identifiers. In the case of comparing ontologies, it does not represent a problem. However, when analyzing grammar and specificity, this is usually the main issue.

#### **7. Evaluation**

In this section, we will assess the possibility of using ADUE on existing applications. Our main questions are whether ADUE is applicable to real-world programs and to what degree these programs contain domain usability errors.

#### *7.1. Method*

Since the implementation of ADUE for Java program analysis is the most mature one, we used several open-source Java GUI applications as study subjects. To obtain such applications, we utilized the SourceForge website (http://sourceforge.net; accessed on 9 August 2021). We selected programs from diverse domains and of various sizes to maximize generalizability. To simplify the interpretation of the results, we focused only on applications in the English language.

Specifically, the following applications were used to evaluate the ADUE prototype: Calculator, Sweet Home 3D, FreeMind (2014), FreePlane (2015), Finanx, JarsBrowser, JavaNotePad, TimeSlotTracker, Gait Monitoring+, Activity Prediction Tool, VOpR (a virtual optical rail), GDL Editor 0.9, and GDL Editor 0.95. The specific versions of the applications can be downloaded from https://git.kpi.fei.tuke.sk/michaela.bacikova/DEAL/-/tree/ master/DEALexamples/examples (accessed on 9 August 2021).

We executed the complete analysis using our implemented ADUE tool and recorded the results. The form stereotype analysis, tooltip detection, grammar error evaluation, parent term evaluation, and the overall domain usability computation were executed on all applications. For some of the applications, we performed an ontology comparison between two different versions (GDL Editor 0.9 and 0.95) or editions (FreeMind and FreePlan). We also recorded the execution times of the analysis process. All results were written in a spreadsheet.

#### *7.2. Results*

We were able to successfully execute ADUE on all mentioned applications. Table 3 presents an overview of the obtained results. For each application, we can see the number

of extracted terms and different kinds of errors and warnings detected by the ADUE prototype. The is also a weighted number of errors (*e*) calculated using Equation (3) and final domain usability index (*du*). The results of the two-ontology comparison are available in Table 4. The complete results can be viewed via Google Sheets using the following URL: http://bit.ly/3hZBImy (accessed on 9 August 2021).

**Table 3.** Results of the evaluation (applications where ontology comparison was used are marked with \*).


**Table 4.** Results of the ontology comparison.


#### 7.2.1. Tooltip Analysis

By using the tooltip verifier process, we extracted 136 components per application on average. From those, 52 function components per application on average had no tooltip defined (38%), from which 46 were a recommendation (34%) and 6 were an error (4%). We manually checked the components associated with the errors and confirmed that these issues were correctly identified.

The results show that DU issues concerning tooltips are very common in applications. Developers are probably not fully aware that tooltips are necessary for application usability.

#### 7.2.2. Grammar and Specificity Evaluation

From each listed application, we extracted an ontology using the DEAL tool and performed the grammar evaluation on it. On average, we extracted 146 items per application from which 15 grammar errors and 11 incorrectly defined parents were identified.

Some of the detected issues represented acronyms, abbreviations, and proper nouns. It is questionable to what degree acronyms and abbreviations are comprehensible to the application users. A portion of the grammar errors was caused by the fact that we were using the US English dictionary, but some applications used British English (or possibly used a combination of US and British English, which is inconsistent).

#### 7.2.3. Ontological Comparison

The two-ontology evaluation was applied only to the FreeMind/FreePlane and GDL Editor 0.9/0.95 applications since they are two versions of the same applications. As we can see in Table 4, numerous elements were added, deleted, or changed in the case of Free-Mind/FreePlane since this version change represents a major redesign of the application. On the other hand, in GDL Editor, a smaller proportion of the terms was changed because this version update is minor.

Note that there were no incorrectly changed components detected in either application.

#### 7.2.4. Overall Domain Usability

As we can see in Table 3, the computed domain usability ranged from 5% to 96%. The computed mean value is 47%. Therefore, the variability of the overall domain usability among the analyzed applications is relatively large.

Applications with low computed domain usability tend to have mainly a high number of detected grammar errors but also incorrectly defined parent terms and missing tooltips in places where they are necessary.

#### 7.2.5. Execution Time

The execution process of DEAL and ADUE includes the traversal of GUI elements of the applications, querying Web services, and other time-consuming operations. For this reason, we would like to know whether the execution time of the domain usability evaluation process is not prohibitive with respect to its practical utilization.

According to our results, the execution time on the listed applications ranges from 0 s to 5 min and 6 s, with a mean of 1 min and 7 s. This means that automated domain usability evaluation could be potentially performed in a variety of contexts, including continuous integration (CI) builds.

#### *7.3. Examples of Issues*

To help the reader understand the nature of domain usability issues, we will now mention a few examples of specific issues found by ADUE.

OpenRocket is a model rocket simulator, containing buttons to zoom out and zoom in. Each of them contains an icon with a magnifying glass and a small sign "−" and "+", respectively. However, these buttons do not contain any textual label or a tooltip. ADUE suggests adding tooltips to these buttons.

OpenRocket also contains multiple sliders, e.g., to control the wind direction or various angles. Next to each slider, there is a numeric input field and a textual descriptive label. However, there is no programmatical connection between the label ("Wind direction:"), the numeric value ("0°"), and the graphical slider. ADUE reports the missing *labelFor* attributes.

An example of a questionable grammar error can be found in the financial calculator Finanx. It contains a list of languages that are translated to the corresponding language instead of English (e.g., Français instead of French). Technically, the word is incorrect, and it should be translated into the language of the application (English). On the other hand, in some contexts, e.g., UI language selection, it can practically help the user to find hist or her language in the list, particularly if the person does not speak English.

#### *7.4. Threats to Validity*

Regarding the internal validity, a portion of the detected issues might have been false positives. To mitigate this threat, for selected analysis types, we manually verified a subset of the results to check their correctness. To improve grammar error detection, in the future, we should implement an option to add a word to the dictionary in ADUE, similarly to traditional spell-checking applications.

The largest threat to the external validity is the selection of applications, which might not be representative of the whole set of Java GUI programs. However, we tried to select applications from multiple different domains and ranging from small one-window utilities to complex software systems.

#### *7.5. Evaluation Conclusion*

From the results, we can conclude that ADUE can be successfully used on existing real-world Java applications with graphical user interfaces. The tool discovered many

domain usability errors, including tooltip errors and warnings, grammar errors, or incorrect parent terms. The overall domain usability of the analyzed applications has high variability (5–96%), which points to the fact that developers are often not aware of domain usability problems, and we need to raise awareness about domain usability issues among them.

#### **8. Potential of Existing Methods for DU Evaluation**

After describing the results of the evaluation of our prototype, in the next two sections, we will try to put our work into the context of existing approaches and propose their extensions if suitable.

The goals of usability evaluation methods are usually to specify the requirements for the UI design, evaluate design alternatives, identify specific usability issues, and improve UI performance [16]. In this section, we will summarize existing general techniques of usability evaluation, and for some of them, we will propose modifications that could make them suitable to evaluate domain usability.

#### *8.1. Universal Techniques*

Simple, universally usable techniques that include users, such as *thinking aloud* [32], *question-asking protocol* [33], *performance measurement*, or *log file analysis* [34,35], can be easily altered to focus on domain dictionary by just changing the questions or tasks included in the process to obtain the desired outcome. If there is a recording output, it can be analyzed with respect to DU. Informal or structured *interviews* and *focus groups* [36] might also be directed on the domain user dictionary by asking the participants (i) whether they understand such or such terminology in the UI, (ii) whether they use it in their everyday work life in their own domain, and (iii) if not, what would they use instead.

#### *8.2. User Testing Techniques*

There are multiple types of *user testing* [37] differentiated by automation, distance from the user (in the room, in the observation lab, remote testing), and recording outputs (sound or image recording of user and/or screen, user logs, notes, software usage records, eye tracking, brain waves, etc.). All of them are usually connected by a more or less functioning system or prototype and users performing pre-prepared scenarios.

Possible alterations to the user testing technique are the following:


#### *8.3. Inspection Methods*

In general usability inspection methods described by Boehm et al. [38] and Nielsen and Mack [39], the expert in usability usually performs the inspection of guidelines, heuristic rules, product consistency, or standards compliance of a prototype.

#### 8.3.1. Specializations of General Methods

Narrowing to DU, we propose the following alternations to the general techniques:

	- **–** different terms naming the same functionality or concepts (e.g., *OK* on one place, *Confirm* on the other);
	- **–** same terms naming different functionality or concepts;
	- **–** uppercase and lowercase letters consistency (e.g., *File, file, FILE*);
	- **–** consistency of term hierarchies, properties, and relations.

#### 8.3.2. Cognitive Walkthrough

As for the *Cognitive Walkthrough (CW)*, we propose an alternation of the latest Wharton et al.'s method [40], marked by Mahatody et al. [41] as *CW3* (this notation will be used further in this subsection).

The evaluator in CW3 should imagine a specific scenario for each action that the target users must accomplish to achieve the completion of their task. To achieve the best results, again, the evaluator should be a domain expert. A scenario should also be credible according to Wharton et al. [40], which means that the user's background knowledge and the feedback from the interface should be justified when evaluating each action. When evaluating domain usability, we recommend focusing on the user's background and knowledge first.

We propose to answer the following supplements to CW3's questions [41] related to various user thoughts and actions (Note: Question Q1 remains unchanged, and our supplements are marked by italic font):


Provided the fact that the target user is the best source of domain knowledge, it would be possible to use an alteration in the "CW with users" approach by Gonz et al. [42]. However, it is questionable whether "CW with users" is still a CW, since the essence of CW

techniques is the evaluation by experts, excluding users. If available, we recommend using domain experts instead of target users.

#### *8.4. Inquiry*

Inquiry techniques are those that focus on user feedback. They include focus groups [43], interviews and surveys [44], questionnaires [45,46], and others. There are two categories of inquiry techniques we would like to focus on: in-system user feedback, and surveys and questionnaires.

#### 8.4.1. In-System User Feedback

General techniques are based on the user sending feedback in a form of recorded events [47], captured screens, or submitted comments. We propose the following techniques for evaluating DU:


#### 8.4.2. Surveys and Questionnaires

Most of the common standard usability surveys and questionnaires [48] are defined too generally to be usable for DU evaluation. This was the primary reason for our proposal of a novel SDUS (System Domain Usability Scale) technique in 2018 [10]. SDUS is based on the common standardized System Usability Scale (SUS) [49,50], which is widely used in the user experience evaluation practice.

Similarly to SUS, our proposal also included a questionnaire with 10 statements targeted at all DU aspects. We designed SDUS similarly to SUS, which means that odd questions were positive and even questions were negative statements. The answers are in the standard five-point Likert scale (*1—Disagree, 5—Agree*). The overall DU metric is a sum of values for all answers. The calculation of the SDUS score is the same as the standard SUS [51].

#### *8.5. Analytical Modeling Techniques*

The goal of *GOMS (Goals, operations, methods, selection rules)* analysis [52,53] is to predict user execution and learning time. Learning time is partially determined by the appropriate terminology, but without a domain expert, it is not easy to evaluate it either automatically or manually. Calculating the overall *appropriateness* of terms [8] per system might provide a good view of the system improvement since the last prototype.

In *cognitive task analysis* [54], the evaluators try to predict usability problems. We claim that it is partially possible to semi-automatically evaluate existing UIs to find potential DU problems. We propose several techniques to support this claim in Section 3.

*Knowledge analysis* is aimed at system learnability prediction. It is only logical that the more appropriate the domain content of the UI is, the more learnable it is. This relates not only to the terminology but also to icons, which should be domain-centric, especially in cases when the particular feature or item is domain-related. Several techniques proposed in Section 3 address this issue.

*Design analysis* aims to assess the design complexity. From the point of view of DU, the complexity of textual content in web UIs can already be assessed by multiple online tools such as Readable (http://readable.com; accessed on 9 August 2021). Readability Score evaluates the given text or a URL and determines multiple reading complexity indices including Flesch–Kincaid [55,56], Keyword Density, and similar.

The goal of *Programmable User Models* [57] is to write a program that acts similarly to a user. Currently, our proposed tool is able to simulate users on existing UIs using a domain-specific language [58]. This automated approach was developed with the goal of testing user interfaces from the domain task-oriented point of view, and it is not related to the main goal of this paper.

#### *8.6. Simulation Techniques*

Simulation techniques, similarly to *Programmable User Model*, try to mimic user interaction. Many tools for end-to-end user testing exist, e.g., Protractor (Protractor end-to-end testing framework: https://www.protractortest.org; accessed on 9 August 2021) for Angular. However, similarly to *Programmable user Models*, simulation represents a general technique that is not specifically related to DU and therefore exceeds the focus of this paper.

#### *8.7. Automated Evaluation Methods*

To date, we have focused on manual or semi-automatized techniques. As for automated approaches, as mentioned in the introduction, we found only one by Mahajan and Shneiderman [59] that enables consistency checking of UI terminology. Their tool is quite obsolete and does not evaluate whether different terms are describing the same functionality. However, the methodical approach is applicable to all UIs. In Section 3, we introduced a novel approach to semi-automatic DU evaluation of existing UIs that includes consistency checking similar to Mahajan and Shneiderman's style, but extends the approach with multiple evaluation techniques.

#### **9. Related Work**

In this section, we selected the most important state-of-the-art works that refer to the aspects of DU, although they might have used different terminology compared to our definition. The number of works referring to matching the application's content to the real world indicates the importance of DU.

#### *9.1. Domain Content*

Most often, the existing literature refers to the *domain content* aspect of DU as to one of the following:


#### *9.2. Consistency*

Among other aspects, Badashian et al. [12] stress the importance of *consistency* in usable UIs. The survey by Ivory and Hearst [16] contains a wide list of automatic usability methods and tools. From over 100 works, only Mahajan and Shneiderman [59] deal with the domain content of applications, and their Sherlock tool is able to automatically check the consistency of UI terminology. Sherlock, however, does not evaluate whether different terms describe the same functionality or not.

#### *9.3. World Language, Language Barriers, Errors*

In addition to complexity, Becker [65] also deals with the *translation* of UIs, which corresponds to the *world language* DU aspect. In the area of web accessibility [13], the *understandability* of web documents is defined by W3C. Compared to our definition, however, it deals only with some of the attributes: *world language* of web UIs, *language barriers*, and *errors*. It focuses on web pages specifically, not on UIs in general.

#### *9.4. All Domain Usability Aspects*

Isohella and Nissila [8] evaluate the *appropriateness* of UI terminology based on the evaluation of users. In a broader sense, appropriateness is equivalent to our DU definition but Isohella and Nissila do not go deeper into the definition's aspects. According to the authors, appropriate terminology can increase the quality of information systems. The terminology should be selected, formed, evaluated, and used.

#### **10. Conclusions**

In this paper, we described the design and implementation of a method for automatized DU evaluation of existing user interfaces. The method not only evaluates the user interfaces for domain usability but also (probably even more importantly) provides recommendations for their improvement. The method was verified using the implemented prototype on several existing open-source Java applications with graphical user interfaces. Among other findings, we conclude that the variability of the computed domain usability

of individual applications is high. Many components do not contain tooltips or have grammatical errors.

As a secondary contribution, we proposed several modifications of existing manual techniques of usability evaluation to utilize them specifically for domain usability evaluation.

Ontologies provide good tools for content comparison, but they have restrictions (such as ID uniqueness) that restrict our approach and the ontological format is rather extensive. Therefore, in the future, we plan to define a new domain-specific language (DSL) for formal domain model description [71] and a custom comparison engine for domain models exported in the DSL.

We believe that the ADUE method contributes to the field of UX and usability and hope that it improves the situation in DU of new user interfaces.

**Author Contributions:** Conceptualization, M.B. and J.P.; methodology, M.B., J.P., M.S., S.C., W.S. and M.M.; software, M.B.; formal analysis, M.B.; investigation, M.B.; data curation, M.B.; writing original draft preparation, M.B.; writing—review and editing, M.B., J.P., M.S., S.C., W.S. and M.M.; visualization, M.B. and M.S.; funding acquisition, J.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Project VEGA No. 1/0762/19 Interactive pattern-driven language development.

**Data Availability Statement:** The evaluation results of ADUE can be found at http://bit.ly/3hZBImy (accessed on 9 August 2021). The General Application Terms Ontology can be found at https: //bit.ly/2R6bm6p (accessed on 9 August 2021).

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.

#### **References**


## *Article* **Posting Recommendations in Healthcare Q&A Forums**

**Yi-Ling Lin <sup>1</sup> , Shih-Yi Chien 1,\* and Yi-Ju Chen <sup>2</sup>**


**Abstract:** Online Q&A forums, unlike search engines, allow posting of various types of queries, thus attracting users to seek information and solve problems in specific domains. However, as insufficient knowledge leads to incomprehensible queries, unsuitable responses are common. We develop posting recommendation systems (RSs) to support users in composing reasonable posts and receiving effective answers. The posting RSs were evaluated by a user study containing 27 participants and three tasks to examine if users engaged more in the question generation process. Two medical experts were recruited to verify whether professionals can understand and answer posts supported by RSs. The results show that the proposed mechanism enables askers to produce posts with better understandability, which leads experts to devote more attention to answer their questions.

**Keywords:** question-answering forum; healthcare informatics; recommendation system; word embedding; user study

#### **1. Introduction**

Although search engines are the most popular channel for information retrieval, the retrieved results are often too general to find solutions that fulfill user needs. Information retrieved from search engines is usually selected and sorted using custom algorithms, which favor preselected hosts or Wikipedia results. When looking for information on an unfamiliar topic, users may lack the knowledge to formulate good search queries, resulting in improper or unexpected search results. The difficulty in composing concise queries for search engines has popularized online Q&A forums, which serve as alternatives by which to find detailed answers to questions. Online Q&A websites attract users because they can respond to detailed questions and query experts without time or geographical constraints [1]; however, for user questions that are incomplete or ambiguous, the resulting answers may not be what the user was looking for; finding professional and reliable answers can be difficult. This has led to many unsolved and unclear questions in online forums.

Generating effective questions on Q&A websites is not easy, particularly for highly specialized domains. In the healthcare field, for instance, people may possess little background on the questions and may not understand the relevant jargon, resulting in ambiguous questions. Most users can only think of simple terms to describe their disease and medical conditions: Phrases used in the queries often do not reflect standard medical terminology. Sometimes, even the asker is not sure how to describe his/her medical condition or to describe the encountered situation in various ways (e.g., different descriptions of the pain scale for the same illness) [2]. Lexical barriers such as partial misspellings and the use of abbreviations also makes questions hard to understand. For example, a typical general question is "Recently I have been suffering from back pain. What kind of lifestyle would help prevent back pain?" A more informative or knowledgeable post would be "I am staying at a healthy weight, but I recently began to suffer from severe back pain. I searched

**Citation:** Lin, Y.-L.; Chien, S.-Y.; Chen, Y.-J. Posting Recommendations in Healthcare Q&A Forums. *Electronics* **2021**, *10*, 278. https:// doi.org/10.3390/electronics10030278

Academic Editors: Matus Pleva, Yuan-Fu Liao and Patrick Bours Received: 11 December 2020 Accepted: 18 January 2021 Published: 25 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

for information online and found that smoking ages the spine. I seldom smoke but my husband smokes a lot. Is it because of inhaling too much secondhand smoke?"

Since this difficulty in formulating effective posts on Q&A websites is rarely addressed, we propose a design that recommends more concepts (e.g., topics and terminology) to users to formulate their posts with more reasonable details, rather than presenting existing questions from the search pool (i.e., routing answers) or finding possible answerers as do most Q&A websites [3–5]. While most products provide recommendations to users at querying and browsing moments, only a few mechanisms (e.g., spellchecks) focus on helping users formulate posts, particularly for an online Q&A forum with a specific subject such as healthcare. Referring to the research thread [6–8], enhancing the quality of the input content not only increases the user's ability to get useful answers but also results in high-quality solutions faster. Recommendation systems (RSs) applied when composing posts could be enhanced by suggesting to users what content should be posted and how to describe the situations in the post, leading to a high input quality and better answers [8]. In this study, we seek to help participants who are unfamiliar with a domain to compose queries with a posting recommendation mechanism.

We propose two posting RSs: a word embedding-based and a semantic-based RS. Word embedding is a well-known tool for processing words into space vectors to improve the automatic understanding of human languages. Our word embedding-based posting RS (we use "the embedding model" in the following content), implemented by a Word2Vec model [9], is trained on 5319 questions and 500 abstracts of publication crawled from health-related websites. For the semantic-based posting RS, we adopt the WordNet (https://wordnet.princeton.edu/) model (we use "the semantic model" in the following content), a lexical database for English with several synonyms that are tagged artificially. It groups words from their meanings for computational linguistic and natural language processing (NLP). Both the embedding and semantic models are meant to recommend ideas and terminology that users may need in their current posts. These feature-based recommendations are expected to help users make more subject-specific posts.

We believe that using text analytics to participate more in the asking processes can be a good approach to support users formulating posts and enhance the clarity of the posts, which would encourage domain experts to reply to the posts and answer the questions. To verify whether reformulated queries yield better query wording and help users to find the desired answers more easily, we conducted a user study and a satisfaction questionnaire to understand user perspectives on our RSs. In addition, posts written by our study's participants were evaluated by experts with a health-related background to determine whether they could be easily solved. The research questions are posited as follows:

RQ1: Does the posting RS help users formulate questions in healthcare Q&A forums? RQ2: Is it easier for experts to understand questions supported by the posting RS?

#### **2. Related Work**

Traditionally, healthcare professionals are the primary sources of health information; they provide and manage health information for their patients [10]. With the spread of the Internet, sources of health information have become more diverse and accessible to individuals and families. Despite this easy access to health information, its main use remains focused on supporting healthcare professionals, such as in hospital information systems [11]. Isern and Moreno [12] organize various agents in healthcare to inform decisions on cure plans and to alert patients when abnormal messages are detected. Although health information has been widely applied to support professionals, Frost et al. [13] state that health information is also beneficial for patients and people in need. Effective support in terms of health information can improve the doctor–patient relationship as well as the completeness and quality of diagnosis [13]. Thus, it is crucial to provide an effective communication channel between professionals and general users in the health information domain.

As the Internet provides a convenient way to access health information, people tend to seek health-related support online [10]. It is estimated that approximately 12.5 million of the 278 million daily Internet searches are health-related [14]. To find the most relevant answers on the Internet, RSs are essential for Q&A forums. Existing RSs generally focus on routing answers or finding answerers. Among various recommendation mechanisms, question routing and grouping are two main approaches to finding potential answers and answerers (people who have similar experiences in a specific area) in Q&A forums [3–5]. These methods consider underlying social network features (e.g., which query gets more hits), user activity (e.g., which category do experts tend to be active in and receive honor for the best answer), and public personal data on websites to improve system usability.

Most studies about RSs in online Q&A forums focus on general aspects rather than a specific subject such as healthcare. Budalakoti et al. [15] present a RS with three different methods for selecting the most appropriate responder given a question on Yahoo! Answer. One is calculating the cosine similarity between the words from an individual's (the author) historical Q&A data and his/her current question; another is grouping documents using K-means clustering; and the other is discovering the author-topic distributions as the general model and recommending the responders based on the marginalized probabilities. Yang and Amatriain [16] analyze the application of RSs at Quora and build a platform to experiment with different machine learning models for the developers. While most studies work on general Q&A forums, few studies focus on specific professional Q&A forums. Xin et al. [17] developed TagCombine, an automatic tag recommendation method to analyze objects in both Stack Overflow and Freecode websites to facilitate search and identify software objects. Pedro and Karatzoglou [18] presented a supervised Bayesian approach to model expertise with similar topics to support question recommendation and to avoid question starvation from the Stack Exchange (http://stats.stackexchange.com). Wang et al. [19] also provided an enhanced tag recommendation system, ENTAGREC++ , for organizing questions and facilitating browsing questions on Stack Overflow. Singh and Simperl [20] implemented a system, Suman, which combines semantic keyword search with traditional text search to find answers for unanswered questions on Reddit and Stack Overflow. There are even fewer studies focus on healthcare Q&A domain. McCray et al. [21] developed a web-based terminology server which allows a diverse audience to easily access current health information by enforcing flexible query grammar, expanding synonyms and lexical variants for a term, and generating alternative spellings for unknown words. Cho et al. [22] helped users to receive satisfactory responses by improving the baseline retrieval model with semantic information to generate top 5 discussion threads that are potential responses for unresolved medical case-based queries. Although RSs have been widely employed in health areas, Jacobs et al. [10] state that the extant mechanisms for online health information search are insufficient.

Despite the popularity of Q&A forums, many questions lack answers due to ambiguous or misleading terms [20]. Baltadzhieva and Chrupała's study [8] on Stack Overflow (a programming Q&A forum) shows that the terms used, tags added, and the length of questions influence question quality. They conclude that questions that are too localized or that have incorrect tags or terms are considered to be of poor quality [8]. In the healthcare domain, Bochet et al. [23] demonstrated that most users are too inexperienced to formulate an effective search query on health information. Spink et al. [24] also showed that when posting medical and health queries, many users fail to retrieve information relevant to their condition due to an ignorance of specialized vocabulary or precise medical terms. Zhang [25] showed that queries posted about health support are usually simple and short and lack other aspects of individual information. For recommendation systems to facilitate the formulation of online questions that are more likely to be answered, it is essential to make posts more comprehensive.

Thus, in this study we focus on generating and improving questions to enhance the recommendation mechanisms in the healthcare domain. We develop posting RSs to suggest potential ideas, formulate user questions, and eliminate ambiguities that might decrease question to be responded to.

**Figure 1.** Interface of post RS.

users interact with the proposed RSs.

*3.1. Interface Design* 

the likelihood of the question being answered or increase the time it takes for the question to be responded to. **3. Posting Recommender Systems (RSs)** 

Thus, in this study we focus on generating and improving questions to enhance the

recommendation mechanisms in the healthcare domain. We develop posting RSs to suggest potential ideas, formulate user questions, and eliminate ambiguities that might de-

#### **3. Posting Recommender Systems (RSs)** After looking over Q&A online forums (e.g., Quora.com, Yahoo! Answer, Stack Over-

#### *3.1. Interface Design*

*Electronics* **2021**, *10*, x FOR PEER REVIEW 4 of 20

After looking over Q&A online forums (e.g., Quora.com, Yahoo! Answer, Stack Overflow (https://stackoverflow.com/), and English Language & Usage (https://english. stackexchange.com/)), we included an input area and a recommendation area in the system layout (see Figure 1). The first column of the recommendation area (the table part) shows topics that askers may focus on and the rest of the columns show the top 10 terms related to the particular topic. flow (https://stackoverflow.com/), and English Language & Usage (https://english.stackexchange.com/)), we included an input area and a recommendation area in the system layout (see Figure 1). The first column of the recommendation area (the table part) shows topics that askers may focus on and the rest of the columns show the top 10 terms related to the particular topic.

Askers compose multi-sentence posts in the green input region (Figure 1). While users compose their posts, Grammarly (https://www.grammarly.com/), an auto-spellcheck extension from the Chrome web store, is activated to eliminate careless typos. If askers need ideas or assistance in generating the appropriate terms to pose their questions, they click on execute to receive system suggestions. In the recommendation table, askers click on the copy button to fetch the required terminology. The askers can click on execute at any time to receive new system recommendations. When askers are satisfied with the Askers compose multi-sentence posts in the green input region (Figure 1). While users compose their posts, Grammarly (https://www.grammarly.com/), an auto-spellcheck extension from the Chrome web store, is activated to eliminate careless typos. If askers need ideas or assistance in generating the appropriate terms to pose their questions, they click on execute to receive system suggestions. In the recommendation table, askers click on the copy button to fetch the required terminology. The askers can click on execute at any time to receive new system recommendations. When askers are satisfied with the post,they click on finish to accomplish the question content. Figure <sup>2</sup> demonstrates how users interact with the proposed RSs.

post, they click on finish to accomplish the question content. Figure 2 demonstrates how

**Figure 2.** User perspective of interactive data flow. **Figure 2.** User perspective of interactive data flow.

The post recommendation mechanism is composed of three phases. First, the user inputs keywords, terms, or sentences to describe her questions. After receiving the user's queries, we attempt to understand what the user is asking or what concepts she is interested in. A post on Q&A forums is a kind of user-generated content (UGC) usually consisting of a question or a narrative. To identify user intentions from posts, we use a noun phrase extractor to extract the main topics from each post. Noun phrases are usually the The post recommendation mechanism is composed of three phases. First, the user inputs keywords, terms, or sentences to describe her questions. After receiving the user's queries, we attempt to understand what the user is asking or what concepts she is interested in. A post on Q&A forums is a kind of user-generated content (UGC) usually consisting of a question or a narrative. To identify user intentions from posts, we use a noun phrase extractor to extract the main topics from each post. Noun phrases are usually the core topics or objects in a sentence, whereas verb phrases describe actions between the objects in a sentence.

core topics or objects in a sentence, whereas verb phrases describe actions between the objects in a sentence. Second, we use embedding and semantic models to provide recommendations to help users construct their posts. In this study we use Word2Vec, a two-layer neural network model [9], as the embedding model. In addition, as the semantic model we use WordNet (https://wordnet.princeton.edu/), a well-processed English lexical database. We Second, we use embedding and semantic models to provide recommendations to help users construct their posts. In this study we use Word2Vec, a two-layer neural network model [9], as the embedding model. In addition, as the semantic model we use WordNet (https://wordnet.princeton.edu/), a well-processed English lexical database. We implemented the application using Python NLTK's WordNet package to generate recommendations. Both embedding-based and semantic-based recommendations are triggered by clicking on execute (more details about the recommendation models and dataset are provided in Sections 3.2 and 4.1.)

implemented the application using Python NLTK's WordNet package to generate recommendations. Both embedding-based and semantic-based recommendations are triggered by clicking on execute (more details about the recommendation models and dataset are Thirdly, the recommendation made by these models is prioritized and displayed with the top 10 recommended terminology, where the users can fetch the required content. They continue to modify the post (such as sentences and terms) until they are satisfied with the post, or until nothing new comes out from the recommendations.

#### provided in Sections 3.2 and 4.1.) *3.2. Recommendation Models*

*3.2. Recommendation Models* 

Thirdly, the recommendation made by these models is prioritized and displayed with the top 10 recommended terminology, where the users can fetch the required content. They continue to modify the post (such as sentences and terms) until they are satisfied with the post, or until nothing new comes out from the recommendations. Several state-of-art recommendation methods such as content-based [26–28], collaborative filtering-based [29] and hybrid methods [30] have been proposed to generate personalized recommendations based on the relationship between users and items. To provide recommendations to help users to construct their posts, we use embedding-based and semantic-based RSs that concentrate on interactive items (i.e., posts) without knowing the previous interactions of the user. To algorithmically understand the post and provide

sonalized recommendations based on the relationship between users and items. To provide recommendations to help users to construct their posts, we use embedding-based and semantic-based RSs that concentrate on interactive items (i.e., posts) without knowing the previous interactions of the user. To algorithmically understand the post and provide recommendations, text representation is important. Different from the traditional text representation such as continuous bag-of-words or Term Frequency-Inversed Document Frequency (TF-IDF) [31], WordNet and Word2Vec bring extra semantic features that help in identifying textual content. WordNet [32,33] is a human-curated ontological symbolic representation based on the similarity between words. It is often limited with its hierarchical representation. Word2Vec [9] is an unsupervised neural network method for determining

recommendations, text representation is important. Different from the traditional text representation such as continuous bag-of-words or Term Frequency-Inversed Document Frequency (TF-IDF) [31], WordNet and Word2Vec bring extra semantic features that help in identifying textual content. WordNet [32,33] is a human-curated ontological symbolic representation based on the similarity between words. It is often limited with its hierarchical representation. Word2Vec [9] is an unsupervised neural network method for determining words meaning by its surrounding context with a vector. The input words are transferred into an n-dimensional vector space, then similar words are identified by being near the input vector. It can perform effectively no matter how many words are included in the input vector, but is constrained by the corpus to the vector space. By incorporating the generalizable contexts into the model, Word2Vec has been proven to be more accurate than other models [9,34,35].

Pre-processing is essential for RS models. Data collected from websites and online forums often contain colloquial sayings and abbreviations (e.g., please → plz, pls). To eliminate meaningless words and punctuation (e.g., "?", ".", ";"), we tokenized sentences, removed stopwords, and regulated terms from the NLTK corpus before training. To reduce the number of inflectional forms, we lemmatized the words (e.g., am, are, is → be) using NLTK to get the general patterns of words. We then put all of the word packs of each sentence into a collection and used the gensim package (https://pypi.org/project/ gensim/), a Python Library for scalable statistic semantics.

To give suitable ideas to help users compose their posts, we developed two models. We implemented the embedding model using Word2Vec, a shallow, two-layer neural network model that uses a large corpus of texts to perform unsupervised learning [36] and produces a vector space to reconstruct the linguistic contexts of words. In the new vector space, words sharing common contexts in the corpus are located in close proximity to each other. Vector relationships can be represented as "Kitten:Cat = Puppy:Dog". Thus, given expressions such as "Kitten:Cat = Dog: ?", we can infer what words should be inserted. In addition, there are two kinds of Word2Vec models: skip-gram (infers context words based on input words) and continuous bag of words (CBOW: infers input words based on context words). In this work, we followed the gensim tutorial (https://radimrehurek. com/gensim/tutorial.html) and used skip-grams to train Word2Vec on a corpus of medical terms and healthcare forum wording.

We also implemented another recommendation system using the WordNet semantic model. We did not change this much because its database is already well-organized. The recommendations are generated based on the English lexical database using the Python NLTK WordNet package given the input sentence.

We utilized selenium-web browser automatio (https://www.seleniumhq.org/) to support users to eliminate misspellings when formulating their posts. When a user types a period or clicks the execution button, the system considers the prior section to be a sentence, automatically normalizes their wordings and feeds them into two recommendation models. The embedding model would map the input words to its context word and offer recommendations. The semantic model would map the input words to the semantic graph of lexical items it pre-generated and then provide similar terms as recommended ideas.

#### **4. Research Design**

To assess whether the proposed posting RSs help users formulate queries that increase the probability of being answered, we conducted a user study to collect and analyze content written by users. We implemented two posting RSs: a word embedding model based on Word2Vec (suggesting ideas (terminology) related to the main topics of the input content), and a semantic WordNet-based model (suggesting synonyms (terminology) for the main topics of the input content), for comparison with the baseline model (no recommendation). We collected the participant behavior and posts using the three models for further analysis and expert evaluation.

#### *4.1. Dataset*

WebMD (https://www.webmd.com/) is one of the few healthcare Q&A forums in which medical specialists (called experts in the forum) offer suggestions to askers about their illness or concerns. The dataset was crawled from WebMD from March 2010 to September 2014 and contains 25,319 questions.

Apart from the daily conversations from WebMD, we also collected medical terminology and specialist wording from other professional healthcare-related websites (such as PubMed (https://www.ncbi.nlm.nih.gov/pubmed/)). Lai et al. (2016) suggest that for word embedding models, the domain of the corpus is more important than its size. Thus, we crawled the abstracts of biotechnology-related publications from PubMed to create the Word2Vec model.

#### *4.2. Tasks and Experimental Materials*

To generate posting ideas for the participants, the experiment provides a short introduction with a background story to simulate possible healthcare conditions. To complete the task, the participants were asked to compose a post associated with the background story.

To evaluate the experimental design, a pilot test was conducted in which three health tasks were examined: flu, asthma (https://www.webmd.com/a-to-z-guides/commontopics), and pregnancy. The results showed that it was difficult for participants to compose posts about asthma and pregnancy because they had little daily experience in these areas. Therefore, we changed the selection of health tasks to flu, allergy, and foodborne illness, which are more common among the public, and employed these in the official study (see Appendix A).

As a short introduction lacks sufficient information to formulate posts under a simulation condition, for each task we prepared supportive paragraphs from relevant medical websites. To cover various aspects of health situations, articles, news, and reports from healthcare agencies were collected as our materials. Finally, we selected supportive excerpts from a health agency's announcement with statistical data (the rate of an illness in a region) and sections gathered from news reports with common knowledge that the public can understand. Participants were to imagine the assigned task and write down their own or the character's experiences of specific illness after understanding the background information.

In addition, we prepared an example question in the try-out (Figure 3) to encourage participants to produce longer questions and not simply question sentences like "what are the symptoms of heart disease."

#### *4.3. Participants and Procedure*

Twenty-seven participants (14 females and 13 males, average age 27.7) were recruited from a social media website (i.e., Facebook). Fifteen out of the 27 participants' experience with Q&A forums was limited to browsing discussion threads, rather than composing or answering posts. Only five participants had experience using "professional" Q&A forums.

The within-subject design was used in the experiments. The three tasks were performed along with three algorithmic models (without RS, with Word2Vec RS, and with WordNet RS). Thus, each participant was asked to complete a total of six posts in three tasks. The Latin square design was applied to avoid the order effect [37]. The experiment used the following procedure (Figure 4):


## *[Task 1]*

#### **Background:**

Sandy's Grandfather has a family history of the heart attack. Unluckily, his illness occurred yesterday and was sent to the hospital. After receiving a phone call from Dad, Sandy tried to search for some information about the sickness. She will go to pick up Grandpa Johnson tomorrow on her way home but she has no ideas what she should know in advance. The following is the information she has now. If you are Sandy and want to get help on the health care online forum, what you will say?

#### **Supportive paragraphs of a daily scenario task:**

#### A guide to a heart attack

When blood can't get to your heart, your heart muscle doesn't get the oxygen it needs. Without oxygen, its cells can be damaged or die. Over time, cholesterol and a fatty material called plaque can build up on the walls inside blood vessels that take blood to your heart, called arteries. This makes it harder for blood to flow freely. Most heart attacks happen when a piece of this plaque breaks off. A blood clot forms around the broken-off plaque, and it blocks the artery.

#### The following is the call, from Sandy's dad:

"If Tracy (paid cleaner) wasn't there at that time, it may have been too late to rescue your grandpa. You know, Grandpa Johnson had a heart attack. He told me before that his chest was sometimes painful and that made it difficult for him to breath. And our hometown was pretty cold in the winter. I'm afraid that if Grandpa forgets to dress warm enough, the low temperature may stimulate another heart attack. Do you think I should find a personal physician for grandpa? Near his house? We are all working outside the county. When emergency happens, this protection may work."

#### **Example question from Sandy:**

*4.3. Participants and Procedure* 

rums.

My grandpa's heart attack occurred yesterday. I'm going to ask the cleaner who found my grandpa fainted what happened before talking to the doctor. What should I know? The body reaction at that moment? In addition, Grandpa is an emotional person, so I consider talking to the doctor by myself first and decide what can tell him directly. Is that good? By the way, the temperature here is pretty low. Does anyone know what things should be prepared for when grandpa goes back home?

Twenty-seven participants (14 females and 13 males, average age 27.7) were recruited from a social media website (i.e., Facebook). Fifteen out of the 27 participants' experience

The within-subject design was used in the experiments. The three tasks were performed along with three algorithmic models (without RS, with Word2Vec RS, and with WordNet RS). Thus, each participant was asked to complete a total of six posts in three tasks. The Latin square design was applied to avoid the order effect [37]. The experiment

(1) After signing the consent form, the participants took the pre-test questionnaire on

(2) A training task was then provided to ensure that participants fully understood the experimental systems and task requirements. An example of an expected post was given to encourage the participant to compose complete questions. Participants were al-

answering posts. Only five participants had experience using "professional" Q&A fo-

**Figure 3.** Material read by participants before composing a post in the try-out. **Figure 3.** Material read by participants before composing a post in the try-out.

their background and past experience using Q&A forums.

used the following procedure (Figure 4):

lowed to ask any questions during this step.

(3) A brief description of the assigned model was also provided. The participant was

(4) A description of the general context of the assigned task was provided to the par-

(5) Another description of the complex context of the task was given to the partici-

pant. Then, the participant began her posting. Please note that as each participant com-

pleted all three tasks with the three models, she completed (3)–(5) three times.

**Figure 4.** Experiment procedure.

#### *4.4. Analysis Method*

**Figure 4.** Experiment procedure.

*4.4. Analysis Method* 

given sufficient time to become familiar with the system.

ticipant, after which the participant began her posting.

the extent to which they would prefer using our RSs.

To answer the first research question, "Does the posting RS help users formulate questions in healthcare Q&A forums?", five measurements from the literature were used to evaluate outcomes: (1) input content length, (2) amount of medical terminology in input To answer the first research question, "Does the posting RS help users formulate questions in healthcare Q&A forums?", five measurements from the literature were used to evaluate outcomes: (1) input content length, (2) amount of medical terminology in input content [24], (3) presence of condition or self-description [25], (4) amount of recommended terminology adopted by user, and (5) total time to use to formulate post.

content [24], (3) presence of condition or self-description [25], (4) amount of recommended terminology adopted by user, and (5) total time to use to formulate post. We used GEE [38] to analyze the data. Regardless of whether a variable data is continuous or nominal, GEE can estimate parameters. Even with missing data in a variable column, GEE can still calculate results from other columns containing data. GEE is suita-We used GEE [38] to analyze the data. Regardless of whether a variable data is continuous or nominal, GEE can estimate parameters. Even with missing data in a variable column, GEE can still calculate results from other columns containing data. GEE is suitable for repeatable experiments even if the input parameters are dependable or undependable and even if the population does not have a normal distribution. Lastly, the main effects and interaction terms of variables can be chosen under GEE manipulation.

ble for repeatable experiments even if the input parameters are dependable or undependable and even if the population does not have a normal distribution. Lastly, the main effects and interaction terms of variables can be chosen under GEE manipulation. To answer the second research question, "Is it easier for experts to understand questions supported by the posting RS?", we invited experts whose jobs were related to med-To answer the second research question, "Is it easier for experts to understand questions supported by the posting RS?", we invited experts whose jobs were related to medical professions to rate the quality of posts composed by the participants. They noted that answering questions on a healthcare forum is similar to that of diagnosing patients in clinics. After patients describe the condition, professionals suggest possible solutions. The only risk is misunderstandings, as experts must judge illness given the posts alone, without face-to-face diagnosis.

ical professions to rate the quality of posts composed by the participants. They noted that answering questions on a healthcare forum is similar to that of diagnosing patients in clinics. After patients describe the condition, professionals suggest possible solutions. The only risk is misunderstandings, as experts must judge illness given the posts alone, without face-to-face diagnosis. Since an illness may present with different symptoms and complications in different people due to age, constitution, medical records, etc., it is difficult to draw conclusions when a replier sends ambiguous messages. To narrow down the range of possible solutions, it is necessary to obtain more details and transparent objectives (e.g., at least a query sentence and a self-description in a post). Therefore, when posting questions in the forum, Since an illness may present with different symptoms and complications in different people due to age, constitution, medical records, etc., it is difficult to draw conclusions when a replier sends ambiguous messages. To narrow down the range of possible solutions, it is necessary to obtain more details and transparent objectives (e.g., at least a query sentence and a self-description in a post). Therefore, when posting questions in the forum, the posting RS should assist users to adopt meaningful terminology and compose complete but concise posts. We requested professionals rate every post with one to five points (low to high quality) on three measurements: willingness, completeness, and clarity. Willingness evaluates whether the professionals were willing to answer c. Completeness and clarity concern the reason for their analysis. For example, informative contents (posts) were rated high in completeness, and contents (posts) with sufficient descriptions of what happened, as well as timing and location, were rated high in clarity.

#### the posting RS should assist users to adopt meaningful terminology and compose com-**5. Analysis of Results**

plete but concise posts. We requested professionals rate every post with one to five points (low to high quality) on three measurements: willingness, completeness, and clarity. Will-The study was conducted with two posting RSs and one baseline model for three tasks. Each participant was requested to generate six posts in total. The system log was analyzed

ingness evaluates whether the professionals were willing to answer c. Completeness and

to objectively investigate user behavior given the various RSs and task conditions, and participant posts were used to investigate whether recommendation support helps experts to better understand the posts and encourages them to answer the posted questions. This section is organized into log analysis and opinion analysis based on the research questions.

#### *5.1. Log Analysis*

This section focuses on the perspective of effectiveness from the post length, the number of medical-related features, and the existence of detailed descriptions among three models, the number of adopted recommended features between two experimental recommender models, and the perspective of efficiency between with RS and without RS. Table 1 shows a basic descriptive statistic of the three models. We applied a linear function with GEE to evaluate the association between post length and three within-subject variables: model, task, and operating order. There is shown to be no significant effect on models, but there is a main effect on tasks [*χ*(2) 2 = 11.758, *p* < 0.003]. Investigating pairwise comparisons with the least significant difference (LSD) reveals that a significant difference in post length exists between allergy (mean = 63.02, S.E. = 5.381) and foodborne illness (mean = 45.72, S.E. = 4.096) (*p* < 0.001), and between flu (mean = 58.64, S.E. = 2.725) and foodborne illness (*p* < 0.004), suggesting that foodborne illness has a significantly shorter post length than both allergy and flu. However, no significant differences were observed between allergy and flu.

**Table 1.** Descriptive statistics of three main measurements among models. (Note. "A" denotes word embedding model, "B" denotes semantic model, and "C" denotes baseline model.).


Using GEE, a linear function was applied to evaluate the association between medicalrelated terminology and three within-subject variables: model, task, and operating order. This reveals a main effect for model [*χ*(2) 2 = 23.941, *p* < 0.000], but no significant effects for task. A pairwise comparison with LSD reveals that participants composed posts using significantly more medical-related terminology (*p* < 0.002) with the semantic model (mean = 4.65, S.E. = 0.236) than with the word embedding model (mean = 3.21, S.E. = 0.560).

When a post includes more detailed context information, experts may better understand the user's questions and expectations. We asked the three curators to note whether posts contained descriptions about patient background and the timing of the illness outbreak. If the majority of the curators believed the post to be informative, it was labeled "T" (True); otherwise, it was labeled "F" (False). Using the GEE binary logistic function, we analyzed the association between the existence of descriptions and model, task, and operating order, revealing main effects of model [*χ*(2) 2 = 11.765, *p* < 0.003] and task [*χ*(2) 2 = 25.799, *p* < 0.000] on the existence of descriptions. In contrast to the baseline model (mean = 0.50, S.E. = 0.006), when using the word embedding model (mean = 0.48, S.E. = 0.006), participants were less likely to augment posts with descriptive information (OR = 0.910, *p* < 0.002). A pairwise model comparison demonstrated that participants were significantly less likely to add details when using the word embedding model than when using the semantic model (*p* < 0.014).

When comparing flu to the other two tasks, participants dealing with foodborne illness were more likely to add descriptions to their posts (OR = 1.107, *p* < 0.004). A pairwise task comparison indicated that (1) foodborne illness (mean = 0.51, S.E. = 0.006) and allergy (mean = 0.48, S.E. = 0.005) and (2) foodborne illness and flu (mean = 0.49, S.E. = 0.006) were significantly different. That is, participants were significantly more likely

to include descriptions in a post about foodborne illness than about allergy (*p* < 0.000) or flu (*p* < 0.004). There was no significant difference between allergy and flu in terms of the existence of descriptions.

We also evaluated which RS better supported users to generate posts by examining the adoption of two experimental models (word embedding and semantic) and the usage of medical-related terms. To gauge the quality of the embedding and semantic models, we first counted the number of adoptions during the asking process. From the viewpoint of total acceptance of the recommended terminology, the embedding model (43 times) yielded more than the semantic model (31 times). However, no significant effect was observed in model or task on the number of adoptions. Different from observing effectiveness of RSs, we also examined the amount of used time between RS (i.e., the embedding model or the semantic model) and the baseline model that is without recommendations with the GEE method. No significant effect was found, which demonstrates that users did not spend more time when using RSs compared to the baseline model.

In summary, in terms of effectiveness, applying an RS (i.e., the embedding model and semantic model) does affect asker posting behavior and encourages them to use medicalrelated terminology and include more description in posts. There was no significant relation between post length and whether askers used an RS. Participants were less likely to describe situations in detail when using word embedding than the semantic system. When analyzing tasks, the result shows that longer posts were used for foodborne illness than for allergy and flu. Participants included more details for foodborne illness scenarios than for allergy and flu. In terms of efficiency, applying RSs will not cost users more time to formulate their posts when they provide more details in their questions.

#### *5.2. Opinion Analysis*

We recruited two experts—one a pharmacist and the other a physician—to go through three lists of posts categorized by different tasks. Before asking the experts for their opinions, we interviewed them to determine how they judge their willingness to answer questions. Both experts indicated that complete and clear descriptions of conditions provide better information to help users. We use "willingness" to indicate their willingness to provide answers, and "completeness" and "clarity" as two factors that affect their willingness. The experts were asked to rate the three factors of posts on a five-point Likert scale (ranging from strongly disagree "1" to strongly agree "5"). The descriptive results of the three factors from the two experts are provided in Table 2.

**Table 2.** Descriptive statistics of three factors.


Inter-rater reliability with Cohen's Kappa [39] was adopted to evaluate the rating agreement of the two experts, yielding low Kappa values for willingness, completeness, and clarity [40], which could be attributable to their different backgrounds (pharmacist vs. physician), leading to different opinions in communicating with their patients [41]. Since there was no significant difference among the models for each expert, a linear function with GEE was applied to evaluate the association between willingness, completeness, and clarity and the existence of description separately on the expert judgment. The judgment of both pharmacist and physician showed that willingness (pharmacist: [*χ*(1) 2 = 22.194, *p* < 0.000]; physician: [*χ*(1) 2 = 9.693, *p* < 0.002]) and completeness (pharmacist: [*χ*(1) 2 = 62.246, *p* < 0.000]; physician: [*χ*(1) 2 = 87.103, *p* < 0.001]) are highly related to the existence of

description. The pairwise comparison of "False" and "True" label descriptions in the physician's willingness (False: mean = 3.61, S.E. = 0.42; True: mean = 3.78, S.E. = 0.40), the pharmacist's completeness (False: mean = 3.82, S.E. = 0.35; True: mean = 4.14, S.E. = 0.39), and the physician's completeness (False: mean = 2.32, S.E. = 0.73; True: mean=3.13, S.E. = 0.48) indicate that the "True" posts are more likely to get high points from experts. In terms of the effect of clarity on the existence of descriptions, a significant effect was found from the physician's judgment [*χ*(1) 2 = 36.817, *p* < 0.001]. If askers did not include greater detail in posts, there was a 65.3% chance of getting less clarity points from the physician.

As we found that having a description contributes to higher points from experts, we directly investigated those posts with sufficient description between models to gain further insight. The judgment of both experts indicates that completeness is an important effect. The pairwise comparisons show that the semantic model is more likely to yield higher completeness points than word embedding and the baseline, whereas the word embedding model is more likely to get high completeness points from experts than baseline.

Clarity, the last measurement, was found to be significantly different between (1) word embedding and baseline (*p* < 0.000) and (2) semantic and baseline (*p* < 0.002). This suggests that using the posting RSs with sufficient post details is more likely to yield a high expert rating.

To examine the relationship between the quality of a user questions, the question's length (i.e., word count and med-related word count) and expert's opinions (including willingness, completeness and clarity) were investigated. As the recruited experts had diverse medical backgrounds and possessed non-identical perspectives, their opinions were therefore analyzed separately. The correlation results revealed that the experts' completeness and clarity were greatly affected by the word count and med-related word count (the results in Table 3 showed marginal differences in expert 1 and statistical differences in expert 2); however, experts' willingness was less likely to be influenced by the question length. In addition, the results indicated that the word count significantly impacted expert 2 0 s opinions.


**Table 3.** Correlation results between word counts and expert opinions.

#### **6. Discussion**

It is easy to find online Q&A forums with mechanisms to support finding existing relevant questions, but it is hard to find supportive systems that focus on post composition during the query process. This study demonstrates that the proposed posting RSs are more effective and efficient than the baseline (with no RS support).

The amount of medical-related terminology has a significant effect on models, showing that using an RS yields more medical-related terminology compared to when an RS is not used. However, the sematic model has a stronger influence than the embedding model, whereas the word embedding model usually yields more relevant topics based on common wordings than the semantic dictionary-based corpus. The semantic corpus, constructed by manipulating WordNet, performs well particularly when askers are able to query more professionally. The weaker performance of the embedding model might be due to the small training dataset, leading to imprecise or ambiguous recommendations. To improve the usefulness of the word embedding model, the future work must collect larger amounts of in-domain data and then re-train the model.

Detail in a post is an important element for experts to evaluate posts because they cannot diagnose a person via back-and-forth interaction: A single question is usually not enough for experts to solve the problem. Our data reveal the main effects between having descriptions on models and tasks. A deeper investigation indicates the embedding model is less likely to result in more details in a post, in contrast to the baseline. This indicates that people are still used to a posting procedure without interference. In addition, as it is merely a simulated scenario, most participants lacked a strong motivation to find a solution. They feel more comfortable writing posts in a stress-free situation without interruptions. In addition, as allergy and flu are common experiences, participants may assume that most readers are familiar with them and thus omit details when describing the malady. In contrast, when generating posts about foodborne illness, which is less familiar, participants provided more details when describing the conditions. Post length was not found to differ significantly between models but it did between tasks, which indicates that different illnesses do affect post length. The interviews revealed that most participants are not familiar with foodborne illness; this unfamiliarity caused participants to compose posts that were shorter than those for allergy and flu.

Suggestions from the embedding and semantic models were adopted 43 and 31 times, respectively. However, each model had 54 posts and adoption was unevenly distributed in each post: For many posts, none of the recommended terminology was selected. This indicates more resources would be needed in the future work to build a robust word embedding model. If a recommendation looks strange, even though the average score for "want to use this kind of topic RS someday" was 3.81/5.00 in the post-questionnaire, poor user experiences dictate that it would be difficult to attract attention.

To explain the connection between a post RS and higher scores from experts, we further conducted a pairwise evaluation between the interaction of models with descriptions labeled "True" and three measurements. According to the result of the first expert (the pharmacist), completeness is higher when using a posting RS with detailed descriptions. Completeness and clarity of the second expert (physician) are increased if an asker uses the RS and provides more details in a post. Although results vary between experts, we conclude it is possible to elicit a response from experts after using an RS and adding details. In addition, we found that willingness is not significantly affected by a post RS that adds details, because the professional ethic of medical experts is to answer patient questions; thus, they seldom refuse to answer such requests. Therefore, willingness may not be a good measurement.

As both physicians and pharmacists are highly specialized and regulated professions, through the rigorous medical training, we assume individual differences in attitude would inject little influence of the collected expert opinions. Therefore, in terms of expert opinions, since professionals from different disciplines have different norms in communicating with their patients, it is difficult to find common ground between physicians and pharmacists [41]. For physicians, the priority is to thoroughly understand the situation and any information that relates to the patient's symptoms [41,42], whereas as pharmacists tend to focus on medicinal instructions and materials; it is more important for them to gather all of the critical information than to understand the situation as a whole [41]. Despite the marked difference between the two experts' evaluations, both pharmacist and physician consider willingness and completeness to depend greatly on the existence of sufficient detail in the problem description. Posts labeled "False" are less likely to earn points from the expert. The physician's judgment also demonstrates that clarity is an important factor as well. According to the interviews with experts, getting a good score from the physician means the post is easily understood by experts. Easily comprehensible posts are more likely to be solved. Although some interesting results are observed, however, due to the small number of the experts used in this study, future work should address this issue and recruit more medical specialists to further validate our findings as well as exclude any potential issues that may arise from the sample size limitation.

#### **7. Conclusions**

In this work, we present a post RS that suggests relevant and useful ideas and terminology to support users who are composing posts to ask questions. Effectiveness and efficiency are evaluated in terms of the usability of the proposed post RSs (RQ1). Combining the result with RQ1, we evaluate the feasibility of the resultant posts to see if experts assign them higher scores (RQ2).

This research reveals that current Q&A forum RSs have reached a plateau because they only recommend relevant questions based on the words in the query and then send query requests to those who might be able to help the askers. These supportive methods may be infeasible when posts are difficult for the system to classify and users may decline to bother people who are reluctant to answer. In addition, most Q&A online forums do nothing about post actions in the asking process. Also, the existence of unanswered posts underlines the necessity of optimizing the posting process. After this user study, we found it is possible to change user posting behaviors by participating more in the asking process via a posting RS. Askers are also willing to be supported by the RS feature when formulating questions in unfamiliar domains. Whether the recommended terminology can be adopted directly or is relevant enough to modify posts conceptually, our RS suggests concrete and possible ideas to askers, which constitutes a new type of manipulation in the Q&A domain. We therefore anticipate that the posting RS will support users to better formulate posts and find solutions in a more efficient manner.

The proposed posting RS is also applicable to domains other than healthcare. Take e-commerce for example: when people are purchasing products that they are not familiar with, it is common for them to ask for details before and after the purchase. If there were a system that would help users compose better questions, the resultant posts would better match the FAQs. If solutions are still not found in the FAQs, websites present previous posts from other askers. An advanced posting RS could attempt to resolve questions before posting to the forum. The unanswered rate would decrease and the likelihood of getting a solution would increase. Any industry that fields many queries is suitable for more participation in the user's asking process.

For the future work, the number of participants should be increased, the illness selection should be reconsidered, and the data resources to make a RS should be expanded. The recommendation presentation should be made more user-friendly. Second, some participants felt the selected tasks to be so general that they did not need an RS to complete the post, whereas others considered the selected tasks too difficult to compose a post about, suggesting feedback varied widely among participants. In the future, a study with various tasks and more participants might be able to bring us more insights for designing the posting recommender systems. Also, the quality assessment of our posting RS is important. Collecting more data from healthcare forums is the most direct way to improve the performance of posting RSs. However, what kind of data resources should be selected to build the posting RS? If the quality of the input (existing posts on online forums) is low, there would be little chance of producing a high-quality RS. Therefore, training models on high quality posts is one way to enhance the usefulness of the RS.

Regardless of whether the RS data sources support high quality revisions, the quality of posting RSs should be evaluated in advance. One potential approach is to take the first sentence of good WebMD questions to see whether the proposed RS can suggest sufficient terminology to formulate the subsequent sentences. Sufficient terminology could be identified by mapping the recommendations to the rest of the sentences of good questions. Then we could observe if the relevant terminology suggested matches the terminology used in the subsequent sentences of every post. Further study with eye-tracking augmentation could be useful to learn more about interactions between the process of decision-making and types of posting RSs.

In addition, while posting recommendations can help to compose posts in a more detailed way to attract experts to answer, the more detailed content provided the more sensitive data releases online. This is always a dilemma between efficiency and privacy.

Practitioners might need to pay attention to the forum policy when providing a posting recommender system. sensitive data releases online. This is always a dilemma between efficiency and privacy. Practitioners might need to pay attention to the forum policy when providing a posting recommender system.

In addition, while posting recommendations can help to compose posts in a more detailed way to attract experts to answer, the more detailed content provided the more

Regardless of whether the RS data sources support high quality revisions, the quality of posting RSs should be evaluated in advance. One potential approach is to take the first sentence of good WebMD questions to see whether the proposed RS can suggest sufficient terminology to formulate the subsequent sentences. Sufficient terminology could be identified by mapping the recommendations to the rest of the sentences of good questions. Then we could observe if the relevant terminology suggested matches the terminology used in the subsequent sentences of every post. Further study with eye-tracking augmentation could be useful to learn more about interactions between the process of decision-

**Author Contributions:** Y.-L.L.: conceptualization, methodology, writing—original draft. S.-Y.C.: methodology, writing—review & editing. Y.-J.C.: software, validation, formal analysis, investigation, data curation. All authors have read and agreed to the published version of the manuscript. **Author Contributions:** Y.-L.L.: conceptualization, methodology, writing - original draft. S.-Y.C.: methodology, writing—review & editing. Y.-J.C.: software, validation, formal analysis, investigation, data curation. All authors have read and agreed to the published version of the manuscript

**Funding:** This research was supported by the Ministry of Science and Technology, Taiwan, under Grant MOST 107-2410-H-004-098-MY3 and MOST 109-2410-H-004-067-MY2. **Funding:** This research was supported by the Ministry of Science and Technology, Taiwan, under Grant MOST 107-2410-H-004-098-MY3 and MOST 109-2410-H-004-067-MY2.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Research Ethics Committee of National Chengchi University (protocol code: NCCU-REC-201709-I036; date of approval: 16 August 2019). **Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki, and approved by the Research Ethics Committee of National Chengchi University (protocol code: NCCU-REC-201709-I036; date of approval: 16 August 2019).

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study. **Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy concerns. **Data Availability Statement:** The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy concerns.

**Acknowledgments:** The authors thank Tsung-Hua Shen, a research assistant in the College of Pharmacy at Taipei Medical University, and Po-Yu Liao, a Physician at Liao ENT Clinic, for their valuable assistance in reviewing and categorizing the participants' queries in the opinion analysis phase. **Acknowledgments:** The authors thank Tsung-Hua Shen, a research assistant in the College of Pharmacy at Taipei Medical University, and Po-Yu Liao, a Physician at Liao ENT Clinic, for their valuable assistance in reviewing and categorizing the participants' queries in the opinion analysis phase.

**Conflicts of Interest:** The authors declare no conflict of interest. **Conflicts of Interest:** The authors declare no conflict of interest.

*Electronics* **2021**, *10*, x FOR PEER REVIEW 15 of 20

#### **Appendix A Supportive Paragraphs for Participants Appendix A. Supportive Paragraphs for Participants**

making and types of posting RSs.

Food poisoning symptoms vary with the source of contamination. Most types of food poisoning cause nausea, vomiting, watery or bloody diarrhea, abdominal pain, and cramps and fever. Signs and symptoms may start within hours after eating the contaminated food, or they may begin days or even weeks later. Sickness caused by food poisoning generally lasts from a few hours to several days. Sometimes, there are serious complications. Whether you become ill after eating contaminated food depends on the organism, Food poisoning symptoms vary with the source of contamination. Most types of food poisoning cause nausea, vomiting, watery or bloody diarrhea, abdominal pain, and cramps and fever. Signs and symptoms may start within hours after eating the contaminated food, or they may begin days or even weeks later. Sickness caused by food poisoning generally lasts from a few hours to several days. Sometimes, there are serious complications. Whether you become ill after eating contaminated food depends on the organism, the amount of exposure, your age and your health. High-risk groups include older adults, pregnant women, infants and young children, and people with chronic disease, who are highly affected by their immune system or changes in metabolism and circulation. Food poisoning is especially serious and potentially life-threatening for them. At home people can stay safe by taking preventions such as separating raw foods from ready-to-eat foods, washing hands before eating, and defrosting foods safely.

washing hands before eating, and defrosting foods safely.

the amount of exposure, your age and your health. High-risk groups include older adults, pregnant women, infants and young children, and people with chronic disease, who are highly affected by their immune system or changes in metabolism and circulation. Food poisoning is especially serious and potentially life-threatening for them. At home people can stay safe by taking preventions such as separating raw foods from ready-to-eat foods,

Food poisoning syndrome results from the ingestion of water and a wide variety of food contaminated with pathogenic organisms (bacteria, viruses, parasites, and fungi), their toxins and chemicals. Food poisoning must be suspected when an acute illness with gastrointestinal or neurological manifestations affects two or more persons or animals who have shared a meal during the previous 72 h. The term generally used encompasses both food-related infection and food-related intoxication. Some microbiologists consider microbial food poisoning to be different from foodborne infections. In microbial food poisoning, the microbes multiply readily in the food prior to consumption, whereas in foodborne infection, food is merely the vector for microbes that do not grow on their transient substrate. Other consider food poisoning as intoxication of food by chemicals or toxins Food poisoning syndrome results from the ingestion of water and a wide variety of food contaminated with pathogenic organisms (bacteria, viruses, parasites, and fungi), their toxins and chemicals. Food poisoning must be suspected when an acute illness with gastrointestinal or neurological manifestations affects two or more persons or animals who have shared a meal during the previous 72 h. The term generally used encompasses both food-related infection and food-related intoxication. Some microbiologists consider microbial food poisoning to be different from foodborne infections. In microbial food poisoning, the microbes multiply readily in the food prior to consumption, whereas in foodborne infection, food is merely the vector for microbes that do not grow on their transient substrate. Other consider food poisoning as intoxication of food by chemicals or toxins from bacteria or fungi.

from bacteria or fungi. Foodborne illness (FBI), often called food poisoning, is caused by pathogens or certain chemicals present in ingested food bacteria, viruses, molds, and worms. Protozoa causing diseases are all pathogens, although there are also harmless and beneficial bacteria that are used to make yogurt and cheese. Some chemicals that cause foodborne illness are natural components of food, whereas others may be accidentally added during production and processing, either through carelessness or pollution. The two most common types of food borne illness are intoxication and infection. Intoxication occurs when toxins produced by the pathogens cause food poisoning, whereas infection is caused by the in-Foodborne illness (FBI), often called food poisoning, is caused by pathogens or certain chemicals present in ingested food bacteria, viruses, molds, and worms. Protozoa causing diseases are all pathogens, although there are also harmless and beneficial bacteria that are used to make yogurt and cheese. Some chemicals that cause foodborne illness are natural components of food, whereas others may be accidentally added during production and processing, either through carelessness or pollution. The two most common types of food borne illness are intoxication and infection. Intoxication occurs when toxins produced by the pathogens cause food poisoning, whereas infection is caused by the ingestion of food containing pathogens.

gestion of food containing pathogens. *[Reference]*

20356230

*[Reference]*  https://www.omicsonline.org/open-access/a-review-on-major-food-borne-bacterial-illnesseshttps://www.omicsonline.org/open-access/a-review-on-major-food-borne-bacterialillnesses-2329-891X-1000176.pdf *Electronics* **2021**, *10*, x FOR PEER REVIEW 17 of 20

> 2329-891X-1000176.pdf https://www.mayoclinic.org/diseases-conditions/food-poisoning/symptoms-causes/sychttps://www.mayoclinic.org/diseases-conditions/food-poisoning/symptoms-causes/ syc-20356230

Some people suffer with seasonal allergies for years before learning about effective treatments. If allergy symptoms are not treated early, they can actually worsen over time. Here are five symptoms you should not ignore: runny or stuffy nose, sinus pressure, sneezing, itchy eyes, and postnasal drip. You may avoid your allergy triggers or ask doctors about other ways to get relief. Food allergies are an immune system reaction that occurs soon after eating a certain food. It is easy to confuse a food allergy with a much Some people suffer with seasonal allergies for years before learning about effective treatments. If allergy symptoms are not treated early, they can actually worsen over time. Here are five symptoms you should not ignore: runny or stuffy nose, sinus pressure, sneezing, itchy eyes, and postnasal drip. You may avoid your allergy triggers or ask doctors about other ways to get relief. Food allergies are an immune system reaction that occurs soon after eating a certain food. It is easy to confuse a food allergy with a much

more common reaction known as food intolerance. While bothersome, food intolerance is a less serious condition that does not involve the immune system. Itching in the mouth,

allergies. People who have similar symptoms should keep away from food triggers, for

Allergies involve almost every organ of the body in variable combinations with a broad spectrum of possible symptoms; thus, their manifestations cover a wide range of phenotypes. Studies in Europe have shown that up to 30% of the population suffer from allergic rhinoconjunctivitis, whereas up to 20% suffer from asthma and 15% from allergic skin conditions. These numbers match those reported for other parts of the world, such as the USA and Australia. Food allergies are becoming more frequent and severe; occupational allergies, drug allergies, and allergies to insect stings (occasionally fatal) further aggravate the burden of the allergy epidemic. Despite the popular belief that allergies are mild conditions, a considerable and increasing proportion of patients (15–20%) have severe, debilitating disease and are under constant fear of death from a possible asthma attack or anaphylactic shock. Within the EU, there are nevertheless wide geographical variations in the incidence of allergies with a south to north and east to west gradient. An alarming observation is that most allergic conditions start in childhood and peak during highly productive years of individuals, with allergic rhinitis affecting up to 45% of 20 to 40-year-old Europeans. The numbers may even be an underestimation, as many patients do not report their symptoms or are not properly diagnosed. Indeed, it is estimated that approximately 45% of patients have never received a diagnosis. Notwithstanding evidence suggesting a plateau in some areas, the European Academy of Allergy and Clinical

example, shellfish, peanuts, and fish.

more common reaction known as food intolerance. While bothersome, food intolerance is a less serious condition that does not involve the immune system. Itching in the mouth, swelling of the lips, face, or other parts of the body, etc., are common signs of the food allergies. People who have similar symptoms should keep away from food triggers, for example, shellfish, peanuts, and fish. more common reaction known as food intolerance. While bothersome, food intolerance is a less serious condition that does not involve the immune system. Itching in the mouth, swelling of the lips, face, or other parts of the body, etc., are common signs of the food allergies. People who have similar symptoms should keep away from food triggers, for example, shellfish, peanuts, and fish.

Some people suffer with seasonal allergies for years before learning about effective treatments. If allergy symptoms are not treated early, they can actually worsen over time. Here are five symptoms you should not ignore: runny or stuffy nose, sinus pressure, sneezing, itchy eyes, and postnasal drip. You may avoid your allergy triggers or ask doctors about other ways to get relief. Food allergies are an immune system reaction that occurs soon after eating a certain food. It is easy to confuse a food allergy with a much

*Electronics* **2021**, *10*, x FOR PEER REVIEW 17 of 20

Allergies involve almost every organ of the body in variable combinations with a broad spectrum of possible symptoms; thus, their manifestations cover a wide range of phenotypes. Studies in Europe have shown that up to 30% of the population suffer from allergic rhinoconjunctivitis, whereas up to 20% suffer from asthma and 15% from allergic skin conditions. These numbers match those reported for other parts of the world, such as the USA and Australia. Food allergies are becoming more frequent and severe; occupational allergies, drug allergies, and allergies to insect stings (occasionally fatal) further aggravate the burden of the allergy epidemic. Despite the popular belief that allergies are mild conditions, a considerable and increasing proportion of patients (15–20%) have severe, debilitating disease and are under constant fear of death from a possible asthma attack or anaphylactic shock. Within the EU, there are nevertheless wide geographical variations in the incidence of allergies with a south to north and east to west gradient. An alarming observation is that most allergic conditions start in childhood and peak during highly productive years of individuals, with allergic rhinitis affecting up to 45% of 20 to 40-year-old Europeans. The numbers may even be an underestimation, as many patients do not report their symptoms or are not properly diagnosed. Indeed, it is estimated that approximately 45% of patients have never received a diagnosis. Notwithstanding evidence suggesting a plateau in some areas, the European Academy of Allergy and Clinical Allergies involve almost every organ of the body in variable combinations with a broad spectrum of possible symptoms; thus, their manifestations cover a wide range of phenotypes. Studies in Europe have shown that up to 30% of the population suffer from allergic rhinoconjunctivitis, whereas up to 20% suffer from asthma and 15% from allergic skin conditions. These numbers match those reported for other parts of the world, such as the USA and Australia. Food allergies are becoming more frequent and severe; occupational allergies, drug allergies, and allergies to insect stings (occasionally fatal) further aggravate the burden of the allergy epidemic. Despite the popular belief that allergies are mild conditions, a considerable and increasing proportion of patients (15–20%) have severe, debilitating disease and are under constant fear of death from a possible asthma attack or anaphylactic shock. Within the EU, there are nevertheless wide geographical variations in the incidence of allergies with a south to north and east to west gradient. An alarming observation is that most allergic conditions start in childhood and peak during highly productive years of individuals, with allergic rhinitis affecting up to 45% of 20 to 40 year-old Europeans. The numbers may even be an underestimation, as many patients do not report their symptoms or are not properly diagnosed. Indeed, it is estimated that approximately 45% of patients have never received a diagnosis. Notwithstanding evidence suggesting a plateau in some areas, the European Academy of Allergy and Clinical Immunology (EAACI) warns that in less than 15 years more than half of the European population will suffer from some type of allergy! *Electronics* **2021**, *10*, x FOR PEER REVIEW 18 of 20 Immunology (EAACI) warns that in less than 15 years more than half of the European population will suffer from some type of allergy! *[Reference]* 

*[Reference]*

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3539924/ https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3539924/ https://www.webmd.com/allergies/features/allergy-symptoms#2

https://www.webmd.com/allergies/features/allergy-symptoms#2 https://www.mayoclinic.org/diseases-conditions/food-allergy/symptoms-causes/syc-

https://www.mayoclinic.org/diseases-conditions/food-allergy/symptoms-causes/syc-20355095 20355095

I. Seasonal influenza (or "flu") is most often caused by type A or B influenza viruses. Symptoms include a sudden onset of fever, cough, headache, muscle and joint pain, sore throat, and a runny nose. The cough can be severe and can last 2 or more weeks. Most I. Seasonal influenza (or "flu") is most often caused by type A or B influenza viruses. Symptoms include a sudden onset of fever, cough, headache, muscle and joint pain, sore throat, and a runny nose. The cough can be severe and can last 2 or more weeks. Most

people recover from fever and other symptoms within a week without requiring medical attention. However, influenza can cause severe illness or death in high-risk groups.

lose their appetites. The fever and aches usually disappear within a few days, but the sore throat, cough, stuffy nose, and tiredness may continue for a week or more. The flu also can cause vomiting, belly pain, and diarrhea. Most people who get the flu get better on their own after the virus runs its course. However, call your doctor if you have the flu and any of these things happen: (a) you are getting worse instead of better; (b) you have trouble breathing or develop other complications, such as a sinus infection; or (c) you have a medical condition (for example, diabetes, heart problems, asthma, or other lung problems). Most teens can take acetaminophen or ibuprofen to help with fever and aches.

What scientists dream of is a vaccine that can protect against any flu strain for years or even a lifetime. This so-called universal flu vaccine is still a long way off, if it is even possible. However, many labs are dusting off past projects on broad flu vaccines, spurred by new funding and fears that H5N1, the deadly avian influenza that has swept across half the world, could acquire the ability to be transmitted from human to human. Until now, "flu has never been before high enough on the radar screen" for companies in particular to follow through with a strong push for a universal vaccine, says Gary Nabel, director of the Vaccine Research Center at the U.S. National Institute of Allergy and Infec-

tious Diseases (NIAID) in Bethesda, Maryland.

people recover from fever and other symptoms within a week without requiring medical attention. However, influenza can cause severe illness or death in high-risk groups. people recover from fever and other symptoms within a week without requiring medical attention. However, influenza can cause severe illness or death in high-risk groups.

I. Seasonal influenza (or "flu") is most often caused by type A or B influenza viruses. Symptoms include a sudden onset of fever, cough, headache, muscle and joint pain, sore throat, and a runny nose. The cough can be severe and can last 2 or more weeks. Most

Immunology (EAACI) warns that in less than 15 years more than half of the European

https://www.mayoclinic.org/diseases-conditions/food-allergy/symptoms-causes/syc-

*Electronics* **2021**, *10*, x FOR PEER REVIEW 18 of 20

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3539924/ https://www.webmd.com/allergies/features/allergy-symptoms#2

population will suffer from some type of allergy!

*[Reference]* 

20355095

II. Someone with the flu may have a high fever, for example, their temperature may be around 104 ◦F (40 ◦C). People with the flu often feel achy and extra tired. They may lose their appetites. The fever and aches usually disappear within a few days, but the sore throat, cough, stuffy nose, and tiredness may continue for a week or more. The flu also can cause vomiting, belly pain, and diarrhea. Most people who get the flu get better on their own after the virus runs its course. However, call your doctor if you have the flu and any of these things happen: (a) you are getting worse instead of better; (b) you have trouble breathing or develop other complications, such as a sinus infection; or (c) you have a medical condition (for example, diabetes, heart problems, asthma, or other lung problems). Most teens can take acetaminophen or ibuprofen to help with fever and aches. II. Someone with the flu may have a high fever, for example, their temperature may be around 104 °F (40 °C). People with the flu often feel achy and extra tired. They may lose their appetites. The fever and aches usually disappear within a few days, but the sore throat, cough, stuffy nose, and tiredness may continue for a week or more. The flu also can cause vomiting, belly pain, and diarrhea. Most people who get the flu get better on their own after the virus runs its course. However, call your doctor if you have the flu and any of these things happen: (a) you are getting worse instead of better; (b) you have trouble breathing or develop other complications, such as a sinus infection; or (c) you have a medical condition (for example, diabetes, heart problems, asthma, or other lung problems). Most teens can take acetaminophen or ibuprofen to help with fever and aches.

What scientists dream of is a vaccine that can protect against any flu strain for years or even a lifetime. This so-called universal flu vaccine is still a long way off, if it is even possible. However, many labs are dusting off past projects on broad flu vaccines, spurred by new funding and fears that H5N1, the deadly avian influenza that has swept across half the world, could acquire the ability to be transmitted from human to human. Until now, "flu has never been before high enough on the radar screen" for companies in particular to follow through with a strong push for a universal vaccine, says Gary Nabel, director of the Vaccine Research Center at the U.S. National Institute of Allergy and Infectious Diseases (NIAID) in Bethesda, Maryland. What scientists dream of is a vaccine that can protect against any flu strain for years or even a lifetime. This so-called universal flu vaccine is still a long way off, if it is even possible. However, many labs are dusting off past projects on broad flu vaccines, spurred by new funding and fears that H5N1, the deadly avian influenza that has swept across half the world, could acquire the ability to be transmitted from human to human. Until now, "flu has never been before high enough on the radar screen" for companies in particular to follow through with a strong push for a universal vaccine, says Gary Nabel, director of the Vaccine Research Center at the U.S. National Institute of Allergy and Infectious Diseases (NIAID) in Bethesda, Maryland.

Doing so, however, means coming up with an alternative way to stimulate immunity to the virus. The tried-and-true technique for seasonal flu uses a killed virus vaccine that works mainly by triggering antibodies to hemagglutinin (HA), the glycoprotein on the virus's surface that it uses to bind to human cells. Hemagglutinin and neuraminidase (NA), another surface glycoprotein that helps newly made viruses exit cells, give strains their names (H5N1, for example). The sequences of HA and NA mutate easily, which is why each season's flu strain—although it may be the same in subtype, such as H3N2—"drifts" slightly from the previous year's, and the annual vaccine must be tailor-made.

To make a universal vaccine for influenza A, which includes the main seasonal flu strains and bird flu, as well as past pandemic strains, some scientists are hoping to use "conserved" flu proteins that do not mutate much year to year. (Influenza B, the other type, occurs only in humans and causes milder symptoms.) Some of the conserved protein vaccines in the works stimulate the production of antibodies as do conventional flu vaccines, whereas others rouse certain immune system cells to battle the virus.

*[Reference]*

http://science.sciencemag.org/content/312/5772/380 http://www.who.int/features/qa/seasonal-influenza/en/ https://kidshealth.org/en/teens/flu.html

#### **References**


## *Article* **Sentiment Level Evaluation of 3D Handicraft Products Application for Smartphones Usage**

**Natinai Jinsakul 1,2 , Cheng-Fa Tsai 2,\* and Paohsi Wang <sup>3</sup>**


**Abstract:** Three-dimensional (3D) technology has attracted users' attention because it creates objects that can interact with a given product in a system. Nowadays, Thailand's government encourages sustainability projects through advertising, trade shows and information systems for small rural entrepreneurship. However, the government's systems do not include virtual products with a 3D display. The objective of this study was four-fold: (1) develop a prototype of 3D handicraft product application for smartphones; (2) create an online questionnaire to collect user usage assessment data in terms of five sentiment levels—strongly negative, negative, neutral, positive and strongly positive—in response to the usage of the proposed 3D application; (3) evaluate users' sentiment level in 3D handicraft product application usage; and (4) investigate attracting users' attention to handicraft products after using the proposed 3D handicraft product application. The results indicate that 78.87% of participants' sentiment was positive and strongly positive under accept using 3D handicraft product application, and evaluations in terms of assessing attention paid by participants to the handicraft products revealed that positive and strongly positive sentiment was described by 79.61% of participants. The participants' evaluation results in this study prove that our proposed 3D handicraft product application affected users by attracting their attention towards handicraft products.

**Keywords:** sentiment level evaluation; handicraft product; 3D handicraft products; smartphone applications; user interaction; user's attracting attention

#### **1. Introduction**

Various advanced and interactive technologies have displayed their efficiency in the processing, promotion and demonstration of products by displaying three-dimensional (3D) products [1,2] of high quality on a screen. Users can access these via their own digital devices such as smartphones [3,4]. The capability and advantages of mobile technology have resulted in the incremental influence and utilization of smartphones, which have also led to the BYOD (Bring Your Own Device) policy. Using a smartphone for e-commerce has led to a gradual increase in online shopping [5]. Increased interest in online shopping [6] has resulted in various studies on 3D technology [1,2,7,8]. This is because 3D technology can show how objects interact with products. Consequently, this affects the shopping motivation of consumers, attracting their interest in the products.

Regarding the context of Thailand, in 2019, the Thai economy was projected to grow moderately by 2.7% in 2020, and it continues to experience growth due to foreign demand [9]. The agricultural sector grew by 1.5% in Quarter 3 of 2019 in accordance with the government's policy [9]. In addition, Thailand's digital economy has gained importance since establishing a new ministry called the Ministry for Digital Economic and Society

**Citation:** Jinsakul, N.; Tsai, C.-F.; Wang, P. Sentiment Level Evaluation of 3D Handicraft Products Application for Smartphones Usage. *Electronics* **2021**, *10*, 199. https://doi. org/10.3390/electronics10020199

Received: 9 November 2020 Accepted: 14 January 2021 Published: 16 January 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in 2016 [10]. This was because the utilization of the Internet, accessed by smartphones, became more widespread in Thai society; by 2020, Thai people had superior Internet usage compared to the global average with 59% [11] of Thailand's population (65.9 million) having access [10]. This has influenced the generation of new business forms; the Thai government has announced significant projects related to the digital economy's growth, such as Digital Thailand, Thailand 4.0, Digital Park and University 4.0. The digital strategy of Thailand is described as follows [10]:


Thai handicraft products can provide people in rural areas with supplementary income through products from their local resources. Various previous studies have recognized the importance of handicraft products [7,12–15]. Areas of community development expertise have been developed, with the One Tambon (subdistrict) One Product (OTOP) project supporting the sustainability of products from rural areas [16–18]. This scheme is modeled on One Village One Product (OVOP) in Japan [19]. OTOP is a small rural entrepreneurship that produces several kinds of products from raw natural resources, using inherited abilities that rely on the local area's ancient wisdom. Products such as textiles, wooden products, baskets and food include Thai handicraft products sold to tourists [16–18].

Thailand's government encouraged OTOP marketing through advertising, established trade show exhibitions and generated an information system for trading between manufacturers and consumers [16–18]. Our study refers to the government-provided information system, which does not support 3D product displays. Such displays could better attract consumers' attention to products than the current system can and could be one channel through which to support local handicraft producers and small rural entrepreneurship. This work investigates customers' sentiment level evaluation regarding the proposed 3D handicraft product application prototype we developed, and for which the study seeks to determine the users' feelings when using the application by collecting sentiment data utilizing an online questionnaire.

In this study, inspired by the idea of sentiment level evaluation, we used data that came from participants' questionnaire answers. The concept diagram of this work is shown in Figure 1. In the first step, a participant, using our 3D handicraft product application on their smartphone, completed an online questionnaire, with the answers collected in cloud storage.

In the second step, data preprocessing downloaded participants' answer data from cloud drive, imported to data to statistical software and removed useless attributes. These are appropriate steps for using in the final process of sentiment level evaluation by applying statistical software with average function for a 3D handicraft product application usage and five levels: strongly negative, negative, neutral, positive and strongly positive.

The purpose of this study was to examine how users feel when using a 3D handicraft product application on a smartphone, using questionnaire data of participants' answer, which offers the advantage of representing the user sentiment level after application usage. Consequently, this study was divided into four objectives: (1) develop a 3D handicraft product application for smartphones; (2) create an online questionnaire to collect user usage assessment data; (3) evaluate users' sentiment level in 3D handicraft product application usage; and (4) investigate attracting users' attention to handicraft products after using the proposed 3D handicraft product application.

**Figure 1.** The concept of sentiment level evaluation approach employs data from an online questionnaire. **Figure 1.** The concept of sentiment level evaluation approach employs data from an online questionnaire.

The purpose of this study was to examine how users feel when using a 3D handi‐ craft product application on a smartphone, using questionnaire data of participants' answer, which offers the advantage of representing the user sentiment level after appli‐ cation usage. Consequently, this study was divided into four objectives: (1) develop a 3D handicraft product application for smartphones; (2) create an online questionnaire to collect user usage assessment data; (3) evaluate users' sentiment level in 3D handicraft This paper is structured as follows. Section 1 provides the rationale and objective of the research. Section 2 includes related works and strengths for comparing with our work. Section 3 includes the 3D handicraft product application, the literature review for generating the online questionnaire and the data collection method. Section 4 gives the experimental results of sentiment level evaluation and attracting users' attention. Section 5 provides a discussion of the research findings and suggests directions for future study. Finally, Section 6 presents our conclusions.

product application usage; and (4) investigate attracting users' attention to handicraft

#### products after using the proposed 3D handicraft product application. **2. Related Works**

This paper is structured as follows. Section 1 provides the rationale and objective of the research. Section 2 includes related works and strengths for comparing with our work. Section 3 includes the 3D handicraft product application, the literature review for generating the online questionnaire and the data collection method. Section 4 gives the experimental results of sentiment level evaluation and attracting users' attention. Section 5 provides a discussion of the research findings and suggests directions for future study. Finally, Section 6 presents our conclusions. **2. Related Works** There are several related works regarding sentiment analysis applied for tracking There are several related works regarding sentiment analysis applied for tracking human behavior. For example, the authors of [20] utilized data from online social media posts for training with the proposed machine learning techniques to generate a dynamic dictionary system for separating people's opinions, reaching good accuracy results of 90.21%. The authors of [21] produced machine learning models to identify people's activity in social networking by the activity providing the emotional sentiment by the proposed models, which obtained the positive rate of 87.50% and a negative rate of 95.90%. The authors of [22] generated machine learning for sentiment prediction based on people's ranking in online social media by combining behavior with social data with word polarity classes, and the proposed method obtained an accuracy of 85%.

human behavior. For example, the authors of [20] utilized data from online social media posts for training with the proposed machine learning techniques to generate a dynamic dictionary system for separating people's opinions, reaching good accuracy results of 90.21%. The authors of [21] produced machine learning models to identify people's ac‐ tivity in social networking by the activity providing the emotional sentiment by the proposed models, which obtained the positive rate of 87.50% and a negative rate of 95.90%. The authors of [22] generated machine learning for sentiment prediction based on people's ranking in online social media by combining behavior with social data with word polarity classes, and the proposed method obtained an accuracy of 85%. Because sentiment analysis using natural language processing can be adopted in the area of linguistics, the authors of [23] created the experiment to compare machine learn‐ ing approaches for the human language in the text data for sentiment classification by defining the verb, adjective and adverbs for the classes, gaining the accuracy of 88.74%. The authors of [24] used a sentiment dataset to train a machine learning method for ex‐ Because sentiment analysis using natural language processing can be adopted in the area of linguistics, the authors of [23] created the experiment to compare machine learning approaches for the human language in the text data for sentiment classification by defining the verb, adjective and adverbs for the classes, gaining the accuracy of 88.74%. The authors of [24] used a sentiment dataset to train a machine learning method for extracting the meaning from each vocabulary item to the sentences in the micro-blog by the proposed machine learning method, achieving the precision of 92.87%. Furthermore, the authors of [25] also proposed machine learning for sentiment analysis by using text and message data in English and Chinese from micro-blogs to match in sentiment classes and then provided an indication represented by an emoticon—the performance obtained the accuracy of 88.30%. The authors of [26] considered that the conversation in social networking has several topics for which the researchers established multi-sentiment classification using the proposed machine learning method trained with a domain sentiment media dataset. The proposed model gained an overall sentiment classification accuracy of 71.79%.

tracting the meaning from each vocabulary item to the sentences in the micro‐blog by the

Sentiment analysis is also applied in the field of education. For example, the authors of [27] investigated student satisfaction with massive open online courses (MOOCs) by employing supervised machine learning models to identify the course features, where the capability evaluation indicated an F-score of 88.32% for student satisfaction for learning via video instruction. In comparison, the authors of [28] focused on using sentiment classification to enhance higher education standards by adopting machine learning as the classifier to isolate students' comments, achieving an accuracy of 83% for classification performance.

Several works in the literature cover e-commerce and online shopping by conducting the sentiment analysis concentrating on customer comments on products and services. The authors of [29] developed a voting classification technique with machine learning by using data from customer reviews for customers' decisions. The results show that the proposed approach increases classification ability by producing an accuracy of 86.13%. The authors of [30] analyzed posts and discussions regarding multi-sentiment class across several topics in employing products and services with machine learning algorithms. The proposed method obtained an accuracy of 60.2% for seven categories and two classes produced an accuracy of 81.3%. The authors of [31] implemented machine learning techniques with customer experiences in reviews of products and service quality in e-commerce, where the information can be represented in emotions and opinions by the results in terms of precision at 80.10%. The authors of [32] generated a machine learning technique for multi-domains for e-commerce goods reviews and sentiment classification by gaining the average classification accuracy for cross-domain sentiment classification of 77.52% and average accuracy for domain-specific classification of 85.58%. The authors of [33] applied machine learning algorithms for identifying sentiment by big consumer review data for the experience in using e-commerce and real-time shopping. The ability of the system is effective, achieving accuracy close to 98%.

Machine learning for sentiment analysis has been applied in the case of hotel and tourism services. For example, the authors of [34] utilized the contextual data in the text comments of hotel service training with ensemble learning by achieving an accuracy of 96.03%. The authors of [35] developed machine learning methods for sentiment analysis of online tourist comments to provide good comments and suggestions for other interested tourists, with the classification results obtaining an accuracy of 81.87%.

The entertainment area can also utilize sentiment analysis. For example, the authors of [36] conducted the extraction of machine learning models, with the results showing a suitable machine learning algorithm that obtained a classification accuracy of 82.50%. The authors of [37] studied the machine learning technique for sentiment analysis by creating a sentiment dictionary for users to message online while watching shows in a real-time video on the screen, for which the proposed technique obtained a classification accuracy of 88.20% and extracted emotional data from the video by using words consisting of several emotions.

In cases of disaster and security, the authors of [38] applied an approach driven by big data for disaster response via sentiment classification, with the data of the disaster gathered from social networks and classified information following the affected people's requirements, categorized with the machine learning algorithm for analyzing the people's sentiment by extracting features of parts of speech and lexicon, indicating good results and achieving high classification precision up to 95%. The authors of [39] investigated sentiment analysis in terms of authentication, availability, integrity and confidentiality to estimate that reviews are trustworthy, by using the machine learning categorization. The outcome showed that 23% of applications have reliability over 0.5. In comparison, 77% of other remaining applications had reliability lower than 0.5. The appropriate application related to topical reliability contained poor security.

The above related works apply machine learning techniques for sentiment analysis in several areas, which differ from this study. We provide the summarized strengths and key differences between these related works and our work in Table 1.


**Table 1.** Related works and strengths for comparing with our work.

#### **3. Materials and Methods 3. Materials and Methods**

To prepare for this study, we created the proposed 3D handicraft product application and user interaction. We also reviewed the literature to generate a questionnaire, the data collection methods, the evaluation of sentiment level of proposed application usage and sentiment in terms of attracting participants' attention to handicraft product. To prepare for this study, we created the proposed 3D handicraft product applica‐ tion and user interaction. We also reviewed the literature to generate a questionnaire, the data collection methods, the evaluation of sentiment level of proposed application usage and sentiment in terms of attracting participants' attention to handicraft product.

#### *3.1. 3D Handicraft Product Application 3.1. 3D Handicraft Product Application*

Each 3D handicraft model was created in open-source software named Blender (version 2.80, Blender Foundation, Amsterdam, The Netherlands) [40] and used to develop an application for a smartphone. The 3D handicraft product application was developed using game engine software Unity (version 2019.4.9, Unity Technologies, San Francisco, CA, USA) [41] and an installed Android SDK to generate an application that would be compatible with an Android system in a smartphone. The application is in both Thai and English; users can change between languages and use the buttons to select and view the 3D handicraft products (Figure 2). Each 3D handicraft model was created in open‐source software named Blender (version 2.80, Blender Foundation, Amsterdam, The Netherlands) [40] and used to de‐ velop an application for a smartphone. The 3D handicraft product application was de‐ veloped using game engine software Unity (version 2019.4.9, Unity Technologies, San Francisco, CA, USA) [41] and an installed Android SDK to generate an application that would be compatible with an Android system in a smartphone. The application is in both Thai and English; users can change between languages and use the buttons to select and view the 3D handicraft products (Figure 2).

**Figure 2.** Menu options of several handicraft products created from various rural natural materials: (**a**) sedge; (**b**) man‐ grove palm leaves; (**c**) timber; and (**d**) coconut shell. **Figure 2.** Menu options of several handicraft products created from various rural natural materials: (**a**) sedge; (**b**) mangrove palm leaves; (**c**) timber; and (**d**) coconut shell.

The Thai handicraft products in the application focused on products from rural re‐ sources in four categories: sedge, mangrove palm leaves, timber, and coconut shell (Fig‐ ure 3). A participant can use the application on a smartphone to get a 360‐degree look at the product (Figure 4); in addition, the 3D product system displayed text with details on each 3D product, including the product's name, size, price, and usability. The Thai handicraft products in the application focused on products from rural resources in four categories: sedge, mangrove palm leaves, timber, and coconut shell (Figure 3). A participant can use the application on a smartphone to get a 360-degree look at the product (Figure 4); in addition, the 3D product system displayed text with details on each 3D product, including the product's name, size, price, and usability.

#### *3.2. Questionnaire Creation*

The literature review investigated several related works in the field of technology assessment, as presented in Table 2, including the perceived ease of use, perceived usefulness, user attitudes, behavioral intention, user interaction, 3D product display and ability to attract attention.

#### *3.3. Data Collection and Demographic Information*

In total, 2500 participants, who were alumni and students from universities in several locations (thus explaining why the study had many participants), were sent online questionnaires constructed via Google Forms and distributed to their email addresses. Participants were recruited from six regions of Thailand (north, northeast, east, west, central and south). The email included a hyperlink for downloading the 3D handicraft product

application for use with the participant's smartphone. Furthermore, participants had to provide primary data and smartphone use habits as smartphone application interaction information. Both the questionnaire and 3D handicraft product application were in the Thai language; the participants' data, including full name and identifying details, were not shown in the collected data. The preliminary participants' information appears, with the total and demographics as percentages, in Table 3; the total and percentage of participants' use smartphone answer are shown in Table 4. The online questionnaire answers were developed on a five-point scale (strongly negative = 1, negative = 2, neutral = 3, positive = 4 and strongly positive = 5) to evaluate sentiment in questions assessing perceived ease of use (PEOU), perceived usefulness (PU), attitude toward (AT), behavioral intention (BT), 3D user interaction (3DUI), 3D product display (3DPD) and attracting attention (AA). After sending the online questionnaire to the 2500 participants' email addresses, 1775 questionnaires were received by the scheduled deadline, for a response rate of 71% (questionnaire sending began 1 June 2019; we waited for responses until 31 December 2019). *Electronics* **2021**, *10*, x FOR PEER REVIEW 7 of 16 *Electronics* **2021**, *10*, x FOR PEER REVIEW 7 of 16

**Figure 3.** 3D handicraft products and details in the smartphone application: (**a**) bag created from sedge, (**b**) bin crafted from mangrove palm leaves, (**c**) vase produced from timber, and (**d**) cup made from coconut shell. **Figure 3.** 3D handicraft products and details in the smartphone application: (**a**) bag created from sedge, (**b**) bin crafted from mangrove palm leaves, (**c**) vase produced from timber, and (**d**) cup made from coconut shell. **Figure 3.** 3D handicraft products and details in the smartphone application: (**a**) bag created from sedge, (**b**) bin crafted from mangrove palm leaves, (**c**) vase produced from timber, and (**d**) cup made from coconut shell.

**Figure 4.** User interaction with a 3D handicraft product. **Figure 4.** User interaction with a 3D handicraft product. **Figure 4.** User interaction with a 3D handicraft product.

The literature review investigated several related works in the field of technology assessment, as presented in Table 2, including the perceived ease of use, perceived use‐ fulness, user attitudes, behavioral intention, user interaction, 3D product display and

The literature review investigated several related works in the field of technology assessment, as presented in Table 2, including the perceived ease of use, perceived use‐ fulness, user attitudes, behavioral intention, user interaction, 3D product display and

> Users' feelings toward an application are easy to assess. The per‐ ceived ease of use indicates the type of user who is expected to employ the application without any difficulty [2,42]. Perceived ease of use (PEOU) was mentioned in the technology acceptance model (TAM) [43]. The TAM in terms of PEOU was tested in an earlier study to describe users' acceptance of the system and re‐ lated applications, namely mobile entertainment [44], virtual worlds [45] and the social virtual world market [46]. This indicated that the application system's performance was appropriate, and

> Users' feelings toward an application are easy to assess. The per‐ ceived ease of use indicates the type of user who is expected to employ the application without any difficulty [2,42]. Perceived ease of use (PEOU) was mentioned in the technology acceptance model (TAM) [43]. The TAM in terms of PEOU was tested in an earlier study to describe users' acceptance of the system and re‐ lated applications, namely mobile entertainment [44], virtual worlds [45] and the social virtual world market [46]. This indicated that the application system's performance was appropriate, and

*3.2. Questionnaire Creation*

*3.2. Questionnaire Creation*

ability to attract attention.

ability to attract attention.

1. Perceived Ease of Use (PEOU)<break/>1.1 How easy is it to learn how to use a 3D handicraft product application on a smartphone?<break/>1.2 How convenient is it to use a 3D handicraft prod‐ uct application on a smartphone?<break/>1.3 How flexible is the 3D handicraft product application on

1. Perceived Ease of Use (PEOU)<break/>1.1 How easy is it to learn how to use a 3D handicraft product application on a smartphone?<break/>1.2 How convenient is it to use a 3D handicraft prod‐ uct application on a smartphone?<break/>1.3 How flexible is the 3D handicraft product application on

a smartphone?

a smartphone?

**Questionnaire Item Literature**

**Questionnaire Item Literature**


**Table 2.** Several related works in the field of technology assessment for generating the questionnaire.


**Table 3.** Total and percentage of participants' demographics.

**Table 4.** Total and percentage of participants' smartphone use information for evaluation.


Demographic information indicates that, of 1775 total participants, 986 were female, accounting for 55.55% (versus 789 male participants for a percentage of 44.45%), and most were in the age range of 25–30 years old (546, or 30.76%; the smallest group was of participants older than 50, who numbered 151, or 8.51%). The most participants (330) were from the northeast (18.86%); the fewest were from the west (257; 14.54%). In terms of occupation, the most participants (367) were self-employed (20.77%), with the smallest group being the unemployed (153; 8.66%) and the average income per month for most participants (291) was 10,001–15,000 baht (24.84%). The smallest group, with 151 participants (8.55%), was for those who earn more than 25,000 baht.

The most frequent answer to the main purpose of using a smartphone application in a day was using social media (403; 33.70%); the least popular answer was using smartphones to play games (371; 18.03%). A question regarding using 3D applications on a smartphone revealed that 1189 participants, or 66.99%, had used them, while 586 participants (33.01%) had never used them. A question about using 3D products on a smartphone revealed that

705 (39.72%) had ever used this function, while 1070, accounting for 60.28%, had never used it.

#### *3.4. Data Preparation and Sentiment Level Evaluation Method*

The questionnaire data were divided into two parts: the part used for users' sentiment level evaluation is the second part of the questionnaire refers to application usage. The first part is the participants' demography—we used only the participants' preliminary information, which does not contain any identifiable information. Thus, we decided not to apply the first part, which is unnecessary for sentiment level evaluation. After obtaining data from the questionnaire, we applied statistical software to prepare our data and evaluate participants' sentiment level by calculating the average in every seven main attributes (PEOU, PU, AT, BI, 3DUI, 3DVP and AA) from its sub-attributes. After that, we computed the average value of all seven main attributes, abd then created the new attribute for the average total, which employs the average value from all seven main attributes to the total average outcome by using an average function provided by statistical software (see Figure 5 for a more in-depth explanation). Once the average total value was obtained, we used this attribute to determine sentiment level by condition (see Table 5). *Electronics* **2021**, *10*, x FOR PEER REVIEW 11 of 16 attributes (PEOU, PU, AT, BI, 3DUI, 3DVP and AA) from its sub‐attributes. After that, we computed the average value of all seven main attributes, abd then created the new at‐ tribute for the average total, which employs the average value from all seven main at‐ tributes to the total average outcome by using an average function provided by statistical software (see Figure 5 for a more in‐depth explanation). Once the average total value was obtained, we used this attribute to determine sentiment level by condition (see Table 5).


**Figure 5.** Average results of seven main attributes, average total results and the calculation of sentiment level. **Figure 5.** Average results of seven main attributes, average total results and the calculation of sentiment level.

**Table 5.** The condition to gain sentiment level result. **Table 5.** The condition to gain sentiment level result.


#### **4. Results 4. Results**

*Application Usage*

This study's contribution is that our proposed 3D handicraft product application is appropriate for usage and can attract users' or customers' attention to handicraft prod‐ ucts after using the proposed application. The results for the assessment of attention paid This study's contribution is that our proposed 3D handicraft product application is appropriate for usage and can attract users' or customers' attention to handicraft products after using the proposed application. The results for the assessment of attention paid by

by participants and the utilization of the statistical software for sentiment level evalua‐

As mentioned, data preparation employed statistical software to evaluated partici‐ pants' sentiment level and users' attention by calculating the average in seven main at‐ tributes from its sub‐attributes and then created the new attribute for the average total to determine sentiment level by the condition. For the sentiment level evaluation results of a 3D handicraft product application usage, we investigated the general statistics regarding the number and percentage by an average function of each sentiment level of the 1775 participants indicated in Table 6. Participants whose sentiment was strongly negative,

tion after using the proposed 3D product application are described in this section.

participants and the utilization of the statistical software for sentiment level evaluation after using the proposed 3D product application are described in this section.

#### *4.1. Results of Participants' Sentiment Level Evaluation of Proposed 3D Product Application Usage*

As mentioned, data preparation employed statistical software to evaluated participants' sentiment level and users' attention by calculating the average in seven main attributes from its sub-attributes and then created the new attribute for the average total to determine sentiment level by the condition. For the sentiment level evaluation results of a 3D handicraft product application usage, we investigated the general statistics regarding the number and percentage by an average function of each sentiment level of the 1775 participants indicated in Table 6. Participants whose sentiment was strongly negative, negative or neutral were combined under the heading of "reject" using 3D handicraft products, which totaled 375 participants accounting for 21.13%, while participants who showed positive and strongly positive sentiment totaled 1400 (78.87%). This demonstrates that participants accept the 3D handicraft product application and generally have positive feelings.

**Table 6.** Total number and percentage of sentiment levels evaluation.


#### *4.2. Results for the Sentiment of Attracting Attention to Handicraft Products*

By collecting data from participants' usage evaluations, we can analyze the results for the attention paid by participants by applying only the attracting attention attribute or AA to calculate the percentage represented in Table 7. The participants' evaluations in terms of sentiment after using the proposed application, assessing attention paid by participants to the handicraft products, revealed that a positive sentiment was described by 688 participants, accounting for 38.76%, while a strongly positive sentiment was described by 725 participants, or 40.85%. Thus, the total positive score of 1413 participants accounted for 79.61%. In comparison with the total of the negative and neutral scores of 362 participants, accounting for 20.39%, participants expressing positive sentiments in relation to their attention being drawn towards handicraft products were found in much greater numbers, and the participants' evaluation results of this survey prove that our proposed 3D handicraft product application affected users by attracting their attention towards handicraft products.



#### **5. Discussion**

The return of 1775 participants' questionnaires answer from the sending of 2500 emails in Thailand represents a response rate of 71%. The interesting data from participants' demography represent that females returned more the questionnaires than male, and most were in the age range of 25–30 years old. In terms of occupation, the largest share of participants were self-employed, the average income per month for most participants being 10,001–15,000 baht.

According to the evaluation results of 1775 participants' sentiment level in using the proposed 3D handicraft product application, we found that participants' sentiment level was positive and strongly positive sentiment with a total score of 1400 (78.87%). This demonstrates that participants accept the 3D handicraft product application and generally have positive feelings. For the results of attracting users' attention to handicraft products after using the proposed 3D handicraft product application, the participants were evaluated in terms of sentiment after using the proposed application. The attention paid by participants to the handicraft products had a total positive score of 1413 participants, accounting for 79.61%. The participants expressing positive sentiments in relation to the attention paid to handicraft products were found in much greater numbers than negative sentiments. The participants' evaluation results in this survey prove that our proposed 3D handicraft product application affected users by attracting their attention towards handicraft products.

Further studies should consider collecting more types of handicraft products to add to the 3D product applications. In the case of sentiment level categories, this study specifically defined only five scales; future research could have scales of 1–7 or 1–3, and we could compare the sentiment level outcome. The proposed questionnaire of this study only contained the answer as a number for an estimate of the sentiment level, while the next study could use the interview method to obtain the users' comments and suggestions and use machine learning for text classification. This study only sampled participants living in different parts of Thailand. In a future study, we could select the sample to include expatriates or international tourists, who may be more likely to pay attention to Thai handicraft products.

#### **6. Conclusions**

This research developed a 3D handicraft product application for smartphones. 3D technology can promote the sustainability of small rural entrepreneurship by advertising a product and attracting consumers' attention. User reactions to the proposed 3D application must be investigated to improve the application and provide greater usage capability. The purpose of this study was to examine how users feel about and how attention can be drawn towards handicraft products by using a 3D handicraft product application on a smartphone, developed a 3D handicraft product application for smartphones, using an online questionnaire to collect user usage assessment data.

The proposed questionnaire of this work was divided into two parts. The demographic information part illustrated that, of 1775 total participants, the most frequent answer to the main purpose of a smartphone application usage in a day was using social media. A question regarding using 3D applications on a smartphone revealed that 66.99% of participants had used them, while 33.01% of participants had never used them. A question about using 3D products on a smartphone revealed that 39.72% had ever used this function, while 60.28% had never used it.

In the second part of questionnaire, the answers were measured on a five-point scale (strongly negative, negative, neutral, positive and strongly positive). The attributes assessed included perceived ease of use, perceived usefulness, attitude toward, behavioral intention, user interaction, 3D product display and attracting attention. Participants whose sentiment was under reject using 3D handicraft products totaled 21.13%, while participants who showed positive and strongly positive sentiment totaled 1400, accounting for 78.87%. This demonstrates that participants accept the 3D handicraft product application and

generally have positive feelings. The participants were evaluated in terms of sentiment after using the proposed application, assessing the attention paid by participants to the handicraft products. It revealed that a positive sentiment was described by 38.76%, while a strongly positive sentiment was described by 40.85%. Thus, the total positive score of 1413 participants accounted for 79.61%. The attention paid by participants to handicraft products after using the proposed 3D handicraft product application and participants expressing positive sentiments related to attracting attention were found in much greater numbers than negative sentiments, while the participants' evaluation results in this study prove that our proposed 3D handicraft product application affected users by attracting their attention towards handicraft products.

**Author Contributions:** Conceptualization, N.J. and C.-F.T.; Data curation, P.W.; Formal analysis, N.J.; Funding acquisition, C.-F.T.; Investigation, N.J. and C.-F.T.; Methodology, N.J. and C.-F.T.; Project administration, C.-F.T.; Resources, N.J.; Software, N.J.; Supervision, C.-F.T.; Validation, P.W.; Visualization, N.J.; Writing—original draft, N.J.; and Writing—review and editing, C.-F.T. and N.J. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Ministry of Science and Technology, Republic of China, Taiwan, grant numbers MOST-108-2637-E-020-003 and MOST-108-2321-B-020-003.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: https://zenodo.org/record/4442207.

**Acknowledgments:** The authors would like to express their sincere gratitude to the anonymous reviewers for their useful comments and suggestions for improving the quality of this paper, as well as the Department of Tropical Agriculture and International Cooperation, Department of Management Information Systems, National Pingtung University of Science and Technology, Taiwan, Ministry of Science and Technology, Republic of China, Taiwan and Suratthani Rajabhat University, Suratthani, Thailand for supporting this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*Electronics* Editorial Office E-mail: electronics@mdpi.com www.mdpi.com/journal/electronics

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34

www.mdpi.com ISBN 978-3-0365-6576-7