1. Introduction
In qualitative research, researchers typically use coding to label and group similar, non-numerical data to generate themes and concepts that allow for more manageable data analysis. Coding is a sophisticated process that involves reviewing the data, developing a set of codes that can accurately characterize each label, and classifying the data. Especially when dealing with large quantities of data, it is not surprising that researchers are looking for help in the field of digital technologies, where specialized software has been developed under the umbrella term qualitative data analysis (QDA) tools (e.g., [
1,
2]).
With the emergence of chatbots such as ChatGPT and Gemini (formerly named Google Bard) as a form of generative artificial intelligence (AI), qualitative data researchers have become interested in their usability in a coding process [
3,
4]. Therefore, researchers face the dilemma of choosing between specialized QDA tools and generative chatbots to find a solution that best fits the needs. To solve the dilemma, multiple material and non-material factors can influence the final decision. As qualitative researchers and educators, we were interested in finding an answer to the question of how, among many other factors, user experience (UX), usability, trust, affect, and mental load can be recognized as predictors of user perception when evaluating artificial intelligence-driven chatbots and traditional qualitative data analysis tools. Therefore, the strengths and weaknesses of chatbots compared to other digital technologies for the same tasks need to be carefully analyzed, including studies on the interaction between humans and computers.
The field of human-computer interaction (HCI) has been thoroughly researched, but studies combining HCI and chatbots, although numerous, are still at an early stage. What hinders practical researchers is also the fact that the functions and capacities that can influence the UX of chatbots change daily. Usability criteria such as effectiveness, efficiency, and satisfaction, for example, have been used to determine how successfully users can learn and use chatbots to achieve their goals and how satisfied users are with their use [
5,
6,
7,
8,
9,
10,
11,
12]. Comparative studies between tasks performed by chatbots and “traditional digital tools“ are even rarer, necessitating additional research [
13].
Even though in the last five years, the number of publications in the field of evaluating the usability and UX of chatbots has started to increase, the existing literature reveals only a very limited number of studies that specifically analyze the usability of chatbots and the UX when using this type of technology for qualitative data analysis. This highlights the need for further research in this area. Our study aims to (1) identify and adapt standard measurement instruments for evaluating chatbots in the context of qualitative data analysis, and (2) evaluate and compare the usability and UX of chatbots for qualitative analysis and determine their (dis)advantage over non-AI-based tools. Based on previous research findings [
5,
7,
9,
10,
11,
12], our study aims to adapt and combine several different measurement instruments that will enable a more holistic analysis of the usability and UX of chatbots in the context of qualitative data analysis. At the same time, one of the goals of this study is to determine whether there are differences in usability and UX perceptions between chatbots and non-AI and natural language processing (NLP)-based QDA tools. This study may be regarded as piloting in HCI research by providing new insights into end users’ perceptions about the usability and UX of chatbots used in qualitative research. Additionally, based on the study’s results, validated measurement instruments provide the researchers with a tool for future research in this field.
The primary goal of the research is to analyze users’ perceptions of the usability and UX of AI-based and non-AI-based tools for qualitative data analysis. To accomplish this, the following research questions served as the foundation for our study design:
- RQ1.
Are there significant differences in the usability and UX of chatbots and non-AI tools in the qualitative data analysis process?
- RQ2.
Are there significant differences in trust, task difficulty, emotional affect evaluation, and mental workload experienced when using chatbots and non-AI tools in qualitative data analysis?
3. Materials and Methods
3.1. Study Design
This study’s main goal was to compare how users’ perceived usability, ease of use, UX, mental effort, and trust between non-AI and AI-based tools differ when conducting qualitative data analysis. Simultaneously, we aimed to understand if user perceptions varied according to the kind of chatbot tool utilized. The basic conceptual model of the main variables that we aimed to investigate, evaluate, and compare with our research is shown in
Figure 1.
This study used the Taguette tool as an example of a tool that does not use AI for qualitative data analysis. Taguette is a free QA tool that can be used to analyze qualitative data such as interview transcripts, survey responses, and open-ended survey questions by encoding different text data segments and allowing multiple tagging in text segments, making it easier to identify patterns or themes [
20]. As examples of two AI-based tools in our study, we chose ChatGPT (version 3.5) and Gemini, which also proved to be tools with high accuracy, comprehensiveness, and self-correction capabilities [
74].
The study’s main objective was to evaluate user perception of the observed AI-based chatbots and non-AI-based tools in the context of qualitative analysis. We focused primarily on the users’ perceptions of using a specific tool for qualitative data analysis. The goal was also to establish measurement instruments to capture quantitative data and compare users’ perceptions of different tools.
Participants in this research were set in the role of UX researchers and tasked with a qualitative evaluation using individual tools. Since the goal was not to compare tools directly, we provided users with different sets of qualitative data obtained in preliminary research about the content, UX, and structure of the Faculty of Electrical Engineering, Computer Science and Informatics, University of Maribor website. The role of the participants in this study was to analyze the obtained qualitative data and to present both positive and negative aspects of the faculty’s website content, UX, and structure quality.
Participants were presented with three separate text documents (content.txt, ux.txt, and structure.txt), each containing 100 user responses. These responses were collected through an online survey designed to evaluate the faculty’s website
https://feri.um.si/en/ (accessed on 15 December 2024). In the survey, 100 users, including students and faculty members (professors, teaching assistants, and others), answered three open-ended questions addressing weaknesses in the website’s content, user experience, and structure. The data gathered as part of a prior user study, served as the basis for the qualitative analysis conducted in our experiments.
Responses were mostly short and presented in separate lines. Examples of the responses included “Too much content at once”, “I find the content OK, as it is related to the study or fields of study”, “A lot of irrelevant content. The front page should not contain news/announcements", "Too few colours, page looks monochrome”, “Maybe someone can’t find what they’re looking for because they can’t find the relevant subcontent”, “I didn’t notice any shortcomings”, and so on. Data was organized into three separate documents to prevent participants from becoming overly familiar with a single dataset, which could have influenced their perception across the tools. By introducing a fresh dataset for each tool, we ensured that any differences in users’ perceptions were more reflective of the tool than prior knowledge of the data. The first file, content.txt, contained answers to questions concerning the quality of the content provided on the faculty website. The second file, ux.txt, included a set of answers to the open-coded question concerning the UX quality of the faculty’s website. Finally, the third file, structure.txt, included answers about user’s perceptions of the website’s structure quality.
This study’s process involved several steps that included practical tasks as presented in
Figure 2. In the beginning, we presented the basic purpose and goals of the experiment to the participants and briefly explained the basic steps (without a detailed explanation of the individual tasks). In this initial step, the participants signed a statement agreeing to participate in the research and use the data for further research. A short survey followed the introduction step, the aim of which was to collect basic demographic information about the participants of the experiment. This included information on participants gender, tool familiarity level and frequency of AI chatbot use. User familiarity with AI tools was assessed for ten of the most common AI-based chatbots available on the market at the time of the study (i.e., ChatGPT, Gemini, Bing Chat, OpenAI Playground, Jasper, Perplexity AI, HuggingChat, Chatsonic, YouChat, Socrates.ai). Five-point Likert scale with answers ranging from 1—Not familiar at all to 5—Very familiar was used. The frequency of the AI tool used was assessed on a similar five-point Likert scale (1—Never, 2—Very rarely, 3—Occasionally, 4—Frequently, 5—Very frequently). All aforementioned demographic data was self-reported. While user familiarity with a broader range of chatbots was assessed to contextualize participant experience, ChatGPT and Gemini were specifically selected for experimental tasks due to their advanced functionalities, alignment with the requirements of qualitative data analysis, and their popularity at the time of the study. Other chatbots listed in
Table 2 were not included in the experiments as they did not fully meet the criteria for this study.
After users had completed the demographic questionnaire, we presented the first task using the Taguette tool. In the first task, they were instructed to perform data annotation on the content.txt file—tagging the survey answers with the core challenges they highlighted. Participants were free to define as many tags as needed. The participants did not receive instructions or restrictions on how to name the tags. In this task, the participants had to identify and prepare a list of the top ten most crucial challenges related to the website’s content based on the qualitative data. When the participants were sure that they had completed the task, they had to submit the result or solution and fill in the questionnaire to assess the implementation of task 1 with the selected tool. In the second task, the participants were asked to categorize the answers in the same dataset based on their sentient as positive or negative. When the participants were sure they had completed the second task, they had to submit their solution via the online system and fill out a questionnaire to evaluate the implementation of task two with the selected tool. After participants completed and submitted both task results and answered the task-related surveys, users were asked to answer the tool survey. With the tool survey, we captured additional quantitative data on user perceptions of usability and UX when using the tool for qualitative data analysis. After completing the survey for evaluating the first tool (Taguette), the process continued with the same tasks and surveys for evaluating two selected AI-based tools. When users evaluated the ChatGPT tool, they conducted the qualitative data on the ux.txt file, and with the Gemini, participants conducted the qualitative data analysis of the data provided with the structure.txt file.
The tasks were designed to reflect key activities in QDA and align with the study objectives of comparing the usability and UX of AI-based and non-AI-based tools. The first task focused on data annotation, a foundational QDA activity (e.g., [
2]). Participants categorised and tagged qualitative data based on themes or patterns, simulating realistic coding activities that require interpretation, pattern recognition, and thematic categorization. This task highlighted the tools’ flexibility and ability to support subjective judgment, which is critical in qualitative research. The second task introduced sentiment analysis, requiring participants to categorize data as either positive or negative. This task was chosen to test the tools’ efficiency and usability in guiding users through structured decision-making processes. Sentiment analysis reflects a common extension of QDA, where researchers analyze emotional tones in textual data (e.g., [
7,
24]). These tasks were intentionally chosen to examine the tools’ ability to handle open-ended and structured qualitative analysis, covering varying levels of cognitive complexity and decision-making processes. To ensure consistency and avoid familiarity bias, the datasets for each tool were specifically assigned: the content.txt file was analyzed using the traditional tool (Taguette; [
30]), while the ux.txt and structure.txt files were analyzed using AI-based tools (ChatGPT and Gemini). The task design supports the study’s primary research questions, particularly comparing usability, UX, trust, and mental workload across different tools during realistic QDA scenarios. Relevant usability criteria, including effectiveness, efficiency, and satisfaction (e.g., [
7,
13]), informed the task design to ensure alignment with theoretical foundations and prior research.
3.2. Participants
Participants in this research’s experiments have been recruited at the University of Maribor, Faculty of Electrical Engineering and Computer Science. Participants were users who had previously performed UX evaluations at a basic to advanced level. Two groups of Master’s students in the Informatics and Data Technologies study program were included in this research. These individuals were chosen based on their prior experience in UX evaluations, ranging from basic to advanced levels, making them well-suited candidates for this experiment. Additionally, due to their prior academic background, participants possessed the requisite skills to meaningfully engage with the data coding tasks at hand. At the same time, their lack of prior exposure to the datasets used in the study ensured neutrality, reducing any potential bias stemming from familiarity with the data. Altogether, 85 students participated in the research, of whom 63 were male and 22 female. The first group consisted of 54 first-year Master’s students, and the second group of 31 second-year Master’s students.
All participants were familiar with intelligent chatbots based on large language models. The participants’ tool familiarity level was evaluated before the exercises. The results were included in the demographic presentation of the user sample, and user familiarity was checked for ten of the most common AI-based chatbots available on the market, on a five-point Likert scale (1—Not familiar, 5—Very familiar). The results are presented in
Table 2. The highest mean familiarity was observed with ChatGPT (
= 4.62, Std = 0.60). All other chatbots reached lower mean familiarity levels, with Gemini (
= 2.65, Std = 1.23) and Bing (
= 2.55, Std = 1.04) reaching the subsequent highest mean familiarity as self-reported by participants.
Additionally, participants were asked to rate the frequency of their intelligent chatbot use on a scale of 1–5 (1—Never, 2—Very rarely, 3—Occasionally, 4—Frequently, 5—Very frequently). The results are presented in
Table 2. Participants expressed a frequent use of ChatGPT (
= 4.39, Std = 0.86) rare use of Gemini (
= 1.85, Std = 1.02) and very infrequent use of other tools.
Figure 3 visualizes the mean values for the participants’s familiarity level with AI chatbots and frequency of use level for the selected chatbots.
3.3. Measures
Evaluating the QDA tools involved a comprehensive assessment from two perspectives: the task’s complexity and the user experience (
UX) during task completion. Each tool was assessed using standardized surveys designed to evaluate five dimensions:
UX,
usability,
trust,
mental workload, and
task difficulty. To achieve this, three questionnaires were designed, each tailored to specific stages of the evaluation process. These surveys captured multiple aspects of UX and usability, and
Figure 4 provides an overview of the evaluated factors and their corresponding measurement instruments.
The evaluations relied on self-reported scales, reflecting participants’ perceptions rather than objective behavior or cognitive process measures. It is important to note that reported trust levels may not directly correspond to trust exhibited in decision-making, and perceptions of workload or difficulty may diverge from actual task performance or cognitive load.
Participants’ general information, including gender, familiarity with chatbots, and the frequency of chatbot use, was collected using a concise Demographic Survey.
Following task completion, a second questionnaire (
Task Survey) was administered to gather insights into users’ effort and experience while completing tasks. This survey measured three key factors:
mental workload,
perceived task difficulty, and
emotional affect. The
NASA-TLX instrument [
75] was used to assess six dimensions of workload: mental demand, physical demand, temporal demand, performance, effort, and frustration. Each subscale was rated on a 1–20 scale, with an overall score of 0–100 calculated as a simple sum (referred to as Raw TLX or RTLX [
76]). Task difficulty was evaluated using the
Single Ease Question (SEQ), a validated 7-point Likert scale that performs comparably to other task complexity measures [
77]. Additionally, emotional responses were assessed using the
I-PANAS-SF, a ten-item scale that measures both positive and negative affect [
78].
A third survey (
Tool Survey) focused on the
usability,
UX, and
trust associated with each tool.
UX was evaluated using the
UEQ-S instrument, a shorter version of the User Experience Questionnaire [
79]. This tool measures pragmatic and hedonic qualities with 8 items, on a scale ranging from −3 to +3 and allows benchmarking against reference values established from 468 studies [
80].
Usability was assessed with the
System Usability Scale (SUS), which provides scores ranging from 0 to 100 and allows comparisons with established benchmarks [
81,
82]. Furthermore, a
net promoter score (NPS) was derived from the SUS results as a measure of user loyalty [
83]. The final dimension,
trust, was evaluated using the
TOAST instrument [
84], which features two subscales measuring performance trust and understanding trust. This tool uses a seven-point Likert scale. Higher scores on the performance subscale indicate that the user trusts the system to help them perform their tasks, while higher scores on the understanding subscale indicate users’ confidence in the appropriate calibration of their trust. Detailed translations and descriptions of the survey items are provided in the
Supplementary Materials to ensure transparency and replicability.
3.4. Statistical Analysis
This study applied different statistical methods to analyze the collected data. Descriptive statistics were used to summarize participant familiarity and frequency of chatbot use, including means, standard deviations, ranges, and percentages. The Kruskal–Wallis H test was used for comparative analysis to assess differences between chatbots across factors such as mental workload, trust, and emotional affect. Where significant differences were found, Dunn’s post hoc test was used for pairwise comparisons between tools. Effect sizes were calculated using eta-squared () to measure the magnitude of these differences.
Additionally, reliability analyses were performed using Cronbach’s alpha to evaluate the internal consistency of the measurement instruments. Correlation analysis, including Kendall’s tau-b and point-biserial correlation, was conducted to explore relationships between variables like task difficulty and mental workload. These statistical techniques provided a robust framework for analyzing the study’s data, enabling in-depth comparisons and validations across different tools and user perceptions.
4. Results
4.1. Qualitative Data Analysis Results
As presented in
Section 3, participants performed QDA with three tools. For each tool, they were presented with a new dataset and two repeating tasks. The first task focused on data annotation and recognizing the repeating challenges. The second task focused on categorizing all instances within the dataset based on their sentient, which was positive or negative.
In the first task, participants recognized more than seventy differently worded categories of challenges. The overlap in recognized categories using synonyms and slightly different wording was significant. The most commonly identified categories were “Too much information”, “Too much irrelevant content”, “Confusing navigation”, “Too much text”, “Lack of content clarity”, “Missing information”, “Missing important information”, “Missing search function", and “No issues/No comments”.
In the second task, participants labeled the answers as positive or negative and submitted the number of recognized positive and negative responses and the ratio between them. Some participants also recognized selected responses as neutral. Not all users submitted valid responses; some only responded with the (private) hyperlink to the chat, which limited the data analysis. For Taguette, 80 valid responses were gathered, for ChatGPT 64 and for Gemini 58. The mean ratio between the positive and negative responses, as recognized by participants from the gathered results, was 36:74 for Taguette (Std = 14.2), 31:69 for ChatGPT (Std = 27.2), and 64:36 for Gemini (Std = 63.2). A larger standard deviation was observed with the use of AI tools. A categorization error was also recognized with this task; some users categorized results multiple times as the sum for their responses was higher than 100 (the sum of all analyzed answers). This error had low representation in the analysis conducted in Taguette (3 responses reported categorization values higher than initial data instances) and Gemini (4 incorrect responses), and higher with ChatGPT, where 16 out of 64 respondents (25% of them) made this error. Participants most likely continued their work in the same chat instance in which they conducted the first task, which led to duplicate counting due to instances in the first task being assigned multiple tags.
4.2. Usability Evaluation
Usability was evaluated using the SUS questionnaire. The results of the SUS score are presented in
Table 3. ChatGPT obtained the highest SUS score (SUS = 79.03). Taguette and Gemini achieved similar results with SUS values of 74.95 and 75.08, respectively. Based on the scale’s benchmark, all usability scores are acceptable and can be categorized as good. Based on some interpretations of the lower limits, ChatGPT could also be considered a promotor. Detailed SUS interpretation for all tools is visualized in
Figure 5. The reliability of the results obtained with the SUS questionnaire was measured with Cronbach’s alpha and reached
= 0.814, indicating very good levels of reliability. Before analysis, all negatively stated items were reversed to avoid negative alpha values. Detailed analysis of reliability for SUS showed very good reliability for ChatGPT and Gemini and an acceptable level of reliability for Taguette. Detailed results are presented in
Table 3. Overall, the usability of all three evaluated tools can be considered good, although some positive variance was observed in the ChatGPT score.
The difference in SUS scores between the tools was tested with the Kruskall–Wallis H Test, which showed no significant difference in the observed tools with
,
, with a rather small effect size (
). An additional pairwise comparison between the SUS scores of the three tools with the Dunn’s Test showed a significant difference between the SUS scores of Taguette and ChatGPT (
). No other differences were statistically significant. Results are presented in
Table 4. This further indicates that there was no statistical difference between the usability of the AI tools used for the observed two UX evaluation tasks.
4.3. User Experience Evaluation
As presented in
Section 3, UX was evaluated with the UEQ-S tool. Initially, 82 responses were received for Taguette, 83 for ChatGPT, and 84 for Gemini. The results of the pragmatic and hedonic scales and the overall results are presented in
Table 5. They indicate an overall positive evaluation, except for Taguette, which was negatively evaluated in hedonic quality (slightly suppressing the benchmark of <0.8). Taguette tool was evaluated as neutral overall, while ChatGPT and Gemini achieved a positive overall evaluation.
Mean values per UEQ-S item are presented in
Figure 6. They show very similar results for ChatGPT and Gemini and some disparities for Taguette. Taguette tool was perceived as similarly supportive, easy, and clear as other tools, but less efficient than them. The previously mentioned, the difference in hedonic quality items is visible; users evaluated Taguette as more boring, not interesting, more conventional and usual compared to the other two tools. Cronbach’s Alpha values were analyzed separately for pragmatic and hedonic scales. They reached
= 0.814 and
= 0.803 in the whole sample. In the Taguette sample, they reached
= 0.70 and
= 0.80, in the ChatGPT sample they reached
= 0.90 and
= 0.82, and on the Gemini sample they reached
= 0.82 and
= 0.80. All values indicate good to very good consistency.
The results of all three observed tools compared to the benchmark values are presented in
Figure 7. The pragmatic quality of the Taguette and ChatGPT tool is considered ’Good’, while Gemini reached the threshold for ’Excellent’ quality. The hedonic quality of the Taguette tool was evaluated as ’bad’, which is placed in the range of the 25% worst results. On the contrary, ChatGPT and Gemini were positioned as ’Excellent’, i.e., in the range of the 10% best results. The difference in the hedonic value for the Taguette tool is also reflected in its positioning of the overall scale.
The difference in the UEQ Scale scores between tools was analyzed with a Kruskall–Wallis H test. The results showed a significant difference in overall UEQ score,
, with a very large effect size (
). Taguette reached the lowest mean rang with 58.43, followed by ChatGPT with 149.09 and Gemini with 160.48. An additional pairwise comparison conducted with the Dunn’s Test indicated that there was a statistically significant difference in overall UEQ score between Taguette and ChatGPT (
) and between Taguette and Gemini (
). There was no statistically significant difference between the observed AI tools. The results of the Dunn’s Test are presented in
Table 6. They confirm the previously indicated difference in UX of the AI and non-AI tools in this context.
4.4. Cognitive Load and Emotion Correlation
We further explored the relationship between cognitive load and emotional response to understand how workload influenced users’ emotions. No meaningful correlation between positive emotion and NASA-TLX scores (workload, frustration, and others) was identified (). Correlation estimates for individual tools also showed no significant correlation between the workload and positive emotions for Taguette (, ), ChatGPT (), and Gemini (). However, a statistically significant moderate to strong positive correlation existed between negative emotion and NASA-TLX (). This result suggests that higher cognitive demands, effort, and frustration are associated with increased negative emotions. QDA tools that lead to higher cognitive demands and effort tend to evoke more negative emotions in users.
There were significant correlations between NASA-RTLX scores and negative emotions such as nervousness (), upset (), hostility (, ), and guilt (). These correlations suggest that as users’ workload and effort increase, they tend to report higher negative emotional states, such as feeling more nervous, upset, or hostile. There were weak and primarily negative correlations between positive emotions, such as determination () and inspiration (). These results indicate that a higher workload is associated with lower positive emotions like determination and inspiration, although the correlations are weaker than those with negative emotions. The analysis reveals a clear emotional response associated with perceived workload, particularly for negative emotions like nervousness, frustration, hostility, and guilt, which increase with higher workload scores. On the other hand, emotions like determination and inspiration decrease with increased workload, showing that cognitive and emotional load affect the user’s emotional experience. While some emotions, like alertness and attention, are less affected, the overall trend suggests that workload negatively impacts users’ emotional state.
4.5. Trust Evaluation
As presented in the methodology
Section 3, trust was evaluated with the TOAST questionnaire. Altogether, 249 valid responses were obtained (three participants did not finish the trust survey—two responses were missing for Taguette evaluation and one for ChatGPT). Results were analyzed according to the two recognized subscales; performance and understanding. The reliability of the obtained results was measured with Cronbach’s alpha and reached
= 0.779 for items in the understanding subscale (items one, three, four, and eight of the questionnaire) and
= 0.805 for items in the performance subscale (items two, five, six, seven, and nine of the questionnaire). Reliability remained high after observing the results per tools—for performance subscale
= 0.805,
= 0.853,
= 0.805, and understanding subscale
= 0.735,
= 0.826,
= 0.813. All the observed reliability values range from good to acceptable.
The results of TOAST understanding and performance subscales per tools are presented in
Figure 8. Observing the performance subscale, which indicates users’ trust that the system will help them perform their task; Taguette reached the highest mean value with 5.38 (Std = 1.00), followed by ChatGPT with 5.17 (Std = 0.96), and Gemini with 5.10 (Std = 1.05). Therefore, users trusted Taguette the most to help them perform their UX evaluation tasks. Observing the understanding subscale, which indicates users’ confidence about the calibration of their trust, ChatGPT reached the highest mean value with 5.88 (Std = 0.80), followed by Gemini with 5.53 (Std = 0.89), and Taguette with 5.47 (Std = 1.03). The results of understanding subscale are correlated with the participants’ familiarity and frequency of use for AI tools, which was highest for ChatGPT and followed by Gemini in both cases (data previously presented in
Table 2). The users had previous experience with these two tools, which allowed them to calibrate their trust.
The difference in both trust subscales between the tools was tested with a Kruskall–Wallis H test, which showed that there was a statistically significant difference in understanding subscale scores between the observed tool,
,
, with a mean rank 143.58 for ChatGPT, 116.60 for Taguette, and 114.85 for Gemini. Additionally, Dunn’s Test for pairwise comparison was performed, results of which are presented in
Table 7 and showed a statistically significant difference in understanding subscale between the ChatGPT and Gemini (
) and Taguette and ChatGPT (
), meaning users were more sure in the calibration of their trust when using ChatGPT compared to Gemini and to Taguette, separately. No significant difference was observed between Taguette and Gemini. Kruskall–Wallis H test additionally showed a statistically significant difference on the performance subscale with
, and mean rank 138.48 for Taguette, 120.95 for ChatGPT, and 115.85 for Gemini. Dunn’s Test showed a statistically significant difference on a performance subscale between Gemini and Taguette (
), meaning users trusted Taguette more to help perform their tasks, compared to Gemini. The effect size of both Kruskall–Wallis tests was quite small with
and
.
4.6. Mental Workload Evaluation
Mental models were evaluated separately for both tasks (categorization and sentiment analysis) using the NASA-RTLX. An overview of the resulting values (sum values of six subscales) by tools and tasks is presented in
Figure 9. It is visible that users reported a higher mental workload when using the Taguette tool in both tasks. The mean NASA-RTLX value for Taguette was 57 (
) for the first task and 49 (
) for the second. Meanwhile, the mean value for other tools was between 26 and 32 for both tasks (
). The mean values for the tools remained quite consistent between the first and second tasks. The difference between them for observed tools was further confirmed with a Kruskall–Wallis H test. It showed a statistically significant difference in the NASA-RTLX score between the different tools used for the UX evaluation on both tasks.
For the first task, this was confirmed with
(2) = 36.607,
, with a mean NASA-RTLX score of 69.99 for ChatGPT, 103.20 for Gemini and 189.16 for Taguette. The effect size, calculated using eta-squared
was 0.136, indicating a small effect. And for the second with
, with a mean NASA-RTLX score of 114.18 for ChatGPT, 99.90 for Gemini, and 164.90 for Taguette. The effect size for the second task was very large (
), indicating a substantial difference in perceived workload across the three systems. Additional pairwise comparison, conducted with Dunn’s Test indicated a clear difference between the AI and non-AI tools. For both tasks, the difference was statistically significant in comparison of ChatGPT and Taguette (
,
) and in comparison of Gemini and Taguette (
,
). The results of Dunn’s Test are presented in
Table 8.
The detailed overview of the subscale values is presented in
Table 9. Taguette consistently shows higher mental demand compared to the AI tools, especially in the first task (
). Both ChatGPT and Gemini present significantly lower mental demands, with ChatGPT reaching the lowest mean value for the first and Gemini for the second task (
). With the use of Taguette users experienced higher temporal demands, particularly in the second task, suggesting that AI tools helped them in managing perceived time pressure better (the experiment had no time restraints for tasks). The subscale performance revealed that users were more satisfied with their performance when using Taguette in the first task, though this satisfaction slightly decreased in the second task, while the satisfaction with their performance when using AI tools improved. However, Taguette reached the highest mean values for perceived performance in both tasks (
). The effort required to complete the task was quite higher for Taguette tool, especially in the first task (
), compared to the use AI tools (
). This could indicate that the use of AI tools could aid efficiency. Observing the frustration subscale, Taguette, users experience higher frustration levels across both tasks. With the use of AI tools for the tasks, particularly ChatGPT, users showed much lower frustration levels (e.g., for the first task
). Mean NASA-TLX values for each tool are visualized in
Figure 10, with
Figure 10a representing the results for Task 1 and
Figure 10b representing the results for Task 2.
Kruskall–Wallis H tests were used to analyze the difference between the tools for each subscale and task. The results are also reported in
Table 9. The test indicated a significant difference between tools in all six subscales. Observing the first task, for mental demand,
, Taguette reached the highest mean rank with 186.43, followed by Gemini with 104.26 and ChatGPT with 98.64. For physical demand,
,
, Taguette also reached the highest mean rank with 167.84, followed by Gemini with 113.92 and ChatGPT with 107.21. For temporal demand,
, Taguette also reached the highest mean rank with 146.62, followed by ChatGPT with 121.36 and Gemini with 120.83. The results were similar for the performance subscale,
, where Taguette reached the highest mean rank with 161.35, followed by ChatGPT with 119.63 and Gemini with 108.40. A significant difference was also observed for effort,
, as Taguette reached the highest mean rank with 190.41, followed by Gemini with 110.93 and ChatGPT with 87.54. Lastly, for frustration,
, Taguette reached the highest mean rank with 173.35, followed by Gemini with 118.93 and ChatGPT with 96.35. Similar results were observed for the second task, with slight changes in mean rank order for ChatGPT and Gemini. For mental demand,
, Taguette reached the highest mean rank with 152.32, followed by ChatGPT with 118.98 and Gemini with 107.80. For physical demand,
, Taguette also reached the highest mean rank with 155.31, followed by ChatGPT with 114.90 and Gemini with 109.09. The results were similar for temporal demand,
, as Taguette reached the highest mean rank with 159.75, followed by ChatGPT with 116.36 and Gemini with 102.91. For performance in the second task,
, Taguette reached the highest mean rank with 141.17, followed by ChatGPT with 125.01 and Gemini with 112.89. For effort,
, Taguette again reached the highest mean rank with 155.00, followed by ChatGPT with 119.44 and Gemini with 104.53. Lastly, for frustration,
, Taguette reached the highest mean rank with 149.20, while ChatGPT and Gemini reached similar values with 115.29 and 115.01, respectively. The effect sizes observed in the first task indicate large effect sizes for most of the observed subscales (
). A small effect size was observed only on the subscale for temporal demand (
), while a medium effect size was observed on the performance subscale (
). For the second task, most subscales indicated a medium effect size (
). A small effect size was observed for performance (
) and frustration subscales (
).
As both NASA-RTLX and SEQ measure the required effort to complete the task (at least in some form), the correlation between these tools was further analyzed. A Kendall’s tau-b correlation was run to determine their relationship on the entire sample of answers (all tasks, with all tools). There was a strong, positive correlation between SEQ and NASA-RTLX, used for measuring the perceived workload, which was statistically significant (). Additionally, Point-Biserial Correlation was used to analyze the errors recognized in the second task with NASA-RTLX responses, which proved insignificant (r). Correlation between the error rates in the second task and SEQ was analyzed with Pearson Chi-Square, though again, no statistically significant association between them was observed ).
4.7. Task Dificulty
Task difficulty was determined with The Single Ease Question (SEQ), a seven-point rating scale (1—Very Difficult, 7—Very Easy) used to assess how difficult users found the task. SEQ was administered after each task. For Task 1, Taguette scored a mean SEQ of 2.66 (Std = 1.48), ChatGPT had a mean of 1.38 (Std = 0.92), and Gemini had a mean of 1.71 (Std = 1.22). In Task 2, Taguette again reached the highest mean of 2.26 (Std = 1.51), ChatGPT had a mean of 2.05 (Std = 1.51), and Gemini scored a mean of 2.02 (Std = 1.37). Mean values and data distribution are presented in
Figure 11. The visualization indicates that Taguette consistently showed higher mean SEQ scores across both tasks, suggesting it was perceived as more challenging to use. These results highlight that the use of ChatGPT generally eased the complexity of the tasks conducted in this study. The overall mean SEQ scores for all three observed tools based on the reference values presented by [
85] are visualized in
Figure 12.
The difference between the SEQ results for observed tools was further confirmed with a Kruskall–Wallis H test. It showed a statistically significant difference in the SEQ score between the different tools used for the evaluation of the first task with
, with a mean rank SEQ score of 98.05 for ChatGPT, 119.37 for Gemini, and 171.19 for Taguette. Effect size, measured with
indicated a large effect size with the value of 0.1972. No statistical difference was observed in the SEQ values between the tools for the second task, with the Kruskall–Wallis H test (
). Additionally, a negligible effect size was observed with this test (
). Further pairwise comparison, conducted with Dunn’s Test indicated a clear difference between the tools for the first task. The difference was statistically significant in comparison of ChatGPT and Taguette (
), in comparison of Gemini and Taguette (
), as well as in comparison of ChatGPT and Gemini (
). The results of Dunn’s Test are presented in
Table 10.
4.8. Emotional Affect Evaluation
Table 11 shows the descriptive statistics of the I-PANAS-SF items for both tasks and all three tools (
Figure 13). Regarding the determination, the average scores varied from 3.37 to 3.73 for each tool and task. Task 1’s ChatGPT tool had the highest average score, suggesting that users felt more determined when using this tool for their job. The Taguette had the highest mean value (3.62) for Task 1, indicating more attentiveness to this tool and task. The scores for attentiveness varied slightly. Using ChatGPT to solve Task 1 resulted in users being less active than Taguette, with a mean value of 3.56. The ChatGPT tool inspired users the most when they completed Task 1 (score of 2.89). With ratings ranging from 2.67 to 3.02, all three tools demonstrated a moderate level of awareness for each of the three tasks and tools.
Users were the least nervous while completing Task 1 using the ChatGPT (1.55) and the most nervous when they had to complete Task 1 using the Taguette tool (2.12). The mean scores for all tools and tasks were quite low, indicating that users rarely felt afraid. Task 1 with the Taguette tool received the highest score for upset emotions (1.94), whereas in the case of other tools and tasks, the upset scores were lower. Hostility scores were modest, with slight variations among tasks and tools and the lowest score for ChatGPT in the case of solving Task 1 (1.25). Overall, feelings of humiliation were slight across all tools and tasks, with Gemini scoring the lowest (1.21) in the case of Task 2.
Positive emotion scores were relatively high across all tools and tasks, where the highest positive emotional response (15.77) was in the case of ChatGPT and Task 1, and the lowest score (14.99) was in the case of ChatGPT and Task 2 (see
Table 12 and
Figure 14). In the case of negative emotions, the instrument provided the highest negative emotions (8.59) for the Taguette tool and solving Task 1. At the same time, ChatGPT Task 1 produced the lowest negative emotional response (6.83), indicating that ChatGPT generally leads to lower negative emotional experiences.
In the case of the Taguette tool, the positive emotion scores are relatively consistent, with Task 1 slightly higher than Task 2. Positive emotion estimates were relatively higher compared to other tools. In the case of ChatGPT, we noticed a slight drop in positive emotion from Task 1 to Task 2. However, the positive emotion scores for ChatGPT remained high overall, indicating that completing the tasks with ChatGPT performed consistently well. For Gemini, the positive emotion scores were also similar for both tasks, where Task 1 slightly outperformed Task 2. The emotions were moderately high but consistent between both tasks for the Taguette tool. In the case of the ChatGPT, the participants had the lowest negative emotion scores across both tasks. And in the case of Gemini, the scores for the emotions were slightly higher when compared to ChatGPT, but similar to Taguette.
Comparing emotional effects between tasks and tools (see
Table 13) showed that while positive emotional experiences were comparable across tasks and tools, negative emotions varied more significantly, particularly between Taguette and the AI-based tools (ChatGPT and Gemini). For positive emotions, the statistical tests showed no significant differences in positive emotions between tasks or tools, meaning all tools and tasks generated similar levels of positive emotions. The Mann–Whitney U test estimates were small (e.g., −1.17, −0.33), and the
p-values were higher than 0.05, indicating no significant differences between the compared groups. Positive emotion scores were also compared between all three tools with the Kruskal–Wallis H test, which resulted in nonsignificant values (
), indicating no statistically significant difference between the three tools.
When we compared negative emotion scores between all three tools, significant differences were found in negative emotions between the tools. The Kruskal–Wallis H test yielded a more significant statistic (27.06) with a p-value , indicating significant differences in the negative emotion scores across all three tools. Comparison of negative scores between the two different tools provided results with higher values (e.g., −4.36, −4.51) and p-values , indicating significant differences between tools. The Mann–Whitney U test revealed significant differences between Taguette vs ChatGPT and Taguette vs Gemini () but no significant difference between ChatGPT vs Gemini.
To summarize, the ChatGPT tool generated higher positive and lower negative emotions, especially in the case of Task 1. Compared to ChatGPT, Taguette demonstrated good positive emotion scores, but, the tool triggered slightly stronger negative feelings, particularly in the case of Task 1, in which users felt more scared, agitated, and hostile. Though it produced more adverse emotions than ChatGPT, Gemini had comparatively balanced emotion scores for both tasks. According to the estimated scores, ChatGPT seems to be the tool of choice.
5. Discussion
This study examined significant differences in usability, UX, trust, task difficulty, emotional affect, and mental workload between chatbots and non-AI tools in qualitative data analysis, revealing key insights into how these tools impact user perception.
Compared to existing usability evaluations of intelligent chatbots, a higher reliability of the results was achieved in this study (overall Chronbach’s
, compared to
, achieved by [
65]). The obtained results of ChatGPT’s usability (
) were higher compared to the existing literature, where the SUS value of 67.44 was measured [
65]. Although the average SUS scores of Taguette (
) and Gemini (
) were slightly lower, all three tools observed in this paper reached acceptable usability scores. Obtained usability scores were higher compared to Aswaja chatbot (
, [
14]) and myHardware chatbot (
, [
61]), that were also evaluated in the educational domain. The positive results of our study highlight the user acceptance of AI chatbots across various tasks.
The UX evaluation (conducted with UEQ-S, separating hedonic and pragmatic scale) of the three observed tools showed overall positive results, though pairwise comparison with Dunn’s Test showed Taguette obtained statistically significant lower scores in separate comparison with ChatGPT and Gemini. This could indicate that AI chatbots can offer better UX than classic QDA tools for observed qualitative analysis tasks. UEQ-S results were compared to the benchmark values of related 468 studies, offered by the tool. For pragmatic quality, Taguette and ChatGPT reached the threshold for ’Good’, while Gemini reached the threshold for ’Excellent’. For hedonic quality, Taguette was evaluated as ’bad’ (in the range of 25% of the lowest obtained results), while ChatGPT and Gemini were evaluated as ’Excellent’ (in the range of top 10% of related studies). All overall UEQ results (
) were much lower compared to the study, analyzing the UX of Atrexa chatbot, where the overall UEQ of
was observed [
64]. Notably, the number of participants in their study was quite lower (N = 17), compared to our study’s larger sample size of 82. Consequently, the difference in sample size and composition may be a factor worth considering when comparing UEQ-S result scores across studies.
The evaluation of trust within our study, as measured by the TOAST questionnaire, provides insightful perspectives on users’ trust in the QDA tools analyzed. The reliability of metrics () remained robust across individual tools and subscales, suggesting that the questionnaire effectively captures the dimensions of trust across varying contexts. On the performance subscale, which reflects users’ trust in the system’s capability to assist them in completing their tasks, Taguette was the highest-rated tool with a mean score of 5.38 (Std = 1.00). In contrast, ChatGPT and Gemini followed closely with mean scores of 5.17 (Std ) and 5.10 (Std ), respectively. These results imply that, despite rating Taguette as a tool with worse UX and usability, users still recognized its effectiveness in aiding their performance, highlighting its potential as a trusted tool in practical applications. On the other hand, on the understanding subscale, which gauges users’ confidence in their calibration of trust, ChatGPT scored the highest with a mean of 5.88 (Std ), followed by Gemini at 5.53 (Std ), and Taguette at 5.47 (Std ). The higher score for ChatGPT can be attributed to users’ prior experience and familiarity with the tool, which likely enhanced their ability to calibrate their trust effectively. As presented in the demographic section of this study, participants had used ChatGPT more frequently than Gemini (no data was gathered for Taguette use), thus reinforcing their confidence in its performance. The difference in results of performance and understanding subscales highlights an important nuance in how trust is developed and perceived among users. While Taguette was most trusted for task performance, users expressed greater confidence in calibrating their trust towards ChatGPT. In this case, trust may have been influenced by factors such as familiarity, prior experience, and user expectations. Enhancing users’ familiarity with a tool through effective onboarding or training could potentially change obtained results and improve their understanding and calibration of trust, thereby increasing overall user satisfaction and performance. Future research could explore strategies to bridge the gap between performance trust and understanding trust, ultimately leading to a more cohesive UX across QDA tools.
The results from the Single Ease Question (SEQ) provide valuable insights into users’ perceptions of task difficulty across the evaluated tools. The average SEQ score (
= 2.01) in this study was lower than the average scores obtained over 4000 tasks by [
85], where a mean score of around 5.5 was observed. This comparison indicates that the tasks in this study were seen as quite complex by the participants. The mean values, compared to the average scores of previous studies, are presented in
Figure 12. The overall mean SEQ score of ChatGPT was 1.71, followed closely by Gemini with 1.86. Taguette reached higher mean values, indicating greater perceived task difficulty with an overall mean SEQ score of 2.46. With these values, ChatGPT is placed in the highest tenth percentiles by the task’s difficulty, while Gemini and Taguette were placed within the top 25% of the most difficult tasks. The consistent trend of higher SEQ scores for Taguette (also visible in separate observations per task) suggests that users perceived this tool as more complex and challenging to navigate, highlighting potential usability issues that could hinder effective task completion. In contrast, AI-based chatbots were associated with a lower task difficulty (while completing a very similar task), as evidenced by their significantly lower SEQ scores across both tasks. The Kruskal–Wallis H test confirmed a significant difference in SEQ scores for Task 1 among the three tools (
), with a large effect size (
). The pairwise comparisons revealed that the differences in perceived difficulty between Taguette and the other tools were statistically significant, further highlighting the challenges users faced when interacting with Taguette. Interestingly, the lack of significant differences in SEQ scores for Task 2 (
) suggests that users may have adapted to the tasks or that the second task was inherently less complex across all tools. This points to the potential for learning effects or task characteristics to influence perceived difficulty, underscoring the need for further investigation into how task design and user familiarity can impact users’ perceptions of the tool.
The mental workload evaluation, conducted with NASA-TLX in this study yielded results comparable to prior studies utilizing this tool. Grier [
86] reported an analysis of over 1000 uses of NASA-TLX scores from over 200 publications. The mean values reported in our study were in the range of 26–32 for the AI tools (ChatGPT, Gemini) and in the range of 49–57 for the non-AI tool (Taguette). This positions the results of AI tools in the 20% of the lowest scores observed in the literature (by [
86]), while the non-AI tool is positioned in the 50%–70% percentile. Comparing the results of the mental workload for the first task (classification of the survey responses) with other classification tasks from prior literature (
, per [
86]), the mental workload measured with the use of AI-tools would be positioned in the lowest scoring 25% of the related studies, while the mental load of the task with the use of the non-AI tool would be positioned in the highest 25% of the related studies. Comparing the results of the second task (sentiment analysis of responses) with other results reported for cognitive tasks by Grier [
86] (
), the mental load measured with the use of AI-tools would again be positioned in the lowest scoring 25% of the studies, while the mental load with the use of the non-AI tool would be positioned slightly above the highest 50% of the studies. The results of NASA-RTLX indicate that the use of AI tools for UX evaluation tasks lowers the cognitive workload on the evaluators compared to the use of non-AI tools. Analysis of NASA-TLX subscales indicated that AI tools (ChatGPT and Gemini) generally reduce mental, physical, and temporal demands compared to the non-AI tool (Taguette). Users reported lower frustration and lower demand for effort with AI tools. However, users’ subjective performance satisfaction can vary, with Taguette reaching the highest mean performance satisfaction in both analyzed tasks. The comparison of effort measurements obtained with NASA-RTLX and SEQ revealed a strong, positive correlation (
). Presented results are complementary to those reported by Sauro and Dumas [
77], though the disparity could be attributed to the difference in task complexity (as reported in
Section 4.6, the NASA-RTLX results indicate the tasks in our research were positioned in the highest 25% of tasks in similar research, based on the required mental load).
Analyzing emotional responses reveals distinct patterns in user experiences across tasks and tools. Positive emotion scores were consistently high, with ChatGPT performing best, particularly in Task 1 (
,
Table 13), driven by high determination (
,
Table 11). However, a slight decline in positive emotion for ChatGPT in Task 2 (
) may suggest task-specific challenges. Taguette showed relatively strong positive emotions (
) but lacked consistency, while Gemini demonstrated balanced scores (
and
). Negative emotions varied significantly, with ChatGPT producing the lowest scores (
), indicating a supportive user experience (
Table 13). In contrast, Taguette triggered higher negative emotions (
for Task 1), with users reporting increased nervousness (
) and upset (
). Gemini’s scores were moderate (
for Task 1), lying between Taguette and ChatGPT. Statistical tests confirmed these trends, showing significant differences in negative emotions across tools (
,
,
Table 13), while positive emotions showed no significant variation (
,
). Overall, ChatGPT’s combination of high positive and low negative emotions positions it as the most user-friendly tool, particularly for tasks requiring engagement and focus. These findings highlight the importance of designing tools that minimize negative emotional impacts while fostering positive user experiences.
The study’s findings provide several significant contextual observations and implications. The higher usability scores of ChatGPT and Gemini compared to Taguette can be attributed to their natural language interfaces, which simplify interaction by allowing users to engage conversationally. Unlike traditional tools like Taguette, which rely on manual tagging and coding workflows, chatbots eliminate the need for technical input by offering predictive and responsive assistance. This conversational approach most likely contributed to the lower cognitive load and improved user satisfaction reported with AI-powered solutions.
Interestingly, despite lower UX and usability ratings, Taguette received the highest rating for performance trust, suggesting that users value its perceived reliability and precision for task-specific outcomes. This emphasizes the importance of familiarity and expectations in establishing user trust. Participants most likely viewed Taguette as a specialized tool for qualitative analysis, which increased their trust in its performance accuracy. On the other hand, ChatGPT’s higher trust scores suggest that familiarity with AI chatbots plays a significant role in users’ ability to understand and evaluate the tool’s capabilities. This emphasizes the necessity of training and onboarding in bridging trust gaps between AI and non-AI technologies.
The findings also suggest that AI tools reduce emotional strain by fostering positive user experiences. ChatGPT and Gemini consistently produced lower negative emotion scores and higher positive emotional responses, reinforcing their potential to make qualitative analysis tasks more engaging and less frustrating. However, the emotional variability across tasks indicates that task complexity and tool suitability influence emotional responses. The slightly lower positive emotion scores for ChatGPT in Task 2 may be due to the nature of the task itself. Task 2 involved categorizing responses as either “positive” or “negative”, which is a more rigid and straightforward activity compared to the open-ended and creative tagging required in Task 1. This lack of flexibility and opportunity for interpretation made Task 2 feel less engaging or rewarding for participants, resulting in slightly reduced positive emotional responses.
Potential solutions to address identified challenges include improving traditional tools such as Taguette with predictive features or automated assistance to improve usability and reduce cognitive load. Similarly, providing comprehensive onboarding and training for AI-driven tools could further enhance trust and usability by familiarizing users with their functionalities and limitations. These measures could foster a more balanced user experience and reduce differences between traditional and AI-based tools. Moreover, future studies could investigate how integrating features like sentiment analysis or pattern recognition directly into traditional tools might improve their acceptance and performance in practical settings.
5.1. Limitations and Threats to Validity
This study has several limitations. The participant sample consisted primarily of Master’s students in informatics and data technologies, which limits the generalizability of the findings to other user groups, such as secondary school students, PhD candidates, or professional staff. These groups may have different levels of familiarity with tools like ChatGPT, Gemini, and Taguette, potentially affecting their experiences and evaluations. Additionally, the use of self-reported measures may not fully reflect actual behavior or cognitive processes. Future research should include a more diverse user base and incorporate objective data, such as task completion times, to provide a more comprehensive understanding of QDA tools.
The study’s findings are subject to several potential threats to validity. Using the short version of UEQ (UEQ-S) allows for only a rough measurement of higher-level meta-dimensions, and can impact the robustness and precision of the results. Consequently, comparisons made to other studies that employed the full UEQ may be less accurate, and the data interpretation should be cautiously approached. This study did not establish a baseline comparison of the task difficulty without the use of tools, which could provide a critical context for evaluating the effectiveness of the tools. As described in
Section 3.3, all dimensions analyzed in this study were measured as perceived by users through self-reported scales. Perceptions may not fully align with users’ actual behavior or cognitive processes. Reported results reflect subjective user perception and should be interpreted within the context of these limitations. User responses were only partly validated, raising concerns about the reliability and accuracy of the feedback received. This study compared two AI-based chatbots and one non-AI tool; therefore, the generalization from AI chatbots to other AI-enhanced tools is limited. There is a possibility that more specialized tools for UX evaluation tasks exist though they were not included in this study. This limitation could affect the generalization and applicability of the findings. Finally, the study involved Master’s students studying informatics and data technologies, which means the results could differ for less tech-savvy users and other design-oriented IT practitioners.
5.2. Future Research
Future research could explore extending this evaluation to other AI-enhanced tools for qualitative data analysis or incorporating more diverse user populations to assess the generalization of these results. Additionally, examining the role of training and prior tool familiarity in shaping user experiences could offer deeper insights. Replication studies across different datasets or task types would be valuable for confirming these findings. Finally, future studies could explore establishing benchmarks or thresholds for emotional and cognitive responses to further standardize evaluations of the QDA tools. Extending the scope of analysis to include additional data, such as task completion times and error rates, could complement self-reported measures and offer a fuller understanding of QDA tools’ user experience. Lastly, longitudinal studies tracking the long-term change in perception, usability and adoption of these tools could yield insights into their sustained effectiveness and user satisfaction.
6. Conclusions
Evaluating the usability, user experience, emotional impacts, and workload of tools designed for qualitative data analysis is critical for understanding how these systems influence users’ performance and satisfaction. Tools like ChatGPT, Gemini, and Taguette each present unique strengths and weaknesses in these areas, highlighting the importance of tailoring tools to users’ needs.
The findings from this study underscore that AI-based tools, particularly ChatGPT, consistently foster more positive user experiences, reduce negative emotional impacts, and lower cognitive workload compared to traditional tools like Taguette. ChatGPT’s high usability score and strong hedonic quality distinguish it, whereas Gemini demonstrates balanced positive and negative emotions across tasks. Taguette, on the other hand, shows a higher mental workload and stronger negative emotions, suggesting areas for improvement, particularly for tasks requiring substantial cognitive effort.
It is important to note that this study involved a limited group of respondents, primarily Master’s students in informatics and data technologies, which may affect the generalizability of these findings. The conclusions presented here are therefore most applicable to this specific demographic and should not be extended to other user groups without further validation. For example, secondary school students, PhD candidates, industry representatives, or staff members may exhibit different usage patterns and perceptions when interacting with these tools.
Nevertheless, these insights provide valuable guidance for the design and implementation of effective QDA tools. Several factors influence the perceived effectiveness of these technologies, including usability, cognitive effort, and emotional impact. This study highlights the value of leveraging AI technologies to simplify complex qualitative tasks, improve user satisfaction, and inform the development of tools that better align with diverse user needs. Future research should aim to include broader and more representative user groups, as well as explore varying contexts of use, to enhance the robustness and generalizability of findings in this evolving field.