1. Introduction
In the digital era, the prominence of the e-commerce industry has been amplified remarkably, with business-to-consumer (B2C) websites being a crucial component. These platforms are designed to maximize benefits for both consumers and businesses [
1]. A significant challenge for many businesses is to develop a B2C website that not only attracts a substantial customer base but evokes a positive emotional response [
2]. Emotions significantly dictate an individual’s information processing, behavioral tendencies, and, ultimately, their decisions on final purchases [
3]. Therefore, a well-designed B2C website that generates positive emotional experiences can effectively foster customer loyalty and recurring visits, thus enhancing the market competitiveness of the website [
4]. To increase the purchasing rate for a given website, it is paramount to design a website that elicits a positive emotional response from users. This necessitates a comprehensive analysis of user emotional experiences, empowering designers with insights into customers’ buying behaviors and refining their website to become a strategic competitive asset.
Most researchers concur that emotions represent transient mental states influenced by life events [
5,
6,
7]. Numerous models exist for quantifying emotion, with Russell’s circumplex model having gained significant traction in the field of human-computer interaction (HCI) [
8,
9,
10]. This model conceptualizes an individual’s perception of their emotions within two-dimensional spaces, encompassing valence and arousal. These two dimensions are orthogonal and independent. Valence is positioned on the horizontal axis in this two-dimensional space. It is an individual’s judgment of whether an emotion is good or bad, representing the positivity and negativity of the emotion. Conversely, arousal, placed on the vertical axis, represents the level of emotional activation [
11,
12]. These two dimensions allow us to distinguish four basic categories of emotion, as shown in
Figure 1. Other emotion models include Ekman’s discrete emotion model and Plutchik’s compound emotion model [
13,
14].
The measurement of emotional responses predominantly revolves around three approaches: subjective feelings, physiological reactions, and motor expressions [
15]. In terms of expressive motion, numerous studies have been conducted on the recognition of facial expressions [
16,
17], utilizing a camera to capture visible light images of users during emotional reactions. Its performance is susceptible to variations in ambient lighting conditions [
18]. Furthermore, a notable drawback of this approach is that individuals often tend to avoid changes in facial expressions when interacting with technological systems, resulting in reduced consistency between emotional experiences and facial expressions [
19,
20]. The Self-Assessment Manikin (SAM) is a subjective–affective report method based on the circumplex model that measures the degree of pleasure, arousal, and dominance of individuals in response to events through nonverbal pictorial assessment techniques, thus effectively mitigating the influence of individual differences in emotion cognition [
21]. Many emotion research studies propose that the physiological changes in emotions are intimately linked with emotional experiences [
22,
23,
24]. This theory has opened new avenues for analyzing emotional experiences on B2C websites. For instance, researchers have studied user emotional responses to two versions of a mobile phone interface using the physiological markers of electrodermal activity (EDA) and heart rate (HR) as indicators. The findings revealed higher EDA levels with the low-usability version, although the HR values showed no significant difference. Additionally, a correlation was observed between EDA, valence, and arousal [
25]. Furthermore, additional methodologies include a multimodal approach combining eye movement indicators with the average galvanic skin response (GSR), skin temperature (SKT), and respiration rate (RSP) to assess user emotional experiences during online shopping. The findings indicated no significant differences in the GSR, SKT, and RSP responses, whereas the eye movement indicators showed significant variance [
26]. In a separate investigation, researchers identified that distinct emotional states are associated with discernible variations in event-related potentials (ERPs), which means that users’ emotional experience while interacting with a website can be quantified by assessing the amplitude of the ERPs within relevant brain regions [
27]. These studies highlight the diversity and effectiveness of methods that are currently employed to study emotional responses in the realm of B2C website interactions.
Physiological signals have shown promising results in detecting emotional experiences, as illustrated by the studies detailed above. However, these methods, while effective, require skin contact or are invasive. There is also the question of delay and cost, given the complexity of these procedures. In contrast, infrared thermal images (IRTIs) have recently gained recognition as a non-contact, non-invasive solution to evaluate human autonomic nervous activity and psychophysiological states. The autonomic nervous system (ANS) serves as the foundation for the thermal observation of emotion. It plays a pivotal role in regulating various physiological signals in individuals, and encompasses unconscious functions such as breathing, heart rate, perspiration, etc. Two biological mechanisms enable the thermal observation of emotions, namely subcutaneous vasoconstriction and emotional sweating, both of which can be characterized and quantified by IRTIs [
28]. The advancement in IRTIs and the miniaturization of infrared detectors have incentivized numerous manufacturers to develop portable systems, specifically mobile and low-cost infrared thermal systems. This advancement has greatly facilitated experimental research [
19].
In recent years, IRTIs have been widely used in the field of emotion recognition. For instance, IRTIs have been used to study changes in nasal temperature induced by feelings of guilt in children [
29]. In the dimensions of valence and arousal, thermal images were used to mark physical changes during emotional tasks, revealing a link between nose temperature and emotions, particularly valence. Positive valence and arousal patterns led to an increase in nose temperature, while negative valence triggered a decrease [
30]. Machine learning has also been incorporated into thermographic emotional studies, demonstrating high accuracy. For example, using the Stroop test to provoke stress, researchers recorded thermal imaging, cardiac, electrodermal, and respiratory activity. A support vector machine (SVM) model was employed for classification, and it was found that stress identification through IRTIs alone achieved a success rate of 86.84% [
31]. Furthermore, the gray-level cooccurrence matrix (GLCM) features of thermal images have been explored for their potential use in emotion detection [
32]. Therefore, the thermal imaging method combined with classification models could provide potential improvements in the quality and efficiency of website emotion evaluation. However, it is worth noting that previous studies predominantly employed videos and images as experimental stimuli. We tried to apply emotion classification based on IRTIs to the field of HCI and used B2C websites as the experimental stimuli.
This paper aims to investigate the effectiveness of the noninvasive IRTI method in classifying user emotional experiences when using B2C websites. We prepared an experimental setup wherein the emotional experiences of users were induced by websites with adjusted usability and aesthetic elements. The participants completed corresponding tasks and SAM, which provided the base truth of their emotional experiences and later served as labels in machine learning. This study is principally focused on establishing the potential of IRTIs in the context of HCI, particularly in its application for B2C websites.
The insights gained in the study will contribute to the understanding of user experience evaluation metrics, which are increasingly being employed as performance indicators for B2C websites. Additionally, they will facilitate the modeling of user emotional experiences from an HCI perspective. More pragmatically, this study serves to further comprehend the impact of website design elements on the emotional experiences of users, thereby enabling designers to optimize these elements for better user engagement. In achieving this, a cross-subject classification model that is promising for improved generalizability was developed. This model aimed to predict the emotional experiences of all participants rather than simply training a different model for each individual.
2. Methods
2.1. Design
This experiment was designed to demonstrate that IRTIs could be used to classify emotional experiences in HCI. This experiment adopted a 2
2 two-factor within-subject experimental design, with two independent variables: interface usability (high or low) and interface aesthetics (high or low). The dependent variables were emotional experience (valence and arousal) and the participant’s facial thermal responses. The emotional experiences were classified into positive emotional experiences and negative emotional experiences. As well, we also measured baseline (without emotional stimuli) thermal images for an actual comparison with the experimental conditions [
28].
2.2. Participants
This experiment was conducted with a group of 24 students (12 males and 12 females) from Southeast University, and the age range was 19–25 (M = 22.50 and SD = 2.02) years old. The participants who accepted the experimental conditions were informed of the start time of the experiment 5 days in advance. Based on the study guidelines, the participants were required to abide by the following rules: no drinking alcohol 24 h before, no drinking coffee or smoking 3 h before, no application of lotions, cosmetics, antiperspirants, or shaving cream on the day of the experiment, and no facial obstructions such as hair and glasses. The participants were informed of the purpose and process of the whole experiment, and those who accepted the guidelines signed a letter of informed consent [
33]. All of the experimental procedures of this study were approved by the clinical research Independent Ethics Committee of Zhongda Hospital affiliated with Southeast University (2022ZDSYLL128-P01).
2.3. Apparatus
The experiment was performed in a 5 × 5 m area in the Ergonomics Laboratory of Southeast University. To maintain a constant temperature, an air conditioner was used to keep the room temperature at 22 ± 2 °C and the relative humidity from 50 to 60%. In addition, the room was not directly ventilated or exposed to direct sunlight. The schedule was arranged between 9 a.m. and 3 p.m.
A FLIR ONE Pro (TeledyneFLIR LLC, Wilsonville, OR, USA) was used to obtain the thermograms; this device has a thermal sensitivity of 70 mk, a thermal pixel size of 12 µm, an infrared resolution of 160 × 120 pixels, and a spectral range between 8 and 13 µm. A One FitTM connector, which can flexibly connect to a phone to directly display thermal images on the screen, was implemented. The acquired thermal images were grayscale images with pixel intensities ranging from 0 to 255. Thus, higher temperatures were associated with brighter pixels (white areas indicated the hottest areas), and lower temperatures were associated with darker pixels (black areas indicated the coldest areas). The time interval for capturing a single frame of a thermographic image was 4 s. A FLIR ONE Pro was connected to a phone, and a tripod was used to secure the camera and phone at a distance of 1 m from the participant under study. Stimuli display and thermogram processing were performed using a MacBook Pro (13-inch, 2017, two Thunderbolt 3 ports) with a 2.3 GHz dual-core Intel Core i5 processor.
2.4. Stimuli
Usability and aesthetics were manipulated to evoke emotional experiences [
34]. For usability, information architectures (IAs) primarily concern the organization and simplification of information, as well as the design and construction of information spaces. They were proposed to assist individuals in gaining a better grasp of information and making optimal decisions [
35]. The relationship between the quality of IAs and usability has been well-researched [
36]. Therefore, this experiment constructed two different IAs to manipulate usability. Afterward, the established IAs were compared based on the applications of latent semantic analysis (LSA) provided by the University of Colorado at Boulder [
37]. We calculated the information scent of each navigation path to target or non-target. The LSA results are shown in
Table 1, which illustrates that, in instances of good IAs, there is a high information scent associated with navigating toward targets and a low scent when navigating toward non-targets. Conversely, in cases of poor IAs, the information scent is a low scent for both target and non-target navigation. This results in the generation of high-usability websites and low-usability websites.
For aesthetics, 4 professors with over a decade of cumulative experience in website design were invited to engage in website design. They were tasked with choosing the simplest and most popular template among 10 websites. Then, adjustments were made based on this template. Afterward, according to research [
38] and as shown in
Figure 2, 5 kinds of website background colors and 4 kinds of product display shapes were combined to generate 20 websites. In a preliminary online study, 148 users were invited to rate the attractiveness of the websites using a 9-point scale. A total of 146 valid questionnaires were collected. We paired those with the highest and the lowest mean aesthetic scores, t (145) = 14.86,
p < 0.01, Cohen’s d = 2.15. The yellow pentagram combination was selected as the website with the lowest aesthetic value (mean aesthetic score = 3.55, SD = 1.89), and the white square combination was selected as the website with the highest aesthetic value (mean aesthetic score = 7.35, SD = 1.63), as shown in
Figure 3.
To evaluate the emotional experiences of users, an online shopping platform was implemented. There were 5 buttons on the navigation bar at the top of the website. Clicking on the “home” button returned the user to the first page, while the other buttons could be clicked to allow the user to select products through a drop-down navigation bar. The first-level navigation bar on the first page (except the “home” button) could be clicked to display a second-level navigation bar with 4 categories. Each of these categories could be clicked to display a third-level navigation bar on the right side, also displaying 4 categories. Clicking on any of these categories displayed a product list of the current category arranged in a 4 × 5 matrix. The pictures, names, and prices of the products were displayed in turn. Clicking on the product picture would display more details of the product. This online shopping website had a total of 1280 products.
Based on the obtained high-usability and low-usability navigation and high and low aesthetics scores, four websites were generated: high usability and high aesthetics (U+A+), high usability and low aesthetics (U+A−), low usability and high aesthetics (U−A+), and low usability and low aesthetics (U−A−). The websites only differed in the IAs of the navigation bar, the background color, and the product display shape; other elements were not changed.
2.5. Procedure
The overall procedure of the experiment is shown in
Figure 4. A participant was invited into the room upon arrival and was seated in a comfortable seat. The height of the seat was adjusted to ensure that the face of the participant was centered on the phone screen without movement; a distance of 1 m between the participant and the phone was verified.
Before starting the study, the participant received an explanation of the whole experimental process and the meaning of the SAM, which is scored on a scale from 1 to 5 and consists of the valence (positive and negative) and arousal (intensity) dimensions. The SAM result was considered the ground truth of the emotional experiences.
A template was provided to the participants before the formal experiment in the same way as the experimental stimulus interaction but without any experimental elements. Thus, the participants were able to familiarize themselves with the utilization of the website. Afterward, the participants were allowed to relax for 15 min to adapt to the environment and stabilize their body temperature. The baseline thermal responses of the participants were measured for 2 min. The baseline served as the foundation for defining the directionality of physiological changes during the emotional arousal process [
28]. During the baseline measurement process, participants were instructed to rest and empty their minds of all thoughts, feelings, and memories [
39].
After completing all of the steps described above, the participants completed 4 tasks on the B2C websites. For displayed product pictures, participants were asked to find the corresponding products and add them to the shopping cart. The participants searched for different products on the 4 websites to avoid operational memory interference. To eliminate effects related to the order in which the websites were presented, the stimulus order in this experiment was counterbalanced using a Latin square design [
40], and the participants were asked to complete the corresponding task on each website in 5 min. If a participant did not complete the task in 5 min, they were asked to stop immediately. After completing all of the tasks for a particular website, the participants were required to complete the SAM, followed by a 2 min break to allow their state to return to baseline.
2.6. Thermal Data Processing
2.6.1. Infrared Thermal Image Preprocessing
To avoid any interference from the use of a fixed head device, such a device was not used in this experiment. Therefore, registration was applied to eliminate the deviation caused by head movement. The centroid of the eye area was positioned in fixed images used for registration [
41] since only the position of the image was translated or rotated, and the gray matrix was not changed. Subsequently, median and Gaussian filters were used to eliminate the noise in the registration image to obtain the best-binarized images, i.e., images that display the facial contour of the participant. Then, a box was used to frame the face, and the original image was generated in batches for further statistical calculation. The forehead, left cheek, right cheek, nose, and maxillary are the 5 regions of interest (ROIs) that are frequently used in emotion research using IRTIs and have yielded significant results [
42,
43,
44]. The ROIs were located with a geometric model of the face, the face width was represented by D, and the regions to be studied were obtained according to the center of the geometric ratio of the ROIs [
28,
45]. The first frame of each group of thermal images was positioned manually. Afterward, the ROI selection box automatically located the ROIs of each frame. Finally, all ROIs of each frame were accurately located.
Figure 5 describes the entire process of thermal data processing.
2.6.2. Feature Extraction
The IRTIs data of participants at baseline and when using the different websites were extracted. Afterward, MATLAB R2020b (MathWorks, Natick, MA, USA) was used to convert each thermal image into a gray matrix. Then, the statistical features and the texture features of the GLCM were calculated, as described below.
The statistical features were obtained from extracted thermal imaging data features in the original gray matrix. In the following equations,
is an ROI described by a series of pixels
in the range of 0–255 (grayscale 8 bits),
k is the currently processed ROI, and
K is the number of ROIs.
where
is the mean values of all pixels
, and
w and
h represent the rows and columns of the gray matrix, respectively.
where
shows the variance in all pixels.
where
represents the mean value of the variance in each row in the current gray matrix and
is the average value of row
i.
where
represents the mean value of the variance in each column in the current gray matrix and
is the average value of column
j.
In addition,
represents the contrast of all pixels
,
is the median value of all gray matrices, and
and
represent the median values of each row and column, respectively.
The GLCM is an extensive image texture analysis method that relies on angle and distance. Five texture feature statistics of the GLCM were used, as described below.
Let an image have
columns and
rows. The gray level that occurs in each pixel is quantized as
levels. Let
be the columns and
be the rows; then,
is a set of pixels. Image
I is a function that assigns some gray levels in
to every pixel;
. The texture-context information is specified by the matrix of relative frequencies
, with two neighboring pixels separated by distance
d occurring on the image, one with gray level
and the other with gray level
. Such matrices of gray-level co-occurrence frequencies are a function of the angular relationship and distance between the neighboring pixels. Formally, for angles quantized to 45
intervals, the unnormalized frequencies are defined by
where
denotes the number of elements in the set.
Let
be the
th entry in the GLCM. The means and standard deviations for the columns and rows of the matrix are
where
represent the angular second moment (ASM), contrast, correlation, homogeneity, and dissimilarity of the GLCM features, respectively [
46]. Finally, these 5 features change the angle and distance (in our study
= 2, 4, 8, and 16). Therefore, 16 features are derived from each of the above features, as shown in Equation (19), where
k is the current ROI being processed.
In summary, each ROI has 8 statistical features and 80 GLCM texture features. Therefore, each region has a total of 88 features, and the total number of features is 440.
2.6.3. Feature Selection
Relevant features that facilitated classification needed to be selected from all features to avoid the dimensionality problem. Hence, the neighborhood component analysis (NCA) method was used [
47,
48], and the feature weight was used to maximize the expected classification accuracy through regularization. In this process, fivefold cross-validation was used to tune the parameters of
in the regularization item, and the minimum loss value was determined according to the loss function.
Let
be the training set, where
is a
d-dimensional feature vector,
is the corresponding training target, and
represents numerous samples. To select the optimal feature, we need a weight vector
w, and the weighted Mahalanobis distance between two samples
and
is:
where
is a weight associated with the
lth feature. Since it is a nondifferentiable function to confirm the nearest neighbor as the reference point with the leave-one-out method, an approximate probability distribution was used to determine the reference point. Therefore, the probability that
chooses
as the reference point is:
where
is a kernel function and
is the kernel width. If
0, only the nearest neighbor of the query sample is selected as its reference point; however, if
, all of the points have the same chance of being selected apart from the query point. Thus, Equation (22) shows the probability that query point
is correctly classified.
Afterward, to avoid overfitting and introduce a regularization term
, the classification accuracy of the leave-one-out method is obtained as:
Finally, the function
is differentiable, and its derivative with respect to
can be computed:
2.6.4. Emotional Classification
To eliminate the influence of the data unit, the feature was normalized by the
z score, and the formula is shown in Equation (25)
where
x is the original data,
is the normalized data, and
and
are the mean and standard deviation of the original data
x, respectively.
The labels of the data used for machine learning were calibrated according to the valence reported by participants on the SAM. We only classified two emotional experiences, positive and negative. Therefore, when the valence was less than 3, the emotional experience was classified as a negative; otherwise, it was classified as positive. Baseline served as an effective point of comparison for emotional experiences. Consequently, three binary classification tasks were identified: positive emotional experiences versus baseline (P-Base), negative emotional experiences versus baseline (N-Base), and positive emotional experiences versus negative emotional experiences (P-N). Based on the results of feature selection, we selected the top 15 features with the highest selection frequency in each task.
The SVM model, also known as a supervised learning approach, was chosen to classify the emotional experiences. The application of this classifier to binary classification tasks in machine learning is more mature. The SVM model identifies a high-dimensional discriminative hyperplane that can distinguish two categories of the datasets and maximize their differences. The performance of the SVM model can be further improved by optimizing parameters and selecting different kernel functions. In this experiment, the cross-validation method and grid search were used to train the SVM model to determine the optimal parameters and kernel function. In addition, we found that the SVM model performed best with a Gaussian kernel function and a parameter of 2.15 in this research.
2.7. Statistical Analysis
The variation in the valence and arousal dimensions from the SAM were evaluated according to the means and standard deviations; therefore, the emotional experiences were classified according to the experimental results of the subjective evaluation. The mean grayscale value variations in each ROI between the baseline and the period of the tasks were compared. Student’s
t-tests (
= 0.05) with Bonferroni correction were used to verify the significance of emotional experience fluctuation. To mitigate the impact of uneven data distribution, this study employed the
, in addition to the accuracy, as an evaluation metric for assessing the classification of emotional experiences. The
is the weighted average of precision and recall in machine learning. The
is denoted by the following, Equation (28), where
is the number of true positives,
is the number of false positives, and
is the number of false negatives [
49].
5. Conclusions
In this paper, IRTIs were used in HCI research. In contrast to previous emotional stimulation research, in this study, B2C websites were used as experimental materials to explore the emotional changes in participants during a task. The results demonstrated that thermal imaging data can effectively reflect the changes in the emotional experience of users interacting with websites with different designs. The main conclusions of this study are as follows. We found that, when participants used different websites, they exhibited greater changes in valence than in arousal. Therefore, we used valence as a benchmark to divide user emotional experiences into positive or negative experiences. In the feature selection process, the left cheek, right cheek, and forehead were the three ROIs that contributed the most features, while the mean of the column medians, the mean of the row variances, the mean in the statistical features, and the correlation and homogeneity in the GLCM texture feature were the most-selected features. In the feature classification process, we found that the SVM model demonstrates good classification performance between baseline and emotional experiences. The results of this study proved the effectiveness of applying IRTIs in HCI research and illuminate more research directions for the application of IRTIs.
There are also some limitations of this study. First, regarding the design of the experimental stimuli, only three design elements, IAs, website background color, and product display shape, were manipulated in this study. There are many other design elements that affect users’ emotional experiences on websites. In future research, additional design elements can be manipulated to elicit users’ emotions. Second, in this experiment, only the changes in emotion of five ROIs were verified, and increasing the number of ROIs to better measure and classify emotions should be considered. Third, the ROI extraction was semiautomated. Due to the inevitable head movements of participants during the experiment, real-time tracking and positioning of the facial ROIs could enhance the accuracy of data extraction.