5.1. Study 1: Sentiment Analysis and Latent Dirichlet Allocation (LDA) Topic Modeling Analysis Results
This research was mainly based on the sentiment analysis of reviews of organic products on, which can help consumers know about the reputation of organic products on Taobao. Usually, sentiment analysis is divided into positive, negative, and neutral [
78]. A dictionary-based sentiment analysis mainly uses the sentiment lexicon to give each word a weight for the corresponding emotional inclination, to give each word the corresponding sentiment weight, and then all sentiment words are extracted from the review and the final sentiment score is calculated based on the negative words and adverbs in the review, and the emotional polarity of the review is judged based on the sentiment score [
36,
79]. The dictionary includes the Boson NLP(Natural Language Processing) dictionary (including positive sentiment words and negative sentiment words), a negative dictionary, and a degree adverb dictionary. The dictionary is derived from the sentiment dictionary of the Boson NLP data downloaded and social media text, so the dictionary is suitable for processing social media sentiment analysis. This study divided reviews into positive and negative reviews based on weight. The criterion was that negative reviews were less than 0 and positive reviews were greater than 0.
Latent Dirichlet Allocation (LDA) is the most commonly used method for topic modeling [
80]. Topic modeling using LDA can discover topic words from large amounts of unstructured text data or big data [
75]. In this study, LDA was mainly used to extract the keywords related to consumer satisfaction with the online purchase of organic products. The generation process of this study was as follows:
Read the collection of review documents and use Jieba for word segmentation.
Assign an ID to each word, namely the corporate dictionary.
After the ID is assigned, the word frequency of each word is sorted out, and a sparse vector is formed using the form of “word ID: word frequency”.
Use the LDA model of the Gensim library for training.
The results show that after the model finishes running, it will output the probability that a comment belongs to a topic and judge which topic that is, based on the probability.
First, using the sentiment analysis method, the reviews crawled on Taobao were divided into positive and negative reviews. A total of 36,603 articles were collected in negative reviews, and a total of 431,567 articles were collected in positive reviews.
Second, the positive and negative reviews were analyzed using LDA topic modeling analysis to derive the keywords. The words were extracted from nouns by topic modeling. The table below summarizes the themes related to consumer purchases of organic products. The LDA topic modeling analysis results of this study were as follows. In Topic 1, words such as great, golden, color, picture, very good, appearance, bag, and gift were extracted. This result confirmed that the topics were related to packaging design. In Topic 2, words such as quality, products, nutrition, health, first-rate, product quality, good, and type were extracted. Thus Topic 2 was related to nutritional information. In Topic 3, words such as quality, beautiful, perfect, great, loyal, fans, fresh, and crisp were extracted. This means that Topic 3 was related to food quality. In Topic 4, words such as time, too slow, hour, yuan tong, consumption, postage, nonsense, and late were extracted. Therefore, Topic 4 was related to the delivery risk. In Topic 5, words such as organic, garbage, almost, pesticide, diarrhea, bad smell, hospital, and epidermis were extracted. Topic 5 was related to freshness. In Topic 6, words such as evaluation, customer service, online shopping, attitude, regular customer, psychology, merchants, and cautious were extracted. This result confirmed that in Topic 6 words related to the source risk were extracted. Therefore, this study used LDA topic modeling analysis to extract a total of six keywords. The keywords for positive reviews were packaging design, nutritional information, and food quality, and the keywords for negative reviews were delivery risk, freshness, source risk. The keywords of online organic products are shown in
Table 4 below.
An online survey was conducted among 434 users who purchased organic produce online to test the relationship between the six keywords above and satisfaction. Using a 7-point Likert scale (1 “completely disagree” to 7 “completely agree”) 24 items were evaluated. The measurement scales were adapted from previous studies, as shown in
Appendix A. These issues have been reviewed by Chinese and Korean experts.
We conducted an online survey of 434 Chinese users who purchased organic agricultural products online, and the questionnaire was conducted from October 29 to November 16, 2019. The following table shows the demographic information of the participants.
Appendix B shows the demographic information of the participants. Among them, there were 160 males (51.95%) and 274 females (48.05%). Users aged 18–40 constituted the largest group, with 184 consumers (42.63%) aged 18–30 and 101 consumers (21.89%) aged 31–40. Regarding the educational level, users who were undergraduates or had a master’s or higher degree were the largest group, with the number of universities being 227 (52.53%) and the number of those with a master’s or higher degree being 107 (24.65%). In terms of income, 255 (58.99%) consumers earned less than
$710 and 140 (32.49%) consumers earned
$710–1410, and in this group were the largest number of consumers who purchased organic produce online. In terms of occupation, 149 (34.33%) of the consumers who purchased organic produce online were career students, followed by 72 (16.59%) consumers who were full time workers (e.g., professor, nurse). Comparing the online and offline purchase of organic products, consumers were more willing to buy organic products online, there were 279 (64.52%) consumers who bought organic products once a month online and 6 (1.38%) consumers who bought organic products 11 or more times, there were 154 (35.71%) consumers who purchased organic products once a month offline, and 44 (10.37%) consumers who purchased organic products 11 or more times. The types of organic produce that was often bought were organic vegetables, 253 (58.29%); organic fruits, 345 (79.49%); and organic foods, 178 (41.01%). Consumers purchased organic products mainly because of their health—157 (55.67%).
This research used the data from the online questionnaire to analyze the validity and hypothesis tests on the relationship between variables and satisfaction, get relevant data results, and discuss the data results. First, using factor analysis, the factor load corresponding to each of the principal component topics was greater than 0.5, indicating that these topics fell well into the corresponding dimensions. The construct reliability (CR) and average variance extracted (AVE) were calculated based on the load values. The results showed that the construct reliability value of each variable was between 0.836 and 0.917—both greater than the standard of 0.6—and the average variance extracted value was between 0.562 and 0.759—both of which are greater than the 0.5 standard. The alpha coefficient is usually used to measure the reliability of a questionnaire. The larger the alpha coefficient, the higher the reliability of the questionnaire, that is, the higher the reliability and stability of the questionnaire. Generally, the alpha coefficient should be higher than 0.5, and the analysis results were all higher than 0.8, which showed that the data had good reliability and that this study passed the reliability test. The results are shown in
Table 5.
Table 6 describes the regression analysis. Each independent variable had a corresponding regression coefficient and a significance test.
β represented the standard regression coefficient. The standardized regression coefficient represented the independent variable, that is, the correlation between the predictor and the dependent variable. The results showed that packaging design, nutritional information, and food quality had a positive correlation with satisfaction. Delivery risk, freshness, and source risk had a negative correlation with satisfaction. Therefore, H1, H2, H3, and H4 were supported. However, H5 was rejected. After standardization, each independent variable and dependent variable could be unified. This made the results more accurate and reduced errors due to different units. The t-value was the result of a
t-test of the regression coefficients. The larger the absolute value, the smaller the sig—sig represents the significance of the
t-test. Statistically, a sig less than 0.05 is generally considered to be significant for the coefficient test. It shows that the independent variable can effectively predict the variation of the dependent variable. Our results were as follows: packaging design (
β = 0.245, sig = 0.000), nutritional information (
β = 0.240, sig = 0.000), food quality (
β = 0.199, sig = 0.000), delivery risk (
β = −0.104, sig = 0.009)), freshness (
β = −0.107, sig = 0.008), and source risk (
β = −0.137, sig = 0.001). The six independent variables had significant standardized regression coefficients for user satisfaction.
5.2. Study 2: Online Variables and Sales Volume Linear Regression
The main research purpose of this part is to predict the impact of consumer purchases of organic agricultural products by crawling the six variables of prices, product fans, the price discount, number of customer reviews, organic labeling, and free delivery. Based on the crawled data, regression analysis was used.
β represents the standard regression coefficient. The results showed that product fans, the number of reviews, and price discount had a positive correlation with the sales volume. The t-value is the result of a
t-test of the regression coefficients. The larger the absolute value, the smaller the sig, sig represents the significance of the
t-test. Statistically, a sig less than 0.05 is generally considered to be significant for the coefficient test. It shows that the independent variable can effectively predict the variation of the dependent variable. The results of the regression model showed that the significance of product fan, price discount, and number of customer reviews were all below 0.005, and all three variables that affect sales passed. The results are shown in
Table 7.
Therefore, the relationship between each variable and the sales volume in the regression analysis is shown in
Figure 4 below. The path coefficient for H8 was positive and significant (2.868,
p < 0.01). Thus, H8 was supported, indicating that the price discount has a positive impact on the sales volume. The hypothesis for the relationship between product fans and sale volume (H11) was also supported, with a path coefficient of 14.174(
p < 0.01). The hypothesis regarding the number of reviews (H12) was also supported, having a significant path coefficient of 36.283. Thus, the regression analysis showed that the three variables of product fans, number of reviews, and price discount had a positive impact on the sale volume. However, hypotheses H7, H9, H10 were not supported, and price, free delivery, and organic labeling did not significantly affect the sales volume.
Recent neural network research has mainly focused on prediction to solve complex problems, and, therefore, is suitable for research with a large amount of data [
50]. It is one of the research methods of machine learning. This research mainly used the BP(Back Propagation) neural network to predict the sales volume of organic products on Taobao. It was mainly composed of interconnected node systems in three hierarchical layers (input, hidden, and output). The process of the BP neural network was mainly divided into two stages. The first stage was the forward propagation of the signal, from the input layer through the hidden layer, and finally to the output layer; the second stage was the backpropagation of the error, from the output layer to the hidden layers, finally to the input layer, training with a BP model of the Keras neural network framework to predict the sales volume of organic products on Taobao.
Artificial neural network analysis and modeling, which is one of the representative methods of predictive analysis for checking whether the three indicators obtained through regression analysis can predict sales volume, was used. There were three layers—the input layer, the hidden layer, and the output layer. The input layer was the prices, discounted prices, free delivery, organic labeling, and the number of customer reviews, and the output layer was the sales volume. The hidden layer was set to 2. The crawled data set was divided into a training set and a test set “Training” was set to 50%, and “testing” was set to 50%. The training set, input prices, product fan, price discount, number of customer reviews, organic labeling, and free delivery were used for modeling to obtain the output index of the sales volume. After obtaining the model, the test set was used, with inputs of the dimensions of prices, product fans, price discount, number of customer reviews, organic labeling, free delivery to get the output index of the sales volume. To confirm the predictive power of three statistically significant indicators, a total of 6 artificial neural network models were implemented, with low loss and low RMSE(Root Mean Square Error) being better, so the artificial neural network analysis concluded that three variables had an impact on the sales volume. The results are shown in
Table 8 below. The artificial neural network model is described in detail in
Appendix C.