**4. Experiments**

In this section, we first examine the extraction performance of pretrained models BERT, ALBERT, and RoBERTa. We then show the effectiveness of our proposed data enrichment methods by conducting an ablation study with the pretrained models.

Datasets: we obtained company reviews from an online company review platform for job seekers. We use the reviews of two companies (Google and Amazon) for evaluation. We chose these companies because their expert articles were also available for comparison. We first split the reviews into sentences using the NLTK sentence tokenizer [11]. For Google, we used all 13,101 sentences from the reviews. For Amazon, we randomly sampled 10,000 sentences. We then asked four editors to finish labeling these sentences (1 or 0) on the basis of their salience. We randomly sampled 100 sentences (50 positive and 50 negative) and asked two editors to label them. There was Cohen's kappa agreemen<sup>t</sup> of 0.9 between the editors. This agreemen<sup>t</sup> is higher than the agreemen<sup>t</sup> scores reported in previous studies related to our work e.g., 0.81 from a SEMEVAL-2019 Competition task 9 [8] and 0.59 from TipRank [12]).

Hyperparameters: we split the labeled dataset into training and test sets at a ratio of 4:1. For training the pretrained models, we set the number of epochs to 5, max sequence length to 128, and batch size to 32. We used the F1 score of the positive class (i.e., salient) to measure the performance of a model. Since a model may achieve the best F1 score in the middle of training, we inspected a model 15 times during training and reported the best F1 score of the 15 snapshots.

### *4.1. Effectiveness of Pretrained Models*

We first compare the performance of pretrained models and other supervised learning algorithms, namely, logistic regression (LR), support vector machine (SVM), convolutional neural network (CNN), and recurrent neural network with long short-term memory (LSTM). We used the same configuration to train and evaluate all models. Unsurprisingly, all pretrained models consistently outperformed other models on the two datasets (as shown in Table 4). BERT achieved the highest F1 scores with absolute F1 gain as high as 0.16 and 0.14 on Google and Amazon, respectively. These results indicate that the pretrained models are suited for the salient fact extraction task.


**Table 4.** F1 scores of BERT, ALBERT (ALB.), RoBERTa (ROB.), LR, SVM, CNN, and LSTM on Google and Amazon datasets. The best score for each dataset is in bold.

### *4.2. Effectiveness of Representation Enrichment*

To investigate the effectiveness of representation enrichment, we curated two lists, one for uncommon attribute descriptions and one for quantitative descriptions. We separately applied the two lists for each pretrained model, and report their F1 scores in Table 5. We also computed the F1 scores before and after representation enrichment.

**Table 5.** F1 score of BERT, ALBERT (ALB.), RoBERTa (ROB.) when using representation enrichment. F1 improvements compared with direct use of pretrained models (see Table 4) marked in orange. Best scores marked in bold.


We first evaluated the effect of representation enrichment using uncommon attribute token list (Uncommon). As shown in Table 5, Uncommon could improve the F1 score of BERT, which appeared to be the best model, as shown in Table 4, from 0.33 to 0.38 on Google and from 0.27 to 0.29 on Amazon, so improvement was 0.05 and 0.02, respectively. More importantly, Uncommon also improved the F1 scores of models ALBERT and RoBERTa on both Google and Amazon. ALBERT achieved the greatest F1 improvement (0.13 on Google and 0.15 on Amazon) and outperformed BERT. RoBERTa achieved 0.15 F1 improvement and outperformed BERT on Amazon. Results indicate that representation enrichment with an uncommon attribute token list is generic and can improve the extraction performance of various pretrained models.

We next evaluated the effect of representation enrichment using quantitative description token list (Quantitative). As shown in Table 5, Quantitative consistently improved F1 scores for all models. In particular, ALBERT achieved F1 improvement of 0.10 on Google, while RoBERTa an F1 improvement of 0.24 on Amazon. The final F1 score of RoBERTa was 0.44 on Amazon, and the score was record-high in Amazon extraction performance. Results further verified that representation enrichment, in particular the quantitative description token list, is a general method that works with various pretrained models.

### *4.3. Effectiveness of Label Propagation*

Label propagation boosts the number of training samples by retrieving similar texts from unlabeled corpora. To evaluate the effect of label propagation, we retrieved three of the most similar texts for each salient fact and use them as positive examples for training. Since Google and Amazon had 62 and 66 salient facts, we retrieved 186 and 198 sentences, respectively. We report the F1 scores of BERT, ALBERT, and RoBERTa in Table 6. We also calculated the F1 improvements before and after the label propagation.


**Table 6.** F1 score of BERT, ALBERT (ALB.), RoBERTa (ROB.) when using label propagation. F1 improvement compared with direct use of pretrained models (see Table 4) are marked in orange. Best scores are marked in bold.

Pretrained models achieved better F1 scores with label propagation. F1 improvement ranged from 0.07 to 0.17 on Google, and 0.01 to 0.09 on Amazon. RoBERTa showed the largest improvement of 0.17 on Google, where its F1 score rose up to 0.36 from 0.19, which did not leverage label propagation (see Table 4). On Google, BERT achieved0.15 F1 improvement and a record-high F1 score of 0.48. Results sugges<sup>t</sup> that label propagation can boost the performance of various pretrained models.

### **5. Extension**

In this section, we extend our method to a new domain and similar tasks that deal with imbalanced datasets to verify whether our task and method had much generality.

### *5.1. New Domain*

We defined the concept of salient fact from analyzing company reviews. We then attempted to transfer the concept to a new domain, i.e., product reviews. First, we directly deployed a trained company model on product review sentences to predict their probability of saliency. Next, we sorted all sentences by saliency score in descending order, and present the top 100 to 4 human annotators. We asked annotators to give label every sentence with positive or negative indicating salient or nonsalient, respectively. We also asked annotators to label randomly sampled 100 sentences for comparison.

We report the averaged ratio of positive examples for four headset products, i.e., plantronic, jawbone, Motorola, and Samsung, in Table 7. According to the results, transferring consistently increased the label ratio by a large margin for all four products. The margin varied from 3× to 7×. Results sugges<sup>t</sup> that the definition of salient facts is general enough to be applied to the product domain. For quick demonstration, we release all sentence samples in our public codebase.

**Table 7.** Ratio of sentences that human annotators feel salient before and after transferring trained company model to product reviews.


### *5.2. Similar Public Task*

We extended the label propagation algorithm to similar tasks since the algorithm was designed to be general. We conducted experiments to compare our method with the state-of-the-art baselines on public tasks that regard minority comment extraction. We obtained four public datasets that contained binary labels for training extraction models. SUGG [8] comes from SEMEVAL 2019 task 9; positive example means that it contains customer suggestions for software improvement. HOTEL [13] was derived from the Hotel domain with, positive example indicating that it carries customer-to-customer suggestions for accommodation. SENT [14] contains sentence-level examples, and a positive label means the sentence contains tips for PHP API design. PARA [14] comes from the same source of SENT, but contain paragraph-level examples. The ratio of positive examples for SUGG, HOTEL, SENT, and PARA was 26%, 5%, 10%, 17%, respectively. All four datasets contained a training set and a test set at 4:1 ratio.

We adopted UDA [10] as a strong baseline method. UDA uses BERT as base model and augments every example in the training set using back translation from English to French then back to English. The example and its back translation are fed into model training to minimize KL divergence, so that the two examples are projected to close vector representations. We ran UDA and BERT on the full training set, and our method on only 2000 training examples. Our F1 scores and those of BERT and UDA are shown Table 8. The average F1 of BERT, UDA, and ours was 0.6687, 0.6980, and 0.6961, respectively. BERT performed the worst because it does not use any data augmentation, so it suffers the most from label imbalance. UDA and ours performed similarly across all the datasets, ye<sup>t</sup> UDA used full training examples, but ours used only 23.52%, 33.33%, 21.97%, and 38.46% of the examples on SUGG, HOTEL, SENT, and PARA, respectively. UDA favors mild data augmentation due to the usage of KL divergence and back translation mostly change one or two word tokens in an example. However, the mild design choice was too conservative to efficiently augmen<sup>t</sup> minority examples in imbalanced datasets (thus requiring a higher volume of augmented data). Therefore, a more aggressive design choice such as ours, which can return new sentences as augmented examples, is needed for the widespread existence of imbalanced datasets.

**Table 8.** F1 of four public tasks for minority comment extraction. All baselines use full training examples. Our method used 2000, ye<sup>t</sup> could match the performance of baselines.


### *5.3. Statistical Significance*

We conducted experiments to evaluate the statistical significance or randomness of our results. Specifically, we set different random seeds to run BERT, UDA, and our method on SUGG, HOTEL, SENT, and PARA. The number of training examples for SUGG, HOTEL, SENT, and PARA was 8500, 6000, 9100, and 5200, respectively. For every dataset, we fed full training examples to BERT and UDA, but only 2000 to our method. We repeated the same experiment three times and reported F1 scores. Statistical analysis was performed using GraphPad Prism 7, and statistical significance was determined using one-way ANOVA followed by Tukey's multiple-comparison test. We calculated the mean, SD, and *p* value with Student's *t* test. Significance: not significant (n.s.) *p* > 0.5, \* *p* < 0.05, \*\* *p* < 0.01, \*\*\* *p* < 0.001.

Comparison results of BERT, UDA, and ours (2000) on SUGG, HOTEL, SENT, and PARA shown in Figure 2. When comparing BERT with ours (2000), BERT showed no significant difference on SUGG and SENT, and worse performance on HOTEL and PARA. Results sugges<sup>t</sup> that ours (2000) could outperform BERT even with fewer training examples. When comparing UDA and ours (2000), the two methods showed no significant difference on SENT and PARA. On HOTEL, UDA was better, but on SUGG it showed worse performance. Results sugges<sup>t</sup> that ours (2000) could achieve equally good performance as that of UDA with much fewer training examples.

**Figure 2.** Comparison between BERT and UDA, with our method. BERT and UDA are trained with full training examples, and our method was trained with only 2000 examples. Training datasets were SUGG, HOTEL, SENT, and PARA. Data are presented as *mean* ± *SD*. Significance: not significant (n.s.) *p* > 0.5, \* *p* < 0.05.

### **6. Related Work**

Informative reviews: extracting informative reviews drives broad applications in web mining, while the definition of informativeness varies across application domains. TipRank [12] extracts short and practical sentences from TripAdvisor reviews to prepare travellers for upcoming trips. AR-Miner [15] and DeepTip [14] highlight useful comments in software reviews to notify developers of potential artifact issues. AMPERE [16] extracts argumentative sentences from paper reviews to help authors improve their manuscripts. In addition to the above research, there are many works targeting different domains such as products [17–19], restaurants [20–22], and hotels [13,23,24]. These works align with discovering helpful reviews to save reader time. Unlike existing works, our paper targets the company domain, where understanding a company heavily relies on knowledge of uncommon attributes and quantitative information, as indicated by expert-written reviews. Therefore, our definition of salient facts serves as another dimension to analyze massive reviews, and our work complements existing efforts towards mining the most useful information from reviews.

Supervised learning: existing works mostly adopt supervised learning when developing automatic extractors because supervised models can automatically learn to differentiate positive and negative instances from human labels. There are three popular categories of supervised models, depending on input sequence representation: word occurrence models [12,13,15], such as logistic regression [25] and support vector machine [26], representing a text as a bag of words and thus suffering from limited vocabulary when the number of training data is small. Word vector models [8,13,14,17,27,28],such as convolutional neural networks [29] and long short-term memory [30], represent a text as a matrix of word embeddings and can thereby process unseen words through their embeddings. Recently, pretrained models [8,9], such as BERT [5], ALBERT [6], and RoBERTa [7], have emerged representing a text as a high-dimensional vector by aggregating word embeddings. Due to the high dimension (e.g., 768 in BERT) and large-scale parameters (e.g., 110M in BERT) for aggregation, pretrained models appear to be the most promising solutions for extractions. In fact, among all different models, pretrained models achieved the best F1 scores and are thus the base models for our work.

Label scarcity: the problem of salient fact extraction falls into the big category of text classification. However, the unique challenge here is label sparsity. The ratios of salient facts in raw reviews are extremely low ( <10%) due to the nature of uncommon attributes and quantitative descriptions that require solid domain-specific knowledge from crowd reviewers. As a result, collecting a large number of salient facts for model training is very

difficult. We thus propose a label propagation method to expand existing salient facts with two benefits. First, the method expands the input tokens of a input sentence towards instructing pretrained models about whether the input carries uncommon attributes or quantitative descriptions. Second, the method fetches more salient fact instances from the ample unlabeled corpus to enable pretrained models seeing more salient facts. The label propagation method was specifically designed to suit the nature of uncommon attributes and quantitative information, and is thus complementary to existing techniques such as data augmentation [31–33] and active learning [34–36]. A combination of existing techniques can further improve extraction quality. However, it is nontrivial to adapt existing techniques here due to increased algorithmic complexity; therefore, incorporating existing techniques is a fascinating future direction for this work.
