**7. Extraction**

In this section, we present extracted salient facts for qualitative analysis. We used BERT as the representative pretrained models. We also present extractions using existing solutions.

### *7.1. Extraction Comparison*

We present salient facts extracted from reviews about Google on Table 9. We also present salient facts extracted by baseline algorithms TextRank, K-means, Longest, and Random. TextRank [37] formulates sentences and their similarity relation into a graph, and extracts texts with the highest PageRank weights. K-means clusters sentences into a number of centroids and extracts the centroid sentences. Longest chooses the longest sentence from the corpus. Random randomly selects sentences from the corpus. These algorithms form a complete set of existing solutions for mining informative texts from a large corpus.

**Table 9.** Extractions of various methods on Google dataset with attributes and descriptions marked in red and blue, respectively. Our extractions revealed finer-grained attributes (see red) and distilled numeric knowledge (see blue).


Finer-grained attribute discovery: extraction examples show that our method extracted salient facts that contained finer-grained attributes than those extracted by the baseline methods. Representative attributes include laundry room, gyms, cars, and museum tickets. These attributes describe concrete properties about the company and are less common in the company domain. In contrast, extractions by the existing solutions tend to contain common attributes such as food, problem, work, salary, or people, which are popular and general topics about companies. The extractions by Longest did not reveal company attributes since the method retrieves long ye<sup>t</sup> fake reviews that are copies of external literature. Results

sugges<sup>t</sup> that salient facts are informative when presenting specific or unique attributes of a company to readers.

Numeric knowledge distillation: our extractions distill numeric knowledge compared with extractions from existing solutions. Representative knowledge includes 90% paid health insurance, 12 weeks paid parental leave, and free meals provided by the company. Knowledge is objective since it quantitatively describes attributes. In contrast, extractions from existing solutions mostly use subjective descriptions such as "lots of", "great", and "awesome". These subjective descriptions are biased towards reviewers. Results sugges<sup>t</sup> that salient facts can provide unbiased and reliable descriptions to readers.

### *7.2. Expert Comment Recognition*

Online comments are written by different people. Some writers with better knowledge about entities tend to give comments that are more informative. We refer to such writers as experts, and their comments as expert comments. In order to show the most informative comments to readers, a salient fact extractor should rank expert comments higher than other comments.

To understand whether our trained model could rank expert comments higher, we curated a collection of comments from online company reviews and FutureFuel. Online comments are those that we labeled as nonsalient (some representatives are in Table 3) and were thereby treated as nonexpert reviews. FutureFuel comments are those that came from invited writers and were thereby treated as expert reviews. We then sorted the collection of nonexpert and expert comments by prediction scores in descending order. A higher prediction score indicated a higher probability to be an expert comment.

Ranking results of Google and Amazon datasets are shown in Table 10. In the optimal case, all comments in the top-*k* list were expert comments. We show the number of expert comments of our model and a baseline that randomly shuffles all comments. Our model consistently achieved better results than the baseline in both the Google and the Amazon dataset, as shown in Table 10. In top 4 lists, all comments returned by our models were expert comments. In top 10 lists, 9 comments were expert comments in both Google and Amazon. Results indicate that our model could identify expert comments with nearly 100% accuracy. In the collection or comments that came from different people, our models could effectively recognize comments that had been written by experts, could and this ensure that readers are shown the most informative contents.


**Table 10.** Number of expert comments in top list after sorting all comments by prediction scores. Baseline randomly shuffles all comments.

### **8. Conclusions**

In this paper, we proposed a task of extracting salient facts from online company reviews. In contrast to reviews written by experts, only a few online reviews contain useful and salient information about a particular company, which creates a situation where the solution can only rely on highly skewed and scarce training data. To address

the data scarcity issue, we developed two data enrichment methods, (1) representation enrichment and (2) label propagation, to boost the performance of supervised learning models. Experimental results showed that our data enrichment methods could successfully help in training a high-quality salient fact extraction model with fewer human annotations.

**Author Contributions:** Conceptualization, J.L., N.B., A.W., C.-Y.H., E.H. and Y.S.; methodology, J.L., N.B., A.W., E.H. and Y.S.; software, J.L., C.-Y.H.; validation, J.L., N.B., E.H. and Y.S.; formal analysis, J.L., N.B., E.H. and Y.S.; investigation, J.L., N.B., E.H. and Y.S.; resources, Y.S.; data curation, J.L., N.B. and Y.S.; writing—original draft preparation, J.L., N.B. and Y.S.; writing—review and editing, J.L., N.B., C.-Y.H., E.H. and Y.S.; visualization, J.L., N.B.; supervision, E.H. and Y.S.; project administration, J.L. and E.H.; All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** https://github.com/rit-git/tagging/tree/master/data.

**Conflicts of Interest:** The authors declare no conflict of interest.
