**1. Introduction**

Online reviews are an essential source of information. More than 80% of people read online reviews before reaching decisions [1]. This trend also applies to job seekers. Before applying to open positions, job seekers often read online employee reviews about hiring experience and work environment on Indeed, LinkedIn, and other channels. However, the overabundance of reviews can render them cumbersome to read. For example, there are 63,400 reviews about Amazon on Indeed. Furthermore, job seekers must skim through several subjective comments in the reviews to find concrete information about a company of interest.

Alternatively, job seekers can find such concrete information (e.g., Table 1) in expert articles about companies on websites such as Business Insider [2,3] and FutureFuel [4]. However, such expert articles are typically written only for very popular companies and do not cover the global majority of companies. Online company reviews, on the other hand, are available for a vast number of companies, as (former) company employees submit reviews about a company to review platforms such as Glassdoor. Therefore, we aim to automatically extract unique and distinctive information from online reviews.

**Citation:** Li, J.; Bhutani, N.; Whedon, A.; Huang, C.-Y.; Hruschka, E.; Suhara, Y. Extracting Salient Facts from Company Reviews with Scarce Labels. *CSFM* **2022**, *3*, 9. https://doi.org/ 10.3390/cmsf2022003009

Academic Editors: Kuan-Chuan Peng and Ziyan Wu

Published: 29 April 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

We refer to informative descriptions in online reviews as salient facts. In order to derive a formal definition of salient facts, we conducted an inhouse study where we asked three editors to inspect 43,000 reviews about Google, Amazon, Facebook, and Apple. The editors discussed salient and nonsalient sentences in the reviews, and concluded that a salient fact mentions an uncommon attribute about a company and/or describes some quantitative information of an attribute. Attributes of a company include employee perks, onsite services and amenities, the company culture, and the work environment. We further validated our definition by looking into expert articles, and confirmed that the articles were extensively composed of the same properties. For example, 4 of the 8 benefits mentioned in an article [2] about Google used less-known attributes such as food variety, fitness facilities, and pet policy. The other 4 of 8 benefits used numeric values, such as 50% retirement pension match.

**Table 1.** Sample sentences from an online review and expert article about Google.


In this paper, we propose the novel task of salient fact extraction and formulate it as a text classification problem. With this formulation, we could automate filtering company reviews that contain salient information about the company. Pretrained models [5–7] are a natural choice for such tasks [8,9] since they generalize better when the training data for the task are extremely small. We, therefore, adopted BERT [5] for our extraction task. However, generating even a small amount of task-specific balanced training data is challenging for salient fact extraction due to the scarcity of salient sentences in the reviews. Naively labeling more sentences to address the scarcity can be prohibitively expensive. As such, even pretrained models that perform robustly in few-shot learning cannot achieve good enough performance when used directly for this task.

In this work, we propose two data enrichment methods, representation enrichment and label propagation, to address the scarcity of salient facts in training data. Our representation enrichment method is based on the assumption that salient sentences tend to mention uncommon attributes and numerical values. We can, therefore, enrich training data using automatically identified uncommon attributes and numeric descriptions from review corpora. Specifically, we append special tags to sentences that mention uncommon attributes and numerical values to provide additional signals to the model. Our label propagation method is based on the idea that we can use a small set of seed salient sentences to fetch similar sentences from unlabeled reviews that are likely to be salient. This can help in improving the representation of salient sentences in the training data. Our methods are applicable to a wide variety of pretrained models [5–7].

We conducted extensive experiments to benchmark the extraction performance and demonstrate the effectiveness of our proposed methods. Our methods could improve the F1 scores of pretrained models by up to 0.24 on salient fact extraction, which is 2.2 times higher than the original F1 scores. This is because our models could identify more uncommon attributes and more quantitative descriptions than directly using pretrained language models can.Our models could also better distinguish between expert- and employeewritten reviews.

To summarize, our contributions are the following: (1) We practice a new review mining taskcalled salient fact extraction using pretrained language models and data augmentation in an end-to-end manner. The task faced an extremely low ratio (i.e., <10%) of salient facts in raw reviews. (2) The best-performing methods still require massive labels in the tens of thousands, because trained models and augmented examples tend to be biased towards majority examples. To alleviate this problem, we leveraged a series of improvements to ensure that the model training and data augmentation worked effectively for the desired minority examples. (3) An extension of our method demonstrates that it generalizes well and could reduce the labeling cost for new domain adaption (e.g., transferring to a product domain achieves an improved label ratio from 5% to 43%) of the same task and for similar tasks that deal with minority review comment extraction (e.g., suggestion mining requires a reduced amount of labels by 75% to hit the performance of UDA semisupervised learning [10]). To facilitate future research, we publicized our implementations and experimental scripts (https://github.com/megagonlabs/factmine, accessed on 22 August 2020). We did not release the company dataset due to copyright issues. However, we aim to release datasets of similar tasks to benchmark the performance of different methods. We also released a command-line programming interface that renders our results readily reproducible.

### **2. Characterization of Salient Facts**

The cornerstone towards automatic extraction is to understand what renders a review (or sentence in a review) salient. To this end, we first inspected raw online reviews to derive a definition of salient facts. We then analyzed expert articles to ensure that the derived definition is valid (Figure 1).

**Figure 1.** Constitution of false instances.

### *2.1. Review Corpus Annotation and Analysis*

We produced inhouse annotation to understand what review sentences are deemed salient facts for human readers.We collected 43,000 company reviews about Google, Amazon, Facebook, and Apple. We split each review into sentences using NLTK [11]. Then, we inspected all the sentences and selected salient sentences according to our understanding of the corresponding companies. Table 2 shows example sentences that were labeled salient.

Sentences labeled salient described more uncommon attributes than nonsalient sentences did. Uncommon attributes include real-world objects and services such as cafes, kitchens, dog parks. They are typically not provided by all companies and can help job seekers differentiate between companies. Furthermore, salient sentences use quantitative descriptions (e.g., 25+ and 100 ft in Table 2). Quantities often represent objective information and vary across companies, even for the same attribute, thereby helping job seekers in differentiating between companies.

These properties are not exhibited by nonsalient sentences. As shown in Table 3, most nonsalient sentences mention solely common attributes (e.g., place, salary and people), disclose purely personal sentiments (e.g., awesome, great, cool), or are noisy (e.g., invalid or incomplete texts). Different kinds of nonsalient sentences and their ratios are shown in Figure 1.

**Table 2.** Sample salient facts extracted from online reviews.

**Example 1.** Google also has 25+ cafes and microkitchens every 100 ft. (Google) **Example 2.** Dogs allowed in all the buildings I've been to (including some dog parks in the buildings!) (Amazon)

**Table 3.** Example non-salient sentences and reasons.


### *2.2. Expert Article Analysis*

We analyze expert-written reviews to investigate if they exhibited characteristics of salient facts i.e., describe an uncommon attribute and/or use quantitative descriptions. First, we compare frequencies of a set of attribute words across expert sentences and review sentences. The used expert sentences attributed words that were infrequent in the review sentences. For example, frequencies of *death*, *family* (commonly mentioned in expert reviews for Google) in review sentences were 0.01% and 0.15%, respectively. In contrast, frequencies of *place*, *pay* (commonly mentioned in review sentences for Google) were 3.44% and 1.28%, respectively. This observation supports our definition.

Next, we inspected if the expert sentences used more quantitative descriptions than randomly selected review sentences. For example, 4 of the 7 expert sentences describing most benefits of Google used quantitative descriptions such as 10 years, USD 1000 per month, 18–22 weeks, and 50% match. On the other hand, none of the 7 sentences randomly sampled from reviews mentioned any quantities. In fact, most of them used subjective descriptions such as nice, interesting, and great. This observation supports our characterization of salient facts.
