Unsupervised Numerical Information Extraction via Exploiting Syntactic Structures

Wang, Zixiang; Li, Tongliang; Li, Zhoujun

doi:10.3390/electronics12091977

Open AccessArticle

Unsupervised Numerical Information Extraction via Exploiting Syntactic Structures

by

Zixiang Wang

,

Tongliang Li

^*

and

Zhoujun Li

State Key Lab of Software Development Environment, Beihang University, Beijing 100191, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(9), 1977; https://doi.org/10.3390/electronics12091977

Submission received: 20 March 2023 / Revised: 5 April 2023 / Accepted: 18 April 2023 / Published: 24 April 2023

Download

Browse Figures

Versions Notes

Abstract

:

Numerical information plays an important role in various fields such as scientific, financial, social, statistics, and news. Most prior studies adopt unsupervised methods by designing complex handcrafted pattern-matching rules to extract numerical information, which can be difficult to scale to the open domain. Other supervised methods require extra time, cost, and knowledge to design, understand, and annotate the training data. To address these limitations, we propose QuantityIE, a novel approach to extracting numerical information as structured representations by exploiting syntactic features of both constituency parsing (CP) and dependency parsing (DP). The extraction results may also serve as distant supervision for zero-shot model training. Our approach outperforms existing methods from two perspectives: (1) the rules are simple yet effective, and (2) the results are more self-contained. We further propose a numerical information retrieval approach based on QuantityIE to answer analytical queries. Experimental results on information extraction and retrieval demonstrate the effectiveness of QuantityIE in extracting numerical information with high fidelity.

Keywords:

numerical information; information extraction; syntactic parsing

1. Introduction

Numerical information plays an important part in several domains such as scientific, financial, business, medical, and news. Numbers are more than just values and units, the relevant contextual information is also valuable. A deep understanding of numerical information is not only necessary for statistical or analytical purposes but also proven to be valuable for many other tasks such as machine reading comprehension [1,2,3], information retrieval [4,5,6,7,8] and semantic parsing [9,10]. An example is DROP [11], a challenging English reading comprehension dataset, whose questions require discrete reasoning (e.g., addition, sorting, or counting). Many reasoning types require a deep understanding of numerical information.

Numerical information extraction is not well studied in NLP; prior studies fall into three categories: open information extraction (OIE)-based, syntactic parsing (SP)-based, and semantic role labeling (SRL)-based approaches. OIE-based methods output relation triples (argument 1, relation, argument 2), where one of the arguments is a number or value-unit phrase. For example, Saha et al. [12] propose BONIE that uses bootstrapping to learn the dependency patterns to extract numerical relation triples. However, not all quantities are expressed in forms of “relation”. For example, BONIE fails to extract any numerical information from “What are the top 100 universities in the US?”, where we should know that “100” refers to “universities in the US”. Other attempts on OIE include designing numerical relation extractors with minimal supervision [13] and using bootstrapping to learn specific dependency patterns that express numerical relations in a sentence [12]. Recent open information extraction methods built on transformer-based [14] pre-trained language models [15,16] achieved new state-of-the-art results [17,18,19,20,21]. Wang et al. (2022) propose a triple relation extraction method, which is also related to our work.

SP-based methods extract numerical information from a syntax structure such as a dependency parsing (DP) tree or constituency parsing (CP) tree. Pure DP-based methods [22] rely on dependency pattern matching, which is complex and might fail to handle numerical information with long-range dependency. For example, it can be difficult to summarize a dependency pattern between “plants” and “four” from the sentence “These plants originated from the Amazon jungle and can grow up to four feet high”. Wang et al. [23], Alonso and Sellam [24], Ravichander et al. [25], Sellam and Alonso [26] extract small pieces of text which contain quantities from a CP tree. We argue that small pieces of text are not sufficient to fully understand the quantities; more details like the unit, quantity value, date, and some context are needed. Our approach also belongs to syntactic parsing with more fine-grained designs on different semantic roles.

For the SRL-based approaches [27,28,29,30,31], numerical information is considered as different types of semantic roles [32], and can be extracted using the SRL model trained on the data manually labeled [33] or collected using distant supervision [34]. SRL is effective for machine-learning-based methods if the roles are well defined, but could have significant time and effort costs. The annotation process requires extensive background knowledge.

To address the above-mentioned limitations, we propose QuantityIE, a numerical information extraction approach exploiting syntactic features of both CP and DP. Our approach outperforms existing methods in terms of:

Simple yet effective rules. The rules we design are based on the principle of finding the closest phrase under selected types (e.g., noun phrase or verb phrase) to a numerical value. The phrase types provided by CP can be easily summarized due to their simplicity. An example is shown in Figure 1. DP-based methods rely on dependency patterns, which require combinations of dependency types from one word to another. The number of dependency types is rather large, not to mention the number of potential combinations. According to Hundman and Mattmann [22], varied and often misused language results in incorrect dependency parsing, which eventually leads to incorrect extraction, making the dependency patterns even more difficult to generalize.
Self-contained results. Clauses or phrases are more likely to be extraction results than single words. CP targets breaking text into sub-phrases, the result of which is more likely to consist of all the information we need. DP aims at finding dependency relationships between words, making it difficult to combine all relevant words as a whole. For example, as shown in Figure 1, “Express supply” extracted by QuantityIE is complete while “supply” is not.

Extracting precise information without introducing noise (irrelevant information) can be challenging. We adopt an “extract and filter” pipeline to extract and identify the possible candidates with CP and filter out irrelevant candidates with DP. The idea of combining CP and DP keeps the “accurate” results and drops the mismatched ones, mainly due to CP parsing text into sub-phrases rather than only words, which provides a “wide” range of phrase or word candidates for matching. DP focuses on the dependency relationships between words or phrases (sub-trees of a dependency tree). The relationship between a number and the candidate results is a key feature to determine if a candidate is actually relevant to the number. We further design a numerical information retrieval approach based on QuantityIE to answer complex queries or queries with quantities by extracting and retrieving structured data from natural language text.

To conclude, our major contributions can be summarized as:

We propose a novel approach, QuantityIE, by combining both CP and DP to extract numerical information. The result of this can be considered distant supervision for zero-shot model training.
The approach we propose is simple yet effective with no need to design complex dependency matching rules.
We build a numerical information retrieval approach based on QuantityIE to support analytical queries for search engines.

We evaluate QuantityIE on two tasks: (1) numerical information extraction and (2) numerical information retrieval. The dataset for the extraction task is manually constructed from DROP and BONIE, of which the data is labeled with semantic roles. The retrieval task is based on retrieving COVID-19 data from news reports. Additionally, We evaluate QuantityIE by comparing the extracted and retrieved COVID-19 data from the news with the actual data from a COVID-19 tracker. Experimental results show that QuantityIE is able to extract numerical information with high fidelity.

2. Methodology

In this section, we describe the design of QuantityIE in detail and introduce a numerical information retrieval approach based on QuantityIE.

2.1. Definitions

We introduce the following definitions first:

Quantity Mention A phrase or token which only contains a numeric value and an optional unit. For example, “four feet” in the sentence “The plant is four feet high”.
Quantified Candidate A phrase or token which is quantified by or contextually relevant to a quantity mention. For example, “The plant” and “high” in the sentence “The plant is four feet high”.
Quantity Fact A tuple with six different roles:

F = (a g e n t, r e l a t i o n, v a l u e, u n i t, q u a n t i t y, t i m e)

(1)

where:

$a g e n t$ is the context that plays a direct role in influencing or motivating a quantity mention.
$r e l a t i o n$ is the context containing predicates and prepositions. Predicates describe the influence and motivation, while prepositions describe the manner (relative or absolute) taken on a quantity mention. For example, both “increased by” and “increased to” express the influence “increased” but represent different manners: “by” represents a relative manner while “to” represents an absolute manner.
$v a l u e$ is the numeric value of a quantity mention.
$u n i t$ is the unit of a quantity mention, which is only available for percentage, currency, dimension, and temperature types of quantity mentions.
$q u a n t i t y$ is the context measured by or obliquely related to the quantity mention.
$t i m e$ is the time when the quantity mention took place.

To better understand the definition of a quantity fact, Figure 2 presents an example with labeled roles:

2.2. QuantityIE Overview

The input of QuantityIE is a sentence with at least one quantity mention while the output is a set of quantity facts extracted from each sentence. QuantityIE is a pipeline of two stages:

Extract: Recognize quantity mentions and time first, then extract the quantified candidates and relations matching each quantity mention by exploiting the constituency parse structures.
Filter and Recombine: filter out quantified candidates that are less relevant or irrelevant to a quantity mention by employing dependency parsing. Recombine all remaining items to quantity facts.

2.3. Stage I: Extract

For the extraction stage, the input is a sentence, and the outputs are the time, quantity mentions, quantified candidates, and relations. We first build a constituency parse tree from the input sentence, then recognize quantity mentions and time from all nodes. Finally, we extract the quantified candidates and relations corresponding to each quantity mention. We explain the details of this stage with Figure 1.

2.3.1. Constituency Parsing

We employ the “Head-Driven Phrase Structure Grammar Parsing on Penn Treebank” [35] to build a constituency parse tree for the input sentence, which is one of the state-of-the-art syntactic parsing models. Additionally, this model is able to construct a CP and DP tree jointly (which is required in stage II). With CP, a sentence is broken into multi-level phrases or tokens recursively. A CP tree is shown in Figure 1. For a CP tree, non-terminals (non-leaf nodes) of the tree are phrases, while terminals (leaf nodes) are words. The headword of a phrase determines the constituent type of the phrase. For example, given a “VP” (verb phrase) node “increased by 57 tonnes this year to 457 tonnes” (non-terminal), the headword of which is the leaf child node “increased” (terminal) labeled as a verb, making the parent node a VP.

2.3.2. Recognize Quantity Mentions and Time

Quantity mention and time can be difficult to recognize. The same number can be written in different formats, leading to identification and normalization issues. For example, “0.25” can also be written as “zero point twenty five”, “1 in 4”, “1/4” or “25%”. Some numbers include units, such as currency, dimension, and temperature, which require extra effort to recognize. For a time expression, it can be more than just a single “date” (e.g., “24 September 2020”), but also “time” (e.g., “9:45”), “date and time” (e.g., “2017-10-04 20:00:00”), “date range” (e.g., “between 2015 and 2019”) and “time range” (e.g., “from 3:00 to 5:00”). In the same way, time normalization is also necessary but difficult.

Named-entity recognition and part-of-speech tagging techniques are incapable of recognizing time and quantity mentions and with complex and varied formats. Similar to MARVE [22] employing Grobid Quantities to identify measurement units and values, we adopt Microsoft Recognizers Text, a work that provides robust recognition and normalization of entities like numbers, units, and date/time, to recognize the time and quantity mentions from a sentence. We mark a node containing the recognized time or quantity mention from the constituency parse tree as a “time node” or “quantity node”. For example, in Figure 1, node “this year” from level 3 is a time node while node “57 tonnes” and node “457 tonnes” from level 4 are quantity nodes.

We further match time to a quantity mention by matching each first closest sibling or cousin time node to the corresponding quantity node from the bottom level to the top level in a constituency parse tree. Based on this rule, the sibling time node “this year” will be matched to the quantity node “by 57 tonnes” in level 3, making “this year” the time of the quantity mention “57 tonnes”. The motivation of this approach is based on the observation of text with numbers. Most of the numerical information is “close” to the number itself. Therefore, we leverage the distance feature from a CP tree by finding nodes “close” to a “quantity node” as target candidates.

2.3.3. Extract Quantified Candidates and Relations

With the constituency parse tree built and nodes thereof marked as “quantity node” or “time node”, we are able to extract quantified candidates and relations for each quantity node in a sentence.

For each quantity mention, we start by finding the corresponding quantity nodes in the constituency parse tree. For example, the corresponding quantity nodes of quantity “57” are node “57” in level 5, node “57 tonnes” in level 4, “by 57 tonnes” in level 3, “increased by 57 tonnes this year to 457 tonnes” in level 2 and node “Express supply increased by 57 tonnes this year to 457 tonnes” in level 1. We define three types of quantity nodes:

Terminal single value (TSV): a terminal quantity node that only contains one quantity mention. For example, node “57” in level 5.
Non-terminal single value (NTSV): a non-terminal quantity node that only contains one quantity mention, parent of a TSV node. For example, node “57 tonnes” in level 4.
Non-terminal multi-value (NTMV): a quantity node which contains more than one quantity mention, parent or ancestor of an NTSV node. For example, node “by 57 tonnes” in level 3 and the rest of the quantity nodes containing “57” below level 3.

To extract quantified candidates, we start by analyzing general sentence structure as well as examining data examples from DROP, BONIE, Marve, and Reverb. We discovered that the noun phrases (NPs) or verb phrases (VPs) close to a quantity mention are relevant to the quantity mention. Such relevance can be measured by the distance between a quantity node and its sibling (sharing the same parent) or cousin (at the same level but from a different parent) node in a constituency parse tree. The distance between two adjacent nodes is the shortest, which represents the highest relevance. The relevance between sibling nodes is higher than cousin nodes. Only identifying quantified candidates from sibling or cousin nodes of a quantity node is not enough; we need to exclude some cases. For most cases, two different quantity mentions in one sentence are less relevant, such as “three” and “five” from the sentence “Tom has three cats and five dogs”. However, for other cases like “It costs us $10 to ship $490 of supplies”, one quantity mention modifies the other. Therefore, we only allow certain types of quantity nodes to become a quantified candidate.

Algorithm 1 concretely describes the quantified candidate extraction process. We traverse each quantity node

q_{n}

from the constituency parse tree T. Then we apply rule

R_{1}

or

R_{2}

depending on the quantity node type to extract quantified candidates in lines 3–9. We keep results extracted from both sides of the quantity node to preserve different information. Rule

R_{3}

and

R_{4}

in lines 10–12 are filter rules to determine whether the matched quantified candidate is valid for becoming part of the quantified candidate set

C

. The extraction rules can be summarized as:

Rule $R_{1}$ : find sibling noun tokens or NP nodes of $q_{n}$ and return the merged text $c_{i n}$ from all found nodes. Based on this rule, the sibling node “tonnes” will be found in level 5.
Rule $R_{2}$ : find the closest sibling/cousin NP or VP node of $q_{n}$ and return the text of which as $c_{e x}$ . Based on this rule, sibling node “Express supply” will be found in level 2.
Rule $R_{3}$ : returns true if $c_{e x}$ is not a noun quantity node.
Rule $R_{4}$ : returns true if $c_{e x}$ is not a noun date node.

Algorithm 1: extract quantified candidates.

In addition, inspired by the work of Jiang and Diesner [36], we extract relations by merging headwords from VP and PP quantity nodes. Based on this rule, “increased” in level 3 and “by” in level 4 build the relation “increased by” for the quantity node “57 tonnes”. Note that Algorithm 1 is applied on each quantity mention.

The relation here is comparable to the relational phrase in OIE, which typically consists of a verb with and a preposition after it (if any). The verb expresses how the value is altered, like “increased” or the action that is conducted with respect to the value, like “donated”. Another key element is the preposition following the verb. For instance, the prepositions “to” and “by” in “increased by” can assist in distinguishing between absolute and relative change. The least verb phrase (VP) ancestor of the value mentioned is typically where the relation is located in the constituency parsing tree [36]. The extraction rule is, therefore, pretty straightforward: we consider the verb and the preposition after it from the least VP ancestor of the quantity mention as related.

2.4. Stage II: Filter and Recombine

In this stage, the inputs are the time, relation, quantity mention, and quantified candidates, and the output is a set of quantity facts. The quantified candidates extracted in Section 2.3 are only relevant to the quantity nodes but not necessarily related to the quantity mention itself. The quantity node can be considered as a proxy of the quantity mentioned. There could be redundant information in a quantity node as it reaches a higher level, which may lead to the matched quantified candidate being less relevant to the quantity mention. To address this problem, we propose a filtering method based on dependency parsing to filter out quantified candidates that are less relevant to the quantified mention. Finally, we recombine all remaining candidates as the extracted quantity facts.

2.4.1. Dependency Parsing

We employ the same model [35] for dependency parsing (DP). In a DP tree, the headword points to its dependent (tail) word (a dependent word modifies its headword).

2.4.2. Filter

We introduce dependency parsing as a filtering method to filter out quantified candidates that are less relevant to a quantity mention. Different from MARVE [22] and other DP-based extraction methods, our method does not rely on complex dependency patterns. As the number of dependency types is large, and it becomes even larger with combinations, designing generalized matching patterns can be difficult. Instead of studying dependency types, we only exploit shallow features from the DP tree structure for two reasons: (1) the necessary information has already been extracted, and (2) a filtering method does not require high accuracy.

We explain the filtering mechanism based on the dependency parse tree structures shown in Figure 3. The following relationship types between two tokens in a DP tree are defined:

Direct dependency: A token modifies another token directly, e.g., “Excess” modifies “supply” directly.
Indirect dependency: A token modifies another token indirectly through other tokens e.g., “tonnes” modifies “increased” indirectly through “to”.
Same head: Both tokens modify the same head token, e.g., “to” and “supply” both modify “increased”.

Figure 3. A DP tree built from an example sentence.

Instead of tokens only, a phrase can be represented as a non-leaf sub-tree in a DP tree. For example, “to 457 tonnes” is a sub-tree. The dependent of a phrase or what it modifies depends on the root word of the sub-tree. As a result, the relationship types defined above can be scaled to describe relationships between phrases. The model we employ performs CP and DP jointly, making it easy to align a phrase from the CP tree to the same phrase in a DP tree.

We assume that any two words or phrases with these relationship types are relevant despite the dependency types between them. Hence, we examine the relationship type between each quantified candidate and the quantity mention and filter out the quantified candidates without a relationship-type match.

2.4.3. Recombine

With irrelevant quantified candidates filtered out and the time, relation, and quantity mention and extracted, a complete quantity fact can be built. Algorithm 2 describes the overall recombine process. The

r e l a t i o n

and

t i m e

are assigned to the quantity fact

F

directly. The

v a l u e

and

u n i t

can be acquired from the quantity mention

q_{m}

. Only the

a g e n t

and

q u a n t i t y

are undetermined. As mentioned in Section 2.3, each c of

C_{f}

is assigned with a type of

c_{i n}

or

c_{e x}

.

c_{i n}

is a quantified candidate extracted based on a TSV or NTSV quantity node. This type of quantity node only consists of a single value, making

c_{i n}

highly relevant to its value. Hence, we assign

c_{i n}

to q with higher priority in line 6.

c_{e x}

is a quantified candidate extracted based on a TSV or TMV quantity node. The relevance between

c_{e x}

and v is lower than the relevance between

c_{i n}

and v.

The agent plays a direct role in influencing or motivating v. In other words,

a g e n t

is related to

v a l u e

through

r e l a t i o n

. We extract

a g e n t

by finding the

c_{e x}

left to a VP quantity node in line 9. If this

c_{e x}

cannot be found, we save it temporarily in

c_{q}

in line 12. If

q u a n t i t y

is undetermined after iteration, we assign

c_{q}

, which is also a

c_{e x}

, to

q u a n t i t y

with a lower priority in line 17.

Algorithm 2 constructs one quantity fact

F

for each corresponding quantity mention. For n quantity mentions in a single sentence, the result will be a set of quantity facts

{F_{0}, F_{1}, \dots, F_{n}}

.

Algorithm 2: build a quantity fact.

2.5. QuantityIE Summary

For each value, QuantityIE extracts each semantic role with simple rules on the constituency parsing tree starting from its lowest-level quantity mention. in bottom-up order. Benefiting from constituency parsing, QuantityIE could partially resolve the ellipsis for text with multiple quantities. For example, “The highest temperature is 35

^{\circ}

C in summer while it is 5

^{\circ}

C in winter”, QuantityIE could extract the correct agent “the highest temperature” for both “35

^{\circ}

C” and “5

^{\circ}

C”, while for pure DP-based methods, achieving this goal requires complex dependency patterns, which will inevitably introduce errors. For sentences like “The ticket price is $5” and “The oil price dropped by 30%”, the agent (“the ticket price” and “the oil price”) can also be considered as the quantity when the relations are {“is”, “dropped by”, ⋯}. This commonality between agent and quantity may serve as a source of confusion. However, we can differentiate them based on the relation. Note that for sentences like “The US force of 500 young soldiers were assembled in New York.”, where the quantity is the subject, QuantityIE only extracts quantity (“young soldiers”), no relation will be extracted. Both the time and space complexity are

O (n)

as we only perform a breadth-first traverse on the CP tree.

2.6. Information Retrieval with QuantityIE

Early search engines considered queries simple keywords and returned the top web pages as the search result. Users needed to examine each web page to find the answer or valuable information. By leveraging information extraction, semantic parsing, and knowledge graphs, modern search engines understand online content and user queries more deeply and have the capability to return direct answers for some queries [37,38,39]. For example, the BING search engine recognizes the entity “Tesla Model 3” from the query “The price of a Tesla Model 3” and returns “$37,990–$54,990” directly as the answer. Likewise, Google can return a list of entities for queries with a type description like “2019 Nobel prize winners”.

However, this capability is limited to simple queries and answer structures (single data/data set). Complex queries, or queries that require a complicated data structure to answer (e.g., table/graph), are beyond the abilities of today’s search engines. Another limitation is the understanding of quantities. Search engines merely match tokens that denote numbers and units as if they were keywords. Consider example queries:

Companies with PE Ratio lower than 10.
NBA results of the last season.
Daily active users of Tiktok in the US last month.
The latest election polls of Biden vs. Trump.

These queries usually contain quantities, entities, and time expressions, which serve the purposes of data analysis and acquiring knowledge of cases or events. Major search engines fail to answer these queries and return web pages instead. However, the second query is an exception. Both Google and BING are able to retrieve the NBA results and return a specialized table to visualize the scores of each team. This capability is based on the structural data retrieved from a trusted data source, such as “STATS, LLC” for NBA. If we change “NBA” to “CBA”, the search engine falls back to search mode again as no data source is available or accessible. Only a small amount of structural data is directly and publicly available, and are often manually collected and built with heavy human effort. For many cases or events, the structural data is “hidden” in the unstructured text such as news, articles, and reports.

Based on information extraction with QuantityIE, we study the task of answering complex queries or queries with quantities by extracting and retrieving structured data from natural language text. This task is extremely valuable for data analysis purposes. The overall numerical information extraction and retrieval process is presented in Figure 4.

To retrieve numerical information, the main idea is to (1) extract numerical information from both the query and the search targets with QuantityIE, (2) perform entity linking to identify and disambiguate named entities from the quantity facts, (3) match the quantity facts of the search targets with the quantity fact of the query to retrieve candidates, and (4) ranks the candidates from different targets to acquire the final result.

Query Parsing. Two different types of queries are supported: quantity-based queries and target-based queries. The quantity-based query is similiar to the Qquery from QSearch [34], which includes a semantic type of answers $t^{*}$ , a quantity condition $q^{*}$ , and search targets $S^{*} = {s_{1}, s_{2}, \dots, s_{n}}$ . An example is “Countries with cases more than 50,000.”, where $t^{*} = c o u n t r i e s$ , $S^{*} = {c a s e s}$ and $q^{*} = 5000$ . The target-based query contains specific entity targets $E^{*} = {e_{1}, e_{2}, \dots, e_{n}}$ and search targets $S^{*}$ . For example, “How many deaths and recoveries have Israel and Italy reported by April 21st” is a target-based query where $E^{*} = {I s r a e l, I t a l y}$ and $S^{*} = {d e a t h, r e c o v e r y}$ . $S^{*}$ can be obtained strictly from a dictionary of keywords defined for each s or loosely based on token/phrase tags. For quantity-based queries, the parser uses a dictionary of entity types to recognize $t^{*}$ and the same method described in Section 2.3 to extract the value and unit from the query as $q^{*}$ ; For target-based queries, $E^{*}$ is recognized with the same entity linking method. Both types of queries support a time constraint $d^{*}$ , which can be obtained with QuantityIE. The two types of queries are shown in Figure 5.
Data Retrieval. First, based on the entity type information from the KB, the quantity facts with entity e satisfying the target type $t^{*}$ or entity condition $E^{*}$ are retrieved from the repository. Second, we perform quantity matching for quantity-based queries by matching $q^{*}$ with value and unit of each quantity fact. The unit should relate to the same concept to be comparable (e.g., both dimensions or both amounts), and their values should match the comparison operator (after normalization). We filter out all quantity facts without a positive match of $q^{*}$ . We further extract the target set S from agent and quantity of each quantity fact with the same keyword-target dictionary. Targets between the query and the quantity fact are considered matched if an overlap between $S^{*}$ and S exists. We filter out quantity facts without a positive match of targets. Finally, if $d^{*}$ exists, we filter out the quantity facts whose time does not satisfy $d^{*}$ . The remaining quantity facts become candidate facts.
Result Ranking. The candidate facts built from different sources may contain duplicated values for the same entity and target. We aggregate candidate facts with the same entity e and target s into different entity-target groups. For each entity-target group, we further aggregate candidate facts with the same value as value groups. The candidate facts from the same value group are from different sources (web pages), but the source text may be identical due to re-posts. We design a simple scoring method to rank the result of each entity-target group by counting the supporting candidate facts of each value group. The candidate facts from the value group with the highest score become the final results for each entity and target.

3. Experiments and Results

We conduct experiments on two different tasks: (1) the numerical information extraction task (numerical IE) and (2) the numerical information retrieval task (numerical IR). The settings and motivations of the two experiments are shown in Table 1.

3.1. Numerical Information Extraction

3.1.1. Experimental Settings

The numerical information extraction experiment is based on two datasets: the DROP dataset released by Dua et al. [11] and the test data of BONIE [12]. The DROP dataset is an English reading comprehension benchmark that requires discrete reasoning (such as addition, counting, or sorting) over the content of paragraphs. A total of 5565 paragraphs from the National Football League (NFL) game summaries and history articles are included in the DROP dataset, among which 3845 paragraphs include quantities. We sampled 100 sentences (NFL and history 1:1) containing at least one valid number (not a date, time, duration, or document words like “Section”, “Table”, or “Figure”) and annotated each sentence with the semantic labels we defined in Section 2. To construct a test set with diverse examples, we only sample two sentences from a single paragraph so that the sentences are from more paragraphs and sources. Similarly, we sample and annotate 100 sentences from the test set of BONIE. BONIE targets numerical relation extraction, so no further processing is required. For all the values, we labeled 199 agents, 210 relations, 253 quantities, and 102 time.

Annotation for each label is based on the following principle: only annotate the minimal but complete (necessary) information. For example, context “RAZr phones” in the sentence “Motorola has sold over 23 million RAZr phones” matches this principle while “phones” does not because without “RAZr” describing “phones”, the meaning would be incomplete. Inspired by evaluation methods of named entity recognition tasks (NER), we adopt three different matching standards.

Exact match (EM): Two text pieces are considered matched if they are exactly the same.
Partial match (PM): Two text pieces are considered matched if one is a sub-text of the other.
Overlapped match (OM): Two text pieces are considered matched if they are overlapped.

We compare QuantityIE with two DP-based baselines on quantity, agent, and relation extraction: MARVE [22] and BONIE [12].

For a fair comparison between BONIE [12], MARVE [22], and QuantityIE, we make some adjustments to the output schemas. BONIE outputs numerical triple <arg1, relation, arg2> with an extra additional. One of the arg1 and arg2 contains a numerical value while the other is the subject or object, relation is similar to the relational phrase in OIE, and additional is the additional information of the sentence. MARVE outputs <value, quantified, related> where quantified is the object or concept measured by the value, and related is a collection of words or entities related to a measurement. For BONIE, we consider a correct extraction of agent or quantity if either arg1 or arg2 matches the labeled text. We match the relation phrase with relation. Some relation phrases from BONIE are normalized; we consider all these types of relations correct. For MARVE, we only evaluate quantity. Either quantified or related matches the labeled quantity will be considered correct.

If a sentence contains more than one quantity, the extraction result will be grouped by quantities. The unit is not evaluated as it is extracted by Microsoft Recognizers Text. Based on the three different matching standards defined above, we calculate the Precision, Recall, and F

_{1}

score using the following rules:

True Positive (TP): the extracted role matches the labeled role.
False Positive (FP): the extracted role exists but does not match any labeled roles, including cases when the labeled role does not exist.
False Negative (FN): the labeled role exists but does not match any extracted roles, including cases when the extracted role does not exist.

During the evaluation, multiple result groups may exist depending on the number of numbers in a sentence. We first align each result group by finding the same value between the labeled and extracted data. Then we evaluate each labeled role corresponding to the value. If a value cannot be matched, all labeled roles corresponding to this value are considered FP or FN.

3.1.2. Experimental Results

Table 2 shows the overall extraction result on quantity, agent, and relation. Results of quantity, agent, and relation in QuantityIE are close to the overall F

_{1}

and are not presented. We evaluate time separately, of which the F1 score is 0.80, 0.84, and 0.88 based on EM, PM, and OM, respectively. QuantityIE outperforms the baselines under all EM, PM, and OM metrics, and the recall of QuantityIE is higher. The results indicate that pure DP-based approaches have limitations in capturing various linguistic phenomena. QuantityIE has significantly better results under the EM metric, the reason is that DP rules are token-level while the quantity facts are phrases. The results also demonstrate that although the extraction rules in QuantityIE are simple, they are quite effective.

We further evaluated the same dataset with exact matching by removing the DP filter from QuantityIE. Results show that the precision has decreased by 0.14 while recall slightly increased by 0.02. This explains that DP is capable of removing candidates that are not “close” to the value mentioned and could boost the performance.

The recall of BONIE is rather low, which is also addressed in their paper. For complex-structured sentences (e.g., DROP sentences), BONIE shows incapability by giving a large number of empty results. This result also proves that not all values are expressed in the form of “relation”. Another reason for the low recall is that BONIE uses bootstrapping to learn the specific dependency patterns, which leads to empty results if no patterns are matched. The precision of MARVE is comparable to QuantityIE, but the precision is still held back by the recall. We analyzed the results extracted by MARVE and found empty results. This is mainly because no DP patterns can be matched, proving the point that DP patterns are complex and DP-based methods might fail to process numerical information with long-range dependency.

3.1.3. Case Studies

Table 3 shows some cases of the extraction results of QuantityIE, BONIE, and MARVE. The output schemas of the baselines are different from QuantityIE but are still comparable as types from different schemas share similar definitions. BONIE outputs a numerical triple <arg1, relation, arg2>, and MARVE outputs <value, quantified, related>. We align the output for comparable schemas, e.g., agent from QuantityIE and arg1 from BONIE. For each aligned output, results with no color are acceptable. Results colored in green are preferable to the ones colored in orange. The red color represents an error. Results that are colored only in red represent common errors. The green and the non-color cells are acceptable extraction outputs, and the quality of green cells is better than the orange ones. Cells in red are incorrect extractions.

The settings are: cases #1–#2, #6–#7 are sentences with one numerical value, cases #3–#5, #8–#10 are sentences with multiple values. There are ellipsis (“2” for “2 kittens”) and coreference (“she” for “Joan”) in case #8. Cases #9–#10 contain grouped values (

$ 5.2

billion and

$ 6

billion, respectively) and scores between two sides (21–18).

This commonality between agent and quantity may serve as a source of confusion. However, we can differentiate them based on the relation. The results show that only QuantityIE is able to extract at least one quantity fact for each case, which matches the conclusion that QuantityIE improves recall significantly.

The extraction output of case #1 shows that the phrase-level extraction results of QuantityIE are more self-contained compared with BONIE or MARVE. For sentences with one numerical value, BONIE has extraction results only for cases #1-#2, and MARVE has results only for cases #1 and #6. However, the phrase-level extraction results of QuantityIE are more self-contained. BONIE outputs <an army, has men of, over

8000 >

while QuantityIE outputs <The prince of Ning, have an army of, 80,000, men>. Though the extraction result of BONIE is also correct, it is not sufficient to fully understand the numerical information. For sentences with multiple values (cases #3–#5), only QuantityIE could extract quantity facts for each numerical value correctly. The other baselines missed some of the values or output empty results. MARVE is able to extract numerical information for each value in case #5 but failed to extract all the quantified correctly. This indicates that the DP-rule-matching-based methods can be unstable in handling sentences with different structures by introducing errors like “brothers and” for value “two” in case #5.

Besides these cases where QuantityIE is able to extract correctly, we show some typical incorrect or partially correct results of QuantityIE (cases #8–#10). It should be noted that baseline approaches BONIE and MARVE might also have these types of errors. We mark these common errors in red as well. QuantityIE also outputs incorrect or partially incorrect results in some cases, where most of which are common errors shared by all methods. For cases with ellipses and coreference, such as case #8, all methods fail to output any extraction results. Handling these cases is not the focus of numerical information extraction. Another typical error is shown in case #9, where we need to design specific search rules for numbers expressed with “respectively”. In order to match “

$ 5.2

billion” to “Profits of Microsoft” and “

$ 6

billion” to “Profits of Apple” separately, further process is required. For case #10, it would be better if the score 21–18 is considered as a whole.

It should be noted that for sentences such as case #9, the agent can also be considered the quantity. This commonality between agent and quantity may serve as a source of confusion. However, we can differentiate them based on the relation: if the relation phrases are “be" verbs, or verbs like “increase” and “drop”, the extracted agent could also be considered the quantity.

3.2. Numerical Information Retrieval

3.2.1. Experimental Settings

We evaluated numerical information retrieval on COVID-19 news reports. To achieve this goal, we collected news reports from BING News as the data source. The news articles were segmented into sentences with Spacy. We filtered out the sentences irrelevant to the COVID-19 topic with predefined keywords. Subsequently, we used Recognizers-Text to detect numerical values in the text. Sentences recognized with at least one numerical value and named entity were considered valid. The rest would be filtered out. The title, content, source link, and publication time were collected and saved for each news article. Finally, we collected 53,604,309 sentences, including a total of 8,755,531 quantity facts. The evaluation of information retrieval was based on the accuracy of six different types of COVID-19 data: confirmed total, confirmed change, recovered total, recovered change, death total, and death change.

3.2.2. Experimental Results

Similar to Luo et al. [40], we adopted the BING COVID-19 tracker data, which was publicly available and updated daily, as the ground truth to compare with. Results of each COVID-19 data target from each region on each date were matched to the corresponding data of the BING tracker. Results are shown in Table 4. Loose Accuracy denoted that a 10% difference was allowed between the retrieved values and the golden values. The results on information retrieval demonstrated the effectiveness of our approach, as both the query and the search targets were extracted using QuantityIE.

4. Conclusions

We propose a novel numerical information extraction approach, QuantityIE, to extract sentence-level numerical information. A two-stage process is employed: the first stage extracts quantity fact candidates, and the second stage filters and recombines them into a formal representation. Both the numerical information extraction and retrieval experiment based on different data demonstrate the effectiveness of QuantityIE. For future work, we will study paragraph or document-level numerical information extraction with new challenges such as co-reference and ellipses, which are also the weaknesses of the current approach.

Author Contributions

T.L.: Conceptualization, Formal analysis, Investigation, Methodology, Software, Writing—original draft. Z.W.: Formal analysis, Investigation, Writing—review & editing. Z.L.: Project administration, Conceptualization, Writing—review & editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/Open-NRE/ONRE (accesssed on 27 July 2018) and here: https://allenai.org/data/drop (accesssed on 19 March 2023) in 2019.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (Grant Nos. 62276017, U1636211, 61672081), the 2022 Tencent Big Travel Rhino-Bird Special Research Program, and the Fund of the State Key Laboratory of Software Development Environment (Grant No. SKLSDE-2021ZX-18).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CP	Constituency Parsing
DP	Depency Parsing

References

Sugawara, S.; Inui, K.; Sekine, S.; Aizawa, A. What Makes Reading Comprehension Questions Easier? In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 4208–4219. [Google Scholar]
Xu, Y.; Liu, X.; Shen, Y.; Liu, J.; Gao, J. Multi-task Learning with Sample Re-weighting for Machine Reading Comprehension. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 2644–2655. [Google Scholar]
Zhang, X.; Huang, H.; Chi, Z.; Mao, X.L. ET5: A Novel End-to-end Framework for Conversational Machine Reading Comprehension. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 570–579. [Google Scholar]
Song, M.; Feng, Y.; Jing, L. Hyperbolic Relevance Matching for Neural Keyphrase Extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 5710–5720. [Google Scholar]
Cao, Y.; Groves, W.; Saha, T.K.; Tetreault, J.; Jaimes, A.; Peng, H.; Yu, P. XLTime: A Cross-Lingual Knowledge Transfer Framework for Temporal Expression Extraction. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 1931–1942. [Google Scholar]
Thai, K.; Chang, Y.; Krishna, K.; Iyyer, M. RELiC: Retrieving Evidence for Literary Claims. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 7500–7518. [Google Scholar]
Jeong, S.; Baek, J.; Cho, S.; Hwang, S.J.; Park, J. Augmenting Document Representations for Dense Retrieval with Interpolation and Perturbation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Dublin, Ireland, 22–27 May 2022; pp. 442–452. [Google Scholar]
Kim, J.; Kim, M.; Hwang, S.w. Collective Relevance Labeling for Passage Retrieval. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 4141–4147. [Google Scholar]
Wang, B.; Shin, R.; Liu, X.; Polozov, O.; Richardson, M. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 1–5 July 2020; pp. 7567–7578. [Google Scholar]
Yu, T.; Zhang, R.; Yang, K.; Yasunaga, M.; Wang, D.; Li, Z.; Ma, J.; Li, I.; Yao, Q.; Roman, S.; et al. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 3911–3921. [Google Scholar]
Dua, D.; Wang, Y.; Dasigi, P.; Stanovsky, G.; Singh, S.; Gardner, M. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MA, USA, 2–7 June 2019; pp. 2368–2378. [Google Scholar]
Saha, S.; Pal, H.; Mausam. Bootstrapping for Numerical Open IE. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 317–323. [Google Scholar]
Madaan, A.; Mittal, A.; Ramakrishnan, G.; Sarawagi, S. Numerical Relation Extraction with Minimal Supervision. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 2764–2771. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Llion, J.; Gomez, A.N.; Kaiser, L.U.; Illia, P. Attention is All you Need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 6000–6010. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving language understanding by generative pre-training. OpenAI Blog. 2018. Available online: https://www.cs.ubc.ca/~amuham01/LING530/papers/radford2018improving.pdf (accessed on 11 June 2018).
Lu, Y.; Liu, Q.; Dai, D.; Xiao, X.; Lin, H.; Han, X.; Sun, L.; Wu, H. Unified Structure Generation for Universal Information Extraction. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 5755–5772. [Google Scholar]
Wang, C.; Liu, X.; Chen, Z.; Hong, H.; Tang, J.; Song, D. DeepStruct: Pretraining of Language Models for Structure Prediction. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 22–27 May 2022; pp. 803–823. [Google Scholar]
Fatahi Bayat, F.; Bhutani, N.; Jagadish, H. CompactIE: Compact Facts in Open Information Extraction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, WA, USA, 10–15 July 2022; pp. 900–910. [Google Scholar]
Vasilkovsky, M.; Alekseev, A.; Malykh, V.; Shenbin, I.; Tutubalina, E.; Salikhov, D.; Stepnov, M.; Chertok, A.; Nikolenko, S.I. DetIE: Multilingual Open Information Extraction Inspired by Object Detection. In Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022, Virtual Event, 22 February–1 March 2022; pp. 11412–11420. [Google Scholar]
Ro, Y.; Lee, Y.; Kang, P. Multi²OIE: Multilingual Open Information Extraction Based on Multi-Head Attention with BERT. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 1107–1117. [Google Scholar]
Hundman, K.; Mattmann, C.A. Measurement Context Extraction from Text: Discovering Opportunities and Gaps in Earth Science. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Data-Driven Discovery Workshop, Long Beach, CA, USA, 6–10 August 2017. [Google Scholar]
Wang, Z.; Yang, L.; Yang, J.; Li, T.; He, L.; Li, Z. A Triple Relation Network for Joint Entity and Relation Extraction. Electronics 2022, 11, 1535. [Google Scholar] [CrossRef]
Alonso, O.; Sellam, T. Quantitative Information Extraction From Social Data. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–12 July 2018; pp. 1005–1008. [Google Scholar]
Ravichander, A.; Naik, A.; Rose, C.; Hovy, E. EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), Hong Kong, China, 3–4 November 2019; pp. 349–361. [Google Scholar]
Sellam, T.; Alonso, O. Raimond: Quantitative Data Extraction from Twitter to Describe Events. In Proceedings of the 15th International Conference on Engineering the Web in the Big Data Era, Rotterdam, The Netherlands, 23–26 June 2015; Volume 9114, pp. 251–268. [Google Scholar]
Wang, N.; Li, J.; Meng, Y.; Sun, X.; Qiu, H.; Wang, Z.; Wang, G.; He, J. An MRC Framework for Semantic Role Labeling. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 2188–2198. [Google Scholar]
Wu, H.; Xu, K.; Song, L. CSAGN: Conversational Structure Aware Graph Network for Conversational Semantic Role Labeling. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; pp. 2312–2317. [Google Scholar]
Zhang, Y.; Xia, Q.; Zhou, S.; Jiang, Y.; Fu, G.; Zhang, M. Semantic Role Labeling as Dependency Parsing: Exploring Latent Tree Structures inside Arguments. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4212–4227. [Google Scholar]
Zhou, S.; Xia, Q.; Li, Z.; Zhang, Y.; Hong, Y.; Zhang, M. Fast and Accurate End-to-End Span-based Semantic Role Labeling as Word-based Graph Parsing. In Proceedings of the 29th International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 12–17 October 2022; pp. 4160–4171. [Google Scholar]
Wu, H.; Tan, H.; Xu, K.; Liu, S.; Wu, L.; Song, L. Zero-shot Cross-lingual Conversational Semantic Role Labeling. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, USA, 10–15 July 2022; pp. 269–281. [Google Scholar]
Lamm, M.; Chaganty, A.; Jurafsky, D.; Manning, C.D.; Liang, P. QSRL: A Semantic Role-Labeling Schema for Quantitative Facts. In Proceedings of the First Financial Narrative Processing Workshop at LREC 2018, Miyazaki, Japan, 7–12 May 2018; pp. 44–51. [Google Scholar]
Lamm, M.; Chaganty, A.; Manning, C.D.; Jurafsky, D.; Liang, P. Textual Analogy Parsing: What’s Shared and What’s Compared among Analogous Facts. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 31 October–4 November 2018; pp. 82–92. [Google Scholar]
Ho, V.T.; Ibrahim, Y.; Pal, K.; Berberich, K.; Weikum, G. Qsearch: Answering Quantity Queries from Text. In Proceedings of the 18th International Semantic Web Conference (ISWC), Auckland, New Zealand, 26–30 October 2019; pp. 237–257. [Google Scholar]
Zhou, J.; Zhao, H. Head-Driven Phrase Structure Grammar Parsing on Penn Treebank. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; pp. 2396–2408. [Google Scholar]
Jiang, M.; Diesner, J. A Constituency Parsing Tree based Method for Relation Extraction from Abstracts of Scholarly Publications. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), Hong Kong, China, 4 November 2019; pp. 186–191. [Google Scholar]
Hasibi, F.; Balog, K.; Garigliotti, D.; Zhang, S. Nordlys: A Toolkit for Entity-Oriented and Semantic Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Tokyo, Japan, 7–11 August 2017; pp. 1289–1292. [Google Scholar]
Garigliotti, D. A Semantic Search Approach to Task-Completion Engines. In Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Ann Arbor, MI, USA, 8–11 July 2018; p. 1457. [Google Scholar]
Balog, K. Entity-Oriented Search; Springer Publishing Company, Incorporated: Cham, Switzerland, 2018. [Google Scholar]
Luo, Y.; Li, W.; Zhao, T.; Yu, X.; Zhang, L.; Li, G.; Tang, N. DeepTrack: Monitoring and Exploring Spatio-Temporal Data: A Case of Tracking COVID-19. Proc. VLDB Endow. 2020, 13, 2841–2844. [Google Scholar] [CrossRef]

Figure 1. An overview of extracting numerical information from the CP structures.

Figure 2. A sentence with labeled roles of a quantity fact.

Figure 4. Extraction and retrieval.

Figure 5. Two types of queries for retrieval.

Table 1. Settings and motivations of different experiments.

	Numerical IE	Numerical IR
Dataset	BONIE & DROP combined	COVID-19 news & Tracker Data
Number of Sentences	200	53604309
Number of Facts	253	8755531
Extraction Target	Natural language text	User query & news (search targets)
Task Purpose	Direct evaluation	Indirect evaluation

Table 2. Experiment results on numerical information extraction. EM, PM, and OM denote exact match, partial match, and overlapped match, respectively. Please find more details from Section 3.1.1. P denotes precision, and R denotes recall.

	BONIE			MARVE			QuantityIE
	P	R	F1	P	R	F1	P	R	F1
EM	0.29	0.05	0.09	0.27	0.11	0.16	0.66	0.89	0.76
PM	0.35	0.06	0.1	0.65	0.27	0.38	0.77	0.93	0.84
OM	0.38	0.06	0.1	0.80	0.34	0.48	0.85	0.91	0.88

Table 3. Case studies of QuantityIE, BONIE, and MARVE. Results with no color are acceptable. Results colored in green are preferable to the ones colored in orange. The red color represents an error. Extraction slots are in italic font.

Case	Result
#1	Sentence	The Prince of Ning was said to have an army of over 80,000 men.
		agent	relation	value	quantity	time
	QuantityIE	The prince of Ning	have an army of	80,000	men	-
		arg1	relation	arg2 (additional)
	BONIE	an army	has men of	over 80,000 (-)
		related		value	quantified
	MARVE	army, 80,000		80,000	men
#2	Sentence	If there is no nominee, those speeches become Clinton’s and Obama’s 8 mile moment.
		agent	relation	value	quantity	time
	QuantityIE	those speeches	become Clinton’s and Obama’s	8	mile moment	-
		arg1	relation	arg2 (additional)
	BONIE	those speeches	become	8 mile moment (If there is no nominee)
		related		value	quantified
	MARVE	-		-	-
#3	Sentence	In fact, it only costs us about $$ 10$ to ship $$ 490$ worth of supplies.
		agent	relation	value	quantity	time
		it	costs us	$$ 10$	to ship $$ 490$ worth of supplies	-
	QuantityIE	-	-	$$ 490$	worth of supplies
		arg1	relation	arg2 (additional)
	BONIE	it	costs	about $$ 10$ (In fact)
		related		value	quantified
	MARVE	-		-	-
#4	Sentence	The Lions scored first in the first quarter with a 23-yard field goal by Jason Hanson.
		agent	relation	value	quantity	time
		The Lions	scored	23	field goal by Jason Hanson	the first quarter
	QuantityIE	The Lions	scored	first	-	the first quarter
		arg1	relation	arg2 (additional)
	BONIE	-	-	- (-)
		related		value	quantified
	MARVE	-		23	field goal
#5	Sentence	He had at least five siblings: two brothers and three sisters.
		agent	relation	value	quantity	time
		He	had	five	siblings	-
		He	had	two	brothers	-
	QuantityIE	He	had	three	sisters	-
		arg1	relation	arg2 (additional)
	BONIE	-	-	- (-)
		related		value	quantified
		He, brothers, sisters		five	siblings
		siblings, sisters		two	brothers and
	MARVE	siblings, brothers		three	sisters
#6	Sentence	The Portuguese fought back two Dutch attacks on Bahia in 1638.
		agent	relation	value	quantity	time
	QuantityIE	The Portuguese	fought back	two	Dutch attacks on Bahia	1638
		arg1	relation	arg2 (additional)
	BONIE	-	-	- (-)
		related		value	quantified
		-		two	-
	MARVE	-		1638	-
#7	Sentence	The squadron rescued or evacuated 466 men.
		agent	relation	value	quantity	time
	QuantityIE	The squadron	rescued or evacuated	466	men	-
		arg1	relation	arg2 (additional)
	BONIE	-	-	- (-)
		related		value	quantified
	MARVE	-		-	-
#8	Sentence	Joan had 8 kittens and she gave 2 to her friends.
		agent	relation	value	quantity	time
		Joan	had	8	kittens	-
	QuantityIE	she	gave	2	to her friends	-
		arg1	relation	arg2 (additional)
	BONIE	Joan	had	8 kittens (-)
		related		value	quantified
		Joan		8 kittens	-
	MARVE	she		2	her friends
#9	Sentence	Profits of Microsoft and Apple were $$ 5.2$ billion and $$ 6$ billion, respectively.
		agent	relation	value	quantity	time
		Profits of Microsoft and Apple	were	$$ 5.2$ billion	-	-
	QuantityIE	Profits of Microsoft and Apple	were	$$ 6$ billion	-	-
		arg1	relation	arg2 (additional)
		Microsoft and Apple	had profits of	$$ 5.2$ billion (-)
	BONIE	Microsoft and Apple	had profits of	$$ 6$ billion (-)
		related		value	quantified
	MARVE	-		-	-
#10	Sentence	The Browns dropped their 13th consecutive season-opening game with a 21-18 loss to the Steelers.
		agent	relation	value	quantity	time
		The Browns	dropped	21	loss to the Steelers	-
		The Browns	dropped	18	loss to the Steelers	-
	QuantityIE	The Browns	dropped	13th	consecutive season	-
		arg1	relation	arg2 (additional)
	BONIE	-	-	- (-)
		related		value	quantified
	MARVE	21–28		21	loss to

Table 4. Accuracy on retrieved results of different COVID-19 data types based on the COVID-19 tracker data. Loose Accuracy denotes that a 10% difference is allowed between the retrieved values and the golden values.

Type	Accuracy	Loose Accuracy
overall	0.742479	0.817168
confirmed	0.73725	0.81102
death	0.764512	0.844327
recovered	0.590278	0.625
total	0.783755	0.865693
change	0.632031	0.687321
confirmed_total	0.794521	0.876027
death_total	0.791908	0.879438
recovered_total	0.595588	0.632353
confirmed_change	0.623641	0.682065
death_change	0.655738	0.704918
recovered_change	0.5	0.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Z.; Li, T.; Li, Z. Unsupervised Numerical Information Extraction via Exploiting Syntactic Structures. Electronics 2023, 12, 1977. https://doi.org/10.3390/electronics12091977

AMA Style

Wang Z, Li T, Li Z. Unsupervised Numerical Information Extraction via Exploiting Syntactic Structures. Electronics. 2023; 12(9):1977. https://doi.org/10.3390/electronics12091977

Chicago/Turabian Style

Wang, Zixiang, Tongliang Li, and Zhoujun Li. 2023. "Unsupervised Numerical Information Extraction via Exploiting Syntactic Structures" Electronics 12, no. 9: 1977. https://doi.org/10.3390/electronics12091977

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unsupervised Numerical Information Extraction via Exploiting Syntactic Structures

Abstract

1. Introduction

2. Methodology

2.1. Definitions

2.2. QuantityIE Overview

2.3. Stage I: Extract

2.3.1. Constituency Parsing

2.3.2. Recognize Quantity Mentions and Time

2.3.3. Extract Quantified Candidates and Relations

2.4. Stage II: Filter and Recombine

2.4.1. Dependency Parsing

2.4.2. Filter

2.4.3. Recombine

2.5. QuantityIE Summary

2.6. Information Retrieval with QuantityIE

3. Experiments and Results

3.1. Numerical Information Extraction

3.1.1. Experimental Settings

3.1.2. Experimental Results

3.1.3. Case Studies

3.2. Numerical Information Retrieval

3.2.1. Experimental Settings

3.2.2. Experimental Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI