3.3.1. Dependency Syntax

To improve the integrity of element recognition, the dependency syntax relationship is introduced, which can capture the long-range collocation and modification relationships. This relationship is combined with the part-of-speech (POS) rules to filter out useless rules that occasionally occur. Then, the POS syntactic rules are organized to identify the complete structure of the elements. The LTP dependency syntax analysis based on the cloud platform (https://www.ltp-cloud.com/), each dependency is composed of core words and modifiers. The dependency relationship between two words is connected by a dependency arc, and the specific relationship between collocations is indicated by the marks on the dependency arc. Tags are shown in Table 1.

**Table 1.** Tags of LTP-cloud Platform Dependency Syntax Analysis.


By reviewing 1050 financial reports, we found that attributes are often composed of compound words, such as "利润增速水平" (profit growth rate). Even attribute values are clause structures, such as "增长接近100%左右" (the growth is close to 100%). Most of the attribute values in the reports appear in the form of objects, and there are also ATT and COO dependencies among the constituent units. Similarly, some fixed dependencies exist between the constituent units of attributes, such as in the following forward-looking sentence: "预计公司的收入和毛利率将继续呈现高速增长的趋势" (It is expected that the company's revenue and gross profit margin will continue to show a trend of rapid growth). The results of the dependency syntactic analysis are shown in Figure 3.

**Figure 3.** Dependency analysis results.

The attributes of this forward-looking sentence are "收入" (revenue) and "毛利率" (gross profit margin), and the syntactic role they play in a sentence is providing the subject for the verb "呈 现" (show), namely, there is an SBV relationship. There is also a COO dependency between "收入" (revenue) and "毛利率" (gross profit margin). The attribute value element of the sentence is "呈现高速 增长的趋势" (show a trend of rapid growth), which is part of the subordinate clause, where "呈现" (show) is the object of "预计" (expected), namely, a VOB relationship. Within the attribute value, there is an ADV relationship between "高速" (rapid) and "增长" (growth), a VOB relationship between "呈 现" (show) and "趋势" (trend), and an ATT dependency relationship between "增长" (growth) and "趋势" (trend).

#### 3.3.2. POS Syntactic Rules

As shown in Tables 2 and 3, certain POS syntactic rules are obtained using the context of attributes and attribute values via a corpus analysis. Ruleid represents the encoding of rules; description represents the specific content of the rule; and output represents the recognized elements. where Ds is the dependency relation between word W*<sup>i</sup>* and W*<sup>j</sup>* (W*k*) and Ds contains {VOB, ATT, ADV, COO, CMP, ... }, where W*<sup>j</sup>* (W*k*) is the first or last n words of W*i*. Therefore, the dependency relation is based on the context of W*i*. POS encompasses the part-of-speech information for W*i*. Con (W*i*~W*j*) represents the connection from W*<sup>i</sup>* to W*j*. The arrows represent the dependency relations. For example, W*i*→Ds→W*<sup>j</sup>* denotes that Wi depends on W*<sup>j</sup>* through a syntactic relation Ds, where W*<sup>i</sup>* is the father node of W. R1i indicates that an attribute (Attr) or attribute value (Val) is identified using only one dependency relation, and R2i indicates that there is more than one type of dependency relationship.


**Table 2.** POS syntactic rules based on the contextual rules of attributes.



#### 3.3.3. Algorithm

Although a more complete integrity structure of elements can be obtained through the POS syntactic rules, the number of rules remains limited. The LSTM-CRF model can greatly improve the element recall rate, although the integrity is not guaranteed with this approach. Therefore, we can combine the advantages of both models and propose the LSTM-CRF model with the integrity algorithm. The method incorporates the advantages of the data-driven method and dependency syntax to raise the integrity and improve the accuracy of elements without losing the element recall rate.

The inputs of the LSTM-CRF model with the integrity algorithm include **LS**tag, **R**tag, and the following user specified parameters:


The main idea of the algorithm is to utilize the tags generated by the POS syntactic rules to correct the tags generated by the LSTM-CRF model. The algorithm can be divided into three steps. First, **R**tag is traversed to find B-Attr, the current position is set as the superscript, and look for the subscript is obtained from the current label position [**R**tag\_begin: **R**tag\_end]. Second, **LS**tag [**R**tag\_begin: Rtag\_end] is traversed to determine whether the label of the interval only includes O and the label; if it does, then **LS**tag is covered with **R**tag in the interval, and otherwise, it will not be replaced. Third, the next set of annotations after **R**tag\_end is searched until **R**tag is traversed. By correcting the recognition strategy, the integrity of the elements will be guaranteed as much as possible. The LSTM-CRF model with integrity algorithm is shown in Algorithm 1.


#### **4. Experiment**

This section evaluates the performance of our proposed method through experiments and a results analysis. The first part describes the source of the data, the second part introduces the evaluation indicators, and the third part describes the training settings of the LSTM-CRF model. Finally, two sets of experiments show that the LSTM-CRF model with the integrity algorithm provides better results compared with the LSTM-CRF model and POS syntactic rules and our proposed method has good domain independence.

#### *4.1. Data Description*

The object of this article's research is to determine whether elements in firm research reports can be recognized at the sentence level. However, Chinese firm reports tagging their corpus are not publicly available. Therefore, we have constructed a forward-looking information corpus of Chinese firm research reports. Financial websites are crawled to get the experimental data, the source data are extracted in html, and the crawled pages are denoised. The size of the corpus is shown in Table 4.

To standardize the data set, reports from different companies in different fields are used and students from financial fields are invited as annotators, with three students used to annotate the same data. The results are determined by the principle of the minority obeying the majority.


**Table 4.** Corpus size statistics, where En, Attr, and Val represents entity, attribute, and value, respectively.

#### *4.2. Evaluation Indicators*

The recognition effectiveness can be evaluated based on the precision (P), recall (R), and F-score (F). To better assess the experimental results, accuracy and coverage evaluations were conducted.

Accuracy evaluations: The recognition results are compared to manual markings and are required to be fully compliant.

Coverage evaluation: This requires that the recognition result partially overlaps with the corpus of the tag. However, due to the different lengths of the constituent elements of the entities, attributes, and attribute values in this paper, the coverage evaluation cannot be unified.

In most cases, entities and attributes are formed by combining two nouns. The standard for coverage evaluation of entities and attributes is a partial overlap between the recognition results and manual markers. Denoted as "利润率增速" (Profit rate growth), the recognition results of "利润率" (Profit rate) and "增速" (growth) are considered to match.

The composition of the attribute value is longer than that of other elements, and partial matching can lead to unclear results. Therefore, the coverage evaluation of attribute values requires the recognition results to contain the manual markers, which is considered a correct match. For example, for the label "同比上升约2.5个百分点" (increased by about 2.5 percentage points year-on-year), "将同 比上升约2.5个百分点" (will increase by about 2.5 percentage points year-on-year) is the correct match.
