1. Introduction
Abstract contextual embeddings, such as WordNet synsets and distributed representations [
1], have proved useful for a number of natural language tasks. By providing a mapping between WordNet synsets and formal ontology concepts, we can expect to extend traditional natural language tasks, such as sentiment analysis or topic classification, to the domains of a particular formal ontology.
This paper presents a new method for upper-ontology-based semantic parsing using FrameNet [
2], WordNet [
3] and PropBank [
4] parsers. These parsers are based on sentence context distributed representations, and a system that integrates them into a single framework is proposed in this paper. According to the approach based on the automatic labeling of semantic roles [
5], a semantic parsing can be represented as a task of labeling sentence constituents with abstract semantic roles, such as
Agent,
Speaker,
Message, etc. Much of the work on semantic parsing is related to the disambiguation of sentence words and labeling the roles of predicates. This paper focuses on both approaches. The proposed system is used to disambiguate frame targets and WordNet synsets, then to identify frame roles and headwords in sentence constituencies that define these roles, and finally, a process is implemented that computes the identification of upper ontology concepts. This process selects only a few of the most abstract concepts that relate to physics engines in 3D graphics systems.
In order to understand this approach by example, consider the book The Hound of the Baskervilles and two sentences from the beginning and one from the middle of Chapter 1.
“Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night, was seated at the breakfast table. I stood upon the hearth-rug and picked up the stick which our visitor had left behind him the night before.
…
He had risen and paced the room as he spoke.”
Natural language processing for text comprehension in context requires more than the classification of documents or word sense disambiguation. To understand these sentences, the important components of a semantic parsing system must include an inference using general ontology for analyzing successive utterances and the use of background knowledge for the identification of objects and pragmatic interpretation. In
Figure 1, we can see several output results from our deep semantic parsing system after it has parsed the whole book. It is worth noting that
Figure 1 shows only those predicate-role tuples that are relevant for the text-to-3D scene generation task [
6].
Creating 3D scenes using only natural language can be a challenging task. Animating 3D scenes using natural language adds an additional level of complexity to this task. However, over the past 50 years, starting with the pioneering work on the SHRDLU system [
7] there were many systems that have attempted to manipulate computer graphics objects using natural language (see [
8] for a review of 26 systems). Many of these systems accept few sentences as input and try to identify physical objects that are relevant to the 3D scene. Systems such as SceneSeer [
9] leverage spatial knowledge priors to infer implicit constraints and resolve spatial relations.
This paper takes a rather different approach. It explicitly focuses on fiction books as an input to the system. There are two reasons that an emphasis is placed on fiction books. For once, there is a complex world and rich semantic information that can be understood only by analyzing the long-distance relations between text phrases and logical inference about salient objects in the scene. It can be used as an example for our extended coreference resolution algorithm. This algorithm is able to identify the word “I” in the phrase from the sentence “I stood upon the hearth-rug…” is a reference to a physical object named “Dr. Watson” (more details on this are provided in the sections below).
The second reason for us to focus on fiction was the gamification of the annotation process. It is much more fun to interact with a program about the meaning of the text in your favorite book than annotating long boring documents drafted by some administrative office.
Figure 2 presents a conceptual model of the framework, and this paper focuses on a natural language processing component called SUMO SRL.
Early projects in text-based interfaces for 3D scene generation have addressed scenarios in which the input is a sentence and the output is a few geometric shapes. More recently, the use of machine learning has made it possible to generate more complex 3D scenes, but the input has remained close to a single sentence. The input in our system is more complex than a few sentences of text, but the system output is simplified even more than in earlier systems of text-to-3D generation.
Figure 2 shows that that the 3D representation of the world in our framework is just a grid of cells. Each cell can have a number of physical objects, the spatial properties of which are determined by the coordinates of the cell. This representation is not as simple as it looks; the design was inspired by several game development projects using the Unity 3D game engine [
10]. In Unity 3D, game objects can have collider components for the detection of physical collisions. The simplest colliders have primitive geometric types such as a box or a sphere. Thus, from the point of view of the physics engine, a visually very complex 3D world can be merely a collection of boxes. In the proposed natural language processing framework, the 3D world view is a set of collision boxes labeled with the name and type of the object.
Most natural language processing systems to date annotate sentences with semantic labels and do not consider the wider use of these annotations. The use of SUMO ontology concepts as semantic labels makes it possible to use the axioms of ontology for deeper semantic reasoning about context in the text.
Figure 2 shows that the role of the reasoning engine for the SUMO SRL label set is the Drools inference system [
11,
12]. Labels from the SUMO SRL natural language component are sent to the Drools inference system as JAVA objects. Additional inputs to Drools are the concepts and axioms of the SUMO upper ontology [
13], and the output is a set of actions for the physics engine on how to instantiate objects in the 3D world (more details about this framework’s implementation can be found on the project page). As mentioned above, this paper focuses on the SUMO SRL component, but in order to understand the meaning of the labels that are produced by this sematic parsing component, it is important to understand the whole framework of text-to-3D processing.
The rest of this paper is organized as follows:
Section 2 describes the natural language processing framework that was used use to identify the physical objects and processes described in the fiction books. The main novelty in this section is the proposed system architecture. It shows how to reuse existing off-the-shelf solutions and integrate them with the components developed during this research project.
Section 3 describes the various components of the tokenization of chapters, dialogues and paragraphs. Herein, we propose a simple probabilistic model for identifying the structural parts of a book.
Section 4 describes the various components of the FrameNet frame and role identification process. Here, we propose a novel approach to reuse PropBank annotations and propose an algorithm to augment those annotations with FrameNet and WordNet labels.
Section 5 presents an evaluation of the SUMO SRL method, and our final thoughts are given in the discussion section.
3. Tokenization of Chapters, Dialogues and Paragraphs
A natural language interface with a text-to-3D purpose must disambiguate object descriptions based on the scene layout and capture the semantics associated with the scene’s spatio-temporal constrains. Parsing the input textual description of a scene begins by identifying a scene template that contains a set of 3D objects and a set of constraints between these objects. Usually, the input for the text-to-3D parsing system is just a few sentences.
Here the system models a scene template in a way that is similar to the SceneSeer [
6,
9] approach, but the input to the system is an entire book that can contain many scenes. The identification of a change in a scene is a challenging task, and currently at this stage of research, the SUMO SRL system tries to identify the scene by analyzing the beginning of the chapter, the flow of dialogue and the starting position of a new paragraph.
The main outcome of this section is simple algorithms for identifying chapters, paragraphs and dialogues. All of these algorithms use the maximum entropy approach and have the same structure. They take the text of a book as an input and learn to identify tokens that mark the beginning of a chapter, dialogue or paragraph in the text. These algorithms integrate techniques adapted from our previous work [
18] on text filtering and segment labeling in a three-step process.
First, the system tries to identify the text segments that mark possible chapter headings using regular expressions (see
Table 1). It employs a novel text segment scoring technique to efficiently find the best regular expression that gives the highest probability for all matched segments to be used as the chapter heading.
The formal definition of a chapter heading model begins with a set of variables
C,
Ri and
Mi. Let us define
i as the index for the regular expression
Ri, then
Mi is the result of the match of this regular expression. The C variable is defined as a boolean to mark the true segmentation of the text into headings and chapter bodies. The conditional probability of this variable
can be defined as the probability that
Ri will be the true model for chapter headings in the book text
T. There are several frameworks to obtain
, but this paper uses the maximum entropy approach
are the feature functions that model the relevant information about chapter heading segmentation. Then, the system makes a decision about
Mi by choosing one with the maximum probability value
The following list defines a set of feature functions used in equation 2 to segment chapter headings:
The scoring of a regular expression. As a scoring function, the system uses the a priori probability that each Ri has as a meta-parameter.
The number of regular expressions that give the same match result as the current one. This feature function aims to give a higher score for the same match by several regular expressions.
The feature function returns 1 if there is a sequential numbering in chapter headings.
The feature function returns the value of the normal distribution of the length of the match result.
Then, the system uses a subset of regular expressions to find the best result for paragraph segmentation. The model is the same as for chapter segmentation; the only difference is in a different set of regular expressions.
After that, for each paragraph with the best score, the system labels each sentence as a dialogue or narrator. Again, it uses a set of regular expressions to identify if a sentence in a paragraph belongs to the dialogue or the narrator.
This research project evaluated several different natural language processing frameworks and found that none of them has a subsystem for identifying chapters, dialogues or paragraphs. The simple probabilistic model proposed here can provide an important starting point for this topic.
4. Integration of PropBank, FrameNet and WordNet
PropBank and FrameNet are popular resources related to semantic role labeling. The PropBank corpus has verbs annotated with sense frames and it puts semantic information about the verbs in the form of possible semantic roles each frame could take. In this project, a rather simplified approach to the AllenNLP semantic role labeling system is used, i.e., verbs are not labeled at all, and all semantic information is conveyed in role labels. Let us take the first sentence from
Figure 1 and look at the output of the AllenNLP semantic role labeling system for the two verbs “
was” and “
seated”.
[Mr. Sherlock Holmes](ARG1), [who](R-ARG1)<was> [usually](ARGM-TMP)[very late in the mornings](ARG2), [save upon those not infrequent occasions when he was up all night](ARGM-ADV), was seated at the breakfast table.
[Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night](ARG1), was <seated> [at the breakfast table](ARG2).
From the first set of PropBank-style annotations, the only useful information for our system is the fact that there is a physical object named “Sherlock Holmes” and of type “PERSON”. This is because our main goal is to identify words that mark physical objects in natural language sentences and use them to interact with the physics engine in a virtual environment. The SUMO SRL parsing system is not interested in any other concept outside the domain of the physics engine. Thus, the system only needs to know a few parameters for each concept that it identifies as a concept in 3D world domain: (1) the size of the box that can be wrapped around the object; (2) the coordinates where the center of this box is located; (3) if the box is stationary, and if not, what is the initial velocity of this box.
From the second set of PropBank annotations, the system needs to know that there is a second physical object named “table”, and the physical object “Sherlock Holmes” is located next to the “table” object. In addition, it needs to know that the <seated> predicate implies that the object “Sherlock Holmes” is not moving. The last requirement that the system must be able to implement is the ability to infer that (1) the physical object “table” implies, by default, the existence of the physical object “room”; (2) the physical object “room” is our scene, on which the physics engine acts as one of the components of the virtual environment.
Obviously, it is not possible to implement all of these functional requirements from PropBank annotations, and additional annotations are required for this. During this research project, the research team tried to explore various linguistic resources to achieve this goal. The solution presented in this section is based on the use of FrameNet and WordNet systems.
The Stanford named entities recognition system can identify that there is an object “Sherlock Holmes” of the type “PERSON”. Using the Stanford dependencies parser it is possible to determine that “Holmes” is the headword in a phrase tagged with the PropBank tag (ARG1), and “table” is the headword in a phrase with the tag (ARG2). Stanford parsers are off-the-shelf components that can be integrated into an existing natural language processing pipeline simply by using the command line in Linux or Windows operating systems. However, the question remains, is the object “table” a physical object, and if so, what are the spatial relationships between it and the rest of the scene objects?
Figure 1 shows that proposed NLP system can complement the PropBank tag (ARG1) with the FrameNet tag (Agent) and the PropBank tag (ARG2) with the FrameNet tag (Location). In addition,
Figure 1 shows that the verb “
seated” is tagged with the FrameNet tag (Posture). The system can get this rich semantic information if it can get a FrameNet parser with a low error rate. There are several FrameNet parsers available as open source projects, but the project team was able to compile, test and integrate with other components of our system only the SEMAFOR [
19] parser. However, it found that SEMAFOR did not meet the recall and precision requirements for our project. For example, SEMAFOR could not parse the word “was” (the verb “
be” is not defined in FrameNet) in all sentences of the book and could not identify “
Sherlock Holmes” as the “Agent” for the predicate “
seated”.
In most research projects, the problem of frame semantic parsing is modeled in two stages: frame identification and argument identification. In FrameNet, frame identification is simply the disambiguation of frame words. For example, the verb “
stood” in the second sentence in
Figure 1 has five frames (
Posture;
Placing;
Change_posture;
Being_located; Occupy_rank;), and the FrameNet parser must decide which one to choose. The second stage is more difficult than the first one, because in the argument identification process one must take into account all possible phrases in the sentence and all possible role labels in the frame. The novelty of this paper lies in both stages: the proposal to use BERT embeddings at the frames’ identification stage and the PropBank augmentation statistical model in the argument identification stage.
The following describes an algorithm that uses BERT contextual attachments as inputs and gives a probability distribution for possible semantic frames. Contextual word embedding is a distributed representation of semantics, in which each word is represented as a vector in
. For example, consider our example sentence “
I stood upon the hearth-rug and…”.
Table 2 shows some statistics for the verb “
stood” in the FrameNet corpus.
It is possible to represent the context of “
stood” as a 1024-dimensional vector using a language representation model called Bidirectional Encoder Representations from Transformers (we use BERT
LARGE (
L = 24, H = 1024, A = 16)) [
20]. The first stage in our algorithms: for all 237,161 sentences in the FrameNet corpus the system extracts 1024-dimensional vectors for each word that targets the frame, and these vectors are stored in the database for late use in the inference stage.
During the inference process (see
Figure 6), the new predicate verb is mapped to the same 1024-dimensional space using BERT
LARGE. The process then selects and loads all the vectors from the FrameNet sample database, where each sample sentence has the same verb as our new verb. If we continue with our example sentence “
I stood upon the hearth-rug and…”, then it means that the system loads all 10 + 2 + 8 + 14 + 85 = 119 vectors for five frames, as is shown in
Table 2. At the last stage, the system uses the k-nearest neighbors algorithm to classify our new verb. These 119 vectors serve as training examples for k-NN.
An important note needs to be made to explain how the system chooses the parameter k in the k-NN algorithm. It always chooses k using the following simple steps: (1) it selects all frames that are triggered by the new verb (five frames in
Table 2 in our example for the verb “
stood”); (2) then, from the selected frames it chooses the frame with the least number of samples, and marks this number of samples as m (this will be the frame
Change_posture with a sample size of two, i.e., m = 2); (3) our parameter k will be
k = m.
The following describes the argument identification model that is used in frame-semantic parsing. It is assumed that the system already disambiguated the frame predicate (e.g., <Posture> in
Figure 6 for the word “
stood”). The novelty of this method lies in the fact that it is simple, but at the same time provides state-of-art accuracy. The standard approach to identifying roles is to start by selecting a set of semantic roles from the frame lexicon and supplement it with a null role. Then it is necessary to consider a set of intervals that can potentially fulfill a semantic role.
The complexity of the task of assigning semantic roles can be estimated using the following arguments. (1) If there is a sentence of n words, then there are 2(n−1) possible sentence segments. (2) This is easy to see if you interpret the existence of a boundary as 1 and the non-existence of 0 between any two words. (3) Let us take any random segmentation solution and consider how many ways it is possible to assign role labels to these segments. To simplify the argument, repeat this as many ways possible to assign role labels to n words, and the answer is mn 4. Then, there are O(2(n−1) * mn) role assignments.
Many existing methods [
19] put hard constrains on the choice of possible spans to narrow down the set of possible role assignments, and then use ILP solvers to perform the final inference.
To understand the novelty of our approach, let us look at the data in
Table 3. This table shows the results after parsing the entire FrameNet corpus using the AllenNLP system, which assigns PropBank-style labels to sentence phrases. This table shows that the probability for a sentence phrase to have a FrameNet label “
Agent” given the AllenNLP tag
ARG1 is one (
P(Agent|ARG1) = 1). The same can be said for the argument
ARG2 (
P(Location |ARG2) = 1). In these cases, there is no need for any state space search algorithm, and the system can use a pattern-matching approach. In the case in which there is a need to disambiguate the AllenNLP labels, the system runs the following simple algorithm: (1) it obtains the embedding vector for the first word in the role phrase using the BERT
LARGE model; (2) then it uses the 1-NN algorithm to find the closest match between this vector and all the vectors from the FrameNet corpus that it parsed with BERT
LARGE; (3) a label from the closest vector is assigned to the new vector.
Figure 1 shows that there is one result in the first sentence that was not explained. The phrase “
at the breakfast table” is annotated by our PropBank-FrameNet parser as <
Location>, but the system needs to know what physical object can be used in the scene to represent this <
Location>. Using the Stanford dependency parser, the system can identify the headword “
table” and use it as the main word for an object in the scene. However, the question remains whether the word “
table” is a physical or abstract object and what properties can be used to describe the semantics for this object.
Word sense disambiguation (WSD) is an important task in the natural language processing pipeline that assigns the correct sense to a word in a given context. Next, a method for generating upper ontology concept embeddings with the full coverage of WordNet is presented. This method is very similar to the frame identification method proposed above. The method consists of five basic steps:
For the training dataset the system collected six standard WSD datasets: SemCor [
21], Senseval-2 [
22], Senseval-3 [
23], SemEval-07 [
24], SemEval-13 [
25] and SemEval-15 [
26].
The system extracts all sense descriptions from the WordNet database and adds them to the dataset that was created in step 1.
The SUMO ontology has a mapping to WordNet synsets. The system uses this mapping to annotate the WordNet corpus with SUMO ontology concept labels.
The system uses a mapping from FrameNet to the SUMO ontology to annotate the most frequent FrameNet frames and lexical units with SUMO concept labels.
The procedure for finding the embeddings of SUMO concepts is the same as for identifying frames.
Figure 7 shows a small experiment that was conducted to demonstrate our approach for word sense disambiguation. Let us return to the first sentence of our example presented in
Figure 1 and analyze the word “
table”. The system selects from the training corpus all sentences with the word “
table” and calculates the embeddings for each instance of the word “
table”. It uses NLTK WordNet sense labels as the class label. Then it computes principal component analysis for 90% of the sentences from the training data and uses the rest as test data.
Figure 7a shows all embedded examples for the word “
table” projected on two principal components. It can be seen that there are two clusters of data points. One (brown), marked with the label “
table.n.01”, represents a sense with the description: “
a set of data arranged in rows and column”. The second largest group of senses is labeled “table.n.02” with the description: “
a piece of furniture having a smooth flat top that is usually supported by one or more vertical legs”. These two senses can be perfectly separated, and what is more, the system does not need to store all 1024 dimensional embedding vectors in a database. It is enough to have two principal components of the embedding space in order to have a perfect classification of these senses.
There is one more important conclusion that can be drawn from
Figure 7a using the WordNet dictionary—using hypernym relations, one can conclude that the meaning of “table.n.05” (red square) is an abstract thing. The description of this sense is as follows: “
a group of persons together in one place”, and the direct hypernym is “
social_group%1:14:00::”. Now let us observe which sentence is encoded with this red square: “
I felt the temblor begin and glanced at the table next to mine, smiled that guilty smile and we both mouthed the words, “Earth-quake! “together.” If the system wants to recreate this sentence as a 3D scene, then the word “
table” would represent a set of objects: people and a table as furniture. On the other hand, WordNet will suggest interpreting it as an abstract object and ignoring it for the scene generation task. There are two important lessons to be drawn from these examples. First, the system needs to use caution when using WordNet for a text-to-3D task, and second, the WSD error from the WordNet point of view does not mean an error in terms of a text-to-3D task.
Figure 7b shows another interesting point about the semantic meaning of word embeddings in context. It is shown here that the greatest separation between data points occurs when the system considers things to be divided into abstract and physical categories.
Figure 7c,d show embeddings of the word “
run”.
Figure 7c shows all senses from the WordNet corpus. It shows that in this case, the disambiguation is more difficult than for the word “
table”. On the other hand, if the system regroups senses for the word “
run” into two groups “
Motion” and “
NoMotion”, then the disambiguation of senses becomes more accurate.
5. Results
The harmonic mean between precision and recall (F1) is used in all experiments because it is the most commonly used metric in the WSD and SRL literature to measure classification results. The dataset consists of several well-known datasets plus the dataset that was created during this project. The following list shows all the experimental datasets:
Table 4 shows the results of our WordNet sense disambiguation system (BERT
1024 k-NN) for these datasets. The semantic analysis method presented in this paper parses only nouns and verb, so the system details results per part of speech. As a benchmark for this experiment, MFS [
27] and LMMS2348 (BERT) [
28] were used. LMMS2348 is similar to the method proposed in this paper, but it uses additional embedding vectors. A smaller vector size of the BERT embeddings was also used (BERT-Medium512 k-NN).
From this experiment, one can observe that the book dataset shows a higher F1 score compared to the Senseval dataset. This can be explained by the fact that the Senseval dataset is more diverse and was created to test the word sense disambiguation with a large set of different words. The book’s dataset has a smaller vocabulary and this fact may explain some of the differences in the F1 score.
Another interesting conclusion from this experiment can be drawn by noting two facts: firstly, all embedding methods outperform the MFS approach by a significant margin, and secondly, the difference between embedding methods is within a few percentage points. This explains why the system chose BERT1024 k-NN as the final step implemented in the natural language processing pipeline. Even if the LMMS2348 (BERT) method performs slightly better, it takes up more than twice the memory space and more than quadruples the processing time in our implementation. WSD has a small part in this text-to-3D framework, and the system has to consider the performance of each component to obtain reasonable processing resources for the whole framework.
There is another reason why there is no need to account for a difference of a few percentage points in the F1 score to be an important argument in favor of a more complex model. WordNet has over 100,000 synsets and the system needs far fewer concepts to create a few simple objects in a 3D environment. The analysis of several results of word disambiguation in which the system made a mistake showed that these results are not an error from the point of view of the upper ontology. To be more precise, the project team investigated the verb “
increase” when the system made a mistake and the MFS system disambiguated correctly. There are two senses of the verb “
increase” in the WordNet dictionary:
(1) “increase%2:30:00::”—(become bigger or greater in amount) (2) “increase%2:30:02::”—(make bigger or more). However, if one looks at the word “increase” in the SUMO ontology, one will find that both senses correspond to the concept of “increasing”. This suggests that for some natural language understanding tasks, upper ontologies may be more appropriate than fine-grained semantic dictionaries such as that of WordNet.
Table 5 shows the results of the experiment that was conducted to test this hypothesis.
The SUMO upper ontology is indexed with WordNet senses, and the system can use this index in all word sense disambiguation tasks by replacing WordNet senses with SUMO concepts. So, the project team took the dataset that was used in the experiment above (
Table 4) and replaced the WordNet senses with SUMO concept labels. The F1 score was expected to increase, and indeed
Table 5 shows a slight increase in the F1 score (columns “
Senseval All SUMO” and “
Book Dataset All SUMO”). The unexpected result of this experiment was that F1 increased by only a few percentage points, but the expectation was a large margin when the system indexes the corpus using the concepts of upper ontology. It was therefore a logical step to try a small subset of the most abstract SUMO concepts, which was used to create objects in the text-to-3D task. The “
Senseval SUMO 3D” and “
Book Dataset SUMO 3D” columns show that even the MFS method can slightly improve the classification score because the system uses labels that combine most synsets into a few concepts.
A PropBank-style automatic and accurate shallow semantic parser can annotate text with a semantic argument structure, which can form the basis for additional semantic annotations. One of the novelties proposed in this paper is based on the observation that the AllenNLP semantic role-labeling system, when presented with a sentence, is able to accurately identify each predicate in the sentence using the predicate’s semantic arguments, which the system annotates with additional labels from the FrameNet system.
The following experiment demonstrates the usefulness of the PropBank role annotation approach for FrameNet-based SRL models. The hypothesis is that the PropBank semantic parser can considerably improve the structural detection of role spans, and the feature-based classifier can considerably improve the process of labeling the semantic roles of the FrameNet system.
Accordingly, this hypothesis is tested using the output of the SEMAFOR and SUMO SRL systems. The first system, SEMAFOR, is the baseline for the frame-semantic role labeling. Both systems receive as an input a random set of sentences from the book corpus and the FrameNet example corpus.
Table 6 summarizes our results with SUMO SRL and SEMAFOR using manually annotated frames and roles. The focus of this experiment is only on the verb as a predicate. Compared to SEMAFOR, this table shows that the SUMO SRL provides some gain of 12.5 points in F1 for the frame identification task and 5.1 points for the role identification task.
The last experiment that was conducted in this research project is related to the identification of physical objects in the scene.
So far, the results have been presented only on the problem of coreference resolution, which was considered as part of the process of identifying an object in a scene. Coreference resolution is the task of grouping mentions in the text that refer to the same underlying real-world object. Our baseline model is the end-to-end span-based neural model from the AllenNLP system that implements the method described in [
29].
Table 7 shows that there is an improvement when the system uses the SUMO SRL heuristic method. This is not a big surprise because SUMO SRL’s coreference resolution method reuses the AllenNLP coreference method with different parameters. The parameters and their heuristic rules are written in such a way as to try to resolve the coreference errors when the AllenNLP coreference method failed. However, this was an interesting experiment, because there is no other work that has tried to identify objects in an entire fiction book.
6. Discussion
The following discusses the extent to which the SUMO SRL framework leverages the semantic role labeling and the SUMO upper ontology for fiction semantic parsing. One of the important results of this research project is a dataset with annotations of frames and role assignments for the entire book The Hound of the Baskervilles. This dataset has been manually reviewed and edited. As far as we know, no attempt has been made to analyze the entire book and try to present its content in any formal framework. This dataset will encourage other researchers to focus more on this complex area of natural language processing.
The SUMO SRL system must understand the world around it in order to understand the language. To do this, it uses common ontologies to explicitly express knowledge about the world. A writer cannot describe all the details of a scene, and our knowledge of general facts provides these details for a better understanding of the language. Therefore, to develop useful systems that can interact with people about what is written in the book, one needs to use general ontologies to interpret the language in context. The SUMO ontology is a collection of about 20,000 concepts linked into a logical theory with 70,000 axioms. Axioms are presented in a first-order logical form and impose constraints on the interpretation of concepts. The SUMO upper ontology is one of the largest open source ontologies, and experiments have been performed in an attempt to answer the question of how useful the upper ontologies are for semantic parsing.
The conclusion one can draw from our experiment using SUMO for semantic parsing is that it is possible to achieve an improvement of a few percentage points in accuracy when there is a requirement to identify physical objects in a scene. In addition, the experiment has shown that the SUMO axioms can be useful for identifying a scene from text. Another conclusion that can be drawn is that the number of axioms needed to better understand the content of the book must be much greater than what is currently available. These commonsense knowledge requirements can be addressed with a comprehensive upper ontology with a built-in inference engine. One of the well-known sources of this type of knowledge is the Cyc system [
30]. Cyc is a large knowledge base with a commonsense reasoning engine, but it is proprietary. SUMO is the best-known alternative to Cyc and it is an open source ontology. One of the goals of this project was the intention to focus on the parsing of fiction and to introduce some gamification necessary for the further development of the SUMO ontology.
This paper introduced a new shallow semantic parser based on the PropBank semantic role annotation process. The idea was to test the hypothesis that it is enough to analyze the roles of the PropBank parser, and then annotate them with labels from the FrameNet, WordNet and SUMO systems in order to obtain the semantic information needed for the text-to-3D task. The research team developed a parser that augments PropBank labels with FrameNet, WordNet and SUMO labels. PropBank annotations are less domain-specific and the label set size is relatively small (in our corpus, over 90% of annotations are covered using only 12 labels). On the other hand, the FrameNet label set size exceeds 1000 and WordNet exceeds 100,000. These labels allow us to refine PropBank annotations and prepare parsing results to map natural language labels to domain-specific ontology. The manually annotated corpus shows that it is a sound idea.
It is important to notice that this approach allows one to filter out a significant part of the information that is not related to the text-to-3D task. If one takes the first sentence from the proposed example, then this filtering process will look like this: (1) the PropBank parser will take the verb phase “was seated” and select the word “seated” as the headword that marks the predicate under consideration; (2) then the PropBank parser will indicate the phrase “Mr. Sherlock Holmes, who was usually very late in the mornings, save upon those not infrequent occasions when he was up all night” as an argument of type <ARG1> and the phrase “at the breakfast table” as an argument of type <ARG2>, and this will indicate that the system can compress two long phrases into two entities in a 3D scene.
The whole process of labeling that was mentioned gives us only shallow semantic information, that is, a system cannot make a logical inference using the axioms of ontology or logical relations expressed as predicates of the ontology. The SUMO ontology has a Sigma browser with built-in automated theorem-proving systems, in particular E and Vampire [
31,
32]. It is possible to use these automated theorem-proving systems to reason about implicit knowledge in a scene, but in the proposed framework, the project team decided to test the more programmer-friendly Drools system. The team transformed SUMO ontology concepts into Java classes and SUMO axioms into Drools rules.
Many formal systems have been suggested to connect natural language with objects and movements in 3D scenes. The SHRDLU system presented a starting point for this purpose. More recently, the Stanford Text2Scene system presented a text-to-3D scene generation solution by learning spatial knowledge from 3D scene data. The authors demonstrated that it is possible to infer unstated implicit constraints between various objects in a room scene. This paper focuses on the 3D scene-modeling system’s physics engine rigid-body collision detection subsystem. These subsystems are used to simulate the motion of solid objects. They affect the position and orientation of objects and do not deform them. The system uses box-like collision shapes, the simplest possible collision shape of the object, because the physics engine must be simple in our goal to build a natural language processing system that can interact with 3D modeling systems.
The research team defined concept-to-physics generation as the task of taking semantic role labels that describe a scene in a book as an input, and generating a plausible 3D scene representation in Drools working memory in terms of Java objects as the output. More specifically, based on the labels from the NLP framework, the system instantiates objects in the Drools memory and then runs an inference process based on the Drools agenda-group workflow.
The presented experiments showed that for WordNet sense disambiguation, the system is on par with the prior state of the art for verbs and nouns. Moreover, they showed that by replacing WordNet synsets with a small set of upper ontology concepts, it is possible to improve the accuracy of the identification of predicates. It is possible to improve the performance of word embedding using the noisy text [
33] approach. This could be an interesting project for future research on word embeddings. The proposed method is focused on the embeddings of the English language, but it allows the extension of semantic parsing techniques, using methods such as the universal semantic dictionary [
34], to a multilingual domain. Second, we presented the results of the proposed new approach to the problem of identifying FrameNet roles. It has been shown that it is possible to improve performance and simplify the task of identifying frame roles by transforming the task of identifying FrameNet roles into the task of labeling PropBank roles. Finally, we have completed the identification task for the objects in a scene using upper ontology. It is not possible to compare this task with any existing benchmarks, as this is a new task, but this study has shown that it is possible to achieve high accuracy when the system compares it to the manually labeled data that were created during this project. The reported results for this task will serve as a benchmark for future research projects.