1. Introduction
One of the main research goals in computational theories of discourse structure has been to reveal whether there are specific discourse coherence patterns followed by speakers in single-author texts, monologues and dialogues that could be exploited for building NLP applications ([
1,
2,
3,
4,
5,
6]). Being able to answer this question allows to also address another question of bigger theoretical depth and importance, namely how coherence and cohesion are realized in the discourse. Exploring discourse relations remains a topic of growing interest, and several rather sophisticated approaches have recently been proposed [
7,
8,
9,
10,
11,
12].
The current trend in attempting to answer this and similar questions in NLP in general has been to train Transformer models ([
13,
14,
15]) and construct the so-called probing tasks to indirectly obtain a deeper understanding of discourse coherence and cohesion. However, it still does not seem feasible to extract interpretable semantic properties of the discourse out of these deep learning models, and one can still use some valuable lessons from the formal discourse semantic tradition in order to achieve this ([
16,
17,
18,
19,
20,
21,
22,
23,
24,
25]).
In this paper, we intend to focus on three different types of recorded discourse-annotated data spanning two languages, English and Greek: (a) the single-author written texts of two daily newspapers with large circulation in Greece that were included in the corpus C58 ([
26]) and (b) the multiparty dialogue texts of the STAC corpus annotated in two stages, resulting in two discourse-annotated versions; (i) the first version includes chat logs (or chat moves) recorded in a virtual environment during an online game session ([
27,
28,
29], and (ii) the second version includes the same multiparty dialogues accompanied by the messages automatically generated by the game software. These messages describe the nonlinguistic events that take place during a game session and thus
situate the linguistic utterances in the broader nonlinguistic context of usage.
The availability of discourse-annotated corpora allows us to use quantitative methods to pin down the essence of discourse coherence and cohesion but also to profit from the numerous tools provided by the mathematical field of network analysis ([
30,
31,
32]). Existing formal discourse theories have been providing the formal means to construct discourse representations that can be mapped to networks that reflect the information flow of discourse and reveal hidden properties of the discourse. Network edges play a crucial role in a discourse dialogue network. They represent the connections or relationships between different discourse units, such as utterances or events, within the network. The importance of network edges lies in their ability to capture the flow and dynamics of discourse interactions.
Most notably, network edges provide a pathway for the exchange of information and meaning between discourse units. They indicate how different units are connected and how information is transmitted within the discourse. By following the network edges, one can trace the progression of ideas, arguments, or conversations in the discourse dialogue.
The discovery of significant network patterns in the discourse representations of the corpora would suggest that the speakers engage in discourse by following implicit strategies, and this might mean that speech acts are related in predictable ways. Moreover, traceable differences and commonalities in the network patterns of single-author written texts and multiparty dialogue texts, as well as between different languages, would help formulate theoretically driven corollaries related to different discourse types.
Based on two corpora and the three datasets that accompany them, the network analysis that we provide gives important and counter-intuitive insights to speakers’ preferences in constructing and interpreting discourse structure.
Section 2 describes the principles behind the types of discourse units and discourse relations adopted in the two corpora, C58 and STAC.
Section 3 offers a brief overview of the compilation and annotation process of C58 and STAC.
Section 4 provides a detailed analysis of (a) the mapping between discourse representations and networks, (b) the profile of discourse representations through key network indices and (c) the presence of and antimotifs for three-node subgraphs for the discourse networks of all three datasets. The last section,
Section 6, sums up the findings of this study and offers a series of theoretically driven reflections related to the types of restrictions on discourse inference and interpretation imposed by the discourse structure.
2. Discourse Relations and Discourse Units
There are numerous formal theories of discourse representation in the literature, and corpora have been annotated for discourse structure based on the principles of these theories (Segmented Discourse Representation Theory (SDRT; [
21,
23], Rhetorical Structure Theory (RST; [
33]), the Linguistic Discourse Model (LDM; [
34]), the Discourse Graphbank model ([
35]) and Discourse Lexicalized Tree Adjoining Grammar (DLTAG; [
36]))). Briefly, Segmented Discourse Representation Theory (SDRT) is a dynamic semantic theory of discourse interpretation that focuses on the interplay between discourse interpretation and discourse coherence. It is an extension of Discourse Representation Theory (DRT), which is a framework for exploring meaning under a formal semantics approach. In SDRT, a text is segmented into constituents related to each other by means of rhetorical relations, resulting in a structure known as a segmented discourse representation structure (SDRS).
Among the above approaches, SDRT offers a well-principled model of discourse inference and interpretation and has also been adopted as the annotation model for the annotation of the STAC corpus as well as the C58 corpus. The availability of both corpora serves the process of stressing the differences and similarities between the three different types of recorded discourse data: single-author written texts, nonsituated and situated dialogues. Additionally, SDRSs, the discourse representations that SDRT adopts and that is analyzed in the rest of the paper, can be mapped to non-tree-like graphs, contrary to the above-mentioned discourse theories.
Although SDRT has been chosen as the basis of the current work, all discourse theories are more or less aligned as to how to proceed in annotating linguistic utterances. The basic premise is that discourse structure is hierarchically organized around discourse units, the
elementary discourse units or
EDUs in SDRT parlance. However, SDRT includes also
Complex Discourse Units or
CDUs to cope with semantic groupings of
EDUs in the discourse and, recently, refs. [
28,
29,
37] extended SDRT’s classic definition of a discourse graph with a third type of node, the
elementary event units or
EEUs, so that nonlinguistic events are considered as nodes during the construction of the discourse graph. Here is the definition of a discourse graph by [
28]:
Definition 1. A discourse graph is a tuple ( ℓ, Last), where V is a set of nodes (EDUs,EEUs, andCDUs); , a set of edges representing discourse relations; , a set of edges relating each Complex Discourse Unit (CDU) to its sub-units; ℓ, a labelling function from elements of to discourse relation types; and Last, a label for the last unit in relative to textual order.
2.1. Elementary Discourse Units
The EDUs are linguistic utterance nodes on the discourse graph and are considered the basic building blocks of a discourse representation. The intricate semantic and pragmatic manner that the EDUs are related is expressed through a number of discourse relations, such as Narration, Elaboration, Explanation, Result and Narration.
All the above-mentioned discourse theories have a different theoretical standpoint regarding both the definition criteria of discourse relations and the set of discourse relations that characterize a coherent text. However, for the purposes of this paper, these heated theoretically driven debates can be safely ignored, since the main target is to unveil deep properties of the discourse structure viewed as a network and not on the type of discourse relations that create them.
Moreover, all discourse theories acknowledge the need of a discourse segmentation algorithm that tokenizes the discourse into EDUs, and they are more or less aligned regarding the segmentation criteria.
The above-mentioned theoretical differences aside, another important classification axis of discourse relations pertains to whether a discourse relation should be considered as coordinating or subordinating ([
38]). Briefly, coordinating relations, such as
Narration and
Continuation, relate
EDUs that share a general common topic and can be thought of as providing information on the same level of detail on that topic, while subordinating relations, such as
Elaboration and
Explanation between two
EDUs are asymmetrically related, since one of the two plays a subordinate role relative to the other. Ref. [
38] argues that it is not an easy task to classify discourse relations as either coordinating or subordinating and presents linguistic tests that help decide which discourse relations are subordinating and which are coordinating in a given context.
The relevant part of the distinction between coordinating and subordinating relations for the construction of the discourse representation is that the first type of relation is indicated through horizontal connections, as in the simple graph in
Figure 1 (1), for the coordinating relation
Narration between
and
and the second type through vertical ones, as in
Figure 2 for the subordinating relation
Explanation between
and
in (1).
- 1.
Yesterday evening, John had a great meal and
won a dancing competition.
- 2.
John fell.
Max pushed him.
Subordinating and Coordinating Relations
Although the classification of discourse relations as subordinating or coordinating has important semantic effects in discourse interpretation, reflected in discourse anaphora and attachment availability ([
39]), it is highly context-dependent, as pointed out by [
38,
40], which suggests that the contexts that they appear in and the conditions that dictate their classification as either coordinating or subordinating should be further investigated. As [
38] mentions, a discourse relation does not share common underlying content with a homogeneous set of other discourse relations that can be assigned to either the coordinating or the subordinating type. Essentially, the two types of discourse relations reveal an important aspect of the core of discourse structure related to the way speakers interpret the information in the discourse; this aspect is often called information packaging in the SDRT tradition.
However, beyond the linguistic contexts that need to be studied further so that the linguistic classification tests suggested by [
38] can be confirmed and/or refined, this paper aims to abstract away from the distinction between subordinating and coordinating relations and to inspect deeper properties of discourse evolvement that would probably play a role in the classification of discourse relations. We claim that the study of network motifs and other key network indices is the formal means to reveal these properties and, then, the distinction between subordinating and coordinating relations may be examined further in light of the findings in this paper. The most important feature of discourse representations that can be exploited to identify network properties is the presence of directionality in the discourse edges, as is made clear in
Section 4.
SDRT’s formalism supports directed acyclic graphs (DAGs), since, following SDRT’s principled method of building graph representations of discourse, every EDU contributes to the information flow of discourse, and its advancement implies that there can be no loop constructed that could ultimately allow self-relational connections.
2.2. Complex Discourse Units
The second type of node on a discourse graph is the
CDU. Following [
41], a
CDU represents a semantically coherent group of
EDUs that collectively serve as an argument to a discourse relation between this group and another
EDU. Accordingly, SDRT assigns semantic content and internal structure to
CDUs and, moreover, since
CDUs are first-class citizens on the discourse representation level, they can be thought of as complex speech acts that participate in discourse inference as arguments of discourse relations. Admitting a second kind of node that has subconstituent nodes inevitably results in acknowledging a second kind of relation apart from discourse relations, and this is reflected in the definition of
CDUs in [
41]:
—Undirected unlabeled edges connect a Complex Constituent to its subconstituents, introducing recursivity in the structure.
Hence, a Complex Discourse Unit or Complex Constituent is a node of the graph that has some subconstituents identified by the second kind of edge. We may write as shortcut for α is a subconstituent of π.
The classic example introduced in [
21], and repeated in
Figure 3, may serve as the basis for distinguishing the two types of relations displayed in the discourse graph of
Figure 3 (you can safely ignore the top-most label,
, which signifies the content that is there at the beginning of the discourse). The directed edges indicate discourse relations, whereas the undirected ones represent the relation between a
CDU and its subconstituents. The
CDUs in the graph are the primed
s;
groups the content of
and
, and
has
and
as its subconstituents.
- 3.
John had a great evening last night.
He had a great meal.
He ate salmon.
He devoured lots of cheese.
He won a dancing competition.
For the purposes of this paper, we focus on the discourse relations of the graphs and not on the secondary subconstituency relationship between a CDU and its subconstituents. The way that groupings of EDUs in terms of CDUs may indirectly affect the discourse structure may, however, be the subject of a separate future study.
2.3. Elementary Event Units
The third kind of node admitted in a discourse representation graph with equal status to
EDUs is the
EEU.
EEUs represent nonlinguistic events in the discourse, especially in the context of situated dialogues, with little to no descriptive content. By nonlinguistic events, in the context of situated dialogues, we mean the different moves in the recorded game. For instance,
and
in ref. [
28] refer to two moves by the server and the UI that represent two nonlinguistic events. On the other hand,
EDUs represent linguistically uttered events, namely events that have been explicit realizations, e.g., through a verb or an event nominal.
EEUs can be either related with other
EEUs or with other
EDUs, opening the window for relating linguistic with nonlinguistic events.
Although the use of nonlinguistic events in discourse is pervasive, for good reasons, pragmatically driven theories of discourse have focused mainly on the way that reference to nonlinguistic events is realized through the use of indexical and demonstrative expressions, ignoring the complex ways that nonlinguistic events affect discourse structure. Contrary to linguistic events, due to
EEUs’ lack of propositional content, it is challenging for the interpreters to individuate and conceptualize them. Different interpreters may assign different content to a nonlinguistic event. However, speakers engage in situated dialogues on a daily basis, and they fully integrate nonlinguistic events throughout discourse inference and interpretation. Allowing
EEUs to interact with
EDUs as well as with
CDUs opens up a way to enrich the discourse graph and study in depth the interaction between linguistic and nonlinguistic events. Refs. [
28,
37] offer a formal model of situated dialogue to capture these effects and the study of STAC. The first discourse-annotated corpus that includes
EEUs by [
29] has already pointed out interesting ways that discourse is affected by the presence of
EEUs. For instance,
Figure 4, which displays the discourse graph of (4), taken from [
29], shows the direct relation between the
CDUs. The labels
,
and
represent three
EEUs generated during an online game session that was recorded during the compilation of the STAC corpus (for more on the description and the basic features of the STAC corpus, go to
Section 3.1).
is clearly related to
with commentary, and the graph encloses a rich repository of linguistic and nonlinguistic information that explains in a more complete fashion the current discourse.
Additionally,
and
are subconstituents of
. As [
29] points out,
CDUs that have
EEUs as their subconstituents, such as
, serve very often as arguments to discourse relations and are essential in a situated discourse, since speakers frequently refer back to grouped nonlinguistic events in various ways.
- 4.
william rolled a 6 and a 1 [Server].
william will move the robber [UI].
william stole a resource from GWFS [Server].
oucho [GWFS]
you can have it back for some ore. [william]
Lastly, according to [
28],
EEUs differ from
EDUs in terms of their semantics, due to the fact that they always denote actual events, and their order matches the order of the events they describe. However, although one would expect that
EEUs always describe a sequence of events, there is a wide range of relations between
EEUs and between
EEUs and
EDUs. Ref. [
29] describes in detail the frequency distribution of the discourse relations that
EEUs are assigned as their arguments, and they include
Result,
Elaboration,
Correction,
Background,
Sequence,
Continuation and
Question–Answer Pair.
3. The Datasets
As mentioned above, we aimed to examine as broad a spectrum of discourses as possible given the availability of the discourse-annotated corpora (single-author written texts, chat-only and situated multiparty dialogue texts) spanning two languages, English and Greek. The following two sections briefly describe the main features, the collection and the annotation process that have been followed during the compilation of the two datasets, the STAC corpus and the C58 corpus.
3.1. The STAC Corpus
There are several discourse-annotated corpora with different theoretical outset but similar concerns and targets, including the Penn Discourse Treebank (PDTB; [
42]), the RST Discourse Treebank (RST- DT; [
43]), DISCOR [
44], ANNODIS [
45] and, more recently, the STAC corpus ([
27]). The largest of them is RST-DT, which is compiled and annotated based on RST’s annotation principles. However, the STAC corpus ([
29]) is the first corpus that provides discourse structures for multiparty dialogues situated in a virtual environment, which is the reason why we chose to use it for the comparison of different discourse types.
The corpus was annotated in two stages leading to two subcorpora: the first subcorpus includes the annotated chat moves of the players during a game session, and the second extends the annotated game sessions of the first subcorpus by adding the annotation of nonlinguistic events that were automatically generated by the same game session. The two subcorpora are offered for direct comparison between situated and nonsituated discourse analysis, as well as for exploring the interaction between the linguistic and the nonlinguistic events in the situated subcorpus. In this paper, we follow the convention of [
29], and we refer to the annotations of the first subcorpus as the chat-only annotations and the annotations of the second subcorpus as the situated annotations.
Each game is divided into dialogues, or subsections, of the game that encompass one or more bargaining sessions. As mentioned above, EEUs lack propositional content and, thus, it is already challenging for the interpreters that participate in the dialogue to individuate and conceptualize them, rendering it almost impossible for the annotators to construct and include them in the situated corpus. However, the messages that are automatically generated by the game software and shared between the players in a common virtual environment set the stage for including nonlinguistic events that occur in a controlled and common virtual environment.
As expected, the subcorpus that includes the situated annotations consists of many more EEUs (31,811) than the chat-only annotated subcorpus (12,588). (All the STAC-annotated data are stored in tabular format—converted as pickle files—and are available upon request from the corresponding author.)
3.2. The C58 Corpus
C58 is the first corpus annotated with discourse relations for Greek, based on SDRTs theoretical framework for the implementation of its annotation scheme, and serves as one of the first resources for Greek discourse parsing (since one of the main goals of creating the C58 corpus was to quantitatively approach the interface between intra- and intersentential semantics, it has also been annotated for the allowing verbal aspect and thematic roles). C58 consists of 58 journalistic texts (more than 1000 annotated DUs), sampled from two popular Greek newspapers in northern and southern Greece, “Makedonia” and “Ta nea”, respectively, which belong to the Corpus of Modern Greek Texts, a reference corpus of Greek compiled by the Center of Greek Language.
The Corpus of Modern Greek Texts includes approximately 7000 and 4500 articles from each newspaper, accordingly. Ref. [
26] decided to collect texts proportionally from both newspapers in order to neutralize any possible dialectical factors. The initial corpus of the Center of Greek Language is divided into 56 text genres, essentially covering the whole extent of the journalistic genre type, ranging from book review to economic news. The scope of C58 texts was narrowed down to 27 genres, based on the mean size of their texts, in an attempt to meet the balance and representativeness criteria in the sample.
Ref. [
26] describes in detail the compilation process, including the data collection criteria and the annotation cycle of the corpus undertaken by the team of five linguistics students, as well as the various challenges that they had to face throughout the process (all C58 texts have been annotated with the brat annotation tool, and the resulting brat-exported annotated data are available at
http://github.com/atantos/C58).
4. Networks and Connectivity
Networks or graphs are sets of nodes and edges ([
46,
47,
48,
49]). The nodes of the network are representations of objects, and the edges represent relations between these objects. In a social network, for example, nodes represent persons, and two nodes are connected if these persons know each other. In a biological protein interaction network, nodes are proteins that are connected if they interact with each other through some biochemical reaction ([
50]). In our case, a discourse representation can be thought of as a network, since it can be directly mapped to a graph with nodes and edges. The nodes in our networks are utterance labels, and these labels are connected if a discourse relation exists between them. This is markedly different from other linguistic networks studied in the past ([
51,
52,
53]), where usually the nodes represent words and the aim is to study relations between words in a sentence. Most importantly, as mentioned earlier, our networks are directed networks. Notice that while constructing and analyzing the discourse networks of the two discourse-annotated corpora, we did not include the relation between
EDUs and their subconstituents, since these relations are not considered as relations with the same impact and role throughout discourse inference and interpretation ([
41]).
Networks have become an invaluable tool of applied mathematics due to their versatility in describing so many diverse physical systems using a common framework ([
54,
55,
56,
57,
58,
59,
60,
61]). A brief search of the Scopus bibliographic database returns more than 750 articles published in 2021 that contain the phrase “Complex Network” in their title.
4.1. Key Network Indices and Their Relevance to Discourse
Graphs can be either undirected or directed. A directed graph, also called a digraph, is a graph in which the edges have a direction. Pictorially, this is indicated with an arrow on the edge. If v and w are vertices, an edge is an unordered pair , while a directed edge is an ordered pair, or . The directed edge is drawn as an arrow from v to w. Of course, it is possible to treat a digraph as a nondirected network and still obtain valuable information. Our graphs are unweighted, i.e., all edges are treated as equal, and contain no multilinks or self-loops.
There are several quantities of interest in the study of networks. Some of them are meaningful for directed as well as for undirected networks, while others are defined for directed graphs only. Thus, typically, the study of a directed network starts by treating it as undirected. When the basic properties of the undirected graph are calculated, then one proceeds by examining the quantities of interest that are associated with the directionality of the graph links. This is the approach that we follow here. When we treat discourse networks as undirected graphs, we are interested in the following quantities:
- (a)
The number of nodes and the number of edges of the network. In terms of discourse networks, these two indices indicate the size of the network.
- (b)
The fraction of edges
, which is the ratio of the number of edges
to the maximum possible number of edges
, i.e.,
where we use the fact that the max number of undirected edges in a network of
nodes is
. The fraction
is a number between zero and one. In terms of discourse graphs, if
is closer to 0, it can be interpreted as an indicator of low connectivity, while in the opposite case, the discourse graph can be considered to have a highly connected structure.
- (c)
The degree centrality of a node is a non-negative integer denoting the number of edges emanating from a node. In our case, utterance labels with a high node degree can be interpreted as having high significance for the discourse evolvement. Here, we study the maximum, minimum and average degree of a network’s nodes , and , respectively, as well as the standard deviation of the degree distribution .
- (d)
The mean clustering coefficient
C. The (local) clustering coefficient of a node
i in a network is defined as
where
is the number of triangles (loops of length 3) attached to this node divided by the maximum possible number of such loops ([
49]). Here, we compute the average of the local clustering coefficients. It is a number in the range
. Clustering coefficient values
indicate that two nodes connected to a third common node have a high probability of also being connected to each other. Social networks are typically networks with a high clustering coefficient, since it is probable that individuals that have a common acquaintance know each other as well ([
48,
59,
62]).
- (e)
The degree assortativity coefficient
r ([
63]). A network is assortative when nodes of high degree tend to connect to nodes with high degree. It is disassortative when nodes of high degree tend to connect to nodes with low degree. The assortativity coefficient
r lies in the range
, with
indicating perfect assortativity and
indicating perfect disassortativity. To be more formal, the assortativity coefficient
r is defined as
The term
is the distribution of the remaining degree, i.e., the number of edges leaving the node, other than the one that connects the pair. This distribution can be derived from the degree distribution
as
([
63]). The quantity
represents the joint probability distribution of the remaining degrees of the two vertices. This quantity is symmetric on a undirected graph and follows the sum rule
and
.
In
Table A1 of the
Appendix A.1, we present a table with basic statistics for a number of real-world networks. It provides, for several real-world networks, numerical estimates for the important properties of clustering and assortativity. As one can see from
Table A1 the clustering coefficient for social networks tends to be of higher value. For example, the network of film actor collaborations has a
, while a network of collaborations between physicists has been found to have a
. These are typical values for social networks. Technological and biological networks by contrast tend to have somewhat lower values. The power grid network, for instance, has a clustering coefficient of only about 0.08. A high clustering coefficient for the study of discourse networks may serve as an important proxy for pinning down discourse coherence and the way that it is maximized, since it appeals to discourse structures with a higher quality of discourse connectivity. However, to establish such a conclusion would require cross-checking whether discourse networks with a high clustering coefficient are also ranked as highly coherent by speakers.
Concerning the assortativity property, one can observe that social networks are in general assortative networks (). This is quite well known to sociologists, as people have, it appears, a strong tendency to associate with others whom they perceive as being similar to themselves in some way. In sociology, this tendency is called homophily, or assortative mixing. More rarely, one also encounters disassortative mixing, the tendency for people to associate with others who are unlike them. Assortative (or disassortative) mixing is also seen in some nonsocial networks. Papers in a citation network, for instance, tend to cite other papers in the same field more than they do papers in different fields. Web pages written in a particular language tend to link to others in the same language. The assortativity coefficient r that we study here is a measure of a particular type of homophily, i.e., it measures the degree of assortativity. One may consider the nodes of a graph as belonging to two groups. A group of highly connected nodes and a group of poorly connected ones. The assortativity coefficient r measures the tendency of nodes to connect (or not) within their respective group.
In terms of discourse representation, a highly assortative discourse network may signal the connectedness between highly important nodes. In the case of a highly assortative discourse, one may also claim that a few or even a single common topic or line of argumentation is promoted that is supported by several influential utterances in the discourse. On the other hand, a disassortative discourse network may signal a lack of connectedness between highly important nodes that can be further interpreted as a type of mutual avoidance between nodes with a high degree centrality.
4.2. Network Motifs and Discourse Structure
The most famous nontrivial quantity of interest characterizing directed networks is the existence or absence of network motifs ([
50,
64]). The basic premise behind network motifs is that networks are made of repeating occurrences of simple patterns. Thus, a network is scanned for specific n-node subgraphs (here
). We refer to these three-node subgraphs as patterns and denote them with the letter
S, followed by an index number; see
Figure 5. The number of occurrences of these patterns in a graph is compared with the number of subgraphs found in appropriately randomized networks, and z-scores are calculated so that one can estimate how likely a particular subgraph is to appear by chance. Z-scores higher than 2 indicate that a particular pattern is frequently present in the actual network, and it is unlikely that this is due to pure chance, since it is infrequent in the randomized graphs. Such a pattern is then termed a network motif. It is equally important to locate patterns that are missing from the original network while they appear in the randomized graphs. Such an absence would indicate that there is a mechanism prohibiting their appearance. We identify them when their associated z-score is less than
. Such patterns are termed antimotifs, i.e., patterns that are rarer than at random, and may be equally important for revealing aspects of the discourse structure. Thus, we must emphasize the difference between a pattern and a motif. For a pattern to be termed a network motif, it is required that it appears much more often or rarely than it would, on average, appear in random networks.
Higher-order patterns (i.e.,
) can in principle be studied, but such a task becomes prohibitively difficult in practical cases. There are 199 possible 4-node subgraphs, 9364 5-node subgraphs and so on ([
65]). Moreover, to count the number of a specific 4-node subgraph occurrences in a graph of 1600 nodes, the
combinations of nodes should be checked. This is a number on the order of
, i.e., the computational task quickly becomes unfeasible. Thus, the vast majority of research is conducted considering the existence of three-node motifs and has been proven a rather fruitful, especially in biological networks ([
64,
66,
67,
68,
69,
70,
71]).
Figure 5 shows the five three-node subgraphs (patterns) that are possible with two of three directed edges, while no bidirectional edges are allowed. It was a celebrated result in biological networks that the feed-forward loop (
in
Figure 5) is a network motif of the Transcriptional Regulatory Network (TRN) of the bacterium
E. coli ([
65]). These subgraphs cannot not be directly mapped to the discourse graphs based on the annotation principles set by both C58 and STAC. Recall that discourse graphs admit two different types of directional edges depending on the type of discourse relation between the two related nodes, namely coordinating vs subordinating ones, and are drawn horizontally or vertically, respectively. Moreover, as mentioned in
Section 2.1, our network analysis aims to offer a deep insight to principles of discourse evolvement and to contribute to the discussion as to what dictates the distinction between coordinating and subordinating relations.
Between the five possible three-node subgraphs in
Figure 5,
is excluded, since discourse structure graphs are not allowed to include loops that could relate a node back to itself. Naturally, discourse advancement forbids this type of subgraph and, thus,
is excluded from our analysis (although it is not analyzed, we included it in the graphs in
Section 5 and in
Table A2 with the mean pattern values; see the discussion below). Before responding to the question of whether discourse graphs are characterized by one or more network motifs or antimotifs in
Section 5, it is important to realize the types of discourse subgraphs that are mapped to the different
network subgraphs. Notice that
Figure 6,
Figure 7,
Figure 8 and
Figure 9 include only the corresponding discourse graphs that respect the Right Frontier Constraint (RFC) and not all possible three-node subgraphs that may satisfy the definition of the network subgraph patterns. However, for our purposes, the distinction between coordinating and subordinating discourse relations does not affect the results of our study (however, see [
72] for certain types of RFC violations related to this distinction).
represents the uninterrupted sequence of related utterances, and this can be translated in three different discourse graphs.
Figure 6 displays the possible three-node subgraphs included in a discourse graph, with either subordinating, coordinating or both types of discourse relations corresponding to the linear chain pattern
. All three subgraphs are expected to be found frequently in single-author written texts, due to coherence realization in these texts that we expect to promote uninterrupted sequences of related utterance, while the opposite may be expected to occur for the multiparty dialogue texts of the STAC corpus, namely that the participants frequently interrupt the discourse and introduce new topics and/or lines of argumentation. All of these conjectures remain to be confirmed or rejected in
Section 5.
corresponds to fully connected discourse subgraphs of three nodes, whereby all three nodes are related to the other two nodes. In terms of subgraph connectivity,
refers to maximally connected discourse graphs, which in turn would mean that in these three-node subgraphs, the maximum quality of discourse coherence is served if one admits that network connectivity is considered a direct and strong indication of discourse coherence. Note that all discourse semantic theories either tacitly agree with the statement that the more edges or connections a discourse representation graph has, the more probable it is that the discourse is coherent, or they integrate it as a principle in their theory, e.g., the Maximize Discourse Coherence principle (MDC) in SDRT ([
73,
74,
75]).
The dual-cause pattern in
maps to discourse subgraphs in which
plays a central role and could be perceived as labeling an utterance that strengthens the quality of discourse coherence. Note that the central role of
we mention here is not identified with RST’s nucleus–satellite distinction by [
33] but is attributed to the fact that
is related to both of the following two nodes in the discourse subgraph. Although
of the discourse subgraphs that corresponds to
is related to the previous two
s, as in
, there is no indirect relation to
through
in which the distance
equals 2. The distance between two nodes A and B is the length of the shortest path between A and B. The absence of an edge between
and
does not mean that they are not related indirectly through an intermediate, though,
but that the number of intermediate nodes, as well as the type of discourse relations that intervenes between them is not specified.
Regarding the permissible discourse graph that corresponds to the common cause pattern , with two outgoing edges plays a prominent role too, since it is directly related to both and , while and are only indirectly related to each other in the sense described for ; namely there is no intermediate node between and .
Notice that the edges of the discourse networks created for both corpora, C58 and STAC, are unweighted. In terms of the relevant distinction made by [
29] between quasi-tree-like and truly non-tree-like discourse units, the extracted three-node subgraphs are not classified based on whether they include quasi-tree-like discourse units or not, since our network analysis does not require distinguishing subgraphs that include one or more quasi-tree-like nodes from the truly non-tree-like nodes (i.e., subgraphs that have node(s) with a degree of 2 or higher). Essentially, what this means is that 3-node subgraphs that consist of entirely non-tree-like nodes include only one discourse relation between any two nodes.
Therefore, there is no weight assigned to three-node subgraphs that are more strongly connected in terms of the number of discourse relations that relate any pair of the three nodes or in terms of the degree of the nodes.
5. Interrogating the C58 and the STAC Networks
Figure 10 shows, in the form of histograms, the distribution of the basic properties of 55 discourse networks derived from Greek newspaper articles that were included in C58. The largest of these networks contains 69 nodes, while the smallest contains only 4 nodes (we left out 3 of the original 58 texts included in C58, since they consisted of less than 3 utterances). Similarly, concerning the edges, the largest graph contains 87 edges, while the smallest only 3. Most of the networks are sparse, as the mean value of the fraction of edges is equal to 0.12. There are, however, some exceptions, as the maximum value of the fraction of edges among the 55 networks equals 0.67. Most of the networks have a low clustering coefficient (mean value of
C is 0.22). Lastly, the vast majority of the networks are disassortative, since their assortativity coefficients
r are negative.
Figure 11 shows the distribution of the z-scores for the 5 basic network patterns depicted in
Figure 5. To be specific, for each of the 55 C58 discourse networks, we measure the number of
patterns. Then, we create 100 random networks with the same number of nodes and edges as the actual network and measure the occurrence of the same
patterns in each of the random networks. For each pattern, and for each actual network, we calculate a z-score as follows. Let
denote the number of
pattern occurrences in the actual network. Let
and
denote the mean number and standard deviation of occurrences of pattern
in the randomized networks. Then, the z-score is calculated as
.
Figure 11 shows that the feed-forward loop
is clearly a network motif for a lot of the 55 networks. There are several cases with very high z-scores. (More than 10 in some cases!) The fact that the feed-forward loop, which is the most famous motif in biological networks, is found to appear in linguistic networks as well is certainly intriguing. It is also intriguing that
, i.e., the dual-cause pattern shown in
Figure 5, is clearly an antimotif, as suggested by the large negative z-scores of its distribution. Furthermore, interestingly, the linear chain pattern
that represents a sequence of uninterrupted related utterances is neither a motif or an antimotif for the discourse-annotated single-author written texts in C58, since it does not appear more frequently or rarely than it would at random for the vast majority of the discourse graphs. In particular, we compared the observed frequency of the
pattern with what would be expected by chance, given the size and composition of the texts in the dataset. Our findings suggest that the usage of this particular linear chain pattern is not a common strategy of the authors of the texts in the C58 dataset. Lastly, the common-cause pattern
does not appear to be a network motif or antimotif.
Figure 12 shows, in the form of histograms, the distribution of the basic properties of 299 STAC (chat-only) networks derived from dialogues in the English language. These networks are quite heterogeneous, as the largest of them contains 322 nodes, while the smallest contains only 2 nodes. Similarly, concerning the edges, the largest graph contains 166 edges, while the smallest only 1. The STAC (chat-only) networks are much sparser than the C58 networks, as the mean value of the fraction of edges is equal to 0.02. It should be noted that there is a complete absence of clustering. All STAC (chat-only) networks have a clustering coefficient
. Moreover, similarly to the C58 networks, the STAC (chat-only) networks are also disassortative (i.e., assortativity coefficients
r are negative for all the networks.)
Next, we scanned the STAC (chat-only) networks for the 5 basic network patterns depicted in
Figure 5. As shown in
Table A2 in the
Appendix A.2, the actual counts of all the
patterns are zero, except for the
pattern, which indeed has nonzero values. This is a rather intriguing fact and is in marked contrast to our observations for the C58 corpus networks. As mentioned above, the presence or absence of a specific pattern, although interesting in itself, does not suffice to term a pattern as motif or antimotif.
Thus,
Figure 13 shows, for the STAC (chat-only) networks, the z-scores’ distribution for the 5 basic network patterns depicted in
Figure 5. The plot shows that, unlike in C58, the feed-forward loop
is not a network motif in the case of the STAC (chat-only) networks. The z-scores of the
distribution are negative, but since all of them are small in absolute values (less than 0.6), we cannot claim that they consist of an antimotif.
It is, however, intriguing that
, i.e., the dual-cause pattern shown in
Figure 5, is still an antimotif, as suggested by the large negative z-scores of its distribution. Moreover, the z-scores’ distribution suggests that the linear chain
and the common-cause
patterns are clearly antimotifs of the STAC (chat-only) networks.
Figure 14 shows the distribution of the basic properties of 299 English discourse networks derived from the situated discourse corpus described above. While the text files of the corpus are the same as those presented in
Figure 12 and
Figure 13, additional information of the nonverbal communication of the participants is included and, thus, the resulting networks are much larger than before. The largest of them contains 1628 nodes, while the smallest contains 29 nodes. Similarly, concerning the edges, the largest graph contains 818 edges, while the smallest only 15. These networks are much more sparse than before, i.e., the mean value of the fraction of edges is equal to 0.005. It should be noted that there is a complete absence of clustering. Again, all networks have a clustering coefficient
. Additionally, similar to the previous cases, the networks are still disassortative (assortativity coefficients
r remain negative for all the networks).
Finally, focusing on the situated STAC networks,
Figure 15 displays the z-scores’ distribution for the 5 basic network patterns depicted in
Figure 5. Despite the increased nonverbal information included in the situated dialogues, the results remain consistent with those of the nonsituated networks, namely that
and
are antimotifs, since they appear with considerably less frequency than expected by chance, while
is neither a motif nor an antimotif.
All numerical simulations and data analyses were performed on a workstation equipped with 2 Intel Xeon Gold 6140 Processors (72 cpu cores in total) provided by the MSc program “Computational Physics” of the Physics Department, Aristotle University of Thessaloniki. The estimation of network motifs was performed through custom code written in python. Basic network properties were studied with python modules NetworkX [
76] and retworkx [
77]. The code was parallelized using the python module dask [
78]. The parallelization step is essential to the project, since the search for network motifs is a computationally intensive task, even for networks with moderate size. For a network of
nodes, the process requires approximately 64 h in a Xeon Gold 6140 Processor. Thus, without parallelization, the processing of the 299 STAC networks is practically prohibited. Lastly, the data preprocessing was conducted in Python and R on a PC with an i9 Intel processor (8 cores in total).
6. Discussion
We applied network analysis to two discourse-annotated corpora, STAC and C58, to uncover deep properties of discourse representations. The corpora contain English and Greek texts from two discourse types: multiparty dialogues and single-author written texts. We also analyzed the chat-only and situated versions of the multiparty dialogues in STAC. This helps us understand and compare how discourse structure varies across discourse types and languages.
Initially, we presented key network indices for all discourse networks. C58 has a higher compared with both STAC versions, though not significantly high. represents network connectivity. The networks differ in size, measured by number of nodes () and edges (), but all are sufficiently large for reliable analyses. Despite being sparse, C58’s discourse network exhibits a significantly higher mean fraction of edges than the STAC networks, indicating greater connectivity in single-author written texts, as expected.
Furthermore, the clustering coefficients of C58 networks are noteworthy, comparable to those found in social networks (see
Figure 10 and
Table A1). However, they exhibit a tendency towards disassortativity, a characteristic commonly observed in technological or biological networks but rarely seen in social networks.
Our study revealed remarkable differences and similarities between two discourse types: single-author written texts and dialogue texts in their chat-only and situated versions. Specifically, we examined the presence of network motifs and antimotifs:
- 1.
The linear chain pattern (
) is not frequently observed as a motif in C58, despite the expectation that journalistic discourse would involve uninterrupted related sequences. In contrast, both versions of STAC systematically avoided the
subgraph pattern, as evidenced by the absence of occurrences (see
Table A2). These findings suggest that multiparty dialogue networks tend to avoid three uninterrupted related utterances, indicating that speakers do not participate in or contribute to a continuous line or chain of utterances. In C58, authors neither avoid nor prefer to establish connections between utterances in the form of a linear chain pattern.
- 2.
The feed-forward pattern () emerges as a motif in C58, indicating a statistically significant preference for constructing fully connected three-node subgraphs in most single-author written texts. This finding highlights the strong preference for fully connected three-node subgraphs in the discourse structures of single-author written texts. On the other hand, the two STAC discourse networks do not exhibit a network motif or antimotif, although it should be noted that no occurrences of the subgraph pattern were recorded in these networks.
- 3.
Among all three discourse networks, the dual-cause pattern () stands out as the only subgraph pattern that acts as an antimotif. In the three-node subgraphs of , there is a node with two incoming edges originating from two utterance labels, and , which are not directly related to each other but precede in the discourse.
- 4.
The common-cause pattern () serves as an antimotif for the STAC corpora, while it does not exhibit characteristics of either a motif or an antimotif in the C58 corpus. This suggests that a commonly observed discourse strategy does not favor the scattered presentation of various aspects of an event described by an utterance. More typically, we witness a sequence of utterances that play a subordinate role to an initial utterance. It is generally avoided to circle back later in the discourse to add more aspects to that initial utterance.
The above findings can be summarized and as follows:
Aw
Table 1 illustrates, both versions of the STAC corpus exhibit similar behavior regarding the four three-node subgraph patterns. This similarity suggests that the additional structures introduced by annotating
EEUs (Embedded Event Units) and their interactions with other discourse units in the situated STAC corpus strongly avoid the same three subgraph patterns (
,
and
) observed in the discourse-annotated structures of the chat-only STAC corpus. This parallelism between the two STAC versions supports the argument made by [
28,
29,
37] that
EEUs introduce higher-level structures with their own complexity and idiosyncrasies. In other words, if the presence of
EEUs were random and lacked systematic structure, the distribution of the corresponding
subgraph patterns in the situated STAC corpus would resemble that of a randomly generated network, rendering it inconclusive for the existence of motifs or antimotifs (see
Table A1).
This study’s findings indicate that the presence of the antimotifs
and
in the STAC discourse networks suggests a strong restriction on establishing discourse relations between
and
when there is a distant discourse relation between them, and
does not serve as the bridge connecting
and
. This restriction imposed by the discourse structure may be influenced by the distance between the two utterance labels,
and
, as the distance in terms of utterances is expected to impact the development of discourse in multiparty dialogue texts. However, it is important to note that the baseline approach of attaching an utterance label
to the last available in the discourse, as noted by [
72], fails to capture 40% of attachments in the ANNODIS corpus. This empirical fact emphasizes that there are numerous attachment candidates for a given
that extend beyond its immediate vicinity, further supporting the argument for the strong restriction imposed by the discourse structure.
A general observation related to and is that although both are considered antimotifs for the two STAC discourse networks, there is a noticeable difference between the two types, since occurrences of the pattern have been recorded in both corpora, while no occurrence has been observed for the pattern. However, as mentioned above, the presence or absence of a specific pattern, although interesting in itself, does not suffice to term a pattern as motif or antimotif. is an antimotif for the C58 discourse networks, too, as mentioned above, but since is neither a motif nor an antimotif for these discourse networks, our network analysis suggests that the above-mentioned restriction holds for both corpora but only for the dual-cause pattern, .