1. Introduction
We propose an approach, based on Longobardi’s parametric comparison method (PCM) and the theory of error-correcting codes, to a quantitative evaluation of the “complexity” of a language family. One associates to a collection of languages to be analyzed with the PCM a binary (or ternary) code with one code word for each language in the family and each word consisting of the binary values of the syntactic parameters of that language. The ternary case allows for an additional parameter state that takes into account certain phenomena of entailment of parameters. We then consider a different kind of parameters: the code parameters of the resulting code, which in coding theory account for the efficiency of the coding and decoding procedures. These can be compared with some classical bounds of coding theory: the asymptotic bound, the Gilbert–Varshamov (GV) bound, etc. The position of the code parameters with respect to some of these bounds provides quantitative information on the variability of syntactic parameters within and across historical-linguistic families. While computations carried out for languages belonging to the same historical family yield codes below the GV curve, comparisons across different historical families can give examples of isolated codes lying above the asymptotic bound.
1.1. Principles and Parameters
The generative approach to linguistics relies on the notion of a Universal Grammar (UG) and a related universal list of syntactic parameters. In the Principles and Parameters model, developed since [
1], these are thought of as binary valued parameters or “switches” that set the grammatical structure of a given language. Their universality makes it possible to obtain comparisons, at the syntactic level, between arbitrary pairs of natural languages.
A PCM was introduced in [
2] as a quantitative method in historical linguistics, for comparison of languages within and across historical families at the syntactic instead of the lexical level. Evidence was given in [
3,
4] that the PCM gives reliable information on the phylogenetic tree of the family of Indo-European languages.
The PCM relies essentially on constructing a metric on a family of languages based on the relative Hamming distance between the sets of parameters as a measure of relatedness. The phylogenetic tree is then constructed on the basis of this datum of relative distances, see [
3].
More work on syntactic phylogenetic reconstructions, involving a larger set of languages and parameters is ongoing, [
5]. Syntactic parameters of world languages have also been used recently for investigations on the topology and geometry of syntactic structures and for statistical physics models of language evolution, [
6,
7,
8].
Publicly available data of syntactic parameters of world languages can be obtained from databases such as Syntactic Structures of World Languages (SSWL) [
9] or TerraLing [
10] or World Atlas of Language Structures (WALS) [
11]. The data of syntactic parameters used in the present paper are taken from Table A of [
3].
1.2. Syntactic Parameters, Codes and Code Parameters
Our purpose in this paper is to connect the PCM approach to the mathematical theory of error-correcting codes. We associate a code to any group of languages one wishes to analyze via the PCM, which has one code word for each language. If one uses a number n of syntactic parameters, then the code C sits in the space , where the elements of correspond to the two ∓ possible values of each parameter, and the code word of a language is the string of values of its n parameters. We also consider a version with codes on an alphabet of three letters which allows for the possibility that some of the parameters may be made irrelevant by entailment from other parameters. In this case we use the letter for the irrelevant parameters and the nonzero values for the parameters that are set in the language.
In the theory of error-correcting codes, see [
12], one assigns to a code
two code parameters:
, the transmission rate of the code, and
the relative minimum distance of the code, where
d is the miminum Hamming distance between pairs of distinct code words. It is well known in coding theory that “good codes” are those that maximize both parameters, compatibly with several constraints relating
R and
δ. Consider the function
from the space
of
q-ary codes to the unit square, that assigns to a code
C its code parameters,
. A point
in the range of
f has finite (respectively, infinite) multiplicity if the preimage
is a finite set (respectively, an infinite set). It was proved in [
13] that there is a curve
in the space of code parameters, the asymptotic bound, that separates code points that fill a dense region and that have infinite multiplicity from isolated code points that only have finite multiplicity. These better but more elusive codes are typically obtained through algebro-geometric constructions, see [
13,
14,
15]. The asymptotic bound was related to Kolmogorov complexity in [
16].
1.3. Position with Respect to the Asymptotic Bound
Given a collection of languages one wants to compare through their syntactic parameters, one can ask natural questions about the position of the resulting code in the space of code parameters and with respect to the asymptotic bound. The theory of error correcting codes tells us that codes above the asymptotic bound are very rare. Indeed, we considered various sets of languages, and for each choice of a set of languages we considered an associated code, with a code word for each language in the set, given by its list of syntactic parameters. When computing the code parameters of the resulting code, one finds that, in a range of cases we looked at, when the languages in the chosen set belong to the same historical-linguistic family the resulting code lies below the asymptotic bound (and in fact below the Gilbert–Varshamov curve). This provides a precise quantitative bound to the possible spread of syntactic parameters compared to the size of the family, in terms of the number of different languages belonging to the same historico-linguistic group.
However, we also show that, if one considers sets of languages that do not belong to the same historical-linguistic family, then one can obtain codes that lie above the asymptotic bound, a fact that reflects, in code theoretic terms, the much greater variability of syntactic parameters. The result is in itself not surprising, but the point we wish to make is that the theory of error-correcting codes provides a natural setting where quantitative statements of this sort can be made using methods already developed for the different purposes of coding theory. We conclude by listing some new linguistic questions that arise by considering the parametric comparison method under this coding theory perspective.
1.4. Complexity of Languages and Language Families
The study of natural languages from the point of view of complexity theory has been of significant interest to linguists in recent years. The approaches typically followed focus on assigning a reasonable measure of complexity to individual languages and comparing complexities across different languages. For example, a notion of morphological complexity was studied in [
17]. An approach to defining Kolmogorov complexity of languages on the basis of syntactic parameters was developed in [
18]. A notion of language complexity based on the production rules of a generative grammar was considered in [
19], in the setting of (finite) formal languages. For a more general computational perspective on the complexity of natural languages, see [
20]. The idea of distinguishing languages by complexity is not without controversy in Linguistics. A very interesting general discussion of the problem and its evolution in the field can be found in [
21].
In the present paper, we argue in favor of a somewhat different perspective, where we assign an estimate of complexity not to individual languages but to groups of languages, and in particular (historical) language families. Our version of complexity is measuring how “spread out” the syntactic parameters can be, across the languages that belong to the same family. As we outlined in the previous subsections, this is measured by assigning to the language family a code, whose code words record the syntactic parameters of the individual languages in the family, then computing its code parameters and evaluating the position of the resulting code points with respect to curves like the asymptotic bound or the Gilbert–Varshamov line. The reason why this position carries complexity information lies in the subtle relation between the asymptotic bound and Kolmogorov complexity, recently derived by Manin and the author in [
16], which we will review briefly in this paper.
2. Language Families as Codes
The Principles and Parameters model of Linguistics assigns to every natural language L a set of binary values parameters that describe properties of the syntactic structure of the language.
Let F be a language family, by which we mean a finite collection of languages. This may coincide with a family in the historical sense, such as the Indo-European family, or a smaller subset of languages related by historic origin and development (e.g., the Indo-Iranian, or Balto–Svalic languages), or simply any collection of languages one is interested in comparing at the parametric level, even if they are spread across different historical families.
We denote by n be the number of parameters used in the parametric comparison method. We do not fix, a priori, a value for n, and we consider it a variable of the model. We will discuss below how one views, in our perspective, the issue of the independence of parameters.
After fixing an enumeration of the parameters, that is, a bijection between the set of parameters and the set , we associate to a language family F a code in , with one code word for each language , with the code word given by the list of parameters , of the language. For simplicity of notation, we just write L for the word in the following.
In this model, we only consider binary parameters with values
(here identified with letters 0 or 1 in
) and we ignore parameters in a neutralized state following implications across parameters, as in the datasets of [
3,
4]. The entailment of parameters, that is, the phenomenon by which a particular value of one parameter (but not the complementary value) renders another parameter irrelevant, was addressed in greater detail in [
22]. We first discuss a version of our coding theory model that does not incorporate entailment. We will then comment in
Section 2.7 below on how the model can be modified to incorporate this phenomenon.
The idea that natural languages can be described, at the level of their core grammatical structures, in terms of a string of binary characters (code words) was already used extensively in [
23].
2.1. Code Parameters
In the theory of error-correcting codes, one assigns two main parameters to a code
C, the
transmission rate and the
relative minimum distance. More precisely, a binary code
is an
-code if the number of code words is
, that is,
where
k need not be an integer, and the minimal Hamming distance between code words is
where the Hamming distance is given by
for
and
in
C. The transmission rate of the code
C is given by
One denotes by
the relative Hamming distance
and one defines the relative minimum distance of the code
C as
In coding theory, one would like to construct codes that simultaneously optimize both parameters
: a larger value of
R represents a faster transmission rate (better encoding), and a larger value of
δ represents the fact that code words are sufficiently sparse in the ambient space
(better decoding, with better error-correcting capability). Constraints on this optimization problem are expressed in the form of bounds in the space of
parameters, see [
12,
13].
In our setting, the
R parameter measures the ratio between the logarithmic size of the number of languages encompassing the given family and the total number of parameters, or equivalently how densely the given language family is in the ambient configuration space
of parameter possibilities. The parameter
δ is the minimum, over all pairs of languages in the given family, of the relative Hamming distance used in the PCM method of [
3,
4].
2.2. Parameter Spoiling
In the theory of error-correcting codes, one considers
spoiling operations on the code parameters. Applied to an
-code
C, these produce, respectively, new codes with the following description (see Section 1.1.1 of [
24]):
A code in , for a map , whose code words are of the form for . If f is a constant function, is an -code. If all pairs with have , then is an -code.
A code
in
, whose code words are given by the projections
of code words
in
C. This is an
-code, except when all pairs
with
have the same letter
, in which case it is an
-code.
A code , given by the level set . Taking gives an -code with , and .
The same spoiling operations hold for q-ary codes , for any fixed q.
In our setting, where C is the code obtained from a family of languages, according to the procedure described above, the first spoiling operation can be seen as the effect of considering one more syntactic parameter, which is dependent on the other parameters, hence describing a function , whose restriction to C gives the function . In particular, the case where f is constant on C represents the situation in which the new parameter adds no useful comparison information for the selected family of languages. The second spoiling operation consists in forgetting one of the parameters, and the third corresponds to forming subfamilies of the given family of languages, by grouping together those languages with a set value of one of the syntactic parameters. Thus, all these spoiling operations have a clear meaning from the point of view of the linguistic PCM.
2.3. Examples
We consider the same list of 63 parameters used in [
3] (see Section 5.3.1 and Table A). This choice of parameters follows the
modularized global parameterization method of [
2], for the Determiner Phrase module. They encompass parameters dealing with person, number, and gender (1–6 on their list), parameters of definiteness (7–16 in their list), of countability (17–24), genitive structure (25–31), adjectival and relative modification (32–14), position and movement of the head noun (42–50), demonstratives and other determiners (51–50 and 60–63), possessive pronouns (56–59); see Section 5.3.1 and Section 5.3.2 of [
3] for more details.
Our very simple examples here are just meant to clarify our notation: they consist of some collections of languages selected from the list of 28, mostly Indo-European, languages considered in [
3]. In each group we consider we eliminate the parameters that are entailed from others, and we focus on a shorter list, among the remaining parameters, that will suffice to illustrate our viewpoint.
Example 1. Consider a code
C formed out of the languages
Italian,
Spanish, and
French, and let us consider only the first six syntactic parameters of Table A of [
3], so that
with
. The code words for the three languages are
| 1 | 1 | 1 | 0 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 0 | 1 | 0 |
This has code parameters
, which satisfy
, hence they lie below the GV curve (see Equation (
8) below). We use this code to illustrate the three spoiling operations mentioned above.
Throughout the entire set of 28 languages considered in [
3], the first two parameters are set to the same value 1, hence for the purpose of comparative analysis within this family, we can regard a code like the above as a twice spoiled code
where both
and
are constant equal to 1 and
is the code obtained from the above by canceling the first two letters in each code word.
Conversely, we have and , in terms of the second spoiling operation described above.
To illustrate the third spoiling operation, one can see, for instance, that , while .
2.4. The Asymptotic Bound
The spoiling operations on codes were used in [
13] to prove the existence of an
asymptotic bound in the space of code parameters
, see also [
16,
24,
25] for more detailed properties of the asymptotic bound.
Let
denote the space of code parameters
of codes
and let
be the set of all limit points of
. The set
is characterized in [
13] as
for a continuous, monotonically decreasing function
(the asymptotic bound). Moreover, code parameters lying in
are realized with infinite multiplicity, while code points in
have finite multiplicity and correspond to the
isolated codes, see [
13,
16].
Codes lying above the asymptotic bound are codes which have extremely good transmission rate and relative minimum distance, hence very desirable from the coding theory perspective. The fact that the corresponding code parameters are not limit points of other code parameters and only have finite multiplicity reflect the fact that such codes are very difficult to reach or approximate. Isolated codes are known to arise from algebro-geometric constructions, [
14,
15].
Relatively little is known about the asymptotic bound: the question of the computability of the function
was recently addressed in [
25] and the relation to Kolmogorov complexity was investigated in [
16]. There are explicit upper and lower bounds for the function
, see [
12], including the Plotkin bound
the singleton bound, which implies that
lies below the line
; the Hamming bound
where
is the
q-ary Shannon entropy
which is the usual Shannon entropy for
,
One also has a lower bound given by the Gilbert–Varshamov bound
The Gilbert–Varshamov curve can be characterized in terms of the behavior of sufficiently random codes, in the sense of the Shannon Random Code Ensemble, see [
26,
27], while the asymptotic bound can be characterized in terms of Kolmogorov complexity, see [
16].
2.5. Code Parameters of Language Families
From the coding theory viewpoint, it is natural to ask whether there are codes C, formed out of a choice of a collection of natural languages and their syntactic parameters, whose code parameters lie above the asymptotic bound curve .
For instance, a code
C whose code parameters violate the Plotkin bound (
5) must be an isolated code above the asymptotic bound. This means constructing a code
C with
, that is, such that any pair of code words
differ by at least half of the parameters. A direct examination of the list of parameters in Table A of [
3] and Figure 7 of [
4] shows that it is very difficult to find, within the same historical linguistic family (e.g., the Indo-European family) pairs of languages
,
with
. For example, among the syntactic relative distances listed in Figure 7 of [
4] one finds only the pair
with a relative distance of
. Other pairs come close to this value, for example Farsi and French have a relative distance of
, but French and Romanian only differ by
.
One has better chances of obtaining codes above the asymptotic bound if one compares languages that are not so closely related at the historical level.
Example 2. Consider the set
with languages
Arabic,
Wolof, and
Basque. We exclude from the list of Table A of [
3] all those parameters that are entailed and made irrelevant by some other parameter in at least one of these three chosen languages. This gives us a list of 25 remaining parameters, which are those numbered as 1–5, 7, 10, 20–21, 25, 27–29, 31–32, 34, 37, 42, 50–53, 55–57 in [
3], and the following three code words:
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
This example, although very simple and quite artificial in the choice of languages, already suffices to produce a code C that lies above the asymptotic bound. In fact, we have , and , so that . Since , the code point violates the Plotkin bound, hence it lies above the asymptotic bound.
It would be more interesting to find a code C consisting of languages belonging to the same historical-linguistic family (outside of the Indo-European group), that lies above the asymptotic bound. Such examples would correspond to linguistic families that exhibit a very strong variability of the syntactic parameters, in a way that is quantifiable through the properties of C as a code.
If one considers the 22 Indo-European languages in [
3] with their parameters, one obtains a code
C that is below the Gilbert–Varshamov line, hence below the asymptotic bound by Equation (
8). A few other examples, taken from other non Indo-European historical-linguistic families, computed using those parameters reported in the SSWL database (for example the set of Malayo–Polynesian languages currently recorded in SSWL) also give codes whose code parameters lie below the Gilbert–Varshamov curve. One can conjecture that any code
C constructed out of natural languages belonging to the same historical-linguistic family will be below the asymptotic bound (or perhaps below the GV bound), which would provide a quantitative bound on the possible spread of syntactic parameters within a historical family, given the size of the family. Examples like the simple one constructed above, using languages not belonging to the same historical family show that, to the contrary, across different historical families one encounters a greater variability of syntactic parameters. To our knowledge, no systematic study of parameter variability from this coding theory perspective has been implemented so far.
Ongoing work of the author is considering a systematic analysis of language families, based on the SSWL database of syntactic parameters, using this coding theory technique. This will include an analysis of how much the conclusions about the spreading of syntactic parameters across language families obtained with this technique depends on data pre-processing like the removal of spoiling features and what can be retained as an objective property of a set of languages. Moreover, a further purpose of this ongoing study is to combine the coding theory approach and the measures of complexity for groups of languages described in the present paper with the spin glass dynamical models of language change considered in [
8], which was aimed at studying dynamically the spreading of syntactic parameters across groups of languages. The aim is to introduce complexity measures based on coding theory as part of the energy landscape of the spin glass model, following the suggestion of [
28], on analogies between the roles of complexity in the theory of computation and energy in physical theories. These results, along with a more detailed analysis of the codes and code parameters of various language families, will appear in forthcoming work.
2.6. Comparison with Other Bounds
Another possible question one can consider in this setting is how the codes obtained from syntactic parameters of a given set of natural languages compare with other known families of error correcting codes and with other bounds in the space of code parameters.
For instance, it is known that an important improvement over the behavior of typical random codes can be obtained by considering codes determined by algebro-geometric curves defined over a finite field
. Let
be the number of points over
of the curve
X, and let
, with the maximum taken over all genus
g curves
X over
. As shown in Theorem 2.3.22 of [
12], asymptotically the
satisfy the Drinfeld–Vladut bound
and as shown in Section 3.4.1 of [
12], this determines an algebro-geometric bound
and the asymptotic Tsfasman–Vladut–Zink bound
The Tsfasman–Vladut–Zink line
lies entirely below the GV line for
(Theorem 3.4.4 of [
12]).
A probabilistic argument given in Section 3.4.2 of [
12] shows that highly non-random codes coming from algebraic curves can be asymptotically better than random codes (for sufficiently large
q) as they cluster around the TVZ line. However, for
or
, as in the case of codes from syntactic parameters of groups of languages that we consider here, the TVZ line lies below the GV line, hence any example that lies above the GV bound also behaves better than the the algebro-geometric bound. Such examples, like the one given above, for the three languages Arabic, Wolof, Basque, are very rare among codes obtained from syntactic parameters of languages, as they require the choice of a group of languages that are all very far from each other syntactically, with very large relative Hamming distances between syntactic parameters.
On the other hand, even for cases of groups of languages for which the resulting code parameters are below the GV line, it is still possible to get some additional information by comparing the position of the code parameters to other curves obtained from other bounds, such as the Blokh–Zyablow bound or the Katsman–Tsfasman–Vladut bound, see Appendix A.2.1 of [
12] for a summary of all these different bounds.
For example, the first example given above, with the three languages Italian, Spanish, French and a string of six syntactic parameters, gives a code with code parameters that are below the GV line, but above both the Blokh–Zyablow and the Katsman–Tsfasman–Vladut, according to the table of asymptotic bounds given in Appendix A.2.4 of [
12].
2.7. Entailment and Dependency of Parameters
In the discussion above we did not incorporate in our model the fact that certain syntactic parameters can entail other parameters in such a way that one particular value of one of the parameters renders another parameter irrelevant or not defined, see the discussion in Section 5.3.2 of [
3].
One possible way to alter the previous construction to account for these phenomena is to consider the codes
C associated to families of languages as codes in
, where
n is the number of parameters, as before, and the set of values is now given by
, with
corresponding to the binary values of the parameters that are set for a given language and value 0 assigned to those parameters that are made irrelevant for the given language, by entailment from other parameters, or are not defined. This allows us to consider the full range of parameters used in [
3,
4]. We revisit Example 2 considered above.
Example 3. Let
be the code obtained from the languages
Arabic,
Wolof, and
Basque, as a code in
with
, using the entire list of parameters in [
3]. The code parameters
of this code no longer violate the Plotkin bound. In fact, the parameters satisfy
so the code
C now also lies below the GV bound.
Thus, the effect of including the entailed syntactic parameters in the comparison spoils the code parameters enough that they fall in the area below the GV bound.
Notice that what we propose here is different from the counting used in [
3], where the relative distances
are normalized with respect to the number of non-zero parameters (which therefore varies with the choice of the pair
) rather than the total number
n of parameters. While this has the desired effect of getting rid of insignificant parameters that spoil the code, it has the undesirable property of producing codes with code words of varying lengths, while counting only those parameters that have no zero-values over the entire family of languages, as in Example 2 avoids this problem. Adapting the coding theory results about the asymptotic bound to codes with words of variable length may be desirable for other reasons as well, but it will require an investigation beyond the scope of the present paper.
More generally, there are various kinds of dependencies among syntactic parameters. Some sets of hierarchical relations are discussed, for instance, in [
29].
By the spoiling operations of codes described above, we know that if some of the syntactic parameters considered are functions of other parameters, the resulting code parameters of are worse than the parameters of the code C where only independent parameters were considered.
Part of the reason why code parameters of groups of languages in the family analyzed in [
3] end up in the region below the asymptotic and the GV bound may be an artifact of the presence of dependences among the chosen 63 syntactic parameters. From the coding theory perspective, the parametric comparison method works best on a smaller set of independent parameters than on a larger set that includes several dependencies.
Entailment relations between syntactic parameters play an important role in the dynamical models of language evolutions constructed in [
8], based on spin glass models in statistical physics.
Notice that the type of entailment relations we consider here are only of a rather special form, where a parameter is made undefined by effect of the value of another parameter (hence the use of the value 0 for the undetermined parameter). There are more general forms of entailment that we do not consider here, but which will be discussed in more detail in upcoming work. For example, one can have a situation with two languages in which a parameter is entailed by the values of two other parameters, but entailed to two different values in the two languages. In this case, the proposal above need to be modified, because this entailed parameter should contribute to the Hamming distance between the two languages. In such a situation the entailed parameter should increase, rather than spoil, the efficiency of the code. Keeping entailed parameters can be used for error-correcting purposes, as contributing to error detection. The role of entailment of parameters was considered in [
8], in the use of spin glass models for language change, where the entailment relations appear as couplings at the vertices (interaction terms) between different Ising/Potts models on the same underlying graph of language interactions. In upcoming work, now in preparation, we will discuss how treating different forms of entailment of parameters in the coding theory setting described here related to the treatment of entailment relations in the spin glass model of [
8].
3. Entropy and Complexity for Language Families
3.1. Why the Asymptotic Bound?
In the examples discussed above we compared the position of the code point associated to a given set of languages to certain curves in the space of code parameters. In particular, we focused on the asymptotic bound curve and the Gilbert–Varshamov curve. It should be pointed out that these two curves have a very different nature.
The asymptotic bound is the only curve that separates regions in the space of parameters that correspond to code points with entirely different behavior. As shown in [
13,
24], code points in the area below the asymptotic bound are realized with infinite multiplicity and fill densely the region, while code points that lie above the asymptotic bound are isolated and realized with finite multiplicity.
The Gilbert–Varshamov curve, by contrast, is related to the statistical behavior of sufficiently random codes (as we recall in
Section 3.2 below), but does not separate two regions with significantly different behavior in the space of code points. Thus, in this respect, the asymptotic bound is a more natural curve to consider than the Gilbert–Varshamov curve.
Thus, a heuristic interpretation of the position of codes obtained from groups of languages, with respect to the asymptotic bound can be understood as follows. The position of a code point above or below the asymptotic bound reflects a very different behavior of the corresponding code with respect to how easily “deformable” it is. The sporadic codes that lie above the asymptotic bound are rigid objects, in contrast to the deformable objects below the asymptotic bound. In terms of properties of the distribution of syntactic parameters within a set of languages, this different nature of the associated code can be seen as a measure of the degree of “deformability” of the parameter distribution: in languages that belong to the same historical linguistic families, the parameter distribution has evolved historically along with the development of the family’s phylogenetic tree, and one expects that correspondingly the code parameters will indicate a higher degree of “deformability” of the corresponding code. If a group of languages is chosen that belong to very different historical families, on the contrary, one expects that the distribution of syntactic parameters will not necessarily lead any longer to a code that has the same kind of deformability property: code points above the asymptotic bound may be realizable by this type of language groups.
There is no similar interpretation for the position of the code point with respect to the Gilbert–Varshamov line. An interpretation of that position can be sought in terms of Shannon entropy, as we discuss below. Summarizing: the main conceptual distinction between the Gilbert–Varshamov line and the asymptotic bound is that the GV line represents only a statistical phenomenon, as we review below, while the asymptotic bound represents a true separation between two classes of structurally different codes, in the sense explained above.
3.2. Entropy and Statistics of the Gilbert–Varshamov Line
The Gilbert–Varshamov line
can be characterized statisticallly. Such a statistical description can be obtained by considering the Shannon Random Code Ensemble (SRCE). These are random codes obtained by choosing code words as independent random variables with respect to a uniform Bernoulli measure, so that a code is described by a randomly chosen set of different words of length
n occurring with probability
, see [
26,
27]. There is no a priori reason why the type of codes we consider here, with code words formed using the syntactic parameters of natural languages, would be linear. Thus, we consider the general setting of unstructured codes, as in Section V of [
27].
The Hamming volume
, that is, the number of words of length
n at Hamming distance at most
d from a given one, can be estimated in terms of the
q-ary Shannon entropy
in the form
The expectation value for the random variable counting the number of unordered pairs of distinct code words with Hamming distance at most
d is then estimated as
This estimate is then used (see [
26,
27]) to show that the probability to have codes in the SRCE with
is bounded by
. By a similar argument (see Section V of [
27] and Proposition 2.2 of [
16]) it is shown that the probability that
is bounded by
.
While, by this type of argument, one can see the Gilbert–Varshamov line as representing the typical behavior of sufficiently random codes, the asymptotic bound does not have a similar statistical interpretation. It does have, however, a relation to Kolmogorov complexity, which is relevant to the point of view discussed in the present paper. The relation between asymptotic bound of error correcting codes and Kolmogorov complexity was described in [
16]. We recall it in the rest of this section, along with its implications for the linguistic applications we are considering.
3.3. Kolmogorov Complexity
We refer the reader to [
30] for an extensive treatment of Kolmogorov complexity and its properties. We recall here some basic facts we need in the following.
Let be a universal Turing machine, that is, a Turing machine that can simulate any other arbitrary Turing machine, by reading on tape both the input and the description of the Turing machine it should simulate. A prefix Turing machine is a Turing machine with unidirectional input and output tapes and bidirectional work tapes. The set of programs P on which a prefix Turing machine halts forms a prefix code.
Given a string
w in an alphabet
, the prefix Kolmogorov complexity is given by minimal length of a program for which the universal prefix Turing machine
outputs
w,
There is a universality property. Namely, given any other prefix Turing machine
T, one has
where the shift is by a bounded constant, independent of
w. The constant
is the Kolmogorov complexity of the program needed to describe
T so that
can simulate it.
A variant of the notion of Kolmogorov complexity described above is given by conditional Kolmogorov complexity,
where the length
is given, and made available to the machine
. One then has
because if
is known, then a possible program is just to write out
w. This means that then
is just number of bits needed for the transmission of
w plus the print instructions.
An upper bound is given by
If one does not know a priori
, one needs to signal the end of the description of
w. For this it suffices to have a “punctuation method", and one can see that this has the effect of adds the term
in the above estimate. In particular, any program that produces a description of
w is an upper bound on Kolmogorov complexity
.
One can think of Kolmogorov complexity in terms of data compression: the shortest description of w is also its most compressed form. Upper bounds for Kolmogorov complexity are therefore provided easily by data compression algorithms. However, while providing upper bounds for complexity is straightforward, the situation with lower bounds is entirely different: constructing a lower bound runs into a fundamental obstacle caused by the fact that the halting problem is unsolvable. As a consequence, Kolmogorov complexity is not a computable function. Indeed, suppose one would list programs (with increasing lengths) and run them through the machine . If the machine halts on with output w, then is an approximation to . However, there may be an earlier in the list such that has not yet halted on . If eventually halts also on and outputs w, then will be a better approximation to . So one would be able to compute if one could tell exactly on which programs the machine halts, but that is indeed the unsolvable halting problem.
Kolmogorov complexity and Shannon entropy are related: one can view Shannon entropy as an averaged version of Kolmogorov complexity in the following sense (see Section 2.3 of [
31]). Suppose given independent random variables
, distributed according to Bernoulli measure
with
. The Shannon entropy is given by
There exists a
, such that, for all
,
The expectation value
shows that the average expected Kolmogorov complexity for length
n descriptions approaches the Shannon entropy in the limit when
.
3.4. Kolmogorov Complexity and the Asymptotic Bound
We recall here briefly the result of [
16] linking the asymptotic bound of error correcting codes to Kolmogorov complexity.
As we discussed above, only the asymptotic bound marks a significant change of behavior of codes across the curve (isolated code points with finite multiplicity versus accumulation points with infinite multiplicity). In this sense this curve is very different from all the other bounds in the space of code parameters. However, there is no explicit expression for the curve
that gives the asymptotic bound. Indeed, even the question of the computability of the function
is a priori unclear. This question was formulated explicitly in [
25].
It is proved in [
16] that the asymptotic bound
becomes computable given an oracle that can list codes by increasing Kolmogorov complexity. Given such an oracle, one can provide an explicit iterative (algorithmic) procedure for constructing the asymptotic bound. This implies that the asymptotic bound is “at worst as non-computable as Kolmogorov complexity”.
Consider the set of (unstructured) q-ary codes and the set of code points and the computable function that assigns to a code its code parameters . Let and be, respectively, the subsets of the space of code points that correspond to code points realized with finite and with infinite multiplicity. The algorithm iteratively produces two sets and that approximate, respectively, and by and . The inductive construction starts by choosing an increasing sequence of positive integers and setting and taking to be the set of code points y with , where is a fixed enumeration of the set of rational points where code points belong.
General estimates on the behavior of (exponential) Kolmogorov complexity under composition of total recursive functions (see [
30], Section VI.9 of [
32]) show that, for a composition
of recursive functions the Kolmogorov complexity satisfies
for a fixed
and varying
,
.
Consider the total recursive function
with
where
is an enumeration of the space of codes. Consider the enumerable sets
and
, with
and
. For
, defined as
on
, applying the composition rule for exponential Kolmogorov complexity, it is shown in Proposition 3.1 of [
16] that, for
and
, one has
, hence
Using the same type of estimate of Kolmogorov complexity for composition of recursive functions, it is then shown in Proposition 3.2 [
16] that, for
and
, and for a unique
, with
,
and
, one finds
To construct inductively
and
, given
and
, one takes
to consist of the elements in the list
Here one invokes the oracle, which ensures that, if such
x exists, then it must be contained in a finite list of points
with bounded complexity
One then takes
to consist of the remaining elements in the list
. We refer the reader to [
16] for a more detailed formulation.
More generally, the argument of [
16] recalled above shows that, for a recursive function
, determining which values have infinite multiplicities is computable given an oracle that enumerate integers in order of Kolmogorov complexity.
As discussed in [
16,
24], the asymptotic bound can also be seen as the phase transition curve for a quantum statistical mechanical system constructed out of the space of codes, where the partition function of the system weights codes according to their Kolmogorov complexity. This is as close to a “statistical description” of the asymptotic bound that one can achieve.
In comparison with the behavior of random codes (codes whose complexity is comparable to their size), which concentrate in the region bounded by the Gilbert–Varshamov line, when ordering codes by complexity, non-random codes of lower complexity populate the region above, with code points accumulating in the intermediate region bounded by the asymptotic bound. That intermediate region thus, in a sense, reflects the difference between Shannon entropy and complexity.
3.5. Entropy and Complexity Estimates for Language Families
On the basis of the considerations of the previous sections and of the results of [
16,
24] recalled above, we propose a way to assign a quantitative estimate of entropy and complexity to a given set of natural languages.
As before let be a binary (or ternary) code where the code words are the binary (ternary) strings of syntactic parameters of a set of languages . We define the entropy of the language family as the q-ary Shannon entropy , where q is either 2 or 3 for binary or ternary codes, and is the relative minimum distance parameter of the code C. We also define the entropy gap of the language family as the value of , which measures the distance of the code point from the Gilbert–Varshamov line, that is, from the behavior of a typical random code.
As a source of estimates of complexity of a language family
one can consider any upper bound on Kolmogorov complexity of the code
C. A possible approach, which contains more linguistic input, would be to provide estimates of complexity for each individual language in the family and then compare these. Estimates of complexity for individual languages have been considered in the literature, some of them based on the description of languages in terms of their syntactic parameters. For instance, following [
18], for a syntactic parameter Π with possible values
, the Kolmogorov complexity of Π set to value
v is given by
with the minimum taken over the complexities of all the parse trees that express the syntactic parameter Π and require
to be grammatical in the language. Notice that, in this approach, the syntactic parameters are not just regarded as binary or ternary values, but one needs to consider actual parse trees of sentences in the language that express the parameter. Thus, such an approach to complexity has the advantage that it is very rich in linguistic information. However, it is at the same time computationally very difficult to realize.
What we are proposing here is a much simpler way to obtain an estimate of complexity for a language family , which is not based on estimating complexity of the individual languages in the family, but which is aimed at detecting how spread out and diversified the syntactic parameters are across the family, by estimating the position of the code point of the associated code C with respect to the asymptotic bound . This can be estimated in terms of the distance to other curves in the space of code parameters that constrain the asymptotic bound from above and below, such as the Plotkin bound, Hamming bound, and Gilbert–Varshamov bound, as in the examples discussed in the previous sections.