Next Article in Journal
Weak Fault Diagnosis of Wind Turbine Gearboxes Based on MED-LMD
Next Article in Special Issue
Approximation of Stochastic Quasi-Periodic Responses of Limit Cycles in Non-Equilibrium Systems under Periodic Excitations and Weak Fluctuations
Previous Article in Journal
Noise Enhancement for Weighted Sum of Type I and II Error Probabilities with Constraints
Previous Article in Special Issue
Spurious Results of Fluctuation Analysis Techniques in Magnitude and Sign Correlations
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

The Entropy of Words—Learnability and Expressivity across More than 1000 Languages

by
Christian Bentz
1,2,*,
Dimitrios Alikaniotis
3,
Michael Cysouw
4 and
Ramon Ferrer-i-Cancho
5
1
DFG Center for Advanced Studies, University of Tübingen, Rümelinstraße 23, D-72070 Tübingen, Germany
2
Department of General Linguistics, University of Tübingen, Wilhemstraße 19-23, D-72074 Tübingen, Germany
3
Department of Theoretical and Applied Linguistics, University of Cambridge, Cambridge CB3 9DP, UK
4
Forschungszentrum Deutscher Sprachatlas, Philipps-Universität Marburg, Pilgrimstein 16, D-35032 Marburg, Germany
5
Complexity and Quantitative Linguistics Lab, LARCA Research Group, Departament de Ciències de la Computació, Universitat Politècnica de Catalunya, 08034 Barcelona, Catalonia, Spain
*
Author to whom correspondence should be addressed.
Entropy 2017, 19(6), 275; https://doi.org/10.3390/e19060275
Submission received: 26 April 2017 / Revised: 2 June 2017 / Accepted: 6 June 2017 / Published: 14 June 2017
(This article belongs to the Special Issue Complex Systems, Non-Equilibrium Dynamics and Self-Organisation)

Abstract

The choice associated with words is a fundamental property of natural languages. It lies at the heart of quantitative linguistics, computational linguistics and language sciences more generally. Information theory gives us tools at hand to measure precisely the average amount of choice associated with words: the word entropy. Here, we use three parallel corpora, encompassing ca. 450 million words in 1916 texts and 1259 languages, to tackle some of the major conceptual and practical problems of word entropy estimation: dependence on text size, register, style and estimation method, as well as non-independence of words in co-text. We present two main findings: Firstly, word entropies display relatively narrow, unimodal distributions. There is no language in our sample with a unigram entropy of less than six bits/word. We argue that this is in line with information-theoretic models of communication. Languages are held in a narrow range by two fundamental pressures: word learnability and word expressivity, with a potential bias towards expressivity. Secondly, there is a strong linear relationship between unigram entropies and entropy rates. The entropy difference between words with and without co-textual information is narrowly distributed around ca. three bits/word. In other words, knowing the preceding text reduces the uncertainty of words by roughly the same amount across languages of the world.
Keywords: natural language entropy; entropy rate; unigram entropy; quantitative language typology natural language entropy; entropy rate; unigram entropy; quantitative language typology
Graphical Abstract

Share and Cite

MDPI and ACS Style

Bentz, C.; Alikaniotis, D.; Cysouw, M.; Ferrer-i-Cancho, R. The Entropy of Words—Learnability and Expressivity across More than 1000 Languages. Entropy 2017, 19, 275. https://doi.org/10.3390/e19060275

AMA Style

Bentz C, Alikaniotis D, Cysouw M, Ferrer-i-Cancho R. The Entropy of Words—Learnability and Expressivity across More than 1000 Languages. Entropy. 2017; 19(6):275. https://doi.org/10.3390/e19060275

Chicago/Turabian Style

Bentz, Christian, Dimitrios Alikaniotis, Michael Cysouw, and Ramon Ferrer-i-Cancho. 2017. "The Entropy of Words—Learnability and Expressivity across More than 1000 Languages" Entropy 19, no. 6: 275. https://doi.org/10.3390/e19060275

APA Style

Bentz, C., Alikaniotis, D., Cysouw, M., & Ferrer-i-Cancho, R. (2017). The Entropy of Words—Learnability and Expressivity across More than 1000 Languages. Entropy, 19(6), 275. https://doi.org/10.3390/e19060275

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop