*2.3. Data Description*

We provide a broad characterization of the PG books in terms of their length, language and (when available) inferred date of publication in Figure 2. One of the main reason for the popularity of books from PG is their long text length, which yields large coherent statistical samples without potentially introducing confounding factors originating from, e.g., the mixing of different texts [39]. The length of most PG books exceeds *m* = 10<sup>4</sup> word tokens (Figure 2a) which is larger than typical documents from most web resources. In fact, the distribution shows a heavy-tail for large values of *m*. Thus we find a substantial fraction of books having more than 10<sup>5</sup> word tokens. Many recent applications in quantitative linguistics aim at tracing diachronic changes. While the metadata does not provide the year of the first publication of each book, we approximate the number of PG books published in year *t* as the number of PG books for which the author's year of birth is *t*birth + 20 < *t* and the author's year of death is *t* < *t*death (Figure 2b). This reveals that the vast majority of books were first published around the year 1900, however, with a substantial number of books between 1800 and 2000. Part of this is known to be a consequence of the Copyright Term Extension Act of 1998 which, sadly, has prevented books published after 1923 to enter the public domain so far. If no further copyright extensions laws are passed in the future, then this situation will be gradually alleviated year after year, as books published in 1923 will enter the public domain on 1 January 2019, and so on.

While most contemporary textual datasets are in English, the SPGC provides a rich resource to study other languages. Using metadata provided by PG, we find that 81% of the books are tagged as written in English, followed by French (5%, 2864 books), Finnish (3.3%, 1903 books) and German (2.8%, 1644 books). In total, we find books written in 56 different languages, with three (13) languages besides English with more than 1000 (100) books each (Figure 2c). The size of the English corpus is 2.8 × 10<sup>9</sup> tokens, which is more than one order of magnitude larger than the British National Corpus (10<sup>8</sup> tokens). The second-largest language corpus is made up of French books with > 10<sup>8</sup> tokens. Notably, there are six other languages (Finnish, German, Dutch, Italian, Spanish, and Portuguese) that contain > 10<sup>7</sup> tokens and still another eight languages (Greek, Swedish, Hungarian, Esperanto, Latin, Danish, Tagalog, and Catalan) that contain > 10<sup>6</sup> tokens.

**Figure 2.** Basic summary statistics from the processed PG data. (**a**) Number of books with a text length larger than *m*; (**b**) Number of books which are compatible with being published in year *t*, i.e., year of author's birth is 20 years prior and year of author's death is after *t*; (**c**) Number of books (left axis) and number of tokens (right axis) which are assigned to a given language based on the metadata. en: English, fr: French, fi: Finnish, de: German, nl: Dutch, it: Italian, es: Spanish, pt: Portuguese, zh: Chinese, el: Greek, sv: Swedish, hu: Hungarian, eo: Esperanto, la: Latin, da: Danish, tl: Tagalog, ca: Catalan, pl: Polish, ja: Japanese, no: Norwegian, cy: Welsh, cs: Czech.

In addition to the "hard-facts" metadata (such as language, time of publication), the SPGC also contains manually annotated topical labels for individual books. These labels allow not only the study of topical variability, but they are also of practical importance for assessing the quality of machine learning applications in Information Retrieval, such as text classification or topic modeling [52]. We consider two sets of topical labels: labels obtained from PG's metadata "subject" field, which we call *subject labels*; and labels obtained by parsing PG's website bookshelf pages, which we call *bookshelf labels*. Table 1 shows that there is certain overlap in the most common labels between the two sets (e.g., Science Fiction or Historical Fiction), but a more detailed analysis of how labels are assigned to books reveals substantial differences (Figure 3). First, subject labels display a very uneven distribution of the number of books per label. That is, most of the subject labels are assigned to very few books (less than 10), with only few subject labels assigned to many books. In comparison, bookshelf labels are more evenly distributed: most of them are assigned to between 10 and 100 books (Figure 3a,c). More importantly, the overlap in the assignment of labels to individual books is much smaller for the bookshelf labels (Figure 3b,d): While roughly 50% of the PG books are tagged with two or more subject labels, up to 85% of books are tagged with a unique bookshelf label. This indicates that the bookshelf labels are more informative because they constitute broader categories and provide a unique assignment of labels to books, and are thus better suited for practical applications such as text classification.

...

 ... ...


 ...  ... ...

**Table 1.** Examples for the names of labels and the number of assigned books from bookshelves (left) and subjects (right) metadata.

**Figure 3.** Comparison between bookshelf labels (top, green) and subject labels (bottom, red). (**<sup>a</sup>**,**<sup>c</sup>**) Number of labels with a given number of books; (**b**,**d**) Fraction of books with a given number of labels.

#### *2.4. Quantifying Variability in the Corpus*

In order to highlight the potential of the SPGC for quantitative analysis of language, we quantify the degree of variability in the statistics of word frequencies across labels, authors, and time. For this, we measure the distance between books *i* and *j* using the well-known Jensen–Shannon divergence [53], *Di*,*j*, with *Di*,*<sup>j</sup>* = 0 if the two books are exactly equal in terms of frequencies, and *Di*,*<sup>j</sup>* = 1 if they are maximally different, i.e., they do not have a single word in common, see Methods for details.
