*2.2. Data Processing*

In this section, we briefly describe all steps we took to obtain the corpus from the raw data (Figure 1), for details see Section 4. The processing (as of 18 July 2018) yields data for 55,905 books on four different levels of granularity:


**Figure 1.** Sketch of the pre-processing pipeline of the Project Gutenberg (PG) data. The folder structure (**left**) organizes each PG book on four different levels of granularity, see example books (**middle**): raw, text, tokens, and counts. On the right we show the basic python commands used in the pre-processing.
