*3.1. Document Extraction*

The first module (Figure 8) performs the preprocessing step introduced in Section 2.4 (Figure 6). Since the source format (PDF) is not suited for further processing, plain text is extracted. By parsing through the documents, elements like figures, charts or tables are also detected.

**Figure 8.** Document extraction module.

Besides detecting non-textual elements, the document is segmented into its paragraphs (e.g., abstract, introduction etc.), since–depending on the IE purpose–the relevant information may be provided mainly in a certain paragraph. For example, the introduction section usually contains information about the aim of the investigation, while the results section further provides a description of the outcome. Parsing is performed based on syntactical rules and pattern matching, e.g., indentations, blank spaces or different fonts, can be used as indicators for the detection process. Besides the content of the publication, meta data about the document (e.g., author, DOI, date, publishing information) is extracted. The last step is the aggregation of the previously segmented elements into JSON (Java Script Object Notation), which is a common and platform independent data sharing format. The document extraction module is implemented using the PyMuPDF Library [65].
