3.1.1. The Data Harvesting Submodule

The Data Harvesting submodule implements several distinct services that may be invoked by Hydria users to initiate automated data collection. At the heart of this submodule lies the web scraping service that extends the basic components of the open source web scrapper Scrapy [74,75] for scraping social media content. Currently, the web scraping service supports two of the most popular social media platforms, Facebook and TripAdvisor, while support for more is under development. Setting up the web scraping service only involves providing the initial seed URLs from the aforementioned social media; the service automatically recognizes which social platform and what type of data (e.g., venues, PoIs, reviews) is targeted for data harvesting and launches the appropriate web scrapper instance. To tackle user privacy issues that may arise during spidering, the web scraping service provides no spidering options for individual users (i.e., one cannot scrape user pages) and supports only the collection of aggregate values and fields that are general enough so that no person may be identifiable by reasonable means. The data flow process is controlled by the scrapper execution engine that is responsible for the data flow between all components of the Hydria system and is summarized below:


This procedure can be applied straightforwardly on plain HTML pages that are received from the server; however, to enhance the user experience, practically all major social media sites employ JavaScript, making their content more interactive. This practice, however, poses challenges to the

scraping procedure; to overcome these challenges, we have also developed a JavaScript handling service by adapting and integrating the Selenium library [76]. The Selenium library provides a useful tool for data harvesting and web scraping: the Selenium renderer uses a web browser engine to render a given URL and mimics human behavior on the web page. This allows the web scraping (and the crawling) service to interact with JavaScript functions that exist on the target website (e.g., infinite scrolling) and avoid unnecessary hold-ups in the spidering process. The JavaScript handling service uses the Google Chrome web driver for the Selenium renderer.

Apart from spidering popular structured social media platforms, the Data Harvesting submodule contains also a *focused crawl* service that is designed to perform *thematic* crawls on the clear web with the purpose to discover new resources that may contain cultural heritage data of interest. From the Hydria user side, setting up the focused crawling service only involves providing a relevant (to the task) query (e.g., "cultural heritage" or "archeological museum") that will be used to produce the initial crawl seeds. The underlying crawling infrastructure is based on the ACHE crawler [77], one of the most popular focused crawlers [78] available, which prioritizes URLs in the crawl frontier and categorizes the crawled pages as relevant or irrelevant using machine learning-based techniques. To direct the crawl towards topically relevant websites (i.e., websites with content relevant to cultural heritage) we use an SVM classifier, which is trained by resorting to an equal number of positive and negative examples of websites that are used as input to the *model builder* component of ACHE. Subsequently, the *seed finder* [79] component is used to aid the process of locating initial seeds for the focused crawl on the clear web; this is achieved by combining the pre-built classification model with the topic-related user-provided query discussed above. Since the crawled websites may be anything from blog posts to organizational web pages and do not have a predetermined structure (unlike the social media pages), the collected content is only parsed to remove HTML markup and is stored as raw text in the Hydria data lake.

Finally, the Data Harvesting submodule contains two auxiliary services (namely data parsing and feature extraction) that are used to extract textual content and features from the harvested websites.

#### 3.1.2. The Structured Data Input Submodule

Apart from the crawling and web scraping functionality that were presented in the previous section, Hydria also supports structured data input via the relevant submodule. Structured data input supports several services that provide users with the necessary functionality to *(i)* import data from a structured CSV/XML/JSON formatted dataset, *(ii)* create (in a stand-alone data pond) a questionnaire-style form that may be used in surveys and the associated user answers, and *(iii)* reuse all or part of these questionnaires via the creation and management of templates. Each of the aforementioned services is associated with a data pond that is created to allow users to manage the stored data (more details about data pond creation and management are given in Section 3.2). To achieve this functionality, the Structured Data Input submodule implements several distinct services as follows.

The *Dataset Import service* is used to automatically load/store a structure dataset into a Hydria data pond; the service automatically matches the columns of a CSV/XML/JSON tagged file with the pre-specified data pond fields and for each data item (typically a row in the CSV file or element under the root of the XML/JSON document) a new record is created and stored in Hydria under the corresponding data pond. Notice that this service may be also used as a separate stand-alone step for importing harvested web/social media content that has undergone processing outside of Hydria; i.e., when the harvesting task is over, the user may select to process the downloaded content outside of Hydria and subsequently manually load the result of this intermediate processing into a separate data pond.

The rest of the available services target the creation of questionnaire-style forms that allow Hydria users to create and store data collection tasks that involve electronic input of end-users into structured forms (e.g., surveys, end-user evaluations, museum experience records, etc.). In order to facilitate data pond creation and reuse of common parts between constructed questionnaires, Hydria supports (via an appropriate template management service) the creation and use of *templates* that can be shared or reused between different data ponds. This functionality is native in Hydria and is tightly coupled with (i) the creation, maintenance and analysis of data ponds (described in detail in Section 3.2) and (ii) the access control mechanism for the different data ponds (described in detail in Section 3.5).

#### *3.2. The Data Management Module*

The Data Management module supervises the creation, editing, organization and management of the data ponds, and performs all necessary storage and retrieval operations to the database back-end (i.e., manages the stored data related with a specific data pond or a specific record). It supports a flexible, adaptive and intuitive way for designing and composing a data pond or a data pond template.

The Data Management module employs several different services that allow Hydria users to create and edit data ponds. Such activities are supported by an easy-to-use wizard mechanism that guides the Hydria user through the whole process. Building a new data pond/template involves defining a title and a description for it; subsequently the user specifies the different fields for the data pond (i.e., the attributes to be stored) by providing for each field its textual description and its type. According to the selected attribute type, hidden fields or dialogs appear for inserting more specific information about the attribute (e.g., if the attribute is of type *multiple choice*, the user has to fill up a list of values or select one of the existing template lists). The available data types that currently Hydria supports are as follows: title (this field is not fillable, although it is used to separate data pond sections), text, integer, decimal, date, multiple choice, picture drawing, image file, and complex data types.

*Complex data types* are a construct provided to allow for more efficient modelling of cases where a group of fields appear multiple times within a document, across different documents in the same data pond, or even across documents in different data ponds. Examples of such cases may be interpretations of cultural items (with each interpretation having an author, a summary, and extended analysis and supporting documents, and each cultural item being potentially subject to multiple interpretations), or company addresses (each address consists of a street name, a number, a city, a zip code and a country, and a single company may have multiple addresses). To introduce a complex data type, a user needs to provide the specification of a recurring attribute with more than one fields. Please note that the complex data type definition may be subsequently modified to change the attribute order and/or edit or delete a specific attribute by using the corresponding controls on the wizard; any changes are reflected to data ponds using the modified complex data type. The advantages of creating and using complex data types include better data modelling, increased flexibility in the design of data ponds with complex/recurring attributes, elevated knowledge capture capabilities and, consequently, the ability to formulate more expressive and semantically rich queries.

To ensure data consistency across data ponds and to enhance data integrity and input validation, Hydria natively supports the following two features:


Finally, notice that Hydria has no direct policy on how one stores or uses the stored data. This is particularly important in the case of evolving data(sets) where the user has many different options to store, monitor, or analyze data evolution. In particular, she has the option to create different data ponds for different snapshots of the data, use different fields to model data evolution within the data pond, or store solely the differences between various data snapshots.
