*3.4. The Publish/Subscribe Module*

The Publish/Subscribe module is essentially an efficient and easy-to-use collaboration tool that allows users to share their collected datasets (or parts of them), as well as to discover and access datasets of other users within the Hydria ecosystem. This tool is not intended for use by the Hydria system end-users (i.e., people participating in a Hydria survey, or museum visitors that provide feedback via a Hydria questionnaire), but rather targets other user categories like curators/super-users (see Section 3.5 for a detailed description of the Hydria user roles). Using this functionality involves the following two-step process:


When both stages have been completed, the user is able to access the subscribed records for preview, incorporate them into her own data ponds, or create charts and visualizations using the data analysis and visualization tool. Moreover, the subscribed user will be notified whenever new records that match her subscription are incorporated in the subscribed data pond.

## *3.5. The User Management Module*

The User Management module is responsible for performing basic and advanced user management tasks, such as manage users in the Hydria community, assign user privileges and permissions, and perform access control on data ponds and the stored data within them. It supports the following three types of users:


Figure 3 presents the UI for the creation of a new user from the administrator point of view in Hydria.


**Figure 3.** User creation form.

## *3.6. Implementation Aspects*

Hydria has been entirely developed by open source software. The Data Acquisition module uses the Linux/Apache/MariaDB/PHP (LAMP) solution stack for temporary storage of the extracted data and was developed using Python tools [80]. The rest of the modules were built by using the Laravel Framework [81] and use the Linux/Apache/PostgreSQL/PHP (LAPP) solution stack as the back-end database infrastructure. Also, many of Hydria functionalities were developed using JavaScript/JQuery/AJAX.

#### **4. The TripMentor Case Study**

In the context of Alpha testing the Hydria platform, we made the system available to partners within the TripMentor project [45] that aims at creating a tourist guide for the region of Attica, Greece. The partners had varying levels of IT expertise, ranging from relatively experienced to very experienced, and were asked to use Hydria for collecting, storing, and managing data relevant to the project, by deploying the different services offered by Hydria and using the functionality provided. In the following, we report the results for this case study, as these were drawn from field observations as well as from the analysis of usage logs and collected data.

The TripMentor partners used Hydria to navigate within TripAdvisor [82] and Facebook [83] and detect content relevant to the tourism domain; the interest of focus was primarily points of interest (PoIs) located in the *Attica region in Greece*, due to the nature of the TripMentor project. The data retrieved was stored in the Hydria data lake, and the stakeholders involved used Hydria functionality to design the necessary data ponds, modify selected data records, mine and visualize information from the stored data, and export files for further analysis.

#### *4.1. Data Harvesting*

To crawl the web for content and data relevant to the TripMentor project, the integrated ACHE crawler was used within the Hydria environment. To set up the crawler, a number of relevant pages within the domain of interest were used as seed URLs; such URLs included tourist articles and blog posts related to cultural and tourist activities in the Attica region. To implement the page classification required by ACHE, different features of the page URL, title and content were exploited, which were *Big Data Cogn. Comput.* **2020**, *4*, 7

determined after an examination of the patterns followed by the pages in different social networks. For instance, for the TripAdvisor spider the following features were used:


Similarly, for the Facebook crawlers patters from the seed URLs were exploited; for instance, to identify the activities available in the city of Athens, Greece (which is located in the Attica region), the seed URLs start with the string *Things-to-do-in-*, which is followed by the city name (Athens), and this is in turn followed by the name of the country (Greece).

Notably, a multitude of additional operators are available for more complex classification tasks, which include matching of the title (title\_regex classifier type), combination of regular expression matches through the AND and OR operators, or using machine-learning-based text classifiers (SVM, Random Forest) using the title\_regex classifier type.

Having collected the seed URLs, six distinct spiders were developed, in an effort to cover a wide spectrum of events and many different scopes; four of them targeted data harvesting from the Facebook platform, while the remaining two were deployed over TripAdvisor. Table 1 summarizes the outcome of the data harvesting process.

**Table 1.** TripMentor-related data ponds records after Hydria harvesting process.


#### 4.1.1. Facebook Spiders

In this section, we discuss the deployed Facebook spiders for the TripMentor case study and present some initial statistics and insights on how the Hydria social media spiders may be used within the cultural informatics context. Please note that the spiders and data ponds were setup and deployed by the TripMentor project partners; our statistics and observations are based on usage logs and the schema analyzes of the data ponds used to store the collected data.

The first spider, which constitutes the initial setup within Hydria, was deployed over Facebook and extracted a total number of 10,405 different PoIs in the Attica region; a time period of approximately 48 hours was needed to conclude the data harvesting operation, and the collected PoIs are categorized as shown in Table 2. For each of these venues, the spider retrieved and stored in a Hydria data pond the following fields from the related Facebook profile pages: the venue name, the venue unique ID as stored in the Facebook platform, the hours/days of the week that the venue is open to visitors, the venue website, the venue phone number, the registered email address, the physical address of the venue, the total number of check-ins for the venue (i.e., the visitor traffic), the average review score of the venue, the venue category, and the geographical coordinates (latitude and longitude) of the venue. Please note that all the collected data were about venues and cultural events, and no personal or user-specific data were harvested or stored.


**Table 2.** Different PoIs in the Attica region extracted by the first FB spider.

Having retrieved the PoIs unique IDs from the previous process, a new spider that generated the venues profile page URLs using the venue IDs was created and launched within Hydria for a subset of the collected PoIs (approximately 1.7 K PoIs). The spider was then deployed and collected around 140 K posts related to the targeted PoIs in a time frame of around 168 h (one week). All retrieved posts were stored in a separate data pond within Hydria, along with the following metadata: The post unique ID, the profile source of the post, the profile that shared this post, the upload date of the post, the post text, the total number of reactions and the number for each individual reaction type (e.g., likes), and the URL of the post. Notice that posts and the collected post sources refer to venues and cultural events and are not related to persons or personal data, while the collected reactions correspond to aggregate numbers and cannot be traced back to individual users. In particular, regarding the information on the profile that shared the post, only the id of the profile was collected and transformed using an one-way function, hence the data cannot be associated with the original profile; however maintaining the ability to determine whether two posts were posted by the same user profile.

By using the generated venue profile page URLs, another spider was setup and executed within the Hydria environment; it employed 587 seed URLs and was able to retrieve around 240 K user comments within a time frame of 168 h (one week). The collected data was also stored in a separate Hydria data pond and contained fields like: the date that the comment was posted, the total number of the reactions in each comment, the comment text, and the comment URL. Notice that the collected comments and the related metadata do not contain user information or personal data; no user IDs were collected and our logs show that all user references in the comment text were deleted by using the Hydria text cleaning (regex-based) functionality.

Next, profiles of venues that are known as major event organizers were used as initial seeds to bootstrap and launch an event harvesting spider that was able to extract a few hundred upcoming events and their relevant event cards. The seed URLs were identified by the TripMentor stakeholders by manually inspecting the collected venues and using tacit knowledge regarding the major event organizers in the region of Attica. Again, the harvested data were stored in a separate Hydria data pond, with the fields stored within this data pond including the event name and date(s), the physical address of the event, the number of people interested to visit this event, the URL of the event, the unique identifier of the PoI where this event was found, the unique identifier of the event, and the text description of the event. Please note that only aggregate numbers on the number of individuals interested in the event are harvested and no personally identifiable information is either collected or stored within Hydria.

#### 4.1.2. TripAdvisor Spiders

In this section, we discuss the deployed TripAdvisor spiders for the TripMentor case study; this set of spiders is used to showcase the versatility and usefulness of the data harvesting component. As with the Facebook spiders, we had no control over the TripAdvisor spiders deployed and the data ponds created; all setup, deployment and data manipulation was done by the TripMentor project partners. The statistics and metadata presented in this work were drawn from usage logs and the data pond schemas.

The first spider was setup and deployed over TripAdvisor aiming to extract PoIs in the Attica region; after a runtime of around 48 h, it collected information about 7 K different PoIs, belonging in a vast variety of different categories (as identified by the respective TripAdvisor field) including monuments, museums, landmarks, natural reserve sites, parks and water parks, different types of restaurants, cafes, etc. For each one of the collected PoIs, the following fields were stored in the Hydria data pond: the venue name in different languages, the overall venue review score, the total number of venue reviews, the ranking of the PoI with respect to other PoIs of the same category in the same broader area (e.g., "4th out of 10 restaurants within the district"), the categories that this PoI appears in, the physical address of the PoI, the PoI phone number, and the TripAdvisor URL of the PoI.

Subsequently, a spider to collect the individual user reviews (without the associated user information) for the 7 K venues/PoIs that were previously harvested was created. After setup and deployment, the spider was able to extract around 300 K individual user reviews in a time frame of around 120 h (five days); the fields of each review that were detected and stored in the respective Hydria data pond are: the review title, the review text, the date of the review, and the review score (in the TripAdvisor bubble format). For the purpose of better understanding the user background, the spider also collected anonymized information about each user that posted a review. This data does not contain any personally identifiable information and was limited on purpose to the following general fields and aggregate metrics: the user country of origin, the total number of user votes (rounded to the nearest ten), the total number of TripAdvisor contributions (rounded to the nearest ten), general user tags (like "history lover"), and generalized age ranges of users. Notice that this information is common among a vast number of TripAdvisor users and cannot be used to personally identify an individual.

The spider examples presented above show only a fraction of the functionality that is available within Hydria. Apart from focused crawlers to crawl the Web for relevant pages and Facebook or TripAdvisor spiders to harvest data from the respective social media sites, Hydria also provides Twitter monitors. These monitors use the Twitter [84] search or stream API to perform keyword-based filtering of published tweets; all retrieved tweets may be subsequently stored in appropriately configured Hydria data ponds for further processing.

#### *4.2. Importing Datasets and Adding/Modifying Records*

Besides automated data harvesting, Hydria also offers a file import service (as described in Section 3.1.2) that allows users to easily import their own datasets into a Hydria data pond. In our case study, we asked partners from the TripMentor project to use the file import tool to incorporate a new CSV dataset into Hydria. One of the project partners responded and reported that they used Hydria to store a home-brewed list of tourism stakeholders (who could be interested in the project results) in a Hydria data pond. The imported dataset consisted of several hundreds of individual records of companies and stakeholders operating in the tourism sector alongside their contact information, and was shared with the rest of the TripMentor partners by defining the appropriate access rights.

Subsequently, other project partners were able to browse the created data pond with the tourism-related companies and add or modify records as needed by filling out the different data pond fields, tagging records with notes for the data pond curator, and save any desired changes in the specific data pond. Figure 4 gives an overview of the aforementioned data pond; at the top of the figure, controls providing access to all available data pond functionality are presented to the user. Figure 5 shows the add record tool where the user may insert individual records, providing data for a multitude of fields of different types (free text; number; drop-down lists; complex types; and an image field).


**Figure 4.** A data pond example.


**Figure 5.** Manual record creation.

In this figure, we can observe the use of the complex data types feature, to design groups of input fields which may also be recurring. For example, in the aforementioned dataset the curator may select to use a complex data type to jointly represent longitude/latitude information (under the complex type named "coordinates"), or branch information (comprising fields "branch name", "address", "phone" and "email"). Additionally, the curator may use the latter complex type (branch information) as a recurring input field, to model contact information about a company that has multiple branches. Recurring input fields effectively model the master-detail relationships between parent and child objects (one-to-many relationships). In the future, we plan to support more complex data types, such as voice and video recording, time-series and streaming data.
