*4.4. Visualizing Information*

Having populated the Hydria data ponds with data concerning different TripMentor needs (e.g., PoIs and related information from Facebook and TripAdvisor spiders, and tourism stakeholder data), the TripMentor partners were able to search, analyze, filter and visualize the stored data by using the powerful yet easy-to-use Hydria data analysis module. The analyzed and visualized information is presented in Figure 7 and is entirely produced using tools provided within the Hydria environment (i.e., no external visualization modules were used to create the graphs shown).

**Figure 7.** Graphs produced by the TripMentor data.

More specifically, the top left graph in Figure 7 presents a pie chart with the different PoIs in the Attica region grouped by venue category, while the graph in the center of Figure 7 provides a bar chart with the total number of Facebook reactions vs. the number of love reactions on posts created for the different venues in the Attica region. Both aforementioned graphs visualize the data harvested in the context of TripMentor from Facebook spidering. The rest of the graphs presented in Figure 7 visualize the data extracted from the TripAdvisor social network; the bottom left graph shows the age range of the users that checked in or commented about different cultural venues in the Attica region in a donut chart; the bottom right diagram presents a bar chart with the most popular (self-assigned) user tags of users who have visited at least one PoI in the Attica region in Greece; and the geo graph in the top-right corner of the figure presents the geographical location of origin (at country granularity) of the visitors of the PoIs within the respective data lake.

Notice that the visualizations in Figure 7 present only a meaningful sample (in the context of the examined case study) of the available data processing and visualization capabilities of Hydria; these capabilities extend to include numerous additional graph types (as enumerated in Section 3.3 above), export to different formats such as raw/CSV/TSV, and many more.

#### **5. Indicative Application Scenarios for Hydria**

In this section, we provide four distinct application scenarios that demonstrate the versatility of Hydria and highlight its usefulness in diverse cultural heritage setups.

#### *5.1. Hydria for Curators*

In this application scenario let us consider Auguste, a curator for a small regional museum who wants to collect data that include mentions of the museum he is responsible for from various online social networks, and perform simple sentiment analysis on these mentions to understand what the visitors like or dislike about the museum. Since the museum resources are not sufficient for maintaining an IT department or acquiring the necessary computational resources, Auguste resorts to manually skimming through scattered visitor reviews on various social media (e.g., Facebook, TripAdvisor, Google reviews) and regularly searching Twitter to get an overall feeling of the visitor opinions about his museum. Of course, as this is a constantly evolving process, he has to personally search for new useful reviews and take into account recent unexpected events (e.g., a power outage that disappointed many first time visitors) that may drive review scores adrift.

Clearly, Auguste would greatly benefit from an online, free, and powerful information system that would allow him to create and launch automated social media harvesting tools able to monitor popular social platforms and store content (e.g., posts, tweets, reviews) in a transparent way at an online, easily accessible data store. Such a system would allow him to access the stored data, perform the required opinion analysis, and easily visualize the results into different types of graphs to be used for press and media releases.

#### *5.2. Hydria for Researchers*

On another scenario, Nikki, member of a humanities research group, wants to perform an analysis of how particular events in European history are perceived by citizens of different European countries and record reflections of individuals on those events. This survey is part of a European effort and involves a large consortium of researchers of different disciplines; as Nikki's group is coordinating the survey, she is also responsible to set up the platform for the collection and processing of survey data. Nikki cannot use any of the popular survey creation tools as they do not support access control to the data and user management (each consortium partner should only have access to the data they collected, with the exception of the coordinator who should be able to access all data). She finally decides to contact an IT company and explain the needs and specificities of the survey. Subsequently, her group would need to buy and maintain a costly IT infrastructure onsite to host the developed solution and, possibly, hire an IT professional/company to keep both the system and the infrastructure up to date.

Clearly, Nikki and her group would benefit from accessing an online service that would allow her to create and deploy such a platform in a fast, free, and effortless way; this system would be a valuable tool *beyond anything currently supported*. After the platform creation, Nikki will be able to create and manage the users of the deployed platform and define access control policies. These users will then be able to login and either input questionnaire data, or directly provide the survey subjects with appropriate credentials that would identify the owner of the input. When the data collection phase is completed, the survey coordinator (i.e., Nikki's group) may use the available searching and filtering techniques to issue appropriate queries, export data of interest for analysis (e.g., to tools like SPSS), and visualize the findings of the analysis in an easy and intuitive way. Since Nikki's group is coordinating the survey, it has access to all inserted data, while other groups' access is restricted to the data policy enforced.

#### *5.3. Hydria for Data Scientists*

Sophia is a computer scientist working in an organization that provides support in the construction and maintenance of a collaborative home-brewed database for cultural heritage applications. In this context, Sophia is responsible for the curation of the database and the enrichment/integration with existing Web resources like DBpedia [85] and various online thesauri (e.g., the Getty Art & Architecture Thesaurus [86]) using both manual/crowdsourced and automatic techniques [87,88]. As this knowledge base is continuously evolving, monitoring its quality over time becomes an essential task. Having access to an appropriate information system that is able to support publish/subscribe functionality for alerting of possible events or data inputs of interest would allow Sophia and other data scientists to subscribe (with appropriate textual and attribute constraints) and get notified about (i) spurious and/or unusual input in the collaborative database, (ii) the creation/evolution of different schemas used to represent various data facets, and (iii) the trending of specific terms or attributes in the database. Such functionality would be an invaluable tool that would simplify database moderation, as Sophia would, for example, be notified about (i) an unusual database update action that mistakenly records the "Benaki Museum" in Attica, Wyoming, instead of Attica, Greece, (ii) a new published dataset on European History of Art containing appropriate metadata for exhibits located in museums that are already using the system for storing collection specific information, or (iii) a new research topic.

#### *5.4. Hydria for End Users*

Finally, Matteo is an under-graduate student from an Information Studies department writing his thesis in contextual reasoning for cultural heritage applications. Matteo is mainly interested in retrieving scientific publications on his topic of interest, following the work of prominent researchers in the area, and studying the evolution of the field over time. Due to the particularities of his research field (i.e., focused but interdisciplinary topic, heavy mathematical background), he regularly resorts to online resources—like the DBLP digital library [68] and WikiCFP [70] portal—to search for new relevant areas, to study and map the evolution of the field in terms of scientific papers and related venues (like conferences and workshops). To do so, it is required to periodically download relevant datasets from these sites (e.g., the raw DBLP data of all indexed papers), filter them to maintain only relevant information and store them for further processing (e.g., perform timeline analysis on the new available data). Even though searching for interesting/related works this week turned up nothing, a search next week may return new information or even new datasets. Clearly, an information system that is able to (i) easily integrate several online digital sources, (ii) incorporate new datasets, (iii) analyze and visualize the analysis results, and (iv) capture his long-term information need (using publish/subscribe functionality) would be a valuable tool that would allow Matteo to save both time and effort.

## **6. Conclusions and Outlook**

We have presented Hydria, the first online, free, zero-administration platform that offers both fundamental and advanced user and data/knowledge management functionality for big cultural data and targets users with little or no IT background. Hydria enables the direct incorporation of heterogeneous data that has been recorded in dispersed formats, while specialized processing engines ingest data without compromising the data structure, making it available for tasks such as visualization, mining, analytics and reporting. Thus, a system in the form of a data lake meant for acquiring, storing, organizing, analyzing and sharing multi-faceted cultural heritage data constitutes a valuable asset to several different cultural heritage applications such as museum curation, user study management, bibliographical analysis, dataset management, and data integration.

In this work we discussed the architectural solutions behind the proposed system, outlined the individual module technologies and provided details on the module orchestration. We also described several novel services that include automated data harvesting from the web and social media, integrated user input collection via standard and customizable data types, easy to perform data analysis and visualizations, publish/subscribe functionality to facilitate sharing of different facets and data shards, and access control mechanisms. Finally, we advocated the appropriateness of our approach for the cultural heritage domain and showcased different scenarios that highlight Hydria's usefulness for cultural data management. We continuously develop new functionality to support more import/export formats and more sophisticated data types, perform user studies to improve usability and document additional user needs, incorporate more data analysis tools and simplify the data analysis procedure, and incorporate versatile data streams from sensors and IoT devices.

Currently, Hydria is running on a commodity server and is dealing mainly with the TripMentor project needs; however, the long-term plan is to release it as a free service to any interested parties. This entails tackling several important issues that include scaling, platform viability and impact measurements. Regarding scaling, we plan to reshape Hydria before releasing a free public version of the system; the intended reshaping includes modifying some system components to be cloud-native, so as to provide resource elasticity and exploit the benefits emanating from cluster computing infrastructures. Platform viability and maintenance after the end of the project is a typical issue in applied research; we plan to actively pursue new funding that will allow us to continue the development and extension of Hydria. Moreover, the open source ecosystem of tools used to build Hydria allows us to release it as an open source project to the development community to further aid project maintainability. Finally, appropriate impact measurements are an important direction that will drive and affect both the large-scale deployment and the viability of the platform. Hydria, as also usually happens with many research prototypes, is not directly involved in generating revenue, so impact measurements could involve KPIs such as user base, data quality, application versatility, release efficiency, and system reliability.

The work presented in this paper has several implications for both practitioners and researchers. At practical level, it introduces a tool that empowers its users to access, interrelate, analyze, share and visualize multi-faceted data harvested from structured or semi-structured sources, through an intuitive graphical interface, without the requirement of any IT skills. While the implemented system targets the cultural information domain, it can be straightforwardly adapted to any domain where data sourced from social networks and the linked open data (LOD) cloud need to be harvested, managed, analyzed and shared, such as the marketing, political and social analysis domains. The Hydria data lake may also contribute the data pond contents to the LOD cloud, reciprocating from social networks and the LOD cloud through the provision of unified and integrated datasets. The sharing mechanisms of the Hydria system can be leveraged to provide persistent identifiers and automatically register data ponds (or parts of them) that are characterized as "public" to searchable directories, adhering thus to the FAIR data principles [89].

Regarding the research dimension, the architectural paradigm of Hydria, which is a key factor to its success, can be adopted in other classes of systems that offer services to non-IT experts, such as scientific data analysis systems or business data analysis systems. The proliferation of systems based on the architecture of Hydria will accelerate the development cycle of analysis and visualization algorithms that are suitable for non-IT experts, since the extension of the prospective user base will facilitate gathering of relevant requirements and allow the collection of richer testing and evaluation feedback.

Another research direction for the Hydria platform is to extend cooperation between users beyond data sharing, to include expertise finding among the users of Hydria for advice seeking or joint execution of tasks requiring diverse areas of expertise (e.g., multidisciplinary tasks). To this end, algorithms for expert identification can be developed for the Hydria platform, or appropriate existing algorithms can be identified and tuned (e.g., [90–92]). Expert searches may also extend outside the scope an Hydria installation (or a federated Hydria installations network), through the interfacing of the Hydria platform to expert hiring and crowdworking platforms [93] as well as the adoption and customization of algorithms for the synchronization of these tasks [94]. In all cases, all types of modules developed for the Hydria system (e.g., analysis or visualization algorithms, or components supporting expert identification and cooperation between users), as well as knowledge about best practices, can be stored in shared and searchable repositories, providing a dynamic, evolving and self-sustained ecosystem for the Hydria platform.

**Author Contributions:** Conceptualization, K.D., P.R., C.T. and C.V.; Methodology, K.D., P.R., C.T. and C.V.; Software, K.D., P.R., C.T., N.P. and C.V.; Visualization, N.P., K.D.; Validation, K.D., P.R., C.T. and C.V.; Writing—Original draft, K.D., P.R., C.T., C.V.; Writing—Review & editing, K.D., P.R., C.T., N.P. and C.V. All authors have read and agree to the published version of the manuscript.

**Funding:** This research has been co-financed by European Union and Greek national funds through the Operational Programme "Competitiveness, Entrepreneurship and Innovation", under the call RESEARCH—CREATE—INNOVATE (project code: T1EDK-03874).

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyzes, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.
