**1. Introduction**

In the last few years Cultural Informatics (CI) has surfaced as a new promising domain that constitutes the socio-technological approach to understand, represent, communicate and re-invent cultures and cultural institutions [1]. CI may also be used in a disruptive fashion, aiming to change the way we understand and experience our cultural heritage [2], by enabling us, for example, to create personalized museum experiences [3,4], to discover facets and stories from new or existing cultural heritage data [5–7], or to create inter-linked cultural data repositories [8–11]. While performing these tasks, CI are creating an avalanche of data, produced by a vast number of related activities such as profiling of or feedback from museum and cultural venue visitors [12–15], social media activity (e.g., posts and comments) related to cultural events [16–23], papers and specialized conferences on the topic [24–27], and raw data on cultural objects such as artifact descriptions [28–31]. This data is typically fragmented and distributed among the different stakeholders, while the data management solutions

that are involved vary greatly, ranging from simple spreadsheet files for the less tech savvy to typical data stores such as relational databases [32–34] or semantically richer knowledge bases [9,10,35,36].

From the above, we can conclude that data management is a key technological factor that drives the Cultural Heritage (CH) domain forward [35,37], but the data management solutions applied so far are fragmented, physically distributed, heterogeneous, non-aligned and require specialized IT knowledge to deploy and operate (e.g., [38–40]). Moreover, the asynchronous nature of the data acquisition process itself poses new challenges in the collection, organization, and processing of the relevant data [37]. Proposed solutions for data storage of cultural information (e.g., [9,10,35,41]) usually require significant computing infrastructure, which is not easy to obtain or maintain, and the constant support and active involvement of IT experts even for trivial tasks, like creating a graph from the given data, updating the produced statistics, or incorporating a new data source/set. Typically, reconfiguring an existing solution for reuse in another setup or setting one up from scratch for the specific cultural data management problem requires *(i)* time-consuming meetings between scientists of different disciplines trying to understand each other's needs and goals, and *(ii)* resource-consuming IT infrastructure that calls for outsourcing to IT specialists and regular maintenance/upgrades to keep up with technological requirements [42]. Due to these issues, a great number of stakeholders (such as small museums or humanities research groups) that lack the resources for infrastructure and/or computing expertise still rely on outdated approaches like *(i)* storing their data in spreadsheets or raw files, *(ii)* sharing their data with colleagues through email, cloud uploads of zip files, or even by snail mailing electronic copies in removable media, and *(iii)* analyzing the data via sub-standard tools and trial software. Such practices create, in turn, other concerns like data freshness/integrity issues due to versioning, issues with the significance of reported results due to data scarcity/fragmentation, and even ethical issues like unequal access to data and resources [43].

In this work, we present *Hydria*, a novel *free online data lake* meant for *acquiring, storing, organizing, analyzing* and *sharing* heterogeneous, multi-faceted cultural heritage data. Hydria and all the provided functionality (given in detail in the following sections) are fully developed by the authors by resorting to open source tools. The data lake architecture [44] adopted in the design of Hydria enables the direct incorporation of heterogeneous information that has been recorded in dispersed formats, while specialized processing engines ingest data without compromising the data structure, making it available for tasks such as visualization, mining, analytics and reporting. In this sense, the proposed system targets primarily the functional requirements posed by the cultural informatics domain, and enables researchers, curators and other stakeholders within the cultural informatics domain to easily acquire, manage and share data/knowledge within Hydria.

Thus, the Hydria system proposed in this paper is an innovative, integrated framework that enables users with *no prior IT knowledge* to *(i) setup and launch*, in an easy and transparent way, data acquisition services like topical focused crawlers, social media monitors, web scrapers and dataset imports, *(ii) collect* questionnaires and other types of user input data by resorting to several built-in and customizable data entry forms, *(iii) record, organize and manage* collected data by storing them in different data stores (called *data ponds* in the Hydria terminology), *(iv) share* whole data sets or horizontal/vertical data shards via a powerful publish/subscribe mechanism that notifies users when other data ponds store data of interest, *(v) search and analyze* data by using a powerful yet simple point-and-click mechanism that performs queries on the stored data and extracts the requested information in several formats/outputs such as histograms, pie charts, (heat) maps, (stacked) bars/columns, area charts and various file types (like CSV/TSV and raw), and *(vi)* perform basic and advanced *user management* tasks (such as manage users, assign user privileges and permissions, perform access control on data and data ponds). All services are designed for usage *by non-IT experts* and are configured/executed by resorting to step-by-step wizards, contain in-context explanations for the different system functions and provide online help with examples.

The contributions of this work are three-fold:


From the argumentation presented above, it becomes clear that a free online system in the form of a *data lake* that is meant for *acquiring, storing, organizing, analyzing* and *sharing* heterogeneous, multi-faceted cultural heritage data would be a valuable asset to several different cultural heritage applications such as museum curation, user study management, bibliographical analysis, dataset management, and data integration. Moreover, such a system would be an invaluable source to many cultural informatics projects that either lack the resources or lack the IT expertise to design and deploy their own software and/or hardware infrastructure.

The rest of the paper is organized as follows. Section 2 discusses related work. Subsequently, Section 3 presents the overall system architecture and outlines the different modules as well as the respective services, while Section 4 provides an indicative case study during the Alpha testing phase of Hydria within the TripMentor project [45]. Section 5 presents various application scenarios in the cultural heritage domain and discusses how different stakeholders may benefit from using Hydria. Finally, Section 6 concludes this article and provides future research directions.

#### **2. Related Work**

In this section, we overview related research approaches that *(i)* are associated with the data acquisition and knowledge extraction for the cultural heritage domain based on social media, *(ii)* present the most prominent solutions in information systems meant for cultural heritage, and *(iii)* include museum and/or user recommendation information systems intended to improve visitors experience in cultural venues.

#### *2.1. Social Data Management in the Cultural Heritage Domain*

As an ever-growing number of social networks users constantly post opinions about cultural venues (by publishing reviews, describing their perceived experiences, using check-ins, subscribing to upcoming events, etc.), high volumes of data/content of great interest to cultural heritage applications is generated within popular social media platforms. The work presented in [16,17] aims to bridge the social media and cultural heritage domains and shows a way to stimulate history reflection by assembling games, social networks, history, and culture. In [20] the authors introduce the notion of the *prosumer*; the term refers to people that, besides consuming information, also produce new content when visiting cultural sites. A prominent paradigm in this line is the HeritageGO system proposed in [18]. HeritageGO tries to convert raw cultural heritage data coming from countries with a vast amount of cultural PoIs into meaningful digital information; towards this effort, the authors present social networks users as the main data harvesting lever and use metric quality models to filter the acquired data.

The approaches presented in [19–23] focus on improving cultural tourism and enhancing the visitors' experience by enriching the information concerning historical sites, monuments and other cultural PoIs with social media content, which is uploaded by social networks users and obtained using web mining techniques. To do so, identification methods that use geotagged multimedia data from social networks or location-aware services and sensors (e.g., GPS) attached on tourists' mobile devices have been implemented, while classification tools are used to rank the most relevant cultural heritage landmarks with respect to the user context (e.g., location) to render smart interactions among tourists and the cultural surrounding. The work presented in [46] proposes a Twitter big data-centric solution, which acknowledges a collection of Key Performance Indicators (KPIs) focusing on the quantity metric evaluation of cultural heritage sensitivity as interpreted by Twitter users, by merging natural language processing, semantic methodologies, location reports and time inspection.

Regarding the use of multimedia content in the cultural heritage domain, [47] is a prominent work; the authors describe the main aspects of multimedia social networks (MSNs), present the interactive system GIVAS, and highlight its importance to archeologists, cultural heritage researchers and tourists, as it consists a multimedia cooperative framework for managing, exploring, visualizing and sharing cultural heritage data. In [30], the author discusses how published multimedia data (especially videos) concerning cultural heritage venues and gathered from social media are of great importance to field consultants for generating 3D models (by using structure for motion methods).

PATCH [48] is a portable system able to harvest cultural heritage content from distributed and heterogeneous sources (such as social networks), to supply its users with profitable and personalized information and services based on their interests and their surroundings, and to provide data management, retrieval and analysis services. This system is the most conceptually and functionally similar work to the Hydria data lake; however PATCH was designed for the needs of a specific project and applied to a particular research study, while our work is an online, free, zero-administration data lake that offers both fundamental and advanced user and data/knowledge management functionality in the cultural heritage domain, able to be customized for the requirements of any cultural heritage project, and addresses all users, without requiring any IT background/skills.

#### *2.2. Information Systems for Cultural Heritage*

Information management in the cultural heritage domain concerns a cycle of organizational activity: the acquisition of cultural content from one or more sources, the storage and distribution of this data to those who need to evaluate it, and its final disposition through archiving. Over the years, many solutions aiming at the management, sharing and analysis of cultural heritage information have been proposed, while other investigations have tried to classify the variety of software tools and systems associated with the vast amount of data in the cultural heritage domain. The authors in [35] perform an itemized categorization of software tools and systems used in the cultural heritage area, associated with both spatial and temporal data. The contribution in [42] aims at exploring and classifying knowledge organization systems that are used in the cultural heritage field, while it applies extensive qualitative evaluation to the most prominent ones.

The work presented in [38] introduces the notion of *smart space* as a software development approach that enables creating service-oriented information systems for emerging computing environments for the Internet of Things (IoT), and considers the different principles to semantic-driven design of service-oriented information systems. In a similar spirit, [39] presents the ExhiSTORY infrastructure and discusses how sensors and the IoT can be used in cultural heritage sites so that exhibits communicate with the visitors towards generating rich, personalized, coherent, and highly stimulating experiences. In [49], a number of separate streams and current systems functionalities are examined through the usage of the European EU-CHIC framework, in order to achieve optimal suggestions for enhancing the management of cultural heritage data. The CHIS project [36] points at constructing an information system to assist operations that involve different user types in the cultural heritage domain, offering a scientific advancement that can improve personalized services in a business

environment. The research in [50] focuses on mobile software development for cultural information educational purposes and presents how mobile device users can be well-informed about cultural heritage sites when they visit them.

On the basis of several studies carried out on cultural landscapes in a spatial-planning perspective, [51] discusses the potential and limits of Geographical Information Systems (GIS) for supporting the territorialization of multidisciplinary landscape analysis for the management of a site of the UNESCO world heritage list, and proposes an approach for a GIS responding to landscape-oriented studies. Two similar approaches that propose a 3D representation of cultural objects, in order to facilitate researchers in determining both the relationships between data and the spatial relationships between cultural information, are presented in [31,52].

Recommendation systems are very popular in many scientific domains; in the cultural heritage field, recommendation systems constitute powerful tools that may help users improve their experience in cultural venues/PoIs. The work in [53] proposes that guidelines and recommendations should be used in all cultural infrastructures in Poland, associated with technical perspectives of digitization (such as technical and structural metadata, rules series, parameters and formats). The work in [54] describes an info-mobility recommendation system, coined TAIS, that assists tourists while traveling. TAIS can interpret user actions, uncloak their preferences, and suggest cultural sites in respect to the users' current locations (while also providing possible transportation means). The approach in [40] concerns a (big data) architecture that is able to host applications that retrieve data of the cultural heritage field from distributed and heterogeneous repositories; the authors introduce an innovative user-focused recommendation method for cultural element proposal to be applied on top of the data management infrastructure. Finally, the work in [41] introduces a novel ontology-based user method, pointing at improving personalized suggestions and users' visit experience by learning their background and interests.

#### *2.3. Information Systems for Museums*

In recent decades, a great number of cultural institutions (e.g., museums or national archives) integrate information systems in order to catalogue and document their exhibits, disseminate cultural information from their web sites and/or deliver informal education to their audience. Moreover, many information systems applied in institutions use Virtual Reality (VR) and Augmented Reality (AR) technologies aiming at enhancing tourists' experiences. The Digital Diorama [55] is a Mixed Reality (MR) system applied to museums focusing on rendering more features than existing dioramas in museum exhibitions, by prefetching background information. The work in [56] aims to enrich visitors' experiences in museum exhibitions by introducing a multichannel information system. The work in [57] introduces a virtual informal education system for the well-known ancient illustration of "Qing-ming Festival by the Riverside", by using VR technology to generate a wide, captivating, and responsive virtual environment. The work in [58] elaborates on the installation and integration of information systems in museums, identifying four success factors for relevant projects, while stressing the fundamental differences between museums and commercial companies. The approach in [59] describes the formulation and the adaptation of an AR-based system tailored for museum supervision; this research aims to narrow the gap between man and machine by applying instinctive as well as user-friendly synergies in an omnipresent computing environment. In a similar direction, TOMS [60] is a collaborative semantic-based system developed to provide sharing services of a vast variety of cultural heritage multimedia content between national museums in Thailand.

Personalization systems emphasize on tailoring a service or a product in order to accommodate particular individual preferences; presently, many museums and cultural institutions adopt personalization systems in order to offer custom-fit guidance and thus improve visitors' experiences. The work in [13] proposes a multimedia information system that is able to support multiple display devices, is built on top of an application server hosting plentiful digital content, and is presented to visitors in respect to their particular requirements. In [14], the authors demonstrate Future Worlds,

a knowledgable game-based environment for cooperative feasibility investigations in science museums that is able to dynamically identify and adjust visitors' specific preferences while touring in the exhibition. The approach in [15] puts forward a web intelligent virtual assistant-based service for virtual museum explorations that can advance suggestions in respect to the museum exhibits and tailored to user's choices. In a similar spirit, [61] discusses experimental results obtained towards personalizing a museum visit based on gaming, using an approach relying on users' cognitive style, social networks, and recommendations. In [62], the authors, trying to connect cultural heritage, games and social networks, design social network games to be used for accomplishing user profiling and supporting museum visits; the games are also presented in a generic framework in cultural heritage. Finally, in [3], the authors investigate the use of indirect profiling methods through a visitor quiz, in order to provide the visitor with specific museum content, identify key profiling issues, and discuss guidelines towards a generalized approach for the profiling needs of cultural institutions.

Other works adopt different technological approaches, such as location-aware or spatial methods, in order to provide particular tour guidelines to their visitors. In [63], the authors investigate the practical usage of GIS as a tool, while inspecting how museums can adapt GIS technologies in separate operating zones. The work in [64,65] demonstrates a similar approach of a 3D information system, developed to manage cultural heritage information, which provides information layers that link with the exterior environment of the artifacts, following a similar to the GIS solution, in order to allow relationships between individual items.

To the best of our knowledge, Hydria is the first system that focuses on collecting, managing, analyzing, and sharing diverse, multi-faceted data in the cultural heritage domain and allows users without an IT background to deploy, populate, and manage their own data stores within minutes, alleviating the need to rely on expensive custom-made solutions that require IT infrastructure and skills to maintain.

#### **3. System Architecture**

The *Hydria data lake* allows users to *(i)* harvest and/or import data from structured and semi-structured data sources, *(ii)* collect user input data by resorting to several built-in and customizable data entry forms, *(iii)* store and manage collected data by organizing them in different big data management data stores (called *data ponds* in the Hydria terminology), *(iv)* share whole data sets or horizontal/vertical data shards via a powerful publish/subscribe mechanism that notifies users when other data ponds store data of interest, *(v)* search, filter and analyze data by using a powerful yet simple point-and-click mechanism that performs queries on the stored data and extracts the requested information in several visual representations and outputs, and *(vi)* perform basic and advanced user management tasks on the stored data. In Hydria, data ponds are custom-made database collections that are used to conceptually group data within a specific cultural heritage application. Figure 1 provides a high-level view of the system architecture, of the different services and functionalities implemented, and their conceptual organization within the Hydria data lake. In what follows, we present in detail the different services and modules that comprise the Hydria ecosystem and briefly outline the functionality and added value of each module.

**Figure 1.** Hydria data lake architecture.
