**1. Introduction**

Big data analytics is one of the buzzwords in today's digital world. It entails examining big data and uncovering the hidden patterns, correlations, etc. available in the data [1]. Big data analytics extracts and analyzes random data sets, forming them into meaningful information. According to statistics, the overall amount of data generated in the world in 2021 was approximately 79 zettabytes, and this is expected to double by 2025 [2]. This unprecedented amount of data was the result of a data explosion that occurred during the last decade, wherein data interactions increased by 5000% [3].

Big data deals with the volume, variety, and velocity of data to process and provides veracity (insightfulness) and value to data. These are known as the 5 Vs of big data [4]. An unprecedented amount of diverse data is acquired, stored, and processed with high data quality for various application domains. These include business transactions, real-time streaming, social media, video analytics, and text mining, creating a huge amount of semior unstructured data to be stored in different information silos [5]. The efficient integration and analysis of these multiple data across silos are required to divulge complete insight into the database. This is an open research topic of interest.

Big data and its related emerging technologies have been changing the way e-commerce and e-services operate and have been opening new frontiers in business analytics and related research [6]. Big data analytics systems play a big role in the modern enterprise managemen<sup>t</sup> domain, from product distribution to sales and marketing, as well as analyzing hidden trends, similarities, and other insights and allowing companies to analyze and optimize their data

**Citation:** Nambiar, A.; Mundra, D. An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management. *Big Data Cogn. Comput.* **2022**, *6*, 132. https://doi.org/ 10.3390/bdcc6040132

Academic Editors: Domenico Talia and Fabrizio Marozzo

Received: 28 September 2022 Accepted: 2 November 2022 Published: 7 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

33

to find new opportunities [7]. Since organizations with better and more accurate data can make informed business decisions by looking at market trends and customer preferences, they can gain competitive advantages over others. Hence, organizations invest tremendously in artificial intelligence (AI) and big data technologies to strive toward digital transformation and data-driven decision making, which ultimately leads to advanced business intelligence [6]. As per reports, the worldwide big data analytics and business intelligence software applications markets seem as though they will increase by USD 68 billion and 17.6 billion by 2024–2025, respectively [8].

Big data repositories exist in many forms, as per the requirements of corporations [9]. An effective data repository needs to unify, regulate, evaluate, and deploy a huge amount of data resources to enhance the analytics and query performance. Based on the nature and the application scenario, there are many different types of data repositories other than traditional relational databases. Two of the popular data repositories among them are enterprise data warehouses and data lakes [10–12].

A data warehouse (DW) is a data repository which stores structured, filtered, and processed data that has been treated for a specific purpose, whereas a data lake (DL) is a vast pool of data for which the purpose is not defined [9]. In detail, data warehouses store large amounts of data collected by different sources, typically using predefined schemas. Typically, a DW is a purpose-built relational database running on specialized hardware either on the premises or in the cloud [13]. DWs have been used widely for storing enterprise data and fueling business intelligence and analytics applications [14–16].

Data lakes (DLs) have emerged as big data repositories that store raw data and provide a rich list of functionalities with the help of metadata descriptions [10]. Although the DL is also a form of enterprise data storage, it does not inherently include the same analytics features commonly associated with data warehouses. Instead, they are repositories storing raw data in their original formats and providing a common access interface. From the lake, data may flow downstream to a DW to ge<sup>t</sup> processed, packaged, and become ready for consumption. As a relatively new concept, there has been very limited research discussing various aspects of data lakes, especially in Internet articles or blogs.

Although data warehouses and data lakes share some overlapping features and use cases, there are fundamental differences in the data managemen<sup>t</sup> philosophies, design characteristics, and ideal use conditions for each of these technologies. In this context, we provide a detailed overview and differences between both the DW and DL data management schemes in this survey paper. Furthermore, we consolidate the concepts and give a detailed analysis of different design aspects, various tools and utilities, etc., along with recent developments that have come into existence.

The remainder of this paper is organized as follows. In Section 2, the terminology and basic definitions of big data analytics and the data managemen<sup>t</sup> schemes are analyzed. Furthermore, the related works in the field are also summarized in this section. In Section 3, the architectures of both the data warehouse and data lake are presented. Next, in Section 4, the key design aspects of the DW and DL models along with their practical aspects are presented at length. Section 5 summarizes the various popular tools and services available for enterprise data management. In Sections 6 and 7, the open challenges and promising directions are explained, respectively. In particular, the pros and cons of various methods are critically discussed, and the observations are presented. Finally, Section 8 concludes this survey paper.

#### **2. Definition: Big Data Analytics, Data Warehouses, and Data Lakes**

The definitions and fundamental notions of various data managemen<sup>t</sup> schemes are provided in this section. Furthermore, related works and review papers on this topic are also summarized.

#### *2.1. Big Data Analytics*

With significant advancements in technology, unprecedented usage of computer networks, multimedia, the Internet of Things, social media, and cloud computing has occured [17]. As a result, a huge amount of data, known as "big data", has been generated. It is required to collect, manage, and analyze these data efficiently via big data processing. The process of big data processing is aimed at data mining (i.e., extracting knowledge from large amounts of data), leveraging data management, machine learning, high-performance computing, statistics, pattern recognition, etc. The important characteristics of big data (known as the seven Vs of big data) (https://impact.com/marketing-intelligence/7-vs-big-data/, accessed on 25 September 2022) are as follows:


Typically, there are mainly three kinds of big data processing possible: batch processing, stream processing, and hybrid processing [18]. In batch processing, data stored in the non-volatile memory will be processed, and the probability and temporal characteristics of data conversion processes will be decided by the requirements of the problems. In stream processing, the collected data will be processed without storing them in non-volatile media, and the temporal characteristics of data conversion processes will mainly be determined by the incoming data rate. This is suitable for domains that require low response times. Another kind of big data processing, known as hybrid processing, combines both the batch and stream processing techniques to achieve high accuracy and a low processing time [19]. Some examples of hybrid big data processing are Lambda and Kappa Architecture [20]. The Lambda Architecture processes huge quantities of data, enabling the batch-processing and stream-processing methods with a hybrid approach. The Kappa Architecture is a simpler alternative to the Lambda Architecture, since it leverages the same technology stack to handle both real-time stream processing and historical batch processing. However, it avoids maintaining two different code bases for the batch and speed layers. The major notion is to facilitate real-time data processing using a single stream-processing engine, thus bypassing the multi-layered Lambda Architecture without compromising the standard quality of service.

## *2.2. Data Warehouses*

The concept of data warehouses (DWs) was introduced in the late 1980s by IBM researchers Barry Devlin and Paul Murphy with the aim to deliver an architectural model to solve the flow of data to decision support environments [21]. According to the definition by Inmon, "*a data warehouse is a subject-oriented, nonvolatile, integrated, time-variant collection of data in support of management decisions*" [22]. Formally, a data warehouse (DW) is a large data repository wherein data can be stored and integrated from various sources in a wellstructured manner and help in the decision-making process via proper data analytics [23]. The process of compiling information into a data warehouse is known as data warehousing.

In enterprise data management, data warehousing is referred to as a set of decisionmaking systems targeted toward empowering the information specialist (leader, administrator, or analyst) to improve decision making and make decisions quicker. Hence, DW systems act as an important tool of business intelligence, being used in enterprise data managemen<sup>t</sup> by most medium and large organizations [24,25]. The past decade has seen unprecedented development both in the number of products and services offered and in the wide-scale adoption of these advancements by the industry. According to a comprehensive research report by Market Research Future (MRFR) titled "*Data Warehouse as a Service Market*

*information by Usage, by Deployment, by Application and Organization Size—forecast to 2028*", the market size will reach USD 7.69 billion, growing at a compound annual growth rate of 24.5%, by 2028 [26].

In the data warehouse framework, data are periodically extracted from programs that aid in business operations and duplicated onto specialized processing units. They may then be approved, converted, reconstructed, and augmented with input from various options. The developed data warehouse then becomes a primary origin of data for the production, analysis, and presentation of reports via instantaneous reports, e-portals, and digital readouts. It employs "online analytical processing" (OLAP), whose utility and execution needs differ from those of the "online transaction processing" (OLTP) implementations typically backed up by functional databases [27,28]. OLTP programs often computerize the handling of administrative data processes, such as order entry and banking transactions, which are an organization's necessary activities. Data warehouses, on the other hand, are primarily concerned with decision assistance. As shown in Figure 1a, a data warehouse integrates data from various sources and helps with analysis, data mining, and reporting. A detailed description of a DW's architecture is presented in Section 3.1.

Data warehousing advancements have benefited various sectors, including production (for supply shipment and client assistance), business (for profiling of clients and stock governance), monetary administrations (for claims investigation, risk assessment, billing examination, and detecting fraud), logistics (for vehicle administration), broadcast communications (in order to analyze calls), utility companies (in order to analyze power use), and medical services [29]. The field of data warehousing has seen immense research and developments over the last two decades in various research categories such as data warehouse architecture, data warehouse design, and data warehouse evolution.

## *2.3. Data Lake*

By the beginning of the 21st century, new types of diverse data were emerging in ever-increasing volumes on the Internet and at its interface to the enterprise (e.g., webbased business transactions, real-time streaming, sensor data, and social media). With the huge amount of data around, the need to have better solutions for storing and analyzing large amounts of semi-structured and unstructured data to gain relevant information and valuable insight became apparent. Traditional schema-on-write approaches such as the extract, transform, and load (ETL) process are too inefficient for such data managemen<sup>t</sup> requirements. This gave rise to another popular modern enterprise data managemen<sup>t</sup> scheme known as data lakes [30–32].

Data lakes are centralized storage repositories that enable users to store raw, unprocessed data in their original format, including unstructured, semi-structured, or structured data, at scale. These help enterprises to make better business decisions via visualizations or

dashboards from big data analysis, machine learning, and real-time analytics. A pictorial representation of a data lake is given in Figure 1b.

According to Dixon, "*whilst a data warehouse seems to be a bottle of water cleaned and ready for consumption, then "Data Lake" is considered as a whole lake of data in a more natural state*" [33]. Another definition for the data lake is provided in [34], and it is as follows: "*a data lake stores disparate information while ignoring almost everything*". The explanation of data lakes from an architectural viewpoint is given in [35], and it is as follows: "*A data lake uses a flat architecture to store data in its raw format. Each data entity in the lake is associated with a unique, i.e.tifier and a set of extended metadata, and consumers can use purpose-built schemas to query relevant data, which will result in a smaller set of data that can be analyzed to help answer a consumer's question*". A data lake houses data in its original raw form. The data in data lakes can vary drastically in size and structure, and they lack any specific organizational structure. A data lake can accommodate either very small or huge amounts of data as required. All of these features provide flexibility and scalability to data lakes. At the same time, challenges related to its implementation and data analytics also arise.

Data lakes are becoming increasingly popular for organizations to store their data in a centralized manner. A data lake may contain unstructured or multi-structured data, where most of them may have unrealized value for the enterprise. This allows organizations to store their data from different sources without any overhead related to the transformation of the data [30]. This also allows ad hoc data analyses to be performed on this data, which can then be used by organizations to drive key insights and data-driven decision making. DLs replace the previous way of organizing and processing data from various sources with a centralized, efficient, and flexible repository that allows organizations to maximize their gains from a data-driven ecosystem. Data lakes also allow organizations to scale them to their needs. This is achieved by separating storage from the computational part. Complex transformation and preprocessing of data in the case of data warehouses is eliminated. The upfront financial overhead of data ingestion is also reduced. Once data are collated in the lake or hub, it is available for analysis for the organization.

#### *2.4. The Difference between Data Warehouses and Data Lakes*

Although data warehouses and data lakes are used as two interchangeable terms, they are not the same [21]. One of the major differences between them is the different structures (i.e., processed vs. raw data). A data warehouse stores data in processed and filtered form, whereas data lakes store raw or unprocessed data. Specifically, data are processed and organized into a single schema before being put into the warehouse, whereas raw and unstructured data are fed into a data lake. Analysis is performed on the cleansed data in the warehouse. On the contrary, in a data lake, data are selected and organized as and when needed.

As for storing processed data, a data warehouse is economic. On the contrary, data lakes have a comparatively larger capacity than the data warehouse and are ideal for raw and unprocessed data analysis and employing machine learning. Another key difference is the objective or purpose of use. Typically, processed data that flow into data warehouses are used for specific purposes, and hence the storage space will not be wasted, whereas the purpose of usage for the data lake is not defined and can ideally be used for any purpose. To use processed or filtered data, no specialized expertise is required, as merely familiarization with the presentation of data (e.g., charts, sheets, tables, and presentations) will do. Hence, DWs can be used by any business or individual. On the contrary, it is comparatively difficult to analyze DLs without familiarity with unprocessed data, hence requiring data scientists with appropriate skills or tools to comprehend them for specific business use. Accessibility or ease of use of data repositories is ye<sup>t</sup> another aspect that differentiates data warehouses and data lakes. Since the architecture of a data lake has no proper structure, it has flexibility of use. Instead, the structure of a DW makes sure that no foreign particles invade it, and it is very costly to manipulate. This feature makes it very secure, too. A detailed analysis of the differences between data warehouses and data lakes is given in Table 1.


**Table 1.** Differences between data warehouses and data lakes.

## *2.5. Literature Review*

A summary of various research works in the field of data warehouses and data lakes is presented here. A list of various survey articles on data warehouses and data lakes is depicted in Table 2. Mainly, data warehouse review works address architecture modeling and its comparisons [36,37], the evolution of the DW concept [38], real-time data warehousing and ETL [39], etc. Compared with the data warehouse literature reviews, data lake papers are relatively fewer in number. Data lake review works summarize recent approaches and the architecture of DLs [31,32] as well as the design and implementation aspects [30]. To the best of our knowledge, only one work on comparing data warehouses and data lakes was found in the literature [12]. In contrast to that article, our work provides a comprehensive analysis of both data managemen<sup>t</sup> schemes by addressing various aspects such as, definitions, architecture, practical design considerations, tools and services, challenges, and opportunities in detail. In addition to the survey papers, we also consolidate various works on data warehouses and data lakes in the reported literature and classify them in Table 3 based on their functions and utility.

**Table 2.** Summary of existing survey articles on data warehouses and data lakes.



**Table 2.** *Cont.*

**Table 3.** Related works: classification of data warehouse and data lake solutions.

