**3. Architecture**

In this section, the architectures of the data warehouse and data lake schemes are described in detail. Furthermore, the classification of data warehouse and data lake solutions based on function is carried out and summarized as a table.

#### *3.1. Data Warehouse Architecture*

The data warehouse architecture contains historical and commutative data from multiple sources. Basically, there are three kinds of architectures [55]:


The architecture of a data warehouse is shown in Figure 2. It consists of a central information repository that is surrounded by some key DW components, making the entire environment functional, manageable, and accessible.

**Figure 2.** Data warehouse architecture.


carried out via extract, transform, and load (ETL) tools [59]. This ETL process helps the data warehouse achieve enhanced system performance and business intelligence, timely access to data, and a high return on investment:


ETL anonymizes data as per regulatory stipulations, thereby anonymizing confidential and sensitive information before loading it into the target data store [60]. ETL eliminates unwanted data in operational databases from loading into DWs. ETL tools carry out amendments to the data arriving from different sources and calculate summaries and derived data. Such ETL tools generate background jobs, Cobol programs, shell scripts, etc. that regularly update the data in the data warehouse. ETL tools also help with maintaining the metadata.

	- – **Query and reporting tools**: Such tools help organizations generate regular operational reports and support high-volume batch jobs such as printing and calculating. Some popular reporting tools are Brio, Oracle, Powersoft, and SAS Institute. Similarly, query tools help end users to resolve pitfalls in SQL and database structure by inserting a meta-layer between the users and the database.
	- – **Application development tools**: In addition to the built-in graphical and analytical tools, application development tools are leveraged to satisfy the analytical needs of an organization.
	- – **Data mining tools**: This tool helps in automating the process of discovering meaningful new correlations and structures by mining large amounts of data.
	- – **OLAP tools**: Online analytical processing (OLAP) tools exploit the concepts of a multidimensional database and help analyze the data using complex multidimensional views [28,62]. There are two types of OLAP tools: multidimensional OLAP (MOLAP) and relational OLAP (ROLAP) [63]:
		- \* **MOLAP**: In such an OLAP tool, a cube is aggregated from the relational data source. Based on the user report request, the MOLAP tool generates a prompt result, since all the data are already pre-aggregated within the cube [64].
		- \* **ROLAP**: The ROLAP engine acts as a smart SQL generator. It comes with a "designer" piece, wherein the administrator specifies the association between the relational tables, attributes, and hierarchy map and the underlying database tables [65].

#### *3.2. Data Lake Architecture*

The architecture of a business data lake is depicted in Figure 3. Although it is treated as a single repository, it can be distinguished as separate layers in most cases.

**Figure 3.** Data lake building blocks.



Many implementations of a data lakes are originally based on Apache Hadoop. The Highly Available Object Oriented Data Platform (Hadoop) is a widely popular big data tool especially suitable for batch processing workloads of big data [66]. It uses HDFS as its core storage and MapReduce (MR) as the basic computing model. Novel computing models are constantly proposed to cope with the increasing needs for batch processing performance (e.g., Tez, Spark, and Presto) [67,68]. The MR model has also been replaced with the directed acyclic graph (DAG) model, which improves computing models' abstract concurrency. The second phase of data lake evolution has happened with the arrival of the *Lambda Architecture* [69,70], owing to the constant changes in data processing capabilities and processing demand. It presents stream computing engines, such as Storm, Spark Streaming, and Flink [71]. In such a framework, batch processing is combined with stream computing to meet the needs of many emerging applications. Yet another advanced phase is for the *Kappa Architecture* [72]. The two models of batch processing and stream computing are unified by improving the stream computing concurrency and increasing the time window of streaming data. In this regard, stream computing is used that features an inherent and scalable distributed architecture.
