**6. Challenges**

This section addresses some of the key challenges in big data analytics problems. In addition, the implementation challenges encountered in data warehouses and data lake paradigms are also critically analyzed.

#### *6.1. Challenges in Big Data Analytics*

In the past few years, big data have been accumulated in every walk of human life, including healthcare, retail, public administration, and research. Web-based applications have to deal with big data frequently, such as Internet text and documents (corpus, etc.), social network analysis, prediction markets, and Internet search indexing [86]. Although we can clearly observe the potential and current advantages of big data, there are some inherent challenges also present that have to be tackled to achieve the full potential of big data analytics [87].

The first hurdle for big data analytics is the **storage mediums and higher I/O speed** [88]. Storage of big data causes a financial overhead which is not affordable or profitable for many enterprises. Furthermore, this also results in slower processes [89]. In decades gone by, analysts made use of hard disk drives for data storage purposes, but this is slower in terms of random I/O performance compared with sequential I/O. To overcome this limitation, the concept of solid-state drives (SSDs) and phase change memory were introduced. However, the currently available storage tech simply does not possess the required performance for processing big data and delivering insights in a timely fashion. Companies opt for various modern techniques to handle large data sets, such as compression (reducing the number of bits within the data), data tiering (storing data in several storage tiers), and deduplication (the process of removing duplicates and unwanted data).

Anther challenge is **the lack of proper understanding of big data and the lack of knowledge professionals**. Due to insufficient understanding, organizations may fail in big data initiatives. This may be due to the absence of skilled data professionals, the lack of a transparent picture for employees, or improper usage of data repositories, among other reasons. It is highly encouraged to conduct big data workshops and seminars at companies to enable every level of the organization to inculcate a basic understanding of knowledge concepts. Furthermore, companies should invest in recruiting skilled professionals, supplying training programs to the staff, as well as purchasing knowledge analytics solutions powered by advanced artificial intelligence or machine learning tools.

Yet another challenge in big data analytics is the **confusion with suitable tool selection**. For instance, many a time, it is not so clear whether Hadoop or Spark is a better option for data analytics and storage. Sometimes, the wrong selection may result in poor decisions and the selection of inappropriate technology. Hence, money, time, effort, and work hours are wasted. The best solution would be to make use of experienced professionals or data consulting to obtain a recommendation for the tools that can support a company based on its scenario.

Data in a corporation come from various sources, such as customer logs, financial reports, social media platforms, e-mails, and reports created by employees. **Integrating data from such a huge spread of sources** is another challenging task [90]. This consolidation task, known as data integration, is crucial for business intelligence. Hence, enterprises purchase proper tools for data integration purposes. Talend Data Integration, IBM InfoSphere Xplenty, Informatica PowerCenter, and Microsoft SQL QlikView are some of the popular data integration tools [91].

**Security of huge sets of knowledge**, especially ones that involve many confidential details of customers, is one of the, inevitable challenges in big data analytics [92,93]. The careless treatment of data repositories may invite malicious hackers, which can cost millions for a stolen record or a knowledge breach. The remedy would be to foster a cybersecurity division of a company to guard their data and to implement various security actions such as data encryption, data segregation, identity and access control, implementation of endpoint security, real-time security monitoring, and using big data security tools (e.g., IBM Guardian).

#### *6.2. Data Warehouse Implementation Challenges*

Implementation of a data warehouse requires proper planning and execution based on proper methods. Some of the major challenging considerations that arise with data warehousing are design, construction, and implementation [94,95].

The efficiency and working of a warehouse are **dependent on the data** that support its operations. With incorrect or redundant data, warehouse managers cannot accurately measure the exact costs. A key solution is to automate the system to improve the lead data quality and make sure that the sales team receives complete, correct, and consistent lead information. Another major concern in a data warehouse is the **quality control of data (i.e., quality and consistency of data)** [96]. The business intelligence process can be fine-tuned by incorporating flexibility to accept and integrate analytics as well as update the warehouse's schema to handle evolutions.

Another major challenge is **differences in naming, domain definitions, and identification numbers from heterogeneous sources**. The data warehouse has to be designed in such a way that it can accommodate the addition and attrition of data sources and the evolution of the sources and source data, thus avoiding major redesign. Yet another challenge is **customizing the available source data into the data model of the warehouse** because the capabilities of a DW may change over time based on the change in technology [97]. Further, **broader skills** are required for the administration of data warehouses in traditional database administration. Hence, managing the data warehouse in a large organization, the design of the managemen<sup>t</sup> function, and selecting the managemen<sup>t</sup> team for a database warehouse are some of the important aspects of a data warehouse.

**Data security** is another critical requirement in DWs, given that business data are extremely sensitive and can be easily obtained [98]. Unfortunately, the typical security paradigm—based on tables, lines, and characteristics—is incompatible with DWs. Following that, the model should be changed to one that is firmly integrated with the applicable model and is focused on the key notions of multidimensional display, such as facts, aspects, and measures. Furthermore, as is frequently advised in computer programming, **information security** should be considered at all stages of the improvement process, from prerequisite analysis to execution and upkeep. In addition, **data warehouse governance** is ye<sup>t</sup> another important consideration, which includes approval of the data modeling standards and metadata standards, the design of a data access policy, and a data backup strategy [99].

#### *6.3. Data Lake Implementation Challenges*

The data lake is relatively novel technology and has not matured yet. Hence, there are many challenges in its implementation, including many of the same challenges that early data warehouses confronted [75,100]. The first challenge is the **high cost of data lakes**. They are expensive to implement and maintain. Data lake platforms that exploit the cloud may be easier to deploy, but they may also come with high fees. Some of the platforms such as Hadoop are open source and hence free of cost. Nevertheless, the implementation and managemen<sup>t</sup> may take more time and more expert staff. **Management difficulty** is another issue [75]. The managemen<sup>t</sup> of the DL involves various complex tasks, such as ensuring the capacity of the host infrastructure to cope with the growth of the DL and dealing with data redundancy and data security. This puts forth challenges even to skilled engineers. Furthermore, it is required to have more domain experts and engineers with real expertise in setting up and managing data lakes. In the current scenario, there is a shortage of both data scientists and data engineers in the field. This **lack of skills** is ye<sup>t</sup> another challenge.

Another aspect for consideration is the **long time to value** (i.e., it takes years to become full-fledged and to be integrated well with the workflow and analytics tools to impart real value to the enterprise) [101]. As mentioned in the case of data warehouses, in the case of DLs, **data security** is also a major concern. It requires special security measures to be considered to enforce data governance rules and to secure the data in the DL with the help of cyber security specialists and security tools. Another critical challenge is the **computation resources and increase in computing power**. This is due to the fact that data are growing unprecedentedly faster than computing power. At the same rate, the existing computers are not well equipped to host and manage them at the same rate due to a lack of power. Similarly, open-source data platforms also find many core problems surrounding data lakes which are too costly to manage. This also requires massive computing power to overcome such serious skill gaps.

To build a better data lake, it is required to modernize the way businesses build and manage data lakes. One key takeaway is to take full **advantage of the cloud**, as opposed to building cumbersome data lakes on a tailor-made infrastructure [102]. It helps to ge<sup>t</sup> rid of data silos and to build data lakes that are applicable to various use cases, rather than only fitting them to a certain range of needs.

#### **7. Opportunities and Future Directions**

Based on our survey, we discuss novel trends in modern enterprise data managemen<sup>t</sup> and point out some promising directions for future research in this section.

#### *7.1. Data Warehouses: Opportunities and Future Directions*

The business managemen<sup>t</sup> landscape has witnessed a massive change with the emergence of the data warehouse. The **advancements in cloud technology, the Internet of Things, and big data analytics** have brought effective data solutions in modern data warehouses [77,103]. With the rapid evolution of technology, many enterprises have migrated their data to the cloud to expand their networks and markets. **Cloud data warehouses** help to overcome the huge costs of purchasing, infrastructure, installation, etc. [104]. Hence, in the coming years, more sophisticated technology in cloud DWs is envisaged to enhance intense, easy-to-use, and economical data clouds as well. The long-term gains for the adoption of cloud warehousing are mainly data availability and scalability. The flexibility to store a variety of data formats— not just relational—combined with the intrinsic flexibility of cloud-based services enables a very broad distribution of cloud services.

Another massive change is in the means of **data analytics**. In contrast to the older times, wherein data analytics and business intelligence occurred in two different divisions, which delayed the overall efficiency of the system, the modern data warehouse provides an advanced structure for storage and faster data flow, thus making them easily accessible for business users. Such an agility model is powered by data fragmentation, allowing access to and the analysis of data across the enterprise in real time.

Another big advancement is in the **Internet of Things (IoT)** platforms for sharing and storing data. This has changed the face of data streaming by enabling users to store and access data across multiple devices. The concept of the IoT is more pertinent to the real world due to the increasing popularity of mobile devices, embedded and ubiquitous communication technologies, cloud computing, and data analytics. In a broader sense, as with the Internet, the IoT enables devices to exist in many places and facilitates applications from trivial to the most crucial. Several technologies such as computational intelligence and big data can be incorporated together with the IoT to improve data managemen<sup>t</sup> and knowledge discovery on a large scale. Much research in this sense has been carried out by Mishra et al. [105].

In summary, the future of data warehouses comprises features that enable the following:


#### *7.2. Data Lakes: Opportunities and Future Directions*

One of the core capabilities of a data lake architecture is its ability to quickly and easily ingest multiple types of data (e.g., real-time streaming data from on-premises storage platforms, structured data generated and processed by mainframes and data warehouses, and unstructured or semi-structured data). The ingestion process makes use of a high degree of parallelism and low latency since it requires interfacing with external data sources with limited bandwidth. Hence, ingestion will not carry out any deep analysis of the downloaded data. However, there are possibilities for **applying shallow data sketches**

**on the downloaded contents and their metadata** to maintain a basic organization of the ingested data sets.

In another phase of data lake managemen<sup>t</sup> (i.e., **the data extraction stage**), the raw data are transformed into a predetermined data model. Although various studies have been conducted on this topic, there still remains room for improvement. Rather than conducting extraction on one file at a time, one can take advantage of the knowledge from the history of extractions. Similarly, in the cleaning phase of the data lake, not much work has not been performed in the literature other than some approaches such as CLAMS [49]. One opportunity in this regard will be to make use of the lake's wisdom and perform collective data cleaning. In addition, it is important to investigate the possible means of errors in the lake and to ge<sup>t</sup> rid of them efficiently to obtain a clean data lake.

The common methods to retrieve the data from the data lake are query-based retrieval (a user starts a search with a query for data retrieval) and data-based retrieval (a user navigates a data lake as a linkage graph or a hierarchical structure to find data of interest) [75]. A new direction may be to incorporate **analysis-driven or context-driven** approaches (i.e., augmenting a data set with relevant data and some contextual information to facilitate learning tasks).

Another direction of research is related to the exploration of **machine learning in data lakes**. Specifically, many studies are underway focusing on ML application toward data set organization and discovery. The data set discovery task is often associated with finding "similar" attributes extracted from the data, metadata, etc. which could be further coupled with classification or clustering tasks. Some recent works have leveraged ML techniques, such as the KNN classifier [106] and a logistic regression model for optimizing feature coefficients [107]. More advanced deep learning and similar sophisticated ML techniques are envisaged to augmen<sup>t</sup> the data set discovery process in the coming years.

**Metadata management** is an important task in a data lake, since a DL does not come with descriptive data catalogs [75,77]. Due to the lack of such explicit metadata of data sets, especially during the discovery and cleaning of data, there is a chance for a data lake to become a data swamp. Hence, it is quite necessary to extract meaningful metadata from data sources and to support efficient storage and query answering of metadata. In this field of metadata management, there remain more topics to explore further in extracting knowledge from lake data and incorporating them into existing knowledge bases. Yet another key aspect is **data versioning**, wherein new versions of the already existing files enter into a dynamic data lake [77]. Since versioning-related operations can affect all stages of a data lake, it is a very crucial aspect to address. There are some large-scale data set version control tools, such as DataHub (https://datahubproject.io, accessed on 25 September 2022), that provide a git-like interface to handle version creation, branching, and merging operations. Nevertheless, more research and development may be carried out further to deal with schema evolution.

As a final note, there is an emerging data managemen<sup>t</sup> architecture trend called the *data lakehouse* that couples the flexibility of a data lake with the data managemen<sup>t</sup> capabilities of a data warehouse. Specifically, it is considered a unique data storage solution for all data—unstructured, semi-structured, and structured—while providing the data quality and data governance standards of a data warehouse [108]. Such a data lakehouse would be capable of imparting better data governance, reduced data movement and redundancy, efficient use time, etc., even with a simplified schema. This topic of the *data lakehouse* is envisaged to be an excellent research area of data managemen<sup>t</sup> in the future.
