4.1.2. Apache Spark

Spark was created after Hadoop and provides the developer with an interface focused on a data structure known as a "Resilient Distributed Dataset" (RDD) that is intended to be a collection of objects distributed across a set of compute nodes, which provide efficient hardware failure managemen<sup>t</sup> [82]. On the other hand, Spark provides the capacity to perform computations in shared memory, where the speed is significantly higher than on a disk. In this way, it becomes possible to implement iterative algorithms that access data multiple times on each iteration, without this coming at the expense of computational time, since the data access time in memory is faster and "closer" to the processor of the compute nodes. Apache Spark, in addition to the way it handles big data, offers the following key extensions [83]:


feature is very important because in Hadoop, new data cannot be preprocessed during processing, but the entire data set must be available when a MapReduce process is started. Java, Scala, and Python programming languages are supported.


Figure 8 shows the components of Apache Spark.

**Figure 8.** Apache Spark components.

Apache Spark can collect data from a variety of health data sources. It can handle large amounts of structured, semi-structured, and unstructured healthcare data, such as electronic health records (EHRs), diagnostic images, and genetic information. It can also preprocess, clean, and transform the data into a format suitable for analysis. Spark streaming components such as MLib can be employed to analyze healthcare data in real time, which is produced by wearable health devices [84]. The data consists of crucial health metrics such as weight, blood pressure, respiratory rate, ECG, and blood glucose levels. By utilizing machine learning algorithms, the analysis can detect any potential critical health conditions before they manifest. For example, the authors in [85] created a health status prediction system in real-time for breast cancer by utilizing Spark streaming and machine learning. The system was designed to predict health status using machine learning models applied to streaming data. In the same context, the authors in [86] have suggested a heart disease monitoring system that utilizes the Spark framework for continuous and real-time monitoring. The system also employs the random forest algorithm with MLlib to build a prediction model for heart disease.

In summary, Apache Spark's ability to process, analyze, and integrate large amounts of healthcare data, combined with its machine learning and real-time capabilities, make it a valuable tool for addressing big data healthcare problems.

#### *4.2. The Use of NoSQL Databases*

Relational databases, structured on the basis of the SQL language, have been the most popular data managemen<sup>t</sup> method for many years among organizations and technology professionals. With the advent of big medical data, which are characterized by both its large size and diverse structure, there is a need to be able to process data on a large scale in order to draw consistent conclusions [87]. SQL-based systems cannot provide a stand-alone solution to the problem of managing these data. This problem can be solved by using NoSQL databases, which offer dynamic data management, flexibility, and scalability over relational databases. Their characteristics make them ideal for managing large, non-homogeneous data that are frequently updated and have frequently changed data field formats, in addition to the data itself [82]. The main NoSQL database options are MongoDB, Neo4j, CouchBase, Dynamo DB, HBase, and Cassandra. For healthcare

companies, the use of MongoDB has dominated over others. MongoDB is provided by 10Gen and can be effectively combined with the use of JSON (JavaScript Object Notation), XML, etc. According to the company itself, MongoDB is flexible, easy to use, and offers high performance, availability, and automatic scaling. Among other important features, it has the ability to perform text searches and connect to Hadoop. According to the official solution website, some indicative examples of solutions provided by MongoDB in the healthcare industry include:


#### *4.3. Commercial Platforms for Healthcare Data Analytics*

A large number of databases that are available in the field of drug development create the need to identify priorities and methods for selecting appropriate information from a vast universe of big data. In this context, the Open PHACTS initiative implements the semantic weight-based search of research questions conducted in the context of pharmaceutical research [88]. The Open PHACTS program has a clear impact in several ways. The most important contribution is the use of the system in scientific research. Several scientific publications have resulted from the extensive use of this system, which allows for data analysis that has been very difficult to achieve in the past. Many pharmaceutical companies have integrated their internal data into Open PHACTS so that they can easily query all the information they have, whether public or private. Another contribution comes from the realization that large amounts of diverse semantic pharmaceutical data can be analyzed efficiently, thus improving data quality. The success of the Open PHACTS project has demonstrated the practicality of using data in biomedical research. Indeed, the fact that providers have chosen to offer their data reinforces the value of the action and helps sustain the Open PHACTS system.

Another very interesting project is the Artemis project. This project uses mining techniques, patented by McGregor, that are designed for non-trivial and possibly meaningful abstract information from huge datasets, where the digital data are generated by monitoring devices [89]. The analysis system employs abstraction techniques on the input data to identify recurrent patterns. It subsequently evaluates whether individuals with various health conditions, including infections, respiratory distress syndrome, and different forms of sleep apnea, exhibit similar data patterns in their normal state. The Artemis project

leverages three medical connectivity systems provided by Capsule Tech, ExcelMedical, and True Process clinical centers to continuously feed real-time data into a cloud-based database and analytics platform powered by IBM's InfoSphere and DB2 relational database [90].

IBM Watson is a complex computer system capable of answering questions in natural language. Medical personnel express in natural language the problem they are facing, describing symptoms and other relevant factors. Watson then performs an analysis and compiles a list of possible causes [91]. The sources of big data that Watson refers to can be physician and nurse notes and records, the electronic medical records of patients, clinical trials and research, scientific articles, as well as the information provided by patients themselves. Although it was developed and advertised as a diagnostic and treatment consultant, in reality, Watson was primarily used to treat patients who had already been diagnosed with a disease by suggesting ways to treat it [92].

The subsequent table presents a comparative analysis of various big data technologies applied in the context of healthcare.

The right technology for healthcare data analytics is determined by several factors, including the complexity and volume of the data, the system's required speed and scalability, the resources available, expert knowledge, and the defined targets and use cases. In general, open-source tools such as Hadoop and Spark provide a cost-effective and flexible solution for handling huge and varied healthcare datasets, as well as supporting various machine learning algorithms and techniques. They may, however, necessitate more technical skills and maintenance efforts than commercial tools. Commercial tools such as IBM Watson, Artemis, and Open PHACTS, on the other hand, often come with pre-built models and features that can accelerate the development and deployment of healthcare analytics applications, as well as provide more user-friendly interfaces and support services. They may, however, be more expensive and have fewer customization options. When selecting a technology for healthcare data analytics, healthcare professionals should carefully evaluate their specific needs and constraints, as well as factors such as data security, regulatory compliance, interoperability, and ethical considerations. It is also important to remember that technology selection is a continuous process that may necessitate continuous evaluation and optimization in response to changing needs and advances in the field. Table 4 represents a comparative analysis of various big data technologies in healthcare.


**Table 4.** Comparative analysis of various big data technologies in healthcare.

#### **5. Technical and Organizational Challenges in Healthcare Big Data**

The challenges that arise when using big data analytics technology are numerous and are particularly important to make the effort effective. The challenges are heterogeneous and diverse. The key points that a healthcare provider must consider in each case in this context are as follows:

(a) Data repositories

Although it has already been reported that the available health data are growing exponentially, the majority of it is in individual repositories: a phenomenon that has been called "data silos" [93]. These are essentially data repositories that are kept within an organization or even individual parts of organizations and are not accessible to the outside world. The lack of a common spirit of collaboration between organizations and internally between different departments inevitably hinder data sharing. It is, therefore, up to the institution concerned to ensure that this risk is avoided by developing the right spirit among employees, which is usually not a standard procedure.
