Veracity

This denotes the accuracy and reliability of the information, the correctness of the data, quality, and data governance. This particular attribute is completely dependent on the data source.

#### *1.4. Growth of BD*

There has recently been a huge shift in the volume and speed of the data, beyond the comprehension of human minds. In 2013, the total volume of data in the world was estimated at 4.4 Zettabytes. This volume experienced an enormous increase, up to 44 Zettabytes by 2020. Although there has been a steady rise in technology, at present, it is not easy to analyse such enormous data. The demand for analysis of large data sets has paved the way for the rise of BD over the past decade. Data analytics, data analysis, and BD originate from the long-standing database managemen<sup>t</sup> domain, which is completely dependent on the extraction, storage, and optimisation methods that are usually used for data stored in RDBMSs. Since the early 20th century, the internet has offered unique data analysis and data collection opportunities [17]. With the expansion of web traffic and online stores, companies such as Amazon, Yahoo, and eBay began analysing customer behaviours by investigating their clicking rates and tracking customer locations through their IP addresses, revealing a new world of possibilities. In addition, HTTP-based web traffic has increased the volume of unstructured and semi-structured data. To analyse these data, organisations require new approaches and solutions for storage issues, in order to investigate these new data efficiently.

#### *1.5. Development of BD in the Medical Sector*

There has been a constant increase in the demand for solutions regarding efficient analytical tools. This trend has also been noticed with regard to analysing large volumes of data. Organisations and institutions are searching for approaches to make use of the power of BD to enhance their competitive advantage, decision making, or business performance. BD provides potential solutions for private and public organisations; however, regarding outcomes, the practical employment of BD in various kinds of organisations requires domain-level adoption and re-structuring. Specifically, the healthcare industry has started shifting from a disease-centred to a patient-centred model, which is applicable in valuebased healthcare delivery systems. To meet the demand and provide efficient patientcentred care, it is important to address and investigate a large amount of data from the healthcare sector. Many issues arise when healthcare data are considered. Healthcare has always produced large amounts of data. In addition, the introduction of electronic medical records and the large amount of data collected by different sensors or the data that the patients generate through social media has created many data streams. The appropriate use of such data can enable healthcare organisations to support clinical decision-making, public health management, and disease surveillance [18].

Classification provides a significant approach for bringing intelligence to medical data. Due to the simplicity of the *k*NN classification algorithm, it has been widely employed in several sectors. However, when the sample size is large and the features of the attributes are large, the effectiveness of the *k*NN algorithm will be reduced. A study [19] has proposed a novel *k*NN algorithm and compared it with the other existing *k*NN algorithms. In particular, the classification was made in the query instance neighbourhood of the existing *k*NN classifiers, and weights were allocated to each class. The recommended algorithm considered the class distribution around the query instance, in order to ensure that the assigned weights do not impact the outliers. The results of the considered study revealed that the recommended algorithm could efficiently enhance the effectiveness of the classification of the *k*NN algorithm when processing large data sets while maintaining the classification accuracy of the KNN algorithm, as well as providing better performance in terms of classification. However, the considered study only researched single-class classification while, in terms of application, multi-class classification is more popular and necessary. Additionally, healthcare data typically have a high missing rate, where these missing fields have been shown to greatly impact the classification results in existing works.

Similarly, another study [20] has developed a BD analytics-enabled transformation system based on the practice-based view. This revealed the causal relationships among BD capabilities, benefit dimensions, IT-enabled business values, and transformation practices. This model was then validated in a medical setting, offering a strategic view of BD analysis. Three vital paths for value chains were detected for medical organisations, through implementing a model that offers practical insights for managers. This study revealed the important elements and links for understanding the transformation of BD. One major limitation of the considered study was the data source. Additionally, better validation could have been performed by collecting and investigating primary data.

To date, the healthcare sector has not completely utilised the potential of BD. While the constantly developing academic research on the concepts of BD analytics has been technically oriented, there is an increasing demand for understanding the strategic implications of BD. Intending to address this lack, the study [21] has attempted to investigate the historical development, component functionalities, and architectural design of BD analytics. They identified 5 BD analytical capabilities from 26 BD implementations, including unstructured data, analytical capability for pattern, decision support capability, traceability, and predictive capability. The main limitation of the considered study was that IT adoption usually lags, when compared with other sectors, which is one of the main reasons why such cases are difficult to find. Although many cases have been found from various sources, the majority of cases were detected from vendors.

Wearable medical tools with sensors continuously generate a large amount of data, which can be considered as BD, in the form of unstructured and semi-structured data. Due to the complexity of the data, it is not easy to investigate valuable information that could help in decision making. Alternately, data security is another major requirement of BD in the healthcare sector. To address this issue, traditional research [22] has attempted to recommend novel architectures for implementing IoT, in order to accumulate and process scalable sensor data for healthcare applications. The recommended architecture consists of two main frameworks: Grouping and choosing (MC) and metafog redirection (MF-R) frameworks. MF-R frameworks employ BD technologies such as Apache HBase and Apache Pig to collect and store the sensor data produced from various sensor devices.

On the other hand, various security frameworks have been studied in the attempt to build models that combine multi-variate and non-stationary data. The obtained models utilize a log-normal distribution for the margins with linear trends and peak series [23].

#### **2. Applications of BD**

This section briefly reviews the body of related work that is available and indexed by reliable databases such as SCOPUS and WoS. The keywords used were under the subject categories of 'BD', 'pharmacology', 'toxicology' and 'pharmaceutics'.

The employment of BD for safety managemen<sup>t</sup> in various areas, such as traffic safety [24], public safety [25], food safety [26], and patient safety [27], has recently been extensively studied. In addition, the influence of BD on drug discovery and design has been explored, in terms of future developments of medicine [6]. The core points in the discussion were the challenges that arise while implementing BD technologies, preserving the quality and privacy of data sets, and how the industry should adapt to welcome the BD era. It was concluded that, while BD has a significant impact on the advancement of pharmaceutical science, there are still many challenges to overcome.

The perspective of BD analytics in adapted medicine, focusing on how it could improve patient care, has been discussed in [25]. It was emphasised that the advancements in information technology have made this possible, but challenges remain to be addressed. BD analytics provide the potential to improve patient care, but more research is required to make this a reality.

The author in [26] has noted that current in vitro toxicity data could be used to develop models and tools to help in chemical toxicity research. The core points of the discussion were that the data are rich in information that can be used to evaluate complex bioactivities, and that a BD approach is necessary for relevant processing. It was found that the data are valuable for chemical toxicity research, but more tools need to be developed to help researchers use it. The pharmaceutical industry is facing a challenge in terms of productivity, in light of which BD initiatives may provide the insights needed to turn the industry around [28].

BD and translational medicine have evolved, and disruptive technology is bringing them together. The evolution of BD and translational medicine has been discussed, as well as the hindrances in applying BD techniques to translational medicine and the future of translational medicine. The author concluded that the future of translational medicine is bright and that the "Complete Health Record" concept will revolutionise the way in which translational medicine is practised [29]

BD is essential in safety sciences [27], and can be used to find similar substances and clusters of properties. Moreover, the need for safety BD [30] has been rapidly growing with constant development, and integration with science and technology has added more life to safety science research [31].

The author in [27] has sought to better understand the interactions between BD and Dynamic Simulation Modelling (DSM), as well as how incorporating them could be useful to healthcare decision-makers. The core points in the discussion were the benefits of BD and DSM, and how they can be used together to improve healthcare delivery. Integrated BD and DSM offer complementary value in healthcare, in terms of addressing complex, systemic health economics and outcomes questions.

#### *2.1. BD in Toxicology*

The rate of data generation associated with toxicology continues to multiply, and the volume of data that is generated has been growing drastically. This is due to advancements in software solutions and the chemical-informatics method, which increase the accessibility of open resources such as biological, chemical, and toxicology data. Thus, the significant necessity for BD analytics to store and access the data associated with the toxicology domain has surged. Concerning this aspect, a conventional study [32] has proposed a machine learning method for raw HRMS-DIA (High-resolution mass spectrometry-Data independent acquisition) data. They evaluated the machine learning model by training, validating, and testing on sets of solvents and blood samples containing drugs considered to be usual in forensic toxicology, with the aim of categorical prediction using a feed-forward neural network framework. With the application of the employed machine learning approach, the specificity and sensitivity of the validation process and the test set for the prediction sample classes were observed to be in a suitable range for routine use in the laboratory. The study clearly emphasised the efficacy of employment of BD along with machine learning algorithms.

Probabilistic topic modelling has been used to analyse large-scale genomic data to uncover hidden patterns; in particular, this method was used to analyse a toxic genomic data set, and it was found that patterns related to the impact of doses and time points of treatment could be identified. The authors concluded that this method can reduce animal use in research [33].

A better understanding of how BD helps to delineate personalised approaches in severe mental illness and the provision of a quantitative synthesis of BD approaches for metabolomics in severe mental illness is necessary [34]. Notably, BD has the potential to improve our understanding of the developmental trajectories of mental disorders.

The considered existing research has used broad data utilized in clinical studies conducted from the perspectives of neurology, tumours, cardiovascular disease, psychiatric diseases, and other implementations [35]. Traditional research has emphasized the advantages of BD, in that it enables the study of diseases at the genetic level, thus offering more valuable treatments than traditional or usual treatments, as well as providing the ability to discover the evolution trajectory of humans. BD has an optimistic impact on medical studies, and its growth continues.

#### *2.2. BD in Pharmacology*

The implementation of BD in precision medicine has been welcomed. Pharmacogenomics—that is, the study of the effect of genes on a person's reaction to certain drugs— is within the realm of precision medicine. This new area combines pharmacology and genomics to improve valid and safe drugs and doses that respond to variations in individual genes. Precision medicine has a relatively limited role in daily care; however, researchers expect that this approach will encompass many healthcare sectors in the coming years. In addition, BD has the potential to facilitate personalised precision medicine [27].

Various analysis techniques [29] and tools are being implemented for genetic/genomic discovery in pharmacogenomics. However, the BD-related issues faced by pharmacogenomics need to be addressed, in order to maximise the potential in the field. Compared with applications in IT fields, such as social network analysis, the data sets used for drug discovery research are relatively small. However, with the development of combinatorial chemistry synthesis, HTS techniques, and genomics/genetics knowledge, the databases for drugs and drug candidates are growing rapidly.

New modelling approaches are needed to handle these larger data sets [36]. The existing research [37] has attempted to investigate the feasibility of BD analysis on 3290 approved drugs and formulations, for which 1,637,499 adverse events have been recorded in both

human and animal species for approximately 70 years. A BD technique was utilized in this study, which is known to be a powerful analytic approach. However, it was revealed that the principle feasibility of a combined text mining and statistical method also led to numerous pitfalls, such as inadequate arrangemen<sup>t</sup> of pre-clinical ontologies and insufficiency of controlled vocabulary.

A conventional study [27] has attempted to identify the factors associated with the success of hypertension drug treatment, using BD approaches along with machine learning methods. As a result, it was disclosed that proton-pump inhibitors (PPIs) and hydroxymethylglutaryl coenzyme (HMGCoA) reductase inhibitors could significantly enhance the success rate of hypertension. In addition, new machine learning methodologies with BD have helped in identifying the prominent anti-hypertension therapy by re-generating medications available for new symptoms.

In a previous study [38], the author has attempted to determine standard methods that would likely help to increase the usability of (publicly available or privately produced) biological data. It was identified that data integrity is significant during pre-clinical drug development, and that investigators should use consistent methods to exploit the functions of privately and publicly created biological data. The author also emphasised that the increasing interest in and the interpretation of cross-platform approaches is significant.

BD can be used in paediatric drug [31] development. The use of BD for clinical trial design, efficiency, and safety of data has been attained in clinical trials. Therefore, exploring the current opportunities and challenges of BD in future paediatric drug development must be enriched. Although BD has the potential to play a significant role in paediatric drug development, and there are still many challenges that need to be addressed.

BD can be used in drug research to determine efficacy and safety signals [39]. The steps involve data acquisition, extraction, aggregation, analysis, modelling, and interpretation. BD can leverage and improve clinical decisions at the point of care, uncovering or validating drug efficacy and safety.

The steps of pharmacogenomics studies, [38] has considered data collection for interpretation and highlighted the bioinformatics aspects that can pose problems. The major challenges of data processing and analysis can lead to inaccurate results. Therefore, paying careful attention to these steps is important, in order to avoid mistakes and produce accurate pharmacogenomics studies.

The author in [40] have discussed the discovery of novel bromodomain BRD4 binders. It was inferred that public databases are useful for predictive model building, and that machine learning can allow for the extraction of real knowledge, despite the noise present in structure-activity data. Therefore, public databases are key assets in drug discovery, and machine learning plays a significant role in mining real data.

BD has changed the field of drug development [41]. Novel methods for therapeutic drug discovery, inference of clinical toxicity, candidate drug prioritisation, and machine learning techniques for drug discovery are becoming familiar. Experts from various platforms should conduct closer collaborations to translate the analysis results for treatment and prognosis in medical practice [42].

BD is significant for medical use and requires re-thinking, regarding the data storage infrastructure, the analysis growth, and the associated tools to drive advancements in the considered field [43]. In addition, BD is undoubtedly important for clinical practice, and physicians are responsible for developing and using BD to enhance patient care.

Machine learning has been utilized to predict psychiatric outcomes [44] in humans, where these techniques are more powerful than traditional statistical approaches. The author also discussed ways to optimise machine-learning techniques in the context of psychiatric research. BD has transformed natural product research and helped researchers both ask and answer new questions [45]. The author also highlighted the limitations regarding our current engagemen<sup>t</sup> with large data sets.

#### *2.3. BD in Pharmaceutics*

Clinical behaviours are important in pharmaceutics and life science, as they can be employed to evaluate whether a particular treatment is efficient and to check whether it is safe for human beings. In addition, clinical behaviours are costly and time-consuming to assess, and many clinical traits may fail to be observed during testing; furthermore, recruiting the right patients is also crucial. The entire trial process is also difficult. With the assistance of BD analytics, pharmaceutical industries can recruit the right patients for clinical traits, employing data such as genetic information, the status of the disease, and personality traits to increase the drug's success rate. This also helps in precisely determining the appropriate medicine(s) for treatment and the diagnosis of the considered disorders, performed using the most related and relevant data along with analysis of certain characteristics, such as behavioural patterns and genetic makeup. Using this BD, pharmaceutical companies can design personalized medicines in line with a particular patient's genetics and lifestyle.

The construction of medical BD involves not just a simple application and collection of medical data but, instead, is a complex systematic model. An existing research study [46] has discussed China's experience in constructing a regional medical BD ecosystem. The construction of the medical BD includes several institutions and high-level management, and cooperation was observed to enhance innovation and effectiveness. Compared with the construction of infrastructure, it is more time-consuming and challenging to develop proper data standards, data mining tools, and data integration. Similarly, another traditional study [47] has attempted to construct a proof-of-concept illustrating that BD approaches possess the capability to enhance the safety of drug monitoring in hospitals and, as such, can highly aid pharmaco-vigilance professionals to determine adverse drug events through data-driven targeted analysis of Drug–Drug Interactions (DDI). They also designed an automatic DDI detection model based on the treatment of the data and the laboratory analysis from electronic health records accumulated in a clinical data warehouse. The research results revealed that the developed DDI model worked effectively and that the time required for computation was manageable. This developed model can be used for regular monitoring processes.

Likewise, another traditional paper [48] has attempted to address data quality issues in electronic patient records using a computerized electronic patient report system with the abstraction of Map reduce and Apache HIVE of BD technology. The existing research also attempted to analyse which patients are spending more money, compared to patients with reduced maps. The data were obtained through a traditional system of Hadoop, through the functions of extract, transform, and load (ETL). The considered model was observed to resolve issues related to the use of conventional manual models. Security was also observed to be improved, as the system demands appropriate authentication for access. However, the developed model does not seem to send any alert regarding the expiration dates of drugs. In addition, factors such as assets and security were not included in the existing system.

The body of work presented in the literature survey indicates the growing importance and trend towards adapting BD in the fields of pharmacology, toxicology, and pharmaceutics. The existing literature has demonstrated that BD can solve various problems and, so, the application of BD in these fields needs to be reinforced.

#### **3. Benefits of Big Data**

The data associated with the healthcare sector are enormous. They are stored in and withdrawn from clinics, hospitals, and insurance companies, resulting in the under-use of resources, data redundancy, and inadequacy. However, stakeholders have increased their voices and requirements to improve the exploitation and exploration of traditional data. With the employment of BD:

• Healthcare organizations can construct networks to bring about extensive changes in the educational field of medicine, practice, and research;


With clinical semiology, computer science, advanced imaging, radiology, biochemistry, and genomics, BD has emerged as a promising tool that can assist in developing a wide range of technical devices, surgical approaches, pharmacological therapies, and others.
