1. Introduction
With the rapid development of “Internet plus”, almost all industry and business data shows explosive growth in recent years [
1]. Big data is a common buzzword in business and research community, referring to great mass of digital data collected from various sources [
2]. Big data has the characteristics of the “5V” [
3]:
Variety: the data is from a variety of sources, and the types and formats of data are becoming richer. It has broken through the category of structured data previously defined, including semi-structured and unstructured data.
Volume: the volume of data is huge, including the amount of data that is collected, stored and calculated.
Velocity: it requires fast processing and fast access to high value information for different types of data, which is fundamentally different from those traditional data mining techniques.
Value: due to the huge amount of data generated with very fast speed and the inevitable formation of various valid and invalid data, the data density is greatly reduced. However, the rational use of big data will bring a very high value in return.
Variability: with the increasing use of social media, data load becomes challenging, which usually results in peak load of data for certain events.
Big data has received wide attention from the academia, the economic community and even the government [
4]. In May 2011, McKinsey Global Research Institute (MGI) issued a report-
Big data: The next frontier for innovation, competition, and productivity [
5]. This study estimated that all companies stored 7.4EB newly generated data in 2010. It was also the first time for professional organization to introduce and look into big data. In January 2012, Davos, Switzerland, at the World Economic Forum, big data was one of the main themes. The report
Big Data, Big Impact stated that big date has become a new category of economic assets [
6]. In March 2012, the Obama Administration announced a “
Big Data Research and Development Initiative” [
7], which proposed to use big data to break through the technologies in the fields of scientific research, environmental protection, biological medicine research, education and national security.
Big data has attracted researchers in all fields, especially in the field of medicine [
8]. In 2009, the Google Corporation analyzed billions of distinctive digital models with billions of search messages and developed the Google Flu Trends. When the outbreak of influenza A (H1N1) virus occurred in the United States, the source of influenza was identified in time. Chawla and Davis [
9] presented a big-data-driven approach towards the personalized healthcare. The American Heart Association, through the investigation of cardiovascular data, proposed a future digital ecosystem for cardiovascular disease and stroke [
10].
Bibliometrics is the cross-disciplinary science of quantitative analysis of all knowledge carriers by mathematical and statistical methods [
11]. It is a commonly used method to identify the development of a certain field [
12,
13]. The earliest bibliometrics started in the early twentieth Century. In 1917, Cole and Eales respectively studied the growth of literature in comparative anatomy through bibliographical citations [
14]. In 1969, the famous British scientist, Allen Richard, first proposed the term “Bibliometrics” instead of “statistical bibliography”. The emergence of this term marks the formal birth of bibliometrics. At present, more and more attention has been given to this research. The most obvious advantage of the bibliometrics is that it allows scholars to study specific research area by analyzing citations, co-citations, geographical distribution and word frequency, and draw very useful conclusions. Up to now, the bibliometrics has been widely used in hotspot research [
15], co-authorship analysis [
16], co-citation analysis [
17], and the development of the whole subject fields [
18].
The concept of medical big data (MBD) has been mentioned by more and more people, and has been widely used in all walks of life. However, the related work mainly concentrates on the engineering application, specifically on the data collection and storage. However, there is little research on MBD from the perspectives of bibliometrics and visualization. The visualization not only uses data mining technology to excavate useful information from data, but also displays the information obtained by data mining technology to users intuitively. In addition, it is also very important to conduct a systematic literature review especially at the initial phase of the study about MBD to ensure good quality results. Therefore, it is necessary for us to make a comprehensive overview on this research direction and find out some basic patterns of MBD-related research. Motivated by this idea, this paper aims to adopt the bibliometric analysis and visualization on MBD to explore the characteristics of this area.
The rest of this paper is organized as follows: In
Section 2, we introduce the data source and methods used in this study.
Section 3 illustrates the results in detail, including the current status of MBD study, the analysis of research hotspots, the co-authorship analysis and the co-citation analysis.
Section 4 summarizes the whole paper and significant results are discussed in this section.
2. Data and Methods
The literature data used in this study were downloaded from the Science Citation Index Expanded (SCIE) and the Social Science Citation Index (SSCI) databases in Web of Science. SCIE and SSCI are the most frequently-used databases in bibliometric analysis [
19,
20,
21]. These two databases cover more scientific and authoritative publications than other databases. What is more, SCIE and SSCI provide citation information, keywords and references. We took “medical big data” as topical retrieval and the time span was defined as “all years” (However, according to the returned results, we know that the first publication in MBD was appeared in 1991). The literature type was defined as “all types”. In total, 988 documents met the selection criteria. Ten document types were found in these 988 publications. The most frequent document type is article (807), accounting for 81.7% of total publications. At the second position is review (98), with a proportion of 10.7%. Other document types including editorial material (36), proceedings paper (25), meeting abstract (12), book chapter (5), letter (2), book review (1), correction (1), news item (1).
Table 1 lists the numbers and proportions of various document types. All documents were downloaded on 7 October 2017 in tab separator format.
Science mapping is an essential procedure of bibliometrics [
22]. It can represent the discipline situation and development status [
23]. There are many softwares for bibliometrics analysis. VOSviewer (Centre for Science and Technology Studies, Leiden University, Leiden, The Netherlands) and CiteSpace (Chaomei Chen, China) were used to make visualization mapping in this paper. CiteSpace [
24] is effective in information visualization. It is used to obtain the quantitative and visual information in specific fields [
25]. In this paper, we use CiteSpace to make keywords timeline picture. VOSviewer is a free software developed by Eck and Waltman [
26]. It has a powerful function in co-occurrence analysis and co-citation analysis. GraphPad Prism 5 (GraphPad Prism Software Inc., San Diego, CA, USA) was used to make histograms and line charts. There are other kinds of bibliometrics softwares. Each software has advantages in one or several specific functions. For example, VOSViewer has a friendly graphical user interface that allows us to view the generated maps easily; Citespace is capable of visualizing the networks utilizing various layouts [
27]. In this paper, we use VOSviewer to make co-authorship networks and co-citation networks, and then use Citespace to make keywords timeline view.
4. Discussions and Conclusions
This study made a bibliometric analysis and visualization on MBD-related publications. We explored some interesting results concerning the MBD-related publications, which can be summarized as follows:
First, the MBD-related publications fluctuated at low level during the initial periods of 1990s and the first decade of the 21st century. However, after 2010, the number of publications grown rapidly. In terms of institutes, the Harvard University has the highest number of publications. The USA has 8 institutes ranked the top 10 regarding to the number of MBD-related publications. The journal, PLoS ONE, ranks first among the MBD-related journals. The USA has the most publications, the highest number of citation frequency and H-index. It implies that the USA is the bellwether in this field. China has a large number of publications, while Chinese scholars should pay attention to the quality of their papers.
Second, through the analysis of keywords, we have found that medical care is moving from a disease-centered model towards a patient-centered model. Until now, personalized medicine is heating up. At the same time, the technical support of MBD study is the key direction that people need to overcome.
Third, in MBD domain, the phenomenon of cooperation among multiple authors is widespread. All the top 10 publications with the highest number of citations were completed with more than one author. However, the international cooperation is not universal.
Fourth, the most frequently cited work in MBD area is Murdoch (2013). JAMA-Journal of the American Medical Association is most influential in MBD domain.
We can draw a conclusion that the patient-centered model is an inevitable trend in future medical development. (1) Firstly, precision medicine construction has spread throughout the world, and many countries have begun to chase the related concepts and industries. In 2008, Clayton Christensen, professor of Harvard Business School, first proposed the concept of precision. In 2011, National Research Council formally introduced the definition of precision medicine in the research reporter
Toward Precision Medicine: Building a Knowledge Network for Biomedical Research and a New Taxonomy of Disease [
47]. At the end of January 2015, in the USA, President Obama announced a new project in the field of life science, namely, Precision Medicine Initiative. This project aims to cure diseases such as cancer and diabetes, with the aim of getting everyone healthy with personalized information [
48]. Jameson and Longo [
49] from the University of Pennsylvania summarized the strengths, challenges and clinical practice of accelerated precision medicine systematically. Wishart [
50] aimed to unmask the essential reason of complex disease and metabolomics’s potential impact on precision medicine via exploring the application of metabonomics. Aronson and Rehm [
51]’s paper published in
Nature built a medical system with seamless cycling between clinical study and nursing to expedite the application of precision medicine. There are great improvements in the field of molecular biology, such as tumor molecular pathology and gene detection. However, the mining, evaluation, integration and application of these data need to be strengthened. Precision medical information technology systems include biological samples, bioinformatics, electronic medical records, and big data analysis techniques. Big data analysis is the key to precision medical treatment [
52]. Making good use of MBD can improve the accuracy and scientific of medical diagnostic, and form the personalized medical care. Through analyzing the influencing factors of residents’ health, patients’ health information can be integrated to provide better data evidence for the diagnosis and treatment of the disease. The data mining framework of precision medical treatment is shown in
Figure 12. (2) Secondly, with the continuous progress of modern society, people’s awareness of safeguarding rights is gradually improved. The protection of patient privacy has become particularly important. Protecting patient privacy is also an all-around project [
53]. It needs to have privacy laws and legal support agreement. Protecting the privacy of patients requires the cooperation among all stakeholders, including patients, patients’ health information holding institutes and government agencies for supervision and enforcement.
Furthermore, to utilize and develop MBD, the technical challenges cannot be ignored. (1) Firstly, the expanding medical information data is filled with a large number of unstructured data, and the data sources are becoming more and more diverse. The storage and transferring technologies of the MBD are quite different from the traditional data analysis technologies. The current storage architecture is unable to meet the needs of big data applications. (2) Secondly, the mining of MBD has become imminent. The original clinical data is large and heterogeneous, mostly from electronic medical records, medical images, medical record parameters, laboratory results, and clinical observation and interpretation [
54]. This clinical information has its own particularity and complexity, such as diversity, privacy, redundancy, incompleteness, and lack of mathematical properties. This makes great difference between medical data mining and conventional data mining. (3) Thirdly, in the perspective of data collection, large scale data is collected from various data sources like Internet, mobile phones, hospital, and scientific community [
55]. (4) Fourthly, there are many other challenges existed in both data management and data analysis to support the big data era, for example, processing highly distributed data sources, tracking data sources, coping with sampling bias and heterogeneity, and developing parallel and distributed architecture algorithm.
Although we have obtained some interesting results through the bibliometric analysis and visualization on MBD-related publications, this study has some shortcomings. We downloaded the documents from SSCI and SCIE databases via Web of Science and more than 99% of the articles were written in English. This leads to underestimation of researchers who use other languages. In addition, we have not considered the technologies for handling MBD.