**Preface to "Big Data Analytics and Information Science for Business and Biomedical Applications"**

In today's data-centric world, there is a host of buzzwords appearing everywhere in digital and print media related to 'Big Data'. We encounter data in every walk of life, and the information it contains can be used to improve society, business, health, and medicine. This presents a substantial opportunity for humanity in general and a great challenge for statisticians and data scientists. Making sense of modern data structures and extracting meaningful information from them is not an easy task. The rapid growth in the size, scope, and complexity of data in a host of disciplines has created the need for innovative statistical strategies for analyzing and visualizing such data.

An enormous trove of digital data has been produced by biomedical research worldwide, including genetic variants genotyped or sequenced at genome-wide scales, gene expression measured under different experimental conditions, biomedical imaging data, including neuroimaging data, electronic medical records (EMR) of patients, and much more.

Analysis of such data will not only deepen our understanding of complex human traits and diseases, but will also shed light on disease prevention, diagnosis, and treatment. Undoubtedly, comprehensive analysis of Big Data in genomics and neuroimaging calls for statistically rigorous methods that can be applied in complex settings with data arising from multiple platforms. Various statistical methods have been developed to accommodate the features of genomic studies, as well as studies examining the function and structure of the brain.

Meanwhile, statistical theories have also correspondingly been developed. Alongside biomedical applications, there has been a tremendous increase and interest in the use of Big Data in business and financial applications. Financial time series analysis and prediction problems present many challenges for the development of statistical methodology and computational strategies for streaming data. The analysis of Big Data in biomedical, as well as business and financial research, has drawn much attention from researchers worldwide.

This book provides a platform for an in-depth discussion of powerful statistical methods developed for the analysis of Big Data in these areas. Both applied and theoretical contributions to these areas are showcased.

With a focus on statistical and machine learning, Wu et al. develop a novel approach for text feature learning motivated by trend analysis in cancer research examining 260,000 cancer studies and the corresponding low-dimensional text representations. Su et al. extend the lasso-penalized distance-weighted discrimination approach for binary classification to multicategory classification problems with an approach that considers all classes simultaneously. They establish uniqueness of the solution and obtain a non-asymptotic error bound in the case of group lasso penalization for ultra-high dimensional data.

With a focus on high-dimensional regression, Ahmed, Amiri, and Doksum focus on developing efficient prediction methods within the setting of high-dimensional regression where the number of predictor variables is larger than the sample size. They investigate the performance of ensemble linear subspace methods that combine the results of linear models applied to smaller subsets of predictor variables selected by random selection and find settings where ensemble methods perform relatively well. Zhao and Chen consider regression analysis within the context of missing data and develop a conditional likelihood approach incorporating an instrumental variable that avoids the specification of the process generating the missing data. A data perturbation technique is also developed for inference in the high-dimensional case. Pet al. consider estimation and variable selection in ultra high-dimensional settings and develop an approach combining forward selection and backward elimination with stopping criteria that control false positives and false negatives. The authors obtain probability bounds on variable selection and show consistent estimation.

With a focus on biomedical data, Opoku et al. develop an algorithm based on ant colony system optimization for improving solutions to the ill-posed neuroelectromagnetic inverse problem from combined EEG and MEG data. Mei et al. use emulated clinical trial data and deep learning to compare endovascular aortic repair and emergent open aortic repair as procedures for abdominal aortic aneurysm and find that the former leads to improved expected survival. Cao and Lee consider the variable selection problem for fMRI data and develop a Bayesian approach based on non-local priors with posterior computation based on a combination of the Laplace approximation and stochastic search.

Finally, with a focus on financial modeling, Sun and Bozdogan consider the segmentation of high-dimensional time series data and develop an approach based on sparse principal component regression and mixture model cluster analysis. This approach is applied to find change-points in the adjusted closing price of the S&P 500 index over a span covering the years 1999 to 2019.

Borjesson and Singull consider deep learning for forecasting financial time series in a study ¨ comparing multi-channel convolutional neural networks represented as nonlinear autoregressive models and standard autoregressive time series modeling. A number of financial indices are considered including the S&P 500 covering the years 2010 to 2019.

In summary, this collection comprises a variety of contributions to the state-of-the-art on statistical methodology for Big Data and high-dimensional problems for biomedical and financial applications and beyond. We hope that it will in turn inspire new methods and applications.

> **Farouk Nathoo, S. Ejaz Ahmed** *Editors*

### *Article* **Cancer Research Trend Analysis Based on Fusion Feature Representation**

**Jingqiao Wu <sup>1</sup> , Xiaoyue Feng <sup>2</sup> , Renchu Guan 1,2 and Yanchun Liang 1,2,\***


College of Computer Science and Technology, Jilin University, Changchun 130012, China; fengxy@jlu.edu.cn **\*** Correspondence: ycliang@jlu.edu.cn; Tel.: +86-18686604031

**Abstract:** Machine learning models can automatically discover biomedical research trends and promote the dissemination of information and knowledge. Text feature representation is a critical and challenging task in natural language processing. Most methods of text feature representation are based on word representation. A good representation can capture semantic and structural information. In this paper, two fusion algorithms are proposed, namely, the Tr-W2v and Ti-W2v algorithms. They are based on the classical text feature representation model and consider the importance of words. The results show that the effectiveness of the two fusion text representation models is better than the classical text representation model, and the results based on the Tr-W2v algorithm are the best. Furthermore, based on the Tr-W2v algorithm, trend analyses of cancer research are conducted, including correlation analysis, keyword trend analysis, and improved keyword trend analysis. The discovery of the research trends and the evolution of hotspots for cancers can help doctors and biological researchers collect information and provide guidance for further research.

**Keywords:** feature representation; feature fusion; trend analysis; text mining

#### **1. Introduction**

Since the completion of the Human Genome Project and with the rapid development of high-throughput biotechnology, the amount of data in the fields of biology, medicine, genetics, and chemistry has exponentially grown. As of January 2021, the number of entries in PubMed (Biomedical Literature Retrieval System) has exceeded 30 million [1]. However, given the large-scale, rapid growth and massive amounts of data in various formats, people can do little with the data. It is a major challenge for clinicians or biological researchers to obtain cutting-edge information about research from tens of thousands of publications. Traditional methods, the knowledge of which was manually acquired from literature and images, can no longer meet researchers' needs for understanding the current hotspots and trends of biomedical research [2,3]. It has become urgent to use intelligent algorithms to quickly and effectively acquire and discover biomedical knowledge.

Cancer research has attracted much attention in the medical field. Among all cancers, lung cancer poses the greatest threat to human health; it is characterized by its rapid spread and high probability of death. In recent years, according to statistical data around the world, the possibility of people suffering from lung cancer has greatly increased. Besides, gastric cancer, colorectal cancer, breast cancer, and liver cancer are also high-risk cancers that have been studied in the medical field in recent years. Traditional trend analysis can only be completed after reading and sorting out many documents published in the field in recent years by experts. This approach may hinder the dissemination of information and knowledge and cause omissions in the retrieval of papers by experts, which may affect the results of the extraction of research hotspots or trend analysis. The usage of machine

**Citation:** Wu, J.; Feng, X.; Guan, R.; Liang, Y. Cancer Research Trend Analysis Based on Fusion Feature Representation. *Entropy* **2021**, *23*, 338. https://doi.org/10.3390/e23030338

Academic Editor: Syed Ejaz Ahmed

Received: 26 January 2021 Accepted: 10 March 2021 Published: 12 March 2021

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

learning models to automatically discover biomedical research trends can make up for this deficiency [4,5].

Text feature learning is an important task in the field of natural language processing, and it is the basis of many downstream applications, such as text clustering and classification [6]. Most existing text feature representation learning is based on words, that is, word vector representation. It obtains word representation by mapping words from a one-dimensional space to a continuous vector space. The word representation methods include neural networks, word co-occurrence methods, methods that rely on probability, and interpretable knowledge base methods. A good low-dimensional mapping representation often improves the performance of downstream tasks [7]. Feature fusion is the integration of multiple different feature information to obtain more prominent feature information [8–11]. Multimodal features from text, audio and vision can be fused with fusion technique [12]. There are two types of fusion technique, early fusion, and late fusion. Early fusion concatenates the features together at first and late fusion combines results [13]. We adopt early fusion for text clustering.

Based on text representation methods, we propose a multi-view feature fusion strategy. The hotspots and trend analysis were conducted on 260,000 cancer studies using the proposed method. Our contribution mainly includes the following points. (1) The fusion of the improved vector representation model Ti-W2v algorithm and Tr-W2v algorithm were proposed. (2) A correlation analysis algorithm based on similarity is proposed to analyze the relationship among five cancer types. (3) A keyword trend analysis model and its improved model are proposed. Taking lung cancer as an example, the keyword analysis model analyzes the overall research hotspots. (4) Taking lung cancer as an example, the trend of lung cancer research is further analyzed from three perspectives, including gene proteins, therapeutic drugs and methods. The results can help guide the literature summary and further work of relevant researchers.

The remainder of this paper is organized as follows: Section 2 lists the materials and methods. Section 3 describes the experimental details, presents the experimental results, and gives the error analysis. Section 4 discusses the results. Section 5 concludes our work.

#### **2. Materials and Methods**

#### *2.1. Background*

Traditional text representation models commonly include models based on word frequency, TF-IDF, TextRank, and word embedding. The text feature representation model based on word frequency is the simplest. It calculates the number of occurrences of each word in the text and obtains the text vector with the word frequency of each word [14]. The expression based on the word frequency algorithm is shown in Equation (1):

$$wordcount(i, j) = n\_{i, j} \tag{1}$$

where *ni,j* is the occurrence number of word *t<sup>i</sup>* in document *d<sup>j</sup>* .

The text feature representation model based on TF-IDF considers the frequency of occurrence of each word in the training texts and the number of other training text containing the word, that is, the frequency of the reverse text [15,16]. The expression of the TF-IDF algorithm is shown as Equation (2):

$$tfidf(i,j) = tf(i,j) \* idf(i) \ = \frac{n\_{i,j}}{\sum\_{k} n\_{k,j}} \* \log \frac{|D|}{1 + |\left\{j : t\_i \varepsilon d\_j\right\}|} \tag{2}$$

where |*D*| represents the total number of files in the corpus. 1 + *j* : *t<sup>i</sup>* ∈ *d<sup>j</sup>*  represents numbers of documents containing the term *t<sup>i</sup>* , we add 1 here to prevent the denominator from being 0. The TF-IDF text representation model is an algorithm based on word frequency, which pays more attention to the number of times the words appeared in the document and does not consider the relative position between them.

The TextRank-based text feature representation model is a graph-based sorting algorithm for text [17]. Its core idea is that a word is more important if it appears after many words. Besides, if a word is followed by another word with a high TextRank value, the TextRank value of this word is accordingly higher. The TextRank model is an algorithm based on graphs. Let G = (*V*,*E*) be a directed graph with the set of vertices *V* and set of edges *E*, where *E* is a subset of *V\*V*. For a given vertex *V<sup>i</sup>* , let *In*(*V<sup>i</sup>* ) be the set of vertices that point to it (predecessors), and let *Out*(*V<sup>i</sup>* ) be the set of vertices that vertex *V<sup>i</sup>* points to (successors). The score of a vertex *V<sup>i</sup>* is defined as followed Equation (3):

$$\text{WS}(V\_{\vec{l}}) = (1 - d) + d \ast \sum\_{V\_{\vec{j}} \in \text{In}(V\_{\vec{l}})} \frac{w\_{\vec{j}\vec{l}}}{\sum\_{k \text{Out}(V\_{\vec{j}})} w\_{\vec{j}\vec{k}}} \text{WS}(V\_{\vec{l}}) \tag{3}$$

where *d* is a damping factor that can be set between 0 and 1; *wji* is the weight between *V<sup>j</sup>* and *V<sup>i</sup>* . The TextRank model focuses more on the degree of co-occurrence between words in a fixed-length window. This considers the relative position of words to a certain extent, so when the number of documents is small, the TextRank algorithm can express text information more accurately, while the TF-IDF algorithm cannot do this.

The text feature representation model based on word embedding maps words to another space through a certain mapping rule and generates expressions in a new space [18]. The word embedding text representation model is an algorithm based on a neural network. The hidden attributes between words in the text, such as the similarity and part of speech between words, are emphasized. As the characteristics of neural networks, the word embedding text representation model is difficult to be explained, but its final effect is better than TF-IDF and the TextRank algorithm. The obtained word vectors can measure the semantic and other relevant features between words. Therefore, word embedding methods to represent text features has been a hotspot in recent years. The most commonly used word embedding tool is Word2Vec [19–21], which contains two training modes: the CBOW training mode and the Skip-gram training mode.

#### *2.2. Method*

As shown in Figure 1, our proposed framework consists of two modules, a feature fusion module, and a research trend analysis module. The feature fusion module contains two fusion strategies, and the research trend analysis includes three trend analysis methods.

**Figure 1.** Framework of our work.

#### 2.2.1. Feature Fusion Representation Model

The word-based text representation method needs to obtain the representation of each word first. Then, word vectors can be used to obtain a text representation. The classic method is used to add all word vectors and the average of all vectors as the text vector. In this method, all words in the text are considered equally important. This is obviously far-fetched because the importance of words in the text is different. The representation algorithms of TF-IDF and TextRank represent a text by calculating the weight of words in the text, but the analysis point and calculation method of the two are quite different. The Word2Vec algorithm can determine the semantic information of words and does not consider the importance of words. To retain the advantages of the above methods, we propose a multi-view fusion strategy, which combines Word2Vec with TF-IDF and TextRank. In this fusion strategy, Word2Vec is chosen as the representation method of words. The weights of words in text are given by TF-IDF and TextRank. We named the fusion method Ti-W2v and Tr-W2v, and the details are given in the following sections.

Ti-W2v is an improved algorithm that combined TF-IDF and Word2Vec. TF-IDF is adopted to calculate the weight coefficient of each word in the text, and the embedding vector of the text can be generated with the product of the weights and embedding vectors of Word2Vec for all words. The advantage is that different words in the text can be given different degrees of importance, closer to the actual situation than average embedding. For a corpus, D is the corpus, and D = {*D*1, *D*2, . . . , *D<sup>k</sup>* }, *D<sup>i</sup>* is the *i*th document. *V<sup>i</sup>* is the

representation of *D<sup>i</sup>* . *wij* is the *j*th word in *D<sup>i</sup>* , *vij* is the vector of *wij* obtained by word2vec. *TIij* is the weight of *wij*, obtained with TF-IDF as Equation (4):

$$TI\_{ij} = \frac{n\_{i,j}}{\sum\_{k} n\_{k,j}} \ast \log \frac{|D|}{1 + |\left\{ j : t\_i \in d\_j \right\} |} \tag{4}$$

*Vi* is defined using Equation (5):

$$V\_{\bar{i}} = T I\_{\bar{i}\bar{j}} \times v\_{\bar{i}\bar{j}} \tag{5}$$

The TF-IDF algorithm is based on word frequency. It measures the importance of words based on text word frequency and global reverse text frequency. It is suitable for cases in which the number of documents is relatively large. While, in TextRank, the importance of words is decided by their relative position. It does not depend on other documents and considers the co-occurrence of each word. Based on the fusion strategy, we propose the Tr-W2v algorithm, which combines TextRank and Word2Vec. First, TextRank is used to calculate the weight coefficients of different words in the text, and then the Word2Vec embedding vectors of the words by weight are added to obtain the text vector. *TRij* is the weight of *wij*, obtained with TextRank as Equation (6):

$$TR\_{ij} = (1 - d) + d \ast \sum\_{V\_{\text{m}} \in In(V\_j)} \frac{w\_{mi}}{\sum\_{V\_{\text{kCout}(V\_{\text{m}})}} w\_{mk}} WS(V\_m) \tag{6}$$

as Ti-W2v, *V<sup>i</sup>* is defined with Equation (7):

$$V\_{\bar{l}} = T I\_{\bar{l}\bar{j}} \times v\_{\bar{l}\bar{j}} \tag{7}$$

#### 2.2.2. Cancer Research Trend Analysis Model

Based on the fusion-improved feature representation model proposed in the previous section, we propose three trend analysis models. We first propose a similarity trend analysis model based on the five high-incidence cancer datasets. A keyword trend analysis model and an improved keyword analysis model are proposed based on the lung cancer dataset. Then, lung cancer-related gene proteas, treatment methods, and drugs, and other hotspots related to lung cancer were analyzed.

#### Correlation Analysis Based on Similarity

We use the Tr-W2v algorithm to obtain the corresponding text vectors of abstracts on the five major cancer in the last five years. Then, the text vectors of various cancers are integrated into a vector for a certain year of this type of cancer in units of years (addition and average). The cosine similarities of different cancers are calculated in different years, and the correlation of different cancers are analyzed for the past five years through cosine similarity. Figure 2 shows the flowchart of the algorithm.

**Figure 2.** Flowchart of similarity trend analysis.

Keyword Trend Analysis Model

Taking the lung cancer dataset as an example, we use the TextRank algorithm to obtain the top 10% of keywords and corresponding weights in each document. Then, all the keywords and corresponding weights of the year are integrated into units of years. The method of integration is as follows: for the keywords that have not appeared, we add them and the corresponding weights directly to the keywords of the year. For the keywords that have appeared, we add and merge their weights as their new weights. Finally, the top 50 keywords were obtained as hotspots of the year through keyword reordering. Figure 3 shows the flowchart of the algorithm.

**Figure 3.** Flowchart of keyword trend analysis.

Improved Keyword Trend Analysis Model

The keyword analysis model proposed in the previous section can coarsely analyze the annual research hotspots of single types of cancer (taking lung cancer as an example). For more detailed trend analysis, we propose an improved keyword analysis on this basic model. As in the correlation analysis, we first use the Tr-W2v algorithm to obtain the text vector corresponding to lung cancer of each year. Further, the k-means clustering algorithm is adopted, and *k* categories are generated. The keyword integration operation in the previous section is utilized to obtain hotspots of different clusters. Then, the top keyword of each category is extracted and integrated into the distribution of hotspots of that year. Figure 4 shows flowchart of the algorithm.

**Figure 4.** Flowchart of improved keyword trend analysis.

#### **3. Results**

#### *3.1. Datasets*

For comparing the effect of representation methods, we use the second edition of the well-known public classification dataset 20 newsgroups [22]. In the analysis of cancer research trends, we retrieve PubMed articles using MeSH terms and obtain experimental datasets that include data from the past five years on the five most common cancers in China (lung cancer, breast cancer, gastric cancer, colorectal cancer, and liver cancer) [23,24]. Table 1 shows the distribution of the number of research papers for the five major cancers in the most recent five years.

**Table 1.** Number of research papers for the five cancers.


#### *3.2. Results*

In the cancer dataset, the most papers on lung cancer and breast cancer were published in 2018. The number of papers published in 2014 is the largest for gastric cancer. For colorectal and liver cancer, the number of papers published in 2015 is the largest. The number of cancer papers has not increased over the years. It shows a stable trend, and in some years, the trend is slightly lower than in previous years; however, the total number of cancer research papers is still rising slightly.

#### 3.2.1. Comparison Results of Feature Fusion Methods

To compare the results of feature fusion methods, we conduct clustering experiments on five text representation algorithms, including TF-IDF, Word2Vec, TextRank, Ti-W2v, and Tr-W2v. First, the five algorithms are used to vectorize the text of the data set. Then, we use the classical k-means clustering algorithm to evaluate the effects of the five wordrepresentation algorithms. We select ten categories from 20 newsgroups dataset as the experimental dataset. The number of clusters in k-means is set to 10, the initialization method defaults to k-means++, and the maximum iteration number is set to 300. Using the silhouette coefficient of clustering as a measurement [25]. The result of TF-IDF, Word2Vec, TextRank, Ti-W2v, and Tr-W2v are 0.402, 0.449, 0.433, 0.491, and 0.502, respectively. Figure 5 shows the two-dimensional clustering visualization effect of the data.

**Figure 5.** Two-dimensional clustering visualization results based on five word-representation algorithms.

In the clustering experiment, the clustering silhouette coefficients based on TF-IDF, Word2Vec, TextRank, Ti-W2v, and Tr-W2v are 0.402, 0.449, 0.433, 0.491, and 0.502, respectively. Among them, the Tr-W2v algorithm has the best result. The effect of Word2Vec vector is 11.7% higher than that of TF-IDF. The effect of the TextRank vector is 7.7% higher than that of TF-IDF. Ti-W2v has a 9.4% improvement over Word2Vec. Tr-W2v has a 2.2% improvement over Ti-W2v. The choice of word vectors plays a vital role in text representation. It is best to use the Word2Vec method to improve the results. Additionally, the word vector fusion method also has a certain effect. The effect based on the TextRank fusion text vector is better than that of the TF-IDF fusion text vector. They are both better than Word2Vec. Although the improvement effect is not as obvious as the replacement of word vectors, there is also a certain degree of improvement. In general, Tr-W2v fusion text vectors have the best clustering effect, which also reflects that it can better represent the text information. We evaluated the effect of the Tr-W2v algorithm's TextRank window size. When the window size is 2, 4, 5, 6, 10, the results are 0.469, 0.485, 0.502, 0.490, and 0.453, respectively. We choose 5 as the window size.

The results of the fusion feature experiment show that the fusion vector obtained by the Tr-W2v algorithm has the best result in clustering experiments, and the Ti-W2v algorithm is slightly inferior to the Tr-W2v algorithm; however, both are better than the traditional text representation model. It may be caused by the difference between the TF-IDF algorithm and the TextRank algorithm. The TF-IDF algorithm only considers word frequency information and does not consider the relationship between words. Compared with the TF-IDF algorithm, TextRank can obtain important information, such as the relative position of words within a single text, so the integration of TextRank and Word2Vec will perform better. In addition, we can see that the TF-IDF and TextRank vector clustering effects are slightly different from the other three vector clustering effects. The TF-IDF and TextRank algorithms represent vectors by word frequency and word co-occurrence position, respectively. Meanwhile, the other three algorithms are based on Word2Vec's low-dimensional dense vectors. Therefore, the clustering shape based on TF-IDF and the TextRank algorithm are more decentralized, while the other three algorithms are more uniform and regular.

#### 3.2.2. Cancer Trend Analysis Results

Next, the experimental results of the cancer research trend analysis model based on the fusion-improved feature representation model are listed below.

#### Correlation Analysis Results Based on Similarity

A correlation analysis is conducted based on similarity to determine the relationships among the five cancer types. Figure 6 shows the results for the most recent five years.

**Figure 6.** Correlation of the top five high-risk cancers. (**a**) Correlation between lung cancer and the other four cancers, (**b**) correlation between breast cancer and the other four cancers, (**c**) correlation between gastric cancer and the other four cancers, (**d**) correlation between colorectal cancer and the other four cancers, and (**e**) correlation between liver cancer and the other four cancers.

From Figure 6, we can conclude that colorectal cancer is most closely related to the other four cancers. The following reasons indicate that smoking may cause lung cancer; long-term smokers are more likely to die from colorectal cancer than nonsmokers [26]. There are many repeated research directions for the treatment of breast cancer and colorectal cancer [27,28]. The stomach and colorectal are organs of the digestive tract system, and many studies are conducted simultaneously [29,30]. The above studies can confirm the close relationship between colorectal cancer and four other cancers from the side. Lung cancer, breast cancer, gastric cancer, and colorectal cancer have the lowest similarity with liver cancer. In addition, lung cancer has the highest similarity to colorectal cancer among all relations, and breast cancer has the lowest similarity to liver cancer. This also shows that among the top five high-incidence cancers, lung cancer is most closely linked to colorectal cancer, while breast cancer is relatively less linked to liver cancer.

Results of the Keyword Trend Analysis Model

Taking lung cancer as an example, Figure 7 shows the visualization results of the annual hotspots word cloud of lung cancer obtained by the keyword trend analysis model and the improved keyword trend analysis model.

**Figure 7.** Hotspots of lung cancer were obtained by keyword trend analysis model (**a**) and improved keyword trend analysis model (**b**).

It can be found that the hotspots in the past five years have focused on the *patient*, *cancer*, *cell*, *lung*, *study*, *tumor*, *CI*, *nsclc*, and so on. According to the principles of the TextRank algorithm and the characteristics of the lung cancer research literature, the above results are normal because the central idea of the TextRank algorithm is that the more times a word and other important words co-occurrence within a certain length, the more important the word is. The above vocabulary in the literature of lung cancer research uses TextRank's weight ranking mechanism, and the above vocabulary may exist in the important vocabulary and hotspot area of the study in the literature on lung cancer. Public hotspots are basically the same each year using the keyword trend analysis model. The results are too rough to study the trend of lung cancer in the past five years. The improved keyword trend analysis model is optimized with more details. The results are significantly different from the keyword analysis methods before improvement (see the right-hand side of Figure 7). Based on improved methods, hotspots in different categories are combined to generate hotspots of one year. Then, the differential hotspots between years are chosen to represent the public hotspot of each year. Therefore, the hotspots of each year are clearly distinguished, which makes it easier and clearer to analyze the research trends of lung cancer in the recent five years.

#### Results of Analysis on Research Trend

Based on improved keyword trend analysis results, to present the research trend on lung cancer, research trends in different areas are listed in Figures 8–10. Figure 8 shows the research trends of related gene proteins and invertase factors in lung cancer research in the past five years. Figure 9 shows the hot research trends of lung cancer-related treatment drugs and methods in the past five years, and Figure 10 shows the other hot topics of lung cancer in the past five years' research trends.

The number of genes and proteins in the human body is very large, and many studies invest in gene protein-related research on lung cancer each year. It can be seen from Figure 8 that the research hotspots for genes and proteins are various in different years. The unique hot research terms in 2014 included the ATK1 gene, YAP gene, PKM2 gene, LSCC gene, and PDCD5 gene. The unique hot research terms in 2015 included PMS separation enhancer protein, BBP gene, Bsm gene, and THOR long noncoding RNA. The unique hot research terms in 2016 included SFTPD gene, p110α protein, DDX17 gene, Globo H glycoprotein, and LHX6 gene. The unique hot research terms in 2017 included SiRNA, TGF-β transforming growth factor, luciferase, RDM1 protein, and SNHG15 longchain noncoding RNA. The unique hot research terms in 2018 included miRNA-223 and LKB1 gene.

**Figure 8.** Research trends of lung cancer research related gene protein and invertase factor in the last five years.

**Figure 9.** Research trends of lung cancer research related therapeutic drugs and methods in the last five years.

With the development of science and technology, different lung cancer related treatment drugs and treatment methods have emerged. It can be seen from Figure 9 that the research hotspots for different drugs and treatments vary by year. The unique research words in 2014 included metal platinum anticancer drugs CHIP, galactosylceramide, thymoquinone drugs, AR lung nodule intelligent diagnosis system treatment, NC treatment, and TP treatment. The unique research hotspot words in 2015 included enzalutamide drugs, tenofovir dipivoxil drugs, dibenzylthiocaprylic acid drugs, thalidomide drugs, PEGPH2O tumor effect drugs, EP regimen treatment, intraoperative radiotherapy, and robot-assisted thoracoscopic surgery. The unique research hotspot words in 2016 included linsitinib drugs, penicillin drugs, SKLB drugs, sitagliptin drugs, erlotinib drugs, statins, ARMS quantitative treatment, and PCR-clamp method detection. The unique research hotspot words in 2017 included cucurbitacin, human resistin, and dimethyl amiloride. The unique research hotspot words in 2018 included aspirin drugs, intraoperative radiotherapy for lung cancer, gamma knife treatment, leishmaniasis, and low-dose lung CT technology.

**Figure 10.** Research trends of other related hotspots in the last five years.

It can be seen from Figure 10 that the research hotspots for other factors related to lung cancer in different years are also different. The unique hot research terms in 2014 included DOX, derivative, exosomes, and mesothelioma. The unique hot research terms in 2015 included DFI and bmplc. The unique hot research terms in 2016 included TMB tumor mutation load, monocyte leukemia CMML, and Co3O4 nanoparticles. The hot research terms in 2017 included adhesion, csc, and tam. The unique hot research terms in 2018 included image, examination, software, feature, algorithm, epithelioid, pneumonia, and bilateral. Regarding the hot words in 2018, it is noteworthy that with the advancement of science and technology, computer software and artificial intelligence algorithms play an increasingly important role in lung cancer research, such as artificial intelligence for image diagnosis of the lung.

#### **4. Limitation**

There are some limitations in our work. We only took five cancers as examples and discussed their relevance. More cancer data should be added. Further, the word2vec is chosen as the embedding method. Advanced text representation methods such as BERT (Bidirectional Encoder Representations from Transformers) [31] and BioBERT [32] might be a better choice. For the design of the experiment, we apply three methods to analyze the trend of cancer, and more diversified test methods could be used in future work.

#### **5. Conclusions**

Text feature representation models play an essential role in natural language processing. Improving these models helps machines better understand relevant text information and promote downstream tasks. Considering the words' degree of importance, we combined the TF-IDF and TextRank with word2vec. Results demonstrate the effectiveness of the fusion models. Meanwhile, the combined model is adopted to present research trend analysis of cancers. The proposed models can help researchers find research hotspots in biology, medicine, information retrieval, and natural language processing.

**Author Contributions:** Conceptualization, X.F.; methodology, R.G.; software, J.W.; validation, J.W.; formal analysis, X.F.; resources, J.W.; data curation, J.W.; writing—original draft preparation, J.W. and X.F.; writing—review and editing, Y.L.; visualization, J.W.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L., R.G., and X.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (grant number 61972174), the Science and Technology Planning Project of Guangdong Province (grant number 2020A0505100018), the Guangdong Key Project for Applied Fundamental Research (grant number 2018KZDXM076), the Guangdong Premier Key-Discipline Enhancement Scheme (grant number 2016GDYSZDXK036), and the Educational Commission of Jilin Province (grant number JJKH20200992KJ).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Publicly available datasets were analyzed in this study. This data can be found here: https://www.ncbi.nlm.nih.gov/pmc/ (accessed on 20 November 2020).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


*Article*
