Next Article in Journal
Equine Polyclonal Antibodies Prevent Acute Chikungunya Virus Infection in Mice
Next Article in Special Issue
ICP8-vhs- HSV-2 Vaccine Expressing B7 Costimulation Molecules Optimizes Safety and Efficacy against HSV-2 Infection in Mice
Previous Article in Journal
Deletion of the H240R Gene in African Swine Fever Virus Partially Reduces Virus Virulence in Swine
Previous Article in Special Issue
Development of a Multi-Epitope Universal mRNA Vaccine Candidate for Monkeypox, Smallpox, and Vaccinia Viruses: Design and In Silico Analyses
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Prediction of Antigenic Distance in Influenza A Using Attribute Network Embedding

1
School of Information Science and Engineering, Yunnan University, Kunming 650500, China
2
State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Yunnan University, Kunming 650500, China
*
Author to whom correspondence should be addressed.
Viruses 2023, 15(7), 1478; https://doi.org/10.3390/v15071478
Submission received: 28 April 2023 / Revised: 23 June 2023 / Accepted: 28 June 2023 / Published: 29 June 2023

Abstract

:
Owing to the rapid changes in the antigenicity of influenza viruses, it is difficult for humans to obtain lasting immunity through antiviral therapy. Hence, tracking the dynamic changes in the antigenicity of influenza viruses can provide a basis for vaccines and drug treatments to cope with the spread of influenza viruses. In this paper, we developed a novel quantitative prediction method to predict the antigenic distance between virus strains using attribute network embedding techniques. An antigenic network is built to model and combine the genetic and antigenic characteristics of the influenza A virus H3N2, using the continuous distributed representation of the virus strain protein sequence (ProtVec) as a node attribute and the antigenic distance between virus strains as an edge weight. The results show a strong positive correlation between supplementing genetic features and antigenic distance prediction accuracy. Further analysis indicates that our prediction model can comprehensively and accurately track the differences in antigenic distances between vaccines and influenza virus strains, and it outperforms existing methods in predicting antigenic distances between strains.

1. Introduction

Influenza A virus is a negative-sense single-stranded RNA virus, and its viral membrane is mainly composed of two surface glycoproteins: hemagglutinin (HA) and neuraminidase (NA). Existing influenza A viruses are classified into 18 HA subtypes and 11 NA subtypes according to different combinations of HA and NA. Each subtype has distinct pathogenic characteristics and antigenicity. Currently, the H3N2 and H1N1 subtypes are the main seasonal influenza viruses circulating in humans. Among them, the H3N2 virus was first discovered during the 1968 Hong Kong influenza pandemic and has since been continuously circulating globally [1,2]. Seasonal influenza is expected to cause 290,000–640,000 respiratory-related fatalities worldwide yearly [3].
HA is the primary surface protein of influenza A virus and is crucial for the virus’s entry into host cells. The receptor-binding site on the HA protein can bind to sialic acid receptors on host cells, thereby mediating the entry of viral particles into cells and causing infection in humans [4]. Consequently, many vaccines have been developed targeting the receptor-binding site of the HA protein to protect us from future seasonal influenza virus infections [2,5]. The quaternary structure of the HA protein is a homotrimer, with each monomer composed of two subunits, HA1 and HA2. The HA1 subunit contains antigenic determinant regions and critical binding sites for the virus to attach to host cells. Regrettably, mutations in the HA1 subunit can influence the virus’s antigenicity and receptor binding affinity, and the mutation rate is typically higher than in other regions [6]. Antigenic drift caused by accumulated sequence mutations greatly impedes the progress in the development of drug treatments and vaccines against potential influenza viruses. Therefore, the rapid detection of antigenic variation and accurate quantification of antigenic variation are crucial for designing and screening effective vaccines.
Numerous research studies have been conducted using HA protein sequence or structure to generate theoretical models and infer antigenic similarity based on sequence similarity [7,8,9,10,11,12,13,14,15,16,17,18]. Liao et al. [8] used the difference in each non-conserved residue between each pair of HA1 protein sequences as a feature to predict the antigenic distance between HA sequences based on a multiple linear regression model. Ref. [18] used a fitting model to infer two fitness components of the strains that were prevalent in a given year. Then, based on the fitness and frequency of each strain, they predicted the frequency of their descendant strains in the following year. Additionally, the study by Christopher et al. [16] first showed a good correlation between the results of the experimental antigenicity measurements and the antigenic distance prediction based on sequences. Other studies [19,20,21] further predicted the antigenic variations between influenza vaccines and circulating strains by exploring amino acid sequence mutations that identify epitope regions and their association with seasonal influenza. In order to enhance the performance of mutation prediction for particular residue sites of the influenza A virus, Yin et al. [19] built a time series sample by simulating the evolutionary path of HA sequences. Ref. [22] encoded protein sequences into numerical matrices, and subsequently input these matrices into downstream machine learning models, which was shown to improve the accuracy of predicting influenza antigenicity.
The immune efficacy of influenza vaccines mainly depends on how closely the vaccine and circulating virus strains match one another antigenically, so antigenic difference analysis is crucial for selecting vaccine strains [10]. The hemagglutination inhibition (HI) assay has currently established itself as a standard technique for determining the antigenic distance between current circulating influenza virus strains and reference vaccines [23]. The titer acquired from the HI assay is used to generate antigenic cartography that visualizes the antigenic characteristics of different virus strains on a two-dimensional plane. In antigenic maps, we can intuitively observe the antigenic distances and similarities between different viral strains, and then analyze the virus transmission patterns and vaccine strategies accordingly. Smith et al. [24], based on Lapedes and Farber’s [25] metric multidimensional scaling method (MDS), plot antigenic characteristics on the antigenic cartography. The Euclidean distance in antigenic cartography directly describes the antigenic distance, thus providing a reliable quantitative interpretation of antigenic differences.
In order to improve conventional cross-reactivity experiments, Cai et al. [26] believe that modeling to reduce temporal bias in the distribution of HI data is important in antigenic mapping. They used a low-rank matrix completion method to complete the HI titer matrix and then applied the improved MDS method (MC-MDS) to generate the antigenic map. Neher et al. [20] introduced virus potency to explain the systematic changes in HI titers of a virus in multiple sera and used it as the basis for implementing and validating the standardized log-transformed titers relative to the homologous titer. Based on this observation, Lee [27] and Qiu [28] suggested improved antigenic distance calculation methods to reduce the impact of variations in experimental conditions.
Some efforts have been devoted to exploring the relationship between the genetic and antigenic characterization of influenza viruses. According to Koel et al. [29], genetic changes in H3N2 viruses are relatively persistent, while changes in antigenic features recognizable by the human immune system occur intermittently. This suggests that influenza viruses evolve in a way that evades recognition by the human immune system, making the development of effective vaccines more challenging. To conduct a cross-study of genetic and antigenic characterization, ref. [30] mapped the antigenic distance from HI titer data onto the HA lineage. They used available antigenic and genomic sequence data to explain the antigenic novelty and virus transmission rates across the population, and then determined the antigenic changes between clades of high growth. Bedford et al. [31] combined antigenic maps with genetic information on the four human influenza virus subtypes and found that the H3N2 subtype’s antigenic phenotype evolves faster than the other three subtypes. Moreover, there is a strong correlation between the antigenic drift of each influenza strain and the number of new influenza cases each year.
Previous works have tended to analyze the genetic and antigenic patterns of viruses in isolation. This work aims to use a unified space composed of antigenic and genetic features to model and analyze the evolutionary dynamics of influenza viruses, with the main challenges being how to integrate genetic information represented by amino acid sequences into the space and how to predict antigenic distance quantitatively. To this end, we propose an effective framework to jointly model antigenic and genetic features through antigenic network representation learning with the ProtVec of HA1 sequences as node attributes and antigenic distance converted by HI titer as edge weights. Then, our model learned effective virus representations by introducing node attribute affinity to predict antigenic distances. We applied our proposed model to the H3N2 dataset, performed antigenic distance prediction tasks using the workflow shown in Figure 1, and studied the antigenic evolutionary dynamics of the H3N2 virus. The model takes two inputs: the attribute matrix capturing node attributes, embedded using ProtVec [32], and the link weight matrix representing antigen distances. By calculating the similarity between attribute matrices, it reveals relationships and patterns among node attributes. The link weight matrix quantifies distances between strains using an antigenic distance formula. As the output, the model provides embedding vector representations of strain nodes. These vectors encode the essential characteristics and relationships of strains in a low-dimensional latent space.
Compared with previous methods, we significantly reduced the root-mean-square error of the prediction results and the classification index of antigenic differences. Through vector analysis based on representation learning, we found that the Pearson correlation between genetic distance and antigenic distance (between antigenic clusters) was 0.8694 to 0.9573, while the correlation within antigenic clusters was only 0.7119 to 0.8556. This suggests high global correspondence and some local differences between influenza genetic and antigenic evolution. Eventually, we found that a historical genetic variation of 0.05 ± 0.004813 led to antigenic drift events of H3N2 influenza virus.

2. Materials and Methods

2.1. Datasets

We used Trevor Bedford’s benchmark dataset [31], which contains 402 H3N2 virus isolates dated from 1968 to 2011.

2.1.1. HI Titers

HI titer data were obtained from official documents published by the World Health Organization. Under the framework of the World Health Organization’s global influenza surveillance network, influenza centers in various countries collect viral isolates of seasonal influenza and use the HI assay to calculate titers for antigenic characterization. The range of HI titers typically spans from 10 to 10,240, as lower dilutions may be subject to potential non-specific inhibition, while higher titers are usually not used. Due to the lack of sensitivity of HI assays beyond a certain threshold, accurate data for experimental results on H3N2 strains mainly exist between strains separated by no more than 14 years. At the same time, some assay data only retain the threshold value, such as “<40”. Specifically, we arrange the antigen and antibody of the HI assay experiments in a matrix according to their chronological order. This results in three types of data: (I) conventional accurate HI titer data showing a band-shaped distribution close to the diagonal of the matrix; (II) data that are lower or higher than a certain threshold, which is typically slightly deviated from the diagonal; and (III) entries lacking experimental data, which are more likely to occur in positions that are significantly off the diagonal.
The HI titer dataset used in our experiments includes 402 viral isolates and 114 antiserum samples. Of these isolates, 187 are from Europe and 215 are from other parts of the world. We obtained a total of 8599 individual HI titer values in the dataset, of which 1110 (12.9%) are type (II) data, i.e., values lower than a certain threshold indicated by “ < t ”, where t { 5 , 10 , 20 , 40 , 80 , 160 } . Considering experimental variations in each study, we removed the top 10% of experimental values for each isolate pair t i j t i j ¯ arranged in descending order and calculated the average of the remaining values as the mean titer between isolate i and reference isolate j.
In actuality, there are three difficulties in interpreting the results of hemagglutination inhibition (HI) assays. Firstly, in order to obtain optimal and reliable results, measurements must be performed under specific experimental conditions such as incubation time, red blood cell concentration, and red blood cell type, which can lead to differences in the results obtained under different experimental conditions. Secondly, HI titers are influenced by the affinity of hemagglutinin for red blood cells, and, to some extent, reflect differences in affinity [33]. In addition, HI assays depend heavily on the antibodies that bind near the receptor-binding domain, and therefore tend to measure responses to specific epitopes. Finally, due to the impossibility of performing HI assays for all pairs of antigens and antisera reactions, the combination of multiple datasets often leads to an incomplete HI titer matrix.
Smith [24] proposed a way of measuring the antigenic distance between two viruses based on HI titer data, where the antigenic distance between strain i and j is defined as follows:
w i j = b j log 2 ( t i j )
where b j represents the maximum titer value of the serum j, i.e., b j = log 2 max ( t j ) . When the HI titer value is of the (I) type, t represents the maximum dilution of serum j that is necessary to inhibit the cell agglutination caused by the virus strain i. When the HI titer data are of the “ < t ” type, t i j = t .
To enhance the traditional cross-reactivity assay method and reduce the impact of differences in receptor binding affinities among different virus strains, we were inspired by Neher’s method [20] of calculating relative titers and used the following formula to calculate antigenic distance:
w i j = log 2 max ( t j j ) log 2 ( t i j )
According to the above formula, the antigenic distance between strain i and antiserum j is converted into the deviation value between the relative titer experimental value of antiserum j itself and the titer experimental value of strain i relative to antiserum j, where t j j and t i j denotes the titer data between strains. This method, as a supplement, can effectively enhance the accuracy and reproducibility of cross-reactivity experiments and has better application effects in analyzing and comparing the receptor binding affinity between different virus strains. For two virus strains with an antigenic distance greater than 4, their antigenicity is considered different, leading to immune escape. Otherwise, they are regarded as antigenically similar. After processing and calculating according to the above method, we finally identified 1543 antigenic difference pairs and 2744 antigenic similarity pairs among 402 strains.
The logarithmic linear correlation concentration ratio defined by the shape space theory of Lapedes and Farber [25] is currently the most widely used method for calculating antigenic distance.
w i j = log t i i t j j t i j t j i
There are only 1403 entries for antigenic distance, as calculated by Formula (3). However, trying to make comparable predictions from the sparse and coarser-grained data in these 1403 entries is more difficult. We validated our proposed method’s significant consistency with the popular method by performing a correlation analysis between the standardized logarithmic transformed antigenic distance calculations and the calculation method described in Formula (3) (measured with the Pearson correlation coefficient (PCC), R = 0.9629, 95% confidence interval: 94.86% to 97.68%, see Figure 2a). Similarly, we also calculated a correlation of 0.9815 between our proposed method and the method used by Smith. Based on these findings, it can be concluded that the standardized logarithmic transformation method for antigenic distance calculations can expand the utilization of influenza monitoring data, and we can better predict major antigenic changes.
A number of studies [20,27,34] have made efforts to explore the relationship between point mutations and influenza outbreaks based on a small number of amino acid characteristics. However, these models only measure the contribution of the selected amino acids as individuals. On the one hand, the lack of background of composite effects is due to the fact that amino acid changes in HA form a three-dimensional structure spatially. A previous study [35] analyzed the contributions of individual mutations and their related combined effects through CNN models. In addition, different amino acid residues have a significant impact on the antigenicity of H3N2 virus, and how to measure the contribution of different residues is one of the key issues in sequence processing.

2.1.2. HA Sequences

The HA sequences of 402 virus strains were collected from databases such as NCBI and GISAID. The lengths of HA1 sequences were 329 for influenza A/H3N2. Sequences containing missing or abnormal amino acids (i.e., “-”) were manually and automatically removed, and aligned using Mega. From 1968 to 2011, there were 0 to 30 amino acid changes between these strains. We defined the genetic distance between two HA sequences as the sum of pairwise distances between their 329 amino acids. To construct the antigenic network and represent the attribute information of its nodes, we embedded the HA sequences into a node attribute information matrix based on ProtVec. ProtVec [32] applies Skip–Gram to learn the distributed embedding vector representations of influenza virus amino acid sequences, representing three contiguous amino acid sequences as 100-dimensional vectors. By training only on protein sequences, the ProtVec feature extraction method is able to capture various meaningful physical and chemical properties, and can serve as an informative and dense representation of biological sequences in protein family classification. Specifically, each HA sequence is represented as a list of 327 contiguous 3 g, and a 327 × 100 matrix is generated as the node attribute information in Figure 2b. To represent 3 g sequences containing “-” in our study, we utilize the “<unk>” vector from ProtVec.

2.2. Antigenic Network Representation Learning

In recent years, some interpretable attributed network embedding algorithms have emerged. For example, accelerated attributed network embedding (AANE) [36] decomposes the heterogeneous information similarity matrix and penalizes embedding differences between adjacent nodes to preserve the similarity of nodes in the original network space on the new continuous vector representation. Inspired by this observation, this paper discusses the possibility of integrating the node attribute information similarity matrix represented by the HA sequence of influenza virus H3N2 and the topological structure represented by antigenic distance into the network embedding representation learning, and whether this method can help to better learn node vectors in the antigenic network. In addition, the rapid mutation and seasonal epidemics require the scalability of antigenic network embedding representation. AANE embeds noisy network topology and node attributes to improve the model’s time complexity and convergence speed.
The graph-based attribute network embedding method demonstrates strong robustness in handling missing data. In cases where certain antigenic distances are missing, the embedding method can utilize the attribute information of the nodes to impute the missing values, thereby partially compensating for the loss of information. By using the ProtVec representation of the genetic sequences as node attributes, we were able to exploit the informative features of each strain. This not only enriches the available information within the network but also facilitates the extraction of meaningful patterns and relationships between strains. Meanwhile, graph networks inherently possess the ability to model non-linear interactions and dependencies, which may be crucial for capturing the complexity of virus strain evolution. This flexibility allows our model to better capture the intricate patterns and dynamics present in the antigenic distance data.
We define the embedding learning of the antigenic network as follows for an antigenic network G = ( V , E , W ) , where V = { v 1 , v 2 , . . . , v n } represents the set of n strain nodes. The attributes of each node are represented by the matrix p i R ( k 2 ) × m , which is obtained by embedding the amino acid sequence of length k using ProtVec. There is also a set of edges between strain nodes in the network E = { e i j } i , j = 1 n , and each edge ( i , j ) E is associated with a non-negative antigenic distance w i j W . e i j is an unobserved edge when w i j = 1 . This study assumes that the antigenic distances between strains are non-negative and symmetric, used to measure the edge weights between antigenic network nodes. We use AANE for embedding learning to obtain a d-dimensional vector h i H for nodes v i . This embedding method produces a representation of every node in the network space that combines genetic and antigenic properties, allowing H to maintain the adjacency of both the topological structure and node attributes simultaneously.
The strain vectors learned through network embedding representation learning should have the following four advantages: (1) low dimensionality, i.e., the embedding dimension d should be smaller than the length of the original sequence, in order to improve the efficiency of downstream tasks; (2) preserving the antigenic features of the original antigenic network structure, i.e., nodes with structural similarity should still be similar in the new space; (3) preserving the genetic features of the original antigenic network, i.e., the similarity of original HA1 sequence should be well captured to complement rather than degrade the expression of the network structure; and (4) the pairwise similarity of node embedding vectors should reflect the pairwise similarity of nodes in the original antigenic network. Compared with pure network embedding, network embedding representation H, which concurrently maintains the topological structure and node attribute information, performs better on the link weight prediction challenge.

2.2.1. Network Topological Structure Modeling

To maintain the proximity of nodes in the network while improving the performance of antigenic distance prediction, we first assume that strains with similar topological structures or connected by smaller weighted edges are more likely to have similar embedding vector representations. To accomplish this, we suggest the following loss function to reduce the differences in embedding between linked strain nodes:
L G = ( i , j ) E ( w i j h i h j 2 )
where h i and h j are the vector representations of node v i and node v j , and w i j is the edge weight between them. The key idea of this loss function is to minimize w i j h i h j 2 . For the antigenic network, a smaller antigenic distance w i j needs to force the difference between the vector representations h i and h j of two strains to be smaller. By using the 2-norm h i h j 2 as the difference metric between vectors, it can not only characterize the distance between vectors but also alleviate the negative effects of outliers and missing data.

2.2.2. Node Attribute Proximity Modeling

According to social science theories such as homophily and social influence [37,38], node attribute information is closely related to network topology. Thus, the similarity of nodes in the network space should be consistent with the similarity of nodes in the attribute space. Inspired by symmetric matrix factorization, the product of H and H T approximates the node attribute similarity matrix S. The basic idea is to force the dot product of the embedding representation of h i and h j to be similar to the corresponding attribute similarity matrix. Therefore, to maintain node attribute proximity, we define the loss function as follows:
L S = S H H T F 2 = i = 1 n j = 1 n ( s i j h i h j T ) 2
where the matrix S R n × n represents the 2-norm between the node attribute matrices, capturing the attribute similarity and the differences between different strains in the joint space. Specifically, given the node attribute information a , b R ( k 2 ) × m based on ProtVec representation, the formula for calculating node attribute similarity is as follows:
s i j = i k 2 j m ( a i j b i j ) 2

2.2.3. Antigenic Network Embedding Representation Learning

According to Equations (4) and (5), we have implemented two loss functions, L G and L S , to fit the similarity between network topology and node attributes. To make them complement each other and form a unified network representation space, our optimization objective is the following function:
J = ρ ( i , j ) E ( w i j h i h j 2 ) + S H H T F 2
ρ functions as both a regularization parameter that balances the number of clusters and a global parameter that defines the contribution of network structure and attribute information to the node representation. The intuitive explanation is that when it approaches 0, the network topology cannot affect the final node representation H, and each strain will reflect the similarity of node attributes. When ρ is sufficiently large, the vector representation of all nodes will tend to reflect the network’s topology fully. We need to use the Euclidean distance in the new network space to calculate the predicted antigenic distance between any two strains v i and v j . To ensure the accuracy and reliability of our predictions, we will follow widely adopted methods for validating network embedding quality.

3. Results

We will address the following three questions through experiments: (1) Does supplementing genetic information represented by HA1 sequences better predict antigenic distances based on antigenic network learning compared to using only antigenic features? (2) Can the AANE method used in antigenic network learning achieve more accurate antigenic distance prediction than other attribute network embedding learning methods? We will also further discuss the sensitivity of model parameters, including the regularization parameter ρ of the loss function and the dimension d of the node representation vector. (3) By modeling genetic and antigenic features in a combined space, can we gain new insights into the evolution of H3N2 viruses?

3.1. Baseline

To answer the aforementioned questions, we will use the following models as baselines and describe them in detail:
  • Node2vec [39] is a Skip–Gram-based algorithm that generates node sequences through biased random walks. Hyperparameters p and q are used to control the random walks, and we adjust them according to the original paper. When p and q are both 1, node2vec is equivalent to DeepWalk. Node2Vec only uses antigenic distance for an edge-weighted biased random walk.
  • LINE [40] constructs the network using only antigenic distance and generates context nodes through breadth-first search, where a node’s neighboring nodes are limited to those that are at most two hops away. LINE models both first-order and second-order similarities for each node and concatenates the two learned embedding vectors according to the actual scenario.
  • LINE1 constructs the network using only antigenic distance and models first-order similarity to learn node representation, which mainly constrains directly connected nodes.
  • LINE2 also constructs the network using only antigenic distances, but focuses on the neighborhood similarity of nodes and learns node representation by preserving second-order similarity.
  • Attri2vec [41] learns node representation by performing linear or non-linear mapping on node attributes. To preserve structural similarity, it uses the DeepWalk learning mechanism so that nodes with similar random walk contexts have similar dense representations in the subspace. This is achieved by maximizing the probability of the appearance of context nodes conditioned on the target representation.
  • GCN [42] is a graph-specific model that applies convolution on graph nodes to generate representations for each node.
  • By employing the masked self-attention layer, GAT [43] overcomes certain limitations present in existing methods. The key aspect of GAT lies in stacking multiple layers, where each layer can implicitly assign varying weights to neighboring nodes without the need for costly matrix operations or prior knowledge of the graph structure.
  • GraphSAGE [44] utilizes local neighborhood sampling to aggregate features and generate embeddings. Subsequently, a minibatch forward propagation algorithm is employed to train the data.
  • GALA [45] proposes a symmetric graph convolutional autoencoder for generating low-dimensional latent representations of graphs. Compared to existing graph autoencoders, our model features a newly designed symmetric decoder that effectively utilizes the graph structure for reconstructing node features.
  • TADW [46] not only considers the structural information of nodes but also utilizes the text information of nodes. It implements the DeepWalk idea through matrix factorization and introduces node text information to improve the expression of embedding vectors.

3.2. Evaluation of Antigenic Distance Prediction Performance

The performance of antigenic distance prediction is evaluated using two metrics: the root mean square error (RMSE) and the Pearson correlation coefficient (PCC) between the predicted and actual antigenic distances. These metrics reflect different aspects of prediction performance: RMSE amplifies the differences between larger errors and quantifies the degree of proximity between the prediction and the average true value, while PCC measures the relative trend between the two. Given n true antigenic distances w i j and predicted antigenic distances d i j , the corresponding metrics are defined as follows:
R M S E = 1 n ( i , j ) E ( w i j d i j ) 2
P C C = ( i , j ) E ( w i j w i j ¯ ) ( d i j d i j ¯ ) ( i , j ) E ( w i j w i j ¯ ) 2 ( i , j ) E ( d i j d i j ¯ ) 2
Furthermore, we will evaluate the model’s ability to detect antigenic drift based on the prediction results. We will assess this capability using four evaluation metrics: accuracy, precision, recall, and F1 score.
With a default embedding dimension, d, of 50, we set the baseline method’s settings as recommended in the original paper. All experimental results are arithmetic averages of 10 tests. We divide the dataset into a training set and a test set (75% and 25%), and perform multiple rounds of training and evaluation through 3-fold cross-validation by dividing the training set into three subsets and using these subsets alternately as validation sets. Based on the preliminary findings from Figure 3, we found that the methods based on network embedding representation learning achieved significantly better performance than LINE1 in predicting antigenic distances on the H3N2 dataset. We attribute this phenomenon to the following two reasons: (1) the LINE model was applied to millions of data points, which is vastly different from the sample size used in this experiment; and (2) in contrast, LINE1 only models first-order proximity, which cannot capture enough information for link weight prediction tasks.
Next, we then evaluate the impact of merging node attribute information. In order to make a fair comparison with models that utilize node attributes, we use Node2Vec, LINE1, LINE2, and LINE as controls, which only consider antigenic distance as the link weight for node vector learning. These methods learn node vectors only through network structure and then predict antigenic distance. As shown in Figure 3a,b, on the dataset of all 402 nodes, the model based on attribute network representation learning outperforms in terms of RMSE and PCC. This confirms our hypothesis that the combination of genetic and antigenic features proposed by us contributes to antigenic distance prediction tasks. Horizontally, Figure 3a,b suggests that utilizing both network structure and node attribute information is beneficial for downstream link prediction tasks.
To further evaluate how much it improves the performance of antigenic distance prediction in the antigenic network, as well as to verify the robustness of the AANE model, we further reduced the number of nodes in the network by 10%, 20%, 30%, 40%, 50%, and 60%. Each experiment was repeated 5 times and the average was taken to test the model’s robustness in predicting missing antigenic distances. For example, as the number of deleted nodes increased, LINE (grayish green line in Figure 3b) and Node2Vec (bluish violet line in Figure 3b) showed an overall decreasing trend in RMSE, with a final decrease in the predictive performance of 24.6% and 24.9%, respectively.
According to the experimental results in Figure 3a,b, AANE consistently outperformed Attri2Vec, GCN, GAT, GraphSAGE, GALA, and TADW. All of these approaches execute node embedding learning and represent the network using node attributes and link weights. This illustrates how effective AANE is. For example, on the dataset consisting of all nodes, AANE achieved a 38.1% and 13.1% improvement over TADW and GALA, respectively. Although TADW is effective in learning information-based node embeddings using rich node text features, its mechanism is not as straightforward, i.e., a clear objective is not provided for how the network structure and node attributes interact with each other.
We compared the predicted antigenic distances of all models with the actual antigenic distances by linear fitting (Figure 4). The results show that the attribute network embedding method had a greater advantage in reducing FN data and increasing TN data. Meanwhile, AANE’s predicted values had good robustness and a more uniform distribution when linearly fitted with the actual distances. The reason for Node2vec’s performance exceeding our expectations may be explained here. We believe that Node2vec does not reflect structural information well. Due to the limited sample size and walk length, it is almost impossible to include two structurally similar nodes in the same sequence through biased walks when the distance between them is very large. This is also related to hyperparameter selection. We set p as 1 and q as 2. The larger q is, the more the embedding tends to express homogeneity. As the number of nodes decreases, nodes with experimental data are often those that are close to their own antigenic distances. Therefore, when expressing network structure, it tends to embed peripheral and central nodes as similar vectors (TN- and FN-predicted values in Figure 4b,f occupy a considerable part). In fact, these predicted values, which fit the original data distribution more closely, can obtain better predictive ability. We followed the author’s recommendation and did not stack higher layers of GCN. Perhaps there will be better performance with more than three layers of GCN, but this is not within the scope of this paper. GCN has the same drawback as Attri2Vec—the over-smoothing problem (that is, after multiple layers of stacking, the node’s representation vectors tend to be consistent, and the nodes are difficult to distinguish). Due to the low-pass filter effect of GCN, the aggregated features continuously merge the node features, which tend to be the same after multiple iterations. GAT has more parameters than GCN and is trained in a full-batch manner. It only considers 1-hop neighbors and does not utilize higher-order neighbors. When higher-order neighbors are used, excessive smoothing is prone to occur. Attri2Vec has a strong bias toward node attributes, so it is not practical to maximize the attribute information difference among strains with no more than 30 different amino acids even after ProtVec representation.
We tested the ability of all models to successfully predict antigenic escape (as described in Section 2.1, where the antigenic distance between two strains is higher than the antigenic escape threshold ( d i j = 4)). The main evaluation metrics for qualitative results are accuracy, precision, recall, and F1 score, as shown in Figure 5. Our results show that the AANE model accurately predicted the antigenic distance between two strains with an accuracy of 91.25%, and other metrics also showed significant advantages.
As shown in Figure 3 and Figure 5, we will present a comprehensive analysis including quantitative evaluation and statistical measures to validate the accuracy and reliability of our defined distances. We compared the predictive performance of the antigenic distances obtained with the computational methods defined by Smith on all benchmark models. As for Equation (3), we chose to discard this set of comparisons because of the small number of entries obtained. We could not obtain more substantial results with these network embedding methods. We found erratic metric fluctuations in the sequential reduction in the number of nodes on the antigenic dataset defined by Smith, a phenomenon that occurs in most of the benchmark methods. As we feared, the formula defined by Smith tended to obtain certain fixed values during the calculations performed on the titer data we collected (also found in Figure 2a), even though it allowed us to obtain more entries of antigenic data. This comparative analysis will help to confirm the validity of our proposed method and to obtain a clearer picture of the differences between the “actual” antigenic distances in our study and the established criteria.

3.3. Parameter Sensitivity Study

In this section, we looked into the effects of two significant parameters, ρ and d. As described in Section 2.2.3, ρ in AANE balances the contribution of network structure and node attributes. To study the effect of ρ , we changed it from 10 3 to 10 6 . As there are up to 35 or more different combinations of hyperparameters ( a , b ) , we only give the optimal value of another parameter b under a specific parameter a. Table 1 shows the RMSE and PCC results of antigenic distance prediction under different ρ values. Setting ρ = 10 3 almost ignores the influence of network structure information, and nodes tend to have the same embedding vector representation. As ρ increases, AANE predicts the antigenic distance based on the topological structure, and the performance gradually improves. As shown in the figure, when ρ is close to 10 5 , the performance of antigenic distance prediction peaks. When ρ continues to increase, the performance will decrease, as larger ρ values tend to make all nodes too dependent on sparse structural information. However, we cannot directly infer from this so-called optimal value that genetic information only contributes 0.001% to the distance prediction task because the two dimensions are intuitively very different. Nevertheless, from the overall trend of change, it indicates to some extent that genetic information contributes to the advantage of using attribute network embedding for antigenic distance prediction, as shown by the improvement in the RMSE value.
Following the rules we established when building the network representation model (Section 2.2), dimension d should be less than 329. Specifically, we changed the embedding dimension from 20 to 150, and Table 1 shows the prediction performance on the dataset. From the results, we found that by increasing d, the performance of the method first increases and then remains stable. This indicates that low-dimensional representations perform well in capturing most of the meaningful information. In reality, determining an appropriate dimension is not easy, especially for antigenic distance prediction. Although a lower dimension has lower time and space complexity, it will undoubtedly lose a lot of information originally present in the network. Higher dimensions may improve reconstruction accuracy to some extent, but at the same time, the Euclidean distance between vectors of higher dimensions will likely become larger. Based on our experimental results, we can conclude that model performance is relatively stable within a small range of node embedding dimensions, and performance declines when the node embedding dimension is too small or too large.

3.4. Antigenic Evolution Dynamic Analysis

The effectiveness of the model has enabled us to explore the dynamics of influenza antigenic evolution based on vector representation in the joint space. As shown in Figure 6, we first performed preliminary clustering by year and calculated the average antigenic distance D ( i , i + 1 ) between all pairs of strains from year i and i + 1 for the 44-year period, resulting in 41 data points (except for 1978 in the H3N2 dataset). Then, we merged adjacent clusters that had the smallest antigenic distance ( D ( i , i + 1 ) < 4 ) without antigenic variation, and each new cluster was named after the earliest year in the cluster. For example, 1982 and 1983 strains had the smallest antigenic distance (0.045804) and were merged into a new cluster, followed by recalculating the average antigenic distance to 1981 and 1984 strains. All strains were finally grouped into seven significant antigenic drift episodes, as seen in Figure 6b. We calculated the antigenic distance between each pair of strains in each event. In an antigenic drift event E including n strains, the antigenic variation level of strain i was defined as
C i = j E d i j n 1
where d i j represents the antigenic distance between strain i and all other strains j within the same event E. The strain with the smallest antigenic variation value within the current cluster is chosen as the dominant strain (which has the smallest average antigenic distance to all other strains in the cluster) and is used to name the event. Between 1968 and 2011, we discovered seven significant antigenic drift events: BI68, BI73, LY79, VI87, MA93, FU00, and ST09.
Based on the clustering results, we quantified the relationship between antigenic and genetic distances among strains. The genetic distance was calculated from the amino acid sequence differences between strains. In our study, the differences in the number of identified antigenic drift events compared to the research conducted by Smith et al. [24] could be attributed to various factors, including the datasets used, the definition of antigenic distance, and the specific criteria employed to define antigenic drift events. Furthermore, we looked at the roughly evolutionary relationship between H3N2’s genetic and antigenic characteristics. First, for the global time scale, the Pearson correlation coefficient between genetic and antigenic distances was 0.8559, indicating a roughly linear relationship between genetic and antigenic differences during the inter-epochal evolution of influenza, which is consistent with the relationship observed by Smith et al. [24]. We randomly selected some strains within and between each cluster event to calculate their average genetic and antigenic distances. Surprisingly, the genetic–antigenic evolutionary relationship between clusters showed a stronger linear pattern than that within clusters (Figure 7). Moreover, the genetic and antigenic evolution between the seven adjacent antigenic drift events showed a linear correlation of 0.8694 to 0.9573, while the evolution within clusters was characterized by discontinuous development (see Figure 8). Furthermore, we calculated that an average of 0.05 ± 0.004813 units of genetic variation led to the occurrence of an antigenic drift event. However, the distribution of genetic distances between different antigenic drift events varied greatly, and the distribution within clusters was more concentrated, even though they were usually very small. This aligns with earlier studies [47,48], which suggested that strong selection and neutral antigenic evolution alternated during antigenic drift events. As a result, the new vector space learned by the antigenic network representation learning method can explain the short-term and long-term patterns of the relationship between genetic and antigenic distances. Moreover, this method greatly improves the resolution and accuracy of antigenic differences.

4. Discussion

Certainly, there have been some notable achievements in the field. For instance, ref. [49] proposed a novel approximation method for antigenic distance, which is based on deep learning in the feature space induced by hemagglutinin protein sequences and convolutional neural networks (CNNs). On the other hand, ref. [50] evaluated the predictive capability of their method by conducting laboratory measurements. We recognize the importance of a thorough comparison and evaluation with other relevant methods. However, the experimental results indicate that our study is an initial exploration of the proposed method and provides a foundation for future investigations and advances.
Historical experience shows that antigenic data cannot significantly reflect genetic data. This article proposes a method based on the AANE model for network representation learning to integrate HA sequence data into the antigenic network structure. This method can quantitatively predict the antigenic differences between strains well. Since the HA sequence and titer data provide different sources of information, it is crucial to capture their key features to learn the comprehensive representation of strains in the antigenic network. For instance, specific epitope positions on the HA protein and the amino acids that make up the epitope may differ due to host species and genotype. The affinity matrix can map specific amino acid mutations between different nodes to the perspective of the entire sequence, which is another reason why we think it should be introduced. Its biological characterization distinguishes the differences in the three continuous amino acid sequences.
In this paper, the antigenic distance is inferred using both genetic and antigenic data, which differs from the positions inferred solely from the HI data. If the HI data are abundant, we expect smaller differences in prediction when genetic data are included (possibly only for H3N2). In contrast, if the HI data are limited, we expect genetic data to play a greater role in determining antigenic positions. As described in Section 3.3, genetic and antigenic features are not complementary, but rather genetic features complement antigenic features to explain quantitative differences in antigenicity. This is in line with our original intention of incorporating genetic information into the network structure to improve antigenic distance prediction performance. Furthermore, our validation of the contribution of hyperparameters of the loss function to the prediction task also supports the same viewpoint.
Little is known about the relationship between influenza virus antigenic phenotype changes and genetic sequence changes. A helpful framework for investigating influenza is provided by the work of Bedford et al. [31], which may be used to identify which alterations to virus genes lead to changes in antigenicity. Our model for antigenic network learning based on AANE optimizes the observed different patterns (such as homogeneity and structure) in the network, proposes a robust objective function, and makes some assumptions about the relationship between the underlying network structure and the prediction task. We can determine its effectiveness and scalability by observing how the learned vectors reflect the relationship between genetic evolution and antigenic evolution. Information about the virus’s “antigenic dynamics” can be reflected in the evolutionary perspective captured between antigenic drift events, especially in the long-term and short-term patterns of antigenic evolution.

5. Conclusions

The main contributions of this paper can be summarized as follows. We propose an effective attribute network embedding framework to learn low-dimensional representations of strains from both the node attribute affinity matrix and topological structure information. By expanding the utilization of influenza surveillance data and the representation of sequence biology significance, we validated the basis of joint spatial modeling, which supports the combination of genetic and antigenic features on real datasets. Through the learned low-dimensional representations, we can better predict the antigenic distance between any strains in the network and explore the new dynamics of antigenic evolution.

Author Contributions

Conceptualization, W.L. and Y.X.; methodology, W.L. and F.P.; software, F.P.; validation, F.P.; formal analysis, F.P. and W.L.; investigation, W.L.,Y.X. and F.P.; resources, W.L.; data curation, W.L.; writing—original draft preparation, F.P.; writing—review and editing, W.L., F.P. and Y.X.; visualization, F.P.; supervision, W.L.; project administration, W.L.; funding acquisition, W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (Grant No. 32060151), and the Yunnan Provincial Foundation for Leaders of Disciplines in Science and Technology, China (Grant No. 202305AC160014).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The antigenic dataset of HI titers was compiled by Bedford et al. and can be obtained at https://datadryad.org/stash/dataset/doi:10.5061/dryad.rc515 (accessed on 28 April 2023); the code implementation and documents related to this paper can be obtained at https://github.com/john-darwin/Antigenic-Distance-Prediction (accessed on 29 June 2023).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Agor, J.K.; Özaltın, O.Y. Models for predicting the evolution of influenza to inform vaccine strain selection. Hum. Vaccines Immunother. 2018, 14, 678–683. [Google Scholar] [CrossRef]
  2. Allen, J.D.; Ross, T.M. H3N2 influenza viruses in humans: Viral mechanisms, evolution, and evaluation. Hum. Vaccines Immunother. 2018, 14, 1840–1847. [Google Scholar] [CrossRef] [Green Version]
  3. Iuliano, A.D.; Roguski, K.M.; Chang, H.H.; Muscatello, D.J.; Palekar, R.; Tempia, S.; Cohen, C.; Gran, J.M.; Schanzer, D.; Cowling, B.J.; et al. Estimates of global seasonal influenza-associated respiratory mortality: A modelling study. Lancet 2018, 391, 1285–1300. [Google Scholar] [CrossRef] [PubMed]
  4. Kumlin, U.; Olofsson, S.; Dimock, K.; Arnberg, N. Sialic acid tissue distribution and influenza virus tropism. Influenza Other Respir. Viruses 2008, 2, 147–154. [Google Scholar] [CrossRef] [PubMed]
  5. Neu, K.E.; Dunand, C.J.H.; Wilson, P.C. Heads, stalks and everything else: How can antibodies eradicate influenza as a human disease? Curr. Opin. Immunol. 2016, 42, 48–55. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Nelson, M.I.; Holmes, E.C. The evolution of epidemic influenza. Nat. Rev. Genet. 2007, 8, 196–205. [Google Scholar] [CrossRef] [PubMed]
  7. Skowronski, D.M.; Sabaiduc, S.; Leir, S.; Rose, C.; Zou, M.; Murti, M.; Dickinson, J.A.; Olsha, R.; Gubbay, J.B.; Croxen, M.A.; et al. Paradoxical clade-and age-specific vaccine effectiveness during the 2018/19 influenza A (H3N2) epidemic in Canada: Potential imprint-regulated effect of vaccine (I-REV). Eurosurveillance 2019, 24, 1900585. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Liao, Y.C.; Lee, M.S.; Ko, C.Y.; Hsiung, C.A. Bioinformatics models for predicting antigenic variants of influenza A/H3N2 virus. Bioinformatics 2008, 24, 505–512. [Google Scholar] [CrossRef] [Green Version]
  9. Qiu, J.; Qiu, T.; Yang, Y.; Wu, D.; Cao, Z. Incorporating structure context of HA protein to improve antigenicity calculation for influenza virus A/H3N2. Sci. Rep. 2016, 6, 31156. [Google Scholar] [CrossRef]
  10. Qiu, T.; Yang, Y.; Qiu, J.; Huang, Y.; Xu, T.; Xiao, H.; Wu, D.; Zhang, Q.; Zhou, C.; Zhang, X.; et al. CE-BLAST makes it possible to compute antigenic similarity for newly emerging pathogens. Nat. Commun. 2018, 9, 1772. [Google Scholar] [CrossRef] [Green Version]
  11. Gupta, V.; Earl, D.J.; Deem, M.W. Quantifying influenza vaccine efficacy and antigenic distance. Vaccine 2006, 24, 3881–3888. [Google Scholar] [CrossRef] [Green Version]
  12. Sun, H.; Yang, J.; Zhang, T.; Long, L.P.; Jia, K.; Yang, G.; Webby, R.J.; Wan, X.F. Using sequence data to infer the antigenicity of influenza virus. MBio 2013, 4, e00230-13. [Google Scholar] [CrossRef] [Green Version]
  13. Daly, J.M.; Elton, D. Potential of a sequence-based antigenic distance measure to indicate equine influenza vaccine strain efficacy. Vaccine 2013, 31, 6043–6045. [Google Scholar] [CrossRef] [PubMed]
  14. Anderson, C.S.; DeDiego, M.L.; Thakar, J.L.; Topham, D.J. Novel sequence-based mapping of recently emerging H5NX influenza viruses reveals pandemic vaccine candidates. PLoS ONE 2016, 11, e0160510. [Google Scholar] [CrossRef] [Green Version]
  15. Li, X.; Deem, M.W. Influenza evolution and H3N2 vaccine effectiveness, with application to the 2014/2015 season. Protein Eng. Des. Sel. 2016, 29, 309–315. [Google Scholar] [CrossRef] [Green Version]
  16. Anderson, C.S.; McCall, P.R.; Stern, H.A.; Yang, H.; Topham, D.J. Antigenic cartography of H1N1 influenza viruses using sequence-based antigenic distance calculation. BMC Bioinform. 2018, 19, 51. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Zhou, X.; Yin, R.; Kwoh, C.K.; Zheng, J. A context-free encoding scheme of protein sequences for predicting antigenicity of diverse influenza A viruses. BMC Genom. 2018, 19, 145–154. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Łuksza, M.; Lässig, M. A predictive fitness model for influenza. Nature 2014, 507, 57–61. [Google Scholar] [CrossRef]
  19. Yin, R.; Luusua, E.; Dabrowski, J.; Zhang, Y.; Kwoh, C.K. Tempel: Time-series mutation prediction of influenza A viruses via attention-based recurrent neural networks. Bioinformatics 2020, 36, 2697–2704. [Google Scholar] [CrossRef]
  20. Neher, R.A.; Bedford, T.; Daniels, R.S.; Russell, C.A.; Shraiman, B.I. Prediction, dynamics, and visualization of antigenic phenotypes of seasonal influenza viruses. Proc. Natl. Acad. Sci. USA 2016, 113, E1701–E1709. [Google Scholar] [CrossRef] [Green Version]
  21. Neher, R.A.; Russell, C.A.; Shraiman, B.I. Predicting evolution from the shape of genealogical trees. Elife 2014, 3, e03568. [Google Scholar] [CrossRef] [Green Version]
  22. Yin, R.; Thwin, N.N.; Zhuang, P.; Lin, Z.; Kwoh, C.K. IAV-CNN: A 2D convolutional neural network model to predict antigenic variants of influenza A virus. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 19, 3497–3506. [Google Scholar] [CrossRef]
  23. Hirst, G.K. Studies of antigenic differences among strains of influenza A by means of red cell agglutination. J. Exp. Med. 1943, 78, 407–423. [Google Scholar] [CrossRef]
  24. Smith, D.J.; Lapedes, A.S.; De Jong, J.C.; Bestebroer, T.M.; Rimmelzwaan, G.F.; Osterhaus, A.D.; Fouchier, R.A. Mapping the antigenic and genetic evolution of influenza virus. Science 2004, 305, 371–376. [Google Scholar] [CrossRef] [Green Version]
  25. Lapedes, A.; Farber, R. The geometry of shape space: Application to influenza. J. Theor. Biol. 2001, 212, 57–69. [Google Scholar] [CrossRef] [Green Version]
  26. Cai, Z.; Zhang, T.; Wan, X.F. A computational framework for influenza antigenic cartography. PLoS Comput. Biol. 2010, 6, e1000949. [Google Scholar] [CrossRef] [PubMed]
  27. Lees, W.D.; Moss, D.S.; Shepherd, A.J. A computational analysis of the antigenic properties of haemagglutinin in influenza A H3N2. Bioinformatics 2010, 26, 1403–1408. [Google Scholar] [CrossRef] [Green Version]
  28. Qiu, T.; Qiu, J.; Yang, Y.; Zhang, L.; Mao, T.; Zhang, X.; Xu, J.; Cao, Z. A benchmark dataset of protein antigens for antigenicity measurement. Sci. Data 2020, 7, 212. [Google Scholar] [CrossRef] [PubMed]
  29. Koel, B.F.; Burke, D.F.; Bestebroer, T.M.; Van Der Vliet, S.; Zondag, G.C.; Vervaet, G.; Skepner, E.; Lewis, N.S.; Spronken, M.I.; Russell, C.A.; et al. Substitutions near the receptor binding site determine major antigenic change during influenza virus evolution. Science 2013, 342, 976–979. [Google Scholar] [CrossRef]
  30. Steinbrück, L.; Klingen, T.R.; McHardy, A.C. Computational prediction of vaccine strains for human influenza A (H3N2) viruses. J. Virol. 2014, 88, 12123–12132. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  31. Bedford, T.; Suchard, M.A.; Lemey, P.; Dudas, G.; Gregory, V.; Hay, A.J.; McCauley, J.W.; Russell, C.A.; Smith, D.J.; Rambaut, A. Integrating influenza antigenic dynamics with molecular evolution. elife 2014, 3, e01914. [Google Scholar] [CrossRef] [PubMed]
  32. Asgari, E.; Mofrad, M.R. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 2015, 10, e0141287. [Google Scholar] [CrossRef] [PubMed]
  33. Hensley, S.E.; Das, S.R.; Bailey, A.L.; Schmidt, L.M.; Hickman, H.D.; Jayaraman, A.; Viswanathan, K.; Raman, R.; Sasisekharan, R.; Bennink, J.R.; et al. Hemagglutinin receptor binding avidity drives influenza A virus antigenic drift. Science 2009, 326, 734–736. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  34. Veljkovic, V.; Paessler, S.; Glisic, S.; Prljic, J.; Perovic, V.R.; Veljkovic, N.; Scotch, M. Evolution of 2014/15 H3N2 influenza viruses circulating in US: Consequences for vaccine effectiveness and possible new pandemic. Front. Microbiol. 2015, 6, 1456. [Google Scholar] [CrossRef] [Green Version]
  35. Lee, E.K.; Tian, H.; Nakaya, H.I. Antigenicity prediction and vaccine recommendation of human influenza virus A (H3N2) using convolutional neural networks. Hum. Vaccines Immunother. 2020, 16, 2690–2708. [Google Scholar] [CrossRef]
  36. Huang, X.; Li, J.; Hu, X. Accelerated attributed network embedding. In Proceedings of the 2017 SIAM International Conference on Data Mining, Houston, TX, USA, 27–29 April 2017; pp. 633–641. [Google Scholar]
  37. Pan, S.; Wu, J.; Zhu, X.; Zhang, C.; Wang, Y. Tri-party deep network representation. Network 2016, 11, 12. [Google Scholar]
  38. Liao, L.; He, X.; Zhang, H.; Chua, T.S. Attributed social network embedding. IEEE Trans. Knowl. Data Eng. 2018, 30, 2257–2270. [Google Scholar] [CrossRef] [Green Version]
  39. Grover, A.; Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 855–864. [Google Scholar]
  40. Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; Mei, Q. Line: Large-scale information network embedding. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 1067–1077. [Google Scholar]
  41. Zhang, D.; Yin, J.; Zhu, X.; Zhang, C. Attributed network embedding via subspace discovery. Data Min. Knowl. Discov. 2019, 33, 1953–19808. [Google Scholar] [CrossRef] [Green Version]
  42. Kipf, T.N.; Welling, M. Semi-supervised classification with graph convolutional networks. arXiv 2016, arXiv:1609.02907. [Google Scholar]
  43. Veličković, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; Bengio, Y. Graph attention networks. arXiv 2017, arXiv:1710.1090. [Google Scholar]
  44. Hamilton, W.; Ying, Z.; Leskovec, J. Inductive representation learning on large graphs. arXiv 2017, arXiv:1706.02216. [Google Scholar]
  45. Park, J.; Lee, M.; Chang, H.J.; Lee, K.; Choi, J.Y. Symmetric graph convolutional autoencoder for unsupervised graph representation learnings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6519–6528. [Google Scholar]
  46. Yang, C.; Liu, Z.; Zhao, D.; Sun, M.; Chang, E.Y. Network representation learning with rich text information. In Proceedings of the 24th International Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015; pp. 2111–2117. [Google Scholar]
  47. McHardy, A.C.; Adams, B. The role of genomics in tracking the evolution of influenza A virus. PLoS Pathog. 2009, 5, e1000566. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  48. Wolf, Y.I.; Viboud, C.; Holmes, E.C.; Koonin, E.V.; Lipman, D.J. Long intervals of stasis punctuated by bursts of positive selection in the seasonal evolution of influenza A virus. Biol. Direct 2006, 1, 34. [Google Scholar] [CrossRef] [Green Version]
  49. Forghani, M.; Khachay, M. Convolutional neural network based approach to in silico non-anticipating prediction of antigenic distance for influenza virus. Viruses 2020, 12, 1019. [Google Scholar] [CrossRef] [PubMed]
  50. Zeller, M.A.; Gauger, P.C.; Arendsee, Z.W.; Souza, C.K.; Vincent, A.L.; Anderson, T.K. Machine learning prediction and experimental validation of antigenic drift in H3 influenza A viruses in swine. MSphere 2021, 6, e00920-20. [Google Scholar] [CrossRef]
Figure 1. Antigenic network representation learning frame based on AANE.
Figure 1. Antigenic network representation learning frame based on AANE.
Viruses 15 01478 g001
Figure 2. (a) The graph illustrates the relationship between the antigenic distance determined by Formula (3) (x-axis) and the proposed calculation method in this paper (y-axis). The green circles represent consistent antigenicity comparisons (i.e., similar or dissimilar) between the two calculation methods, while the red circles represent inconsistent results. The black solid line is the best linear fit with zero intercept, and the correlation between the two antigenic distance calculation methods is 0.9629 with a 95% confidence interval of 94.86% to 97.68%. There is a correlation of 0.9815 between our proposed method and the method used by Smith. (b) Using H3N2 HA1 sequence data for ProtVec continuously distributed the representation as node attribute information. To construct the antigenic diversity network and represent the node attribute information, we embedded the HA1 amino acid sequences of 329 in length into a node attribute information matrix using ProtVec. ProtVec applies Skip–Gram to learn the distributed embedding vector representation of influenza virus amino acid sequences, representing continuous triplets of amino acid sequences as 100-dimensional vectors. Through ProtVec representation, each node’s attribute information matrix can be obtained.
Figure 2. (a) The graph illustrates the relationship between the antigenic distance determined by Formula (3) (x-axis) and the proposed calculation method in this paper (y-axis). The green circles represent consistent antigenicity comparisons (i.e., similar or dissimilar) between the two calculation methods, while the red circles represent inconsistent results. The black solid line is the best linear fit with zero intercept, and the correlation between the two antigenic distance calculation methods is 0.9629 with a 95% confidence interval of 94.86% to 97.68%. There is a correlation of 0.9815 between our proposed method and the method used by Smith. (b) Using H3N2 HA1 sequence data for ProtVec continuously distributed the representation as node attribute information. To construct the antigenic diversity network and represent the node attribute information, we embedded the HA1 amino acid sequences of 329 in length into a node attribute information matrix using ProtVec. ProtVec applies Skip–Gram to learn the distributed embedding vector representation of influenza virus amino acid sequences, representing continuous triplets of amino acid sequences as 100-dimensional vectors. Through ProtVec representation, each node’s attribute information matrix can be obtained.
Viruses 15 01478 g002
Figure 3. Figure depicts the performance of the antigenic network embedding learning model based on AANE in terms of the RMSE (top) and PCC (bottom) metrics for antigenic distance prediction tasks, using the H3N2 antigenic network dataset (1968–2011) with d = 50. This model outperforms all other models. The x-axis represents the percentage of randomly removed nodes from the network (from 0% to 60%), and the y-axis represents the corresponding evaluation metrics. In (a) and (b), the predicted results of antigenic distance calculation are shown using the formula defined by Smith and the proposed normalized logarithmic transformation formula in this paper, respectively. In the left subplots of (a) and (b), the models that utilize antigenic distance as the only link weight for antigenic distance prediction are compared with AANE (green line) in terms of RMSE and PCC results. In the right subplots of (a) and (b), the models that utilize antigenic distance as the link weight and the ProtVec matrix encoding HA as the node attribute for network embedding learning are compared with AANE in terms of RMSE and PCC results (green line).
Figure 3. Figure depicts the performance of the antigenic network embedding learning model based on AANE in terms of the RMSE (top) and PCC (bottom) metrics for antigenic distance prediction tasks, using the H3N2 antigenic network dataset (1968–2011) with d = 50. This model outperforms all other models. The x-axis represents the percentage of randomly removed nodes from the network (from 0% to 60%), and the y-axis represents the corresponding evaluation metrics. In (a) and (b), the predicted results of antigenic distance calculation are shown using the formula defined by Smith and the proposed normalized logarithmic transformation formula in this paper, respectively. In the left subplots of (a) and (b), the models that utilize antigenic distance as the only link weight for antigenic distance prediction are compared with AANE (green line) in terms of RMSE and PCC results. In the right subplots of (a) and (b), the models that utilize antigenic distance as the link weight and the ProtVec matrix encoding HA as the node attribute for network embedding learning are compared with AANE in terms of RMSE and PCC results (green line).
Viruses 15 01478 g003
Figure 4. Linear regression analysis of predicted values (y-axis) versus actual values (x-axis) for different models (solid black line). Green dots represent true-positive (TP) predictions; blue dots represent true-negative (TN) predictions; red dots represent false-positive (FP) predictions and yellow dots represent false-negative (FN) predictions.
Figure 4. Linear regression analysis of predicted values (y-axis) versus actual values (x-axis) for different models (solid black line). Green dots represent true-positive (TP) predictions; blue dots represent true-negative (TN) predictions; red dots represent false-positive (FP) predictions and yellow dots represent false-negative (FN) predictions.
Viruses 15 01478 g004
Figure 5. The antigenic distances predicted by the model were converted to antigenic differences (using D ( i , i + 1 ) = 4 as the threshold for binary classification) and measured on the H3N2 dataset with different classification metrics: accuracy, precision, recall, and F1 score. (a) and (b) represent the results of predictions using the antigenic distance data calculated using the formula defined by Smith and the antigenic distance data calculated using Equation (3), respectively.
Figure 5. The antigenic distances predicted by the model were converted to antigenic differences (using D ( i , i + 1 ) = 4 as the threshold for binary classification) and measured on the H3N2 dataset with different classification metrics: accuracy, precision, recall, and F1 score. (a) and (b) represent the results of predictions using the antigenic distance data calculated using the formula defined by Smith and the antigenic distance data calculated using Equation (3), respectively.
Viruses 15 01478 g005
Figure 6. Antigenic clustering over the past four decades (1968–2011). (a) During the entire clustering process, adjacent clusters with the smallest antigenic distance in the current collection of all clusters are selected successively and merged into a new cluster without antigenic variation ( D ( i , i + 1 ) < 4 ). (b) Each circle represents all strains in a given year, and the numerical values between two circles represent the average antigenic distance between clusters during the updating process. Adjacent clusters with similar antigenicity are merged into new clusters, and the strain with the smallest antigenic variation in each cluster is used to name the final cluster.
Figure 6. Antigenic clustering over the past four decades (1968–2011). (a) During the entire clustering process, adjacent clusters with the smallest antigenic distance in the current collection of all clusters are selected successively and merged into a new cluster without antigenic variation ( D ( i , i + 1 ) < 4 ). (b) Each circle represents all strains in a given year, and the numerical values between two circles represent the average antigenic distance between clusters during the updating process. Adjacent clusters with similar antigenicity are merged into new clusters, and the strain with the smallest antigenic variation in each cluster is used to name the final cluster.
Viruses 15 01478 g006
Figure 7. Comparison of the relationship between genetic distance and antigenic distance in the same cluster.
Figure 7. Comparison of the relationship between genetic distance and antigenic distance in the same cluster.
Viruses 15 01478 g007
Figure 8. Comparison of the relationship between genetic distance and antigenic distance in adjacent clusters.
Figure 8. Comparison of the relationship between genetic distance and antigenic distance in adjacent clusters.
Viruses 15 01478 g008
Table 1. Different combinations of the regularization parameter ρ and embedding dimension d affect the prediction results of antigenic distance.
Table 1. Different combinations of the regularization parameter ρ and embedding dimension d affect the prediction results of antigenic distance.
ParametersRMSEPCC
( ρ = 10 3 , d = 150 ) *2.19940.6956
( ρ = 10 2 , d = 150 )2.01670.7354
( ρ = 10 1 , d = 150 )1.95030.7462
( ρ = 1 , d = 150 )1.85590.7662
( ρ = 10 , d = 120 )1.70250.7897
( ρ = 10 2 , d = 120 )1.66030.8018
( ρ = 10 3 , d = 120 )1.29730.8660
( ρ = 10 4 , d = 100 )0.88990.9336
( ρ = 10 5 , d = 50 )0.86780.9373
( ρ = 10 6 , d = 20 )1.51600.8311
( d = 20 , ρ = 10 6 )1.61020.8086
( d = 32 , ρ = 10 6 )0.97460.9197
( d = 50 , ρ = 10 5 )0.87300.9362
( d = 64 , ρ = 10 5 )0.91200.9303
( d = 80 , ρ = 10 3 )1.12480.8965
( d = 100 , ρ = 10 4 )0.89500.9326
( d = 120 , ρ = 10 3 )1.31080.8652
( d = 150 , ρ = 10 2 )1.79060.7753
* The first parameter represents the determined parameter, and the second parameter represents the optimal value under the condition of the first parameter.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Peng, F.; Xia, Y.; Li, W. Prediction of Antigenic Distance in Influenza A Using Attribute Network Embedding. Viruses 2023, 15, 1478. https://doi.org/10.3390/v15071478

AMA Style

Peng F, Xia Y, Li W. Prediction of Antigenic Distance in Influenza A Using Attribute Network Embedding. Viruses. 2023; 15(7):1478. https://doi.org/10.3390/v15071478

Chicago/Turabian Style

Peng, Fujun, Yuanling Xia, and Weihua Li. 2023. "Prediction of Antigenic Distance in Influenza A Using Attribute Network Embedding" Viruses 15, no. 7: 1478. https://doi.org/10.3390/v15071478

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop