This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Determining the complex relationships between diseases, polymorphisms in human genes and environmental factors is challenging. Multifactor dimensionality reduction (MDR) has been proven to be capable of effectively detecting the statistical patterns of epistasis, although classification accuracy is required for this approach. The imbalanced dataset can cause seriously negative effects on classification accuracy. Moreover, MDR methods cannot quantitatively assess the disease risk of genotype combinations. Hence, we introduce a novel weighted risk scorebased multifactor dimensionality reduction (WRSMDR) method that uses the Bayesian posterior probability of polymorphism combinations as a new quantitative measure of disease risk. First, we compared the WRSMDR to the MDR method in simulated datasets. Our results showed that the WRSMDR method had reasonable power to identify highorder genegene interactions, and it was more effective than MDR at detecting fourlocus models. Moreover, WRSMDR reveals more information regarding the effect of genotype combination on the disease risk, and the result was easier to determine and apply than with MDR. Finally, we applied WRSMDR to a nasopharyngeal carcinoma (NPC) casecontrol study and identified a statistically significant highorder interaction among three polymorphisms: rs2860580, rs11865086 and rs2305806.
Complex interactions among genes and environmental factors are known to play a role in common human disease etiology. However, the identification and characterization of genegene interactions for common complex human diseases remain a challenge for human geneticists. Traditional statistical methods are not well suited for detecting such interactions, especially when the data are highly dimensional (having many attributes or independent variables) or when interactions occur between more than two polymorphisms [
Nasopharyngeal carcinoma (NPC) is a squamous cell carcinoma that arises in the epithelial lining of the nasopharynx [
In this paper, we introduce a novel weighted risk scorebased multifactor dimensionality reduction (WRSMDR) method for detecting and characterizing highorder genegene interactions in casecontrol studies. This WRSMDR method uses the Bayesian posterior probability of each genotype combination as a quantitative measure of disease risk and computes the proportion of each genotype combinations in all samples as the weight. WRSMDR exhaustively searches all possible combinations of polymorphisms to identify the one that can divide the samples into the best risk subgroups. We first evaluated WRSMDR using simulated multilocus data with epistatic effects and then compared it to the original MDR method. Next, we applied the WRSMDR method to identify multiple singlenucleotide polymorphisms (SNP) associated with nasopharyngeal carcinoma.
Specific detection rate = the proportion of simulated datasets in which the true model was detected as the overall best model [
Detection rate = the proportion of simulated datasets in which a multilocus model, including the true model, was detected as the overall best model [
Error rate = the proportion of simulated datasets in which the overall best model did not include the true model.
No detection rate = the proportion of simulated datasets in which the method did not detect any statistically significant model.
We applied the WRSMDR and MDR methods to balanced and imbalanced simulated datasets, and the results are shown in
Power comparison of the MDR and weighted risk scorebased multifactor dimensionality reduction (WRSMDR) methods in balanced datasets.
Evaluation Indicator  TwoLocus  ThreeLocus  FourLocus  

WRSMDR  MDR  WRSMDR  MDR  WRSMDR  MDR  
Specific Detection Rate  0.87  0.83  0.74  0.83  0.92  0.46 
Detection Rate  1  1  1  1  0.97  0.56 
Error Rate  0  0  0  0  0.01  0.44 
No Detection Rate  0  0  0  0  0.02  0 
Power comparison of the MDR and WRSMDR methods in imbalanced datasets.
Evaluation Indicator  TwoLocus  ThreeLocus  FourLocus  

WRSMDR  MDR  WRSMDR  MDR  WRSMDR  MDR  
Specific Detection Rate  0.96  0.61  0.57  0.66  0.94  0.68 
Detection Rate  1  0.81  0.85  0.85  0.98  0.79 
Error Rate  0  0.19  0.03  0.15  0.01  0.21 
No Detection Rate  0  0  0.12  0  0.01  0 
We also applied the MDR method to explore the NPC data in a further step. The result is shown in
Summary of the results for applying the WRSMDR method to the nasopharyngeal carcinoma (NPC) dataset.
Number of Locus  SNPs  Weighted Risk Score  Consistency 


2  rs2860580rs11865086  1.324  10  <0.001 
3  rs2860580rs11865086rs2305806 *  1.332  10  <0.001 
4  rs2860580rs11865086rs836475rs4976028  1.266  4  <0.001 
5  rs2860580rs11865086rs836475rs4976028rs6488297  1.236  7  <0.001 
* The threelocus combination was selected as the best model by the WRSMDR method.
Summary of the disease probability estimated using Bayes’ posterior probability.
Genotype Combination of the Three SNPs ^{a}  Disease Probability ^{b}  Fold Increase in Risk ^{c}  Weight of Genotype ^{d} 

GGCCAG  0.00077  3.07  0.03 
GGCCAA  0.00045  1.78  0.03 
GGACAA  0.00038  1.51  0.09 
GGACAG  0.00037  1.49  0.11 
AGCCAG  0.00036  1.43  0.03 
GGAAAA  0.00034  1.36  0.08 
AGACAA  0.00032  1.29  0.09 
GGAAAG  0.00031  1.24  0.08 
GGAAGG  0.00031  1.23  0.02 
GGACGG  0.00027  1.07  0.03 
AGCCAA  0.00026  1.03  0.02 
AGACGG  0.00019  0.77  0.03 
AGACAG  0.00019  0.77  0.09 
AGAAAG  0.00017  0.67  0.07 
AGAAGG  0.00016  0.66  0.02 
AGAAAA  0.00016  0.62  0.08 
AAACAG  0.00013  0.52  0.02 
AAACAA  0.00010  0.39  0.02 
AAAAAG  0.00008  0.33  0.01 
AAAAAA  0.00006  0.26  0.01 
^{a} The three SNPs = rs2860580rs11865086rs2305806; ^{b} the disease probability is calculated by Bayes’ posterior probability formula, which represents the disease probability of an individual who carries a specific multilocus genotype combination; ^{c} the fold increase in risk compared to the cumulative risk of NPC; ^{d} the weight of the genotype is the proportion of samples with the specific genotype combination.
The three SNPs are located in the HLAA, MAPK3 and VAV1 genes, which play important roles in the NK cell pathway. NK cells are lymphocytes distinct from B and Tcells that induce the perforinmediated lysis of tumor cells and virusinfected cells, and the NK cell pathway regulates the role of NK cells in the immune response. The highest NPC risk among the threelocus genotype combinations was three times greater than the cumulative risk of the disease, which indicates that this pathway may be associated with NPC. To the best of our knowledge, this is the first report describing a threelocus interaction associated with NPC, and the results of this study therefore provide new insights into the pathogenesis of NPC.
Summary of results applying the MDR method to the NPC dataset.
Number of Locus  SNPs  Prediction Error (%)  CrossValidation Consistency 


2  rs2860580rs11865086  41.65  9/10  <0.001 
3  rs2860580rs11865086rs2305806 *  40.48  10/10  <0.001 
4  rs2860580rs11865086rs2305806rs2115485  41.31  8/10  <0.001 
5  rs2860580rs11865086rs2305806 
45.35  5/10  <0.022 
* The threelocus combination was selected as the best model by MDR.
The WRSMDR method provides several advantages. First, similar to the original MDR method, the WRSMDR method is a nonparametric approach and assumes no particular genetic model; Second, the WRSMDR method provides a more robust quantitative measure of disease risk and reveals more information regarding the effect of certain genotype combinations on the disease risk, and this also represents an important difference from the MDR method, which only discretized the risk into high and low. Our results showed that the WRSMDR method had more power than the MDR method in detecting fourlocus genegene interactions in the simulated datasets. For the balanced fourlocus datasets, the specific detection rates of the WRSMDR and MDR method were 92% and 46%, respectively. For the imbalanced fourlocus datasets, the specific detection rates of the WRSMDR and MDR method were 94% and 68%, respectively. The reason for this difference may lie in the fact that the MDR method is vulnerable to false positive and false negative errors when the sample size is small or when the number of simultaneously detected loci is large. In the case of this scenario, the number of cases and controls with a certain genotype combination is very small, and a small change in the frequency can change the classification to the opposite result. With the WRSMDR method, the quantitative measure of the disease risk was effected less than binary classification values in this scenario; Third, the WRSMDR method uses a weighted risk score rather than classification accuracy as the evaluation measure of the multilocus interaction. The goal of MDR is to search a locus combination with maximum classification accuracy. For imbalanced datasets, classifiers seeking an accurate performance are not suitable. The imbalanced dataset can cause a seriously negative effect on classification accuracy [
However, similar to MDR, WRSMDR has the limitation of being computationally intensive. A genome scan with hundreds or thousands of polymorphisms requires robust machine learning algorithms, as all of the possible multilocus combinations cannot be exhaustively searched. This requirement, however, is a limitation of any multilocus method that does not first condition on a particular locus showing an independent main effect (e.g., stepwise logistic regression) [
With the WRSMDR method, the disease probability of an individual carrying a multilocus genotype combination was used to assess the susceptibility of the genotype combination, which was denoted as
In this equation,
In this equation,
Suppose we want to investigate
The extent of increased or decreased risk can be defined as follows:
The weight of genotype
Based on the “common diseasecommon variant” hypothesis, we omit the genotype combinations with weights less than 0.01, and the weighted risk score is defined as follows:
Instead of performing a direct search of the multilocus combinations with the maximum score among the SNP set, which is liable to result in some falsepositive loci in the detected multilocus combinations, we used a random sampling method to repress noise in the identification of the susceptibility of multilocus combinations. This procedure included three steps. In Step 1, 90% of the samples were randomly selected. The weighted risk scores of
To evaluate the WRSMDR method, we simulated six sets of 100 replicates using three different multilocus genetic models. Three sets were balanced, and the simulated dataset was composed of 400 cases and 400 controls. The other three sets are imbalanced, and the simulated dataset was composed of 1200 cases and 400 controls. All the genetic models and datasets were generated using the Genetic Architecture Model Emulator for Testing and Evaluating Software (GAMETES) [
The procedure for the WRSMDR and MDR methods.
The parameter settings of the three models.
Parameters  TwoLocus Model  ThreeLocus Model  FourLocus Model 

Number of predictive SNPs  2  3  4 
Number of nonpredictive SNPs  8  7  6 
Heritability  0.05  0.05  0.05 
MAF of predictive SNPs  0.2  0.2  0.2 
MAF of nonpredictive SNPs  (0.01~0.5)  (0.01~0.5)  (0.01~0.5) 
MAF = minor allele frequency.
This NPC data were based on a large GWAS of NPC that was performed on Southern Chinese individuals by genotyping 620,901 SNPs in 1615 cases and 1025 controls of persons of Han Chinese descent from Guangdong and an additional 1008 Singapore Chinese controls, who share the same ancestral origin with Han Chinese individuals in Southern China [
Prior to applying WRSMDR to the NPC dataset, the method was evaluated using the simulated multilocus datasets. For every 100 replicates generated by each of the three multilocus epistasis models, we applied the WRSMDR algorithm as described in the subsection “WRSMDR”. An exhaustive search of all possible two to fivelocus models was performed. Then, the WRSMDR method was applied to the NPC dataset with the cumulative risk of the disease equal to 0.00025 [
NK cell pathway SNPs involved in this study.
SNP  Chr.  Locus  MA  ChiSquare Value 

rs2860580  6  HLAA  A  89.95 
rs11865086  16  MAPK3  C  14.96 
rs4976028  5  PIK3R1  G  9.98 
rs11150675  16  LAT  A  7.47 
rs6488297  12  KLRC1  A  7.05 
rs941831  10  ITGB1  G  5.88 
rs836475  7  RAC1  A  4.80 
rs2733840  12  KLRC4  G  3.02 
rs2733840  12  KLRC3  G  3.02 
rs10109834  8  PTK2B  C  2.71 
rs2115485  9  SYK  A  2.68 
rs2305806  19  VAV1  G  2.57 
rs7166547  15  MAP2K1  A  2.35 
rs744167  12  PTPN6  A  1.97 
rs7301582  12  KLRC2  A  1.45 
rs3019238  11  PAK1  G  1.23 
rs7645550  3  PIK3CA  A  0.76 
rs11214093  11  IL18  G  0.70 
rs12310310  12  KLRD1  A  0.58 
rs4780  15  B2M  G  0.23 
Chr., chromosome; MA, minor allele.
In this study, we introduced WRSMDR as a method for detecting genegene interactions in casecontrol studies. We compared the WRSMDR and MDR methods in simulated datasets. Our result showed that the WRSMDR method had reasonable power to identify highorder interactions in simulated datasets. In particular, for the fourlocus datasets, the detection rate and specific detection rate of the WRSMDR method were higher than the MDR method, whereas the error rate of the WRSMDR method was lower than the MDR method; these differences were statistically significant. The WRSMDR method was more effective than the MDR method at detecting fourlocus models in the simulated datasets. Moreover, the WRSMDR method reveals more information regarding the effect of genotype combinations on the disease risk. We then applied WRSMDR to identify genegene interaction effects in the NK cell pathway related to the risk of NPC, and we found a statistically significant, highorder interaction among three polymorphisms. For ease of use, the source code and binaries are freely available for download at [
The authors thank the anonymous reviewers for their helpful comments. The work was supported by the National Natural Science Foundation of China (81325018, 81220108022) and the National Basic Research Program of China (2011CB504303).
Weihua Jia conceived of the study. Futian Luo aided in study design and statistical method. Chaofeng Li performed the simulations and data analysis and wrote the paper. Weihua Jia and Yixin Zeng performed the clinical data collection, genotyping and interpretation of study findings. All authors have contributed their efforts to this work.
The authors declare no conflict of interest.