Open-access Assessment of SNP-SNP interactions by using square contingency table analysis

Abstract

The evolution of SNP-SNP interactions has become an interesting field in genetic epidemiology. Most of the studies, aimed to analyze the relationship between genetic factors and disease of interest, are focused on single SNP associations. However, for quantitative traits, influenced by the interplay of environmental and more than one genetic factors, interaction between the multi factors should be taken into consideration. In this study, symmetry models for square contingency tables are applied to the cross-classified SNP-SNP interactions data. Results from a genome-wide association analysis of blood pressure are used as a prior evidence for the interacted SNPs.

Key words complete symmetry; marginal homogeneity; quasi-symmetry; SNP-SNP interaction analysis; square contingency tables

Introduction

A single nucleotide polymorphism (SNP) is defined as the variation at a single position in a DNA sequence among individuals. DNA sequence is formed from a chain of four nucleotide bases, namely, A, C, G, and T. For example, the variation is classified as a SNP, when a substitution of a T for a A in the nucleotide sequence GGAATCG, consequently turning out the sequence GGATTCG.

SNP-SNP interaction is generally defined as the interaction between different loci and, the statistical interaction analysis is expected to explain the etiology of many complex human traits such as diabetes, hypertension and asthma. In recent years, in spite of the increased number of genome-wide association analysis, interaction analysis of these genome-wide research are still few in number. As a major drawback, in genome-wide association analysis, the main focus is usually the single SNP associations. However, the effect of one locus is masked by the effects at another locus or the joint effect two SNPs may be significant whereas they are ineffective separately. Thus, the interactions play an important role in explaining the missing heritability of complex diseases.

A common and simple way for detecting the interaction between two SNPs is to use several types of regression techniques, with straightforward implementation of interaction analysis (Hartwing 2013). Moreover, when there is more than two SNPs, classical statistical approaches may lack power due to the high dimensionality. Vaidyanathan et al. (2017) followed up a clustering approach based on pairing up the SNPs based on similarity of genetic identity and they carried on the analysis by conducting a standard case control association test using Cochran-Mantel-Haenszel test to analyze the SNP-disease association.

On the other hand, a case control study is a common study design used for testing the association by comparing the frequency of SNP alleles in cases who have been diagnosed with the disease under study and controls who known to unaffected (Lewis 2002, Clarke et al. 2011). Contingency table analysis methods allow alternative models of genetic relationship by summarizing the counts in different ways. For example, Haber (1982) performed intraclass contingency table analysis for testing the independence under the assumptions of marginal homogeneity, quasi-independence and symmetry by intercrossing maternally and paternally inherited genes. A comparative chi-square analysis was applied by Song et al. (2014) to screen the large gene expression data for conserved and differential gene interactions.

The goal of this paper is to provide an alternative approach to analysis the cross-classified SNP- SNP interactions with symmetry models. The symmetry models described in the literature are complete symmetry, quasi-symmetry and marginal homogeneity models. Using these symmetry models, we explore the most suitable symmetry structure among the SNP-SNP pairs that have been found to have a relationship in the preliminary study. These symmetry models are applied to 24 SNP-SNP pairs and the most relevant SNP-SNP-pair is obtained.

The remaining part of the paper contains the description of genetic association and its theoretical properties, symmetry models with their theoretical background, a real life data application and some concluding remarks, respectively.

GENETIC ASSOCIATION

Genetic association studies are used for testing the relationship between the phenotype of interest and the genetic variant. Phenotype can be defined as the observable properties of an organism that are produced by the interaction of the genetic and environmental variants. In genetic epidemiology, the phenotype of interest is usually obtained as a disease status or a continuous indicator. In addition to this, in most of the association studies, SNPs are considered as the genetic variants. As above-mentioned, SNP is the variation at a single position in a DNA sequence among individuals and usually coded in genotype as a combination of alleles. Considering a SNP consisting of a single bi-allelic locus with alleles a and A. Then, the SNP can be characterized by three different possible categories, aa, aA and AA.

Testing genetic association is performed by using different statistical methods depending on the structure of the phenotype. For continuously measured phenotypes such as blood pressure measurements, linear models are useful tools. When the binary phenotype (0/1) is a case and control, then a logistic regression model can be used to detect the relationship between the trait and the genetic variation

In genetic epidemiology, case control studies are widely used designs since they allow contingency table analysis as a result of the categorical structure of the genotype (Slager & Schaid 2001, Velez et al. 2016). Under the null hypothesis of no association with the disease, the genotype frequencies are expected to be the same in case and controls. A contingency table can be analyzed by using standard test statistics that measure the divergence of the observed frequencies from the expected ones under the null hypothesis of no association.

For a single bi-allelic SNP with alleles a and A tested in a case control study, the data generated consist of six counts of the numbers of genotypes (aa, aA, AA) in cases and controls. In case of interaction of two SNPs, structure of the table transforms from 2x3 (Table I.(a)) to 3x3 (Table I.(b)) for cases and controls, separately.


(a)
SNP-V aa aA AA Total Case n .1 n .2 n .3 n Control m .1 m .2 m .3 m Total n 𝐵 .1 n 𝐵 .2 n 𝐵 .3 N

(b)
Controls U \ V aa aA AA Total aa m 11 m 12 m 13 m 1. aA m 21 m 22 m 23 m 2. AA m 31 m 32 m 33 m 3. Total m .1 m .2 m .3 m

In Table I, n is the number of cases, m is the number of control and N=n+m is the total number of patients.

For the interaction analysis of SNPs, square contingency table which is a special case of contingency tables, could be obtained as in the above tables. Square contingency tables that arise in dependent samples where the row and column variables have same level. Some specific models should be used in the analysis of these kinds of tables. These models are mostly in the symmetrical pattern that represents the symmetric structure of tables.

In this study, three different symmetry models are considered which are employed in the case of complete symmetry, quasi symmetry and marginal homogeneity.

SYMMETRY MODELS

Let n𝑖𝑗 be the observed frequency in the cell (i,j) and p𝑖𝑗 denotes the probability of the corresponding cell. Then, the complete symmetry (S) model is defined by;

p𝑖𝑗=p𝑗𝑖,i,j=1,2,R,(𝑓𝑜𝑟ij)
and, the S model is based on R(R1)/2 degrees of freedom (df), where R is the dimension of the square table (Goodman 1985, Bishop et al. 1975).

This model indicates that the probability that an observation will fall in cell (i,j) is equal to probability that it falls in symmetric cell (j,i). In addition, as an extended model of the S model, Quasi Symmetry (QS) model is defined by;

p𝑖𝑗p𝑗𝑘p𝑘𝑖=p𝑗𝑖p𝑘𝑗p𝑖𝑘,1i<j<kR
The QS model has (R1)(R2)/2 df (Yamaguchi 1990). QS model indicates that equality of odds ratio on one side of the main diagonal and the other side.

Other extended model of the S model, the Marginal Homogeneity (MH) model is defined by;

p𝑖.=p.i,i,j=1,2,R
where p𝑖.=t=1Rp𝑖𝑡,p.i=s=1Rp𝑠𝑖. The model has (R1) df (Stuart 1955, Tahata et al. 2008). This model indicates that the row marginal distribution is identical to the column marginal distribution.

In the genetic field, for testing the interaction between two separate bi-allelic SNPs, symmetry models can be used and it can be interpreted as the similar genotype distribution occurs in SNP-1 and SNP-2.

Let p𝑖𝑗=n𝑖𝑗/n denotes the probability of an individual having ith genotype level for the SNP-1 and jth genotype level for the SNP-2, i=1,2,3. In terms of the S model, the null hypothesis states that there are no differences between p21=p12 for genotypes “aa” and “aA”, p31=p13 for genotypes “AA” and “aA” and p32=p23 for genotypes “aa” and “AA”. For the QS model, the null hypothesis of no difference is the statement of p12p23p31=p21p32p13.

Let p𝑖.=n𝑖./n denotes the probability of ith genotype level for the SNP-1 and p.i=n.i/n denotes the probability of ith genotype level for the SNP-2. The null hypothesis for the MH model tests the differences between, namely p.1=p1. for genotype “aa”, p.2=p2. for genotype “aA” and p.3=p3. for genotype “AA”.

The Maximum Likelihood estimates of expected values e𝑖𝑗 under S model is

ê𝑖𝑗={n𝑖𝑗+n𝑗𝑖2ijn𝑖𝑖i=j
The likelihood equations for the QS model are defined as;
ê𝑖.=n𝑖.andê.i=n.ii=1,2,3ê𝑖𝑗+ê𝑗𝑖=n𝑖𝑗+n𝑗𝑖ij
Note that marginal homogeneity is not equivalent to a log linear model and for studying marginal homogeneity (Agresti 2002). The MH model assumes that summation of marginal is symmetric whereas the structure of table is non-symmetric. When α=(p12p23p31)/(p21p32p13) equals to 1, QS is equivalent to Caussinus (1965) showed that S is equivalent to QS and MH holding simultaneously. Thus, the distribution of a SNP-SNP interaction that satisfies both QS and MH, also satisfies S. Vice versa also holds, S𝑄𝑆𝑀𝐻.

The cell distribution of the parameters under symmetry models can be represented in a matrix format. Let S𝑖𝑗 denotes the element of a design matrix S in row i and column j:

S𝑖𝑗={(k+1)(i+1)(12i+1)+(R+3)(i+1)32Iij(k+1)(j+1)(12j+1)+(R+3)(j+1)32Ii>j
where, k=|ij| (Lawal & Sundheim 2002, Efendioğlu 2015).

Thus, for a bi-allelic SNP-SNP interaction, the S matrix and the corresponding R and C matrices are given below,

S=[123245356]R=[111222333]C=[123123123]
where, R is the row matrix, C is the column matrix.

Under the null hypothesise that no interaction exists between the SNPs, test statistic follows a chi- square distribution with associated df. The likelihood ratio test statistic equals

G2=i=1Rj=1Rn𝑖𝑗log(n𝑖𝑗ê𝑖𝑗)
Several models may fit to the data in the square contingency table. In such cases, the model selection process refers to the selection of the best fitting model among the models. For model selection, ranking information criteria is a common way. The well-known information criteria is the Akaike’s Information Criterion (AIC) that might be used for the model selection:
𝐴𝐼𝐶=G22𝑑𝑓
The model having the smallest AIC value gives the best fitting model (Akaike 1974).

REAL DATA APPLICATION

Hypertension and relatedly, the abnormal levels of blood pressure, are the cardiovascular risk factors. In this paper, to evaluate the performance of the symmetry models in testing SNP-SNP interaction, as a prior knowledge, the results of a genome wide analysis are analysed (Karadağ & Aktaş 2018). The 24 top associated SNPs are detected by using a multilevel latent class modelling approach considering familial and serial correlations.

Table II includes the genetic position of the variants with a chromosome number (Chr) and a chromosomal position. The number of individuals having recoded genotypes levels of 1, 2, 3 are n1, n2 and n3, respectively. The association results are summarized in Table II, where b denotes the effect of variant and se(b) is the standard error of parameter. More detailed information about the association analysis can be found in Karadağ & Aktaş 2018.

Table II
Results for top associated 25 variants (listed according to chromosomal position).

For the interaction analysis of SNPs, 3x3 square tables are generated by using number of individuals, n1, n2 and n3 given in Table II. The possible number of two-way interactions is 276 for 24 variants. In Table III, likelihood ratio test statistic G2 and p-values are summarized for only significant pairs under every three symmetry models.

Table III
Results under symmetry models.

According to the results in Table III, as an example of a significant interaction, we can say that the pair 1-24 fits to whole symmetry structures depending on the p-values, 0.970, 0.900 and 0.892, respectively (p>0.05). The distribution of observed counts is given by Table IV.(a) and the expected counts under S, QS and MH are given by Table IV.(b), Table IV.(c) and Table IV.(d), respectively.


(a)

(b)

(c)

(d)

The comparison of three symmetry models for significant interactions, are evaluated by the AIC values. G2 , df and AIC values are summarized for each model in Table V. The smallest values of G2 and AIC indicate the SNP pair that best fits the model.

Table V
Model comparison results.

SNP-SNP interactions are represented in 3x3 square contingency tables for case-control. The row and column variables have labels as “aa”, “aA” and “AA” indicating the SNP characteristics.

The S, QS and MH models are applied to the 24 top associated SNP pairs which are structured in the form of square contingency tables. Considering that the S model rarely fits data very well, for all the SNP pairs, excluding the pair 14-15 and the pair 23-24, we see that we provide the S𝑄𝑆𝑀𝐻. It means that the test statistic for goodness-of-fit of the S model is asymptotically equivalent to the sum of the QS and MH models. G2 values for the pairs 14-15 and 23-24 are calculated as zero due to the non-diagonal elements of the contingency tables consist of zero.

For the interactions, data that fitted to the S model indicate that p𝑖𝑗=p𝑗𝑖 holds for i,j=1,2,3. We could say that, for instance, p𝑎𝑎,𝐴𝑎=p𝐴𝑎,𝑎𝑎, p𝑎𝑎,𝐴𝐴=p𝐴𝐴,𝑎𝑎 and p𝐴𝑎,𝐴𝐴=p𝐴𝐴,𝐴𝑎 over alleles a and A. As a consequence, SNPs including at least one allele “A” might be at risk for disease. The pairs fitted to the MH model show that p𝑖.=p.i holds for i,j=1,2,3. For instance, for the SNP pair 1-24 it can be noted that the marginal genotype distribution of two SNPs is equal to each other. Thus, the marginal probabilities can be represented as: P(𝑆𝑁𝑃1=𝑎𝑎)=P(𝑆𝑁𝑃24=𝑎𝑎), P(𝑆𝑁𝑃1=𝑎𝐴)=P(𝑆𝑁𝑃24=𝑎𝐴) and P(𝑆𝑁𝑃1=𝐴𝐴)=P(𝑆𝑁𝑃24=𝐴𝐴).

The QS model holds for the pair 1-24. It can be concluded that the ratio of expected frequencies under the QS model can be calculated as (ê12ê23ê31)/(ê21ê32ê13)=0.999 which is approximately equal to 1, hence we can say that the QS is equivalent to S.

CONCLUSIONS

In this study, complete symmetry, quasi-symmetry and marginal homogeneity models representing the symmetry structure in square contingency tables were applied to 24 cross-classified SNP pairs. Our results show that by using symmetry models, SNP-SNP interactions can be evaluated when the study design is a case control study. We hope that the results given in this paper will be helpful to researchers studying on gene interaction and association analysis and also other kind of interaction analysis on various sciences.

REFERENCES

  • AGRESTI A. 2002. Categorical Data Analysis. Second Edition, John Wiley & Sons, Inc., New Jersey.
  • AKAIKE H. 1974. A new look at the statistical model identification. IEEE Trans Autom Contr 19: 716-723.
  • BISHOP YM, FIENBERG S & HOLLAND PW. 1975. Discrete Multivariate Analysis. Theory and Practice, MIT Press, London.
  • CAUSSINUS H. 1965. Contributions a l’analyse Statistique des Tableaux de Correlation. Ann Fac Sci Univ Toulouse 29: 77-182.
  • CLARKE MC, ANDERSON CA, PETTERSSON FH, CARDON LR, MORRIS AP & ZONDERVAN KT. 2011. Basic statistical analysis in genetic case-control studies. Nature Protocols 6: 121-133.
  • EFENDİOĞLU G. 2015. Karesel Olumsallık Tablolarında Simetrik Olmayan Modeller. Yüksek Lisans Tezi, Hacettepe Üniversitesi Fen Bilimleri Enstitüsü, Ankara.
  • GOODMAN LA. 1985. The Analysis of Cross-Classified Data Ordered and/or Unordered Categories: Association Models, Correlation Models, and Asymmetry Models for Contingency Tables with or without Missing Entries. Ann Statistics 13: 10-69.
  • HABER M. 1982. Testing for Independence in Intraclass Contingency Tables. Biometrics 38: 93-103.
  • HARTWING FP. 2013. SNP-SNP Interactions: Focusing on Variable Coding for Complex Models of Epistasis. J Genet Syndr Gene Ther 4: 189.
  • KARADAĞ Ö & AKTAŞ S. 2018. A generalized, multi-stage adjusted, latent class linear mixed model for testing genetic association. Comm Statist Simulation Comput 48: 2301-2312.
  • LAWAL H & SUNDHEIM R. 2002. Generating Factor Variables for Asymmetry, Non-independence and Skew-symmetry Models in Square Contingency Tables using SAS. Journal Statist Soft 7: 1-23.
  • LEWIS CM. 2002. Genetic Association Studies: Design, analysis and interpretation. Briefings in Bioinformatics 3: 146-153.
  • SLAGER SL & SCHAID DJ. 2001. Evaluation of Candidate Genes in Case- Control Studies: A Statistical Method to Account for Related Subjects. Am J Human Genet 68: 1457-1462.
  • SONG M, ZHANG Y, KATZAROFF AJ, EDGAR BA & BUTTITTA L. 2014. Hunting complex differential gene interaction patterns across molecular contexts. Nucleic Acids Research 42.
  • STUART A. 1955. A Test for Homogeneity of the Marginal Distributions in a Two-way Classification. Biometrika 42: 412-441.
  • TAHATA K, IWASHITA T & TOMIZAWA S. 2008. Measure of Departure from Conditional Marginal Homogeneity for Square Contingency Tables with Ordered Categories. Statistics 42: 453-466.
  • VAIDYANATHAN V, NAIDU V, KARUNASINGHE N, JABED A, PALLATI R, MARLOW G & FERGUSON RL. 2017. SNP-SNP interactions as risk factors for aggressive prostate cancer. F1000Research 6: 621.
  • VELEZ JI, MARMOLEJO-RAMOS F & CORREA JC. 2016. A Graphical Diagnostic Test for Two-Way Contingency Tables. Rev Colomb Estadístic 9: 97-108.
  • YAMAGUCHI K. 1990. Some models for the analysis of asymmetric association in square contingency tables with ordered categories. Sociological Methodology, p. 181-212.

Publication Dates

  • Publication in this collection
    27 Nov 2020
  • Date of issue
    2020

History

  • Received
    19 Apr 2019
  • Accepted
    22 Oct 2019
location_on
Academia Brasileira de Ciências Rua Anfilófio de Carvalho, 29, 3º andar, 20030-060 Rio de Janeiro RJ Brasil, Tel: +55 21 3907-8100 - Rio de Janeiro - RJ - Brazil
E-mail: aabc@abc.org.br
rss_feed Acompanhe os números deste periódico no seu leitor de RSS
Acessibilidade / Reportar erro