Genetic Analysis Package

Jing Hua Zhao, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK, https://jinghuazhao.github.io/.

2022-05-10

Introduction

This package was initiated to integrate some C/Fortran/SAS programs I have written or used over the years. As such, it would rather be a long-term project, but an immediate benefit would be something complementary to other packages currently available from CRAN, e.g. genetics, hwde, etc. I hope eventually this will be part of a bigger effort to fulfill most of the requirements foreseen by many,1 within the portable environment of R for data management, analysis, graphics and object-oriented programming. My view has been outlined more formally2,3 in relation to other package systems and also on package kinship.4,5

The number of functions are quite limited and experimental, but I already feel the enormous advantage by shifting to R and would like sooner rather than later to share my work with others. I will not claim this work as exclusively done by me, but would like to invite others to join me and enlarge the collections and improve them.

With my recent work on genomewide association studies (GWASs) especially protein GWASs, I have added many functions such as METAL_forestplot which handles data from software METAL and sentinels which extracts sentinels from GWAS summary statistics in a way that is very appealing to us compared to some other established software. At the meantime, the size of the package surpasses the limit as imposed by CRAN, thus the old good feature of S as with R that value both code and data alike has to suffer slightly in that gap.datasets and gap.examples are spun off as two separate packages; they do deserve a glimpse however for some general ideas.

Several other updates are worth mentioning:

  1. use of markdown from the noweb/Sweave format which allows for Citation Style Language (CSL) (https://citationstyles.org/) to format the bibliography.

  2. multiple figures generated from a code section to be embedded.

  3. Experimental implementation of a shiny App (runshinygap()).

  4. roxygen2 style documentation in companion with R code to allow for easier generation of .Rd files through devtools::document().

It is likely these experiences will be shared with/useful for other colleagues in converting their work into R.

Implementation

The following list shows the data and functions currently available.

Name Description
ANALYSIS  
AE3 AE model using nuclear family trios
bt Bradley-Terry model for contingency table
ccsize Power and sample size for case-cohort design
cs Credibel set
fbsize Sample size for family-based linkage and association design
gc.em Gene counting for haplotype analysis
gcontrol genomic control
gcontrol2 genomic control based on p values
gcp Permutation tests using GENECOUNTING
gc.lambda Estionmation of the genomic control inflation statistic (lambda)
genecounting Gene counting for haplotype analysis
gif Kinship coefficient and genetic index of familiality
grmMCMC Mixed modeling with genetic relationship matrices
gsmr Mendelian randomization analysis
hap Haplotype reconstruction
hap.em Gene counting for haplotype analysis
hap.score Score statistics for association of traits with haplotypes
htr Haplotype trend regression
hwe Hardy-Weinberg equilibrium test for a multiallelic marker
hwe.cc A likelihood ratio test of population Hardy-Weinberg equilibrium
hwe.hardy Hardy-Weinberg equilibrium test using MCMC
invnormal Inverse normal transformation
kin.morgan kinship matrix for simple pedigree
LD22 LD statistics for two diallelic markers
LDkl LD statistics for two multiallelic markers
lambda1000 A standardized estimate of the genomic inflation scaling to
  a study of 1,000 cases and 1,000 controls
log10p log10(p) for a standard normal deviate
log10pvalue log10(p) for a P value including its scientific format
logp log(p) for a normal deviate
masize Sample size calculation for mediation analysis
mia multiple imputation analysis for hap
mtdt Transmission/disequilibrium test of a multiallelic marker
mtdt2 Transmission/disequilibrium test of a multiallelic marker
  by Bradley-Terry model
mvmeta Multivariate meta-analysis based on generalized least squares
pbsize Power for population-based association design
pbsize2 Power for case-control association design
pfc Probability of familial clustering of disease
pfc.sim Probability of familial clustering of disease
pgc Preparing weight for GENECOUNTING
print.hap.score Print a hap.score object
s2k Statistics for 2 by K table
sentinels Sentinel identification from GWAS summary statistics
tscc Power calculation for two-stage case-control design
   
GRAPHICS  
asplot Regional association plot
ESplot Effect-size plot
circos.cnvplot circos plot of CNVs
circos.cis.vs.trans.plot circos plot of cis/trans classification
circos.mhtplot circos Manhattan plot with gene annotation
cnvplot genomewide plot of CNVs
METAL_forestplot forest plot as R/meta’s forest for METAL outputs
makeRLEplot make relative log expression plot
mhtplot Manhattan plot
mhtplot2 Manhattan plot with annotations
pqtl2dplot 2D pQTL plot
pqtl2dplotly 2D pQTL plotly
pqtl3dplotly 3D pQTL plotly
mhtplot.trunc truncated Manhattan plot
miamiplot Miami plot
pedtodot Converting pedigree(s) to dot file(s)
plot.hap.score Plot haplotype frequencies versus haplotype score statistics
qqfun Quantile-comparison plots
qqunif Q-Q plot for uniformly distributed random variable
   
UTILITIES  
SNP Functions for single nucleotide polymorphisms (SNPs)
BFDP Bayesian false-discovery probability
FPRP False-positive report probability
ab Test/Power calculation for mediating effect
b2r Obtain correlation coefficients and their variance-covariances
chow.test Chow’s test for heterogeneity in two regressions
chr_pos_a1_a2 Form SNPID from chromosome, posistion and alleles
cis.vs.trans.classification a cis/trans classifier
comp.score score statistics for testing genetic linkage of quantitative trait
GRM functions ReadGRM, ReadGRMBin, ReadGRMPLINK,
  ReadGRMPCA, WriteGRM,
  WriteGRMBin, WriteGRMSAS
  handle genomic relationship matrix involving other software
get_b_se Get b and se from AF, n, and z
get_pve_se Get pve and its standard error from n, z
get_sdy Get sd(y) from AF, n, b, se
h2G A utility function for heritability
h2GE A utility function for heritability involving gene-environment interaction
h2l A utility function for converting observed heritability to its counterpart
  under liability threshold model
h2_mzdz Heritability estimation according to twin correlations
klem Haplotype frequency estimation based on a genotype table
  of two multiallelic markers
makeped A function to prepare pedigrees in post-MAKEPED format
metap Meta-analysis of p values
metareg Fixed and random effects model for meta-analysis
muvar Means and variances under 1- and 2- locus (diallelic) QTL model
pvalue P value for a normal deviate
read.ms.output A utility function to read ms output
revStrand Allele on the reverse strand
runshinygap Start shinygap
snptest_sample A utility to generate SNPTEST sample file
twinan90 Classic twin models
whscore Whittemore-Halpern scores for allele-sharing
weighted.median Weighted median with interpolation

Assuming proper installation, you will be able to obtain the list by typing library(help=gap) or view the list within a web browser via help.start(). See Appendix on how to obtain a full list of functions.

This file can be viewed with command vignette("gap", package="gap").

You can cut and paste examples at end of each function’s documentation.

Both genecounting and hap are able to handle SNPs and multiallelic markers, with the former be flexible enough to include features such as X-linked data and the later being able to handle large number of SNPs. But they are unable to recode allele labels automatically, so functions gc.em and hap.em are in haplo.em format and used by a modified function hap.score in association testing.

It is notable that multilocus data are handled differently from that in hwde and elegant definitions of basic genetic data can be found in the genetics package.

Incidentally, I found my C mixed-radixed sorting routine6 is much faster than R’s internal function.

With exceptions such as function pfc which is very computer-intensive, most functions in the package can easily be adapted for analysis of large datasets involving either SNPs or multiallelic markers. Some are utility functions, e.g. muvar and whscore, which will be part of the other analysis routines in the future.

The benefit with R compared to standalone programs is that for users, all functions have unified format. For developers, it is able to incorporate their C/C++ programs more easily and avoid repetitive work such as preparing own routines for matrix algebra and linear models. Further advantage can be taken from packages in Bioconductor, which are designed and written to deal with large number of genes.

Independent programs

To facilitate comparisons and individual preferences, The source codes for 2LD, EHPLUS, GENECOUNTING, HAP, now hosted at GitHub, have enjoyed great popularity ahead of the genomewide association studies (GWAS) therefore are likely to be more familiar than their R couunterparts in gap. However, you need to follow their instructions to compile for a particular computer system.

I have included ms code (which is required by read.ms.output and .xls files to accompany read.ms.output and FPRP and BFDP functions as with a classic twin example for ACE model in OpenMx. The package is now available from CRAN.

For these models it is actually simpler to use facilities as in package mets, which is now in suggests.

A final category is twinan90, which is now dropped from the package function list due to difficulty to keep up with the requirements by the R environment but nevertheless you will still be able to compile and use otherwise.

Demos

This has been a template for self-contained examples:

library(gap)
demo(gap)

See examples of haplotype analysis there – additional examples are given below.

Examples

I would like to highlight pedtodot, pbsize, fbsize and ccsize functions used for pedigree drawing and power/sample calculations in a genome-wide asssociatoin study as reported.7

Pedigree drawing

I have included the original file for the R News as well as put examples in separate vignettes. They can be accessed via vignette("rnews",package="gap.examples") and vignette("pedtodot", package="gap.examples"), respectively.

Kinship calculation

Next, I will provide an example for kinship calculation using kin.morgan. It is recommended that individuals in a pedigree are ordered so that parents always precede their children. In this regard, package pedigree can be used, and package kinship2 can be used to produce pedigree diagram as with kinship matrix.

Pedigree diagram

The pedigree diagram is as follows,

library(gap)
#> Loading required package: gap.datasets
#> gap version 1.2.3-6
# pedigree diagram
data(lukas, package="gap.datasets")
library(kinship2)
#> Loading required package: Matrix
#> Loading required package: quadprog
ped <- with(lukas,pedigree(id,father,mother,sex))
plot(ped,cex=0.4)
A pedigree diagram

A pedigree diagram

Kinship calculation

We then turn to the kinship calculation.

# unordered individuals
library(gap)
gk1 <- kin.morgan(lukas)
write.table(gk1$kin.matrix,"gap_1.txt",quote=FALSE)

library(kinship2)
kk1 <- kinship(lukas[,1],lukas[,2],lukas[,3])
write.table(kk1,"kinship_1.txt",quote=FALSE)

d <- gk1$kin.matrix-kk1
sum(abs(d))
#> [1] 2.443634

# order individuals so that parents precede their children
library(pedigree)
#> Loading required package: HaploSim
#> Loading required package: reshape
#> 
#> Attaching package: 'reshape'
#> The following object is masked from 'package:Matrix':
#> 
#>     expand
op <- orderPed(lukas)
olukas <- lukas[order(op),]
gk2 <- kin.morgan(olukas)

write.table(olukas,"olukas.csv",quote=FALSE)
write.table(gk2$kin.matrix,"gap_2.txt",quote=FALSE)

kk2 <- kinship(olukas[,1],olukas[,2],olukas[,3])
write.table(kk2,"kinship_2.txt",quote=FALSE)

z <- gk2$kin.matrix-kk2
sum(abs(z))
#> [1] 0

We see that in the second case, the result agrees with kinship2.

Study design

Family-based design

It now has an experimental work via Shiny from inst/shinygap.

The example involving family-based design is as follows,

options(width=150)
library(gap)
models <- matrix(c(
         4.0, 0.01,
         4.0, 0.10,
         4.0, 0.50, 
         4.0, 0.80,
         2.0, 0.01,
         2.0, 0.10,
         2.0, 0.50,
         2.0, 0.80,
         1.5, 0.01,    
         1.5, 0.10,
         1.5, 0.50,
         1.5, 0.80), ncol=2, byrow=TRUE)
outfile <- "fbsize.txt"
cat("gamma","p","Y","N_asp","P_A","H1","N_tdt","H2","N_asp/tdt",
    "L_o","L_s\n",file=outfile,sep="\t")
for(i in 1:12) {
    g <- models[i,1]
    p <- models[i,2]
    z <- fbsize(g,p)
    cat(z$gamma,z$p,z$y,z$n1,z$pA,z$h1,z$n2,z$h2,z$n3,
        z$lambdao,z$lambdas,file=outfile,append=TRUE,sep="\t")
    cat("\n",file=outfile,append=TRUE)
}
table1 <- read.table(outfile,header=TRUE,sep="\t")
nc <- c(4,7,9)
table1[,nc] <- ceiling(table1[,nc])
dc <- c(3,5,6,8,10,11)
table1[,dc] <- round(table1[,dc],2)
unlink(outfile)
# APOE-4, Scott WK, Pericak-Vance, MA & Haines JL
# Genetic analysis of complex diseases 1327
g <- 4.5
p <- 0.15
cat("\nAlzheimer's:\n\n")
#> 
#> Alzheimer's:
data.frame(fbsize(g,p))
#>   gamma    p         y       n1        pA        h1      n2        h2       n3  lambdao  lambdas
#> 1   4.5 0.15 0.6256916 162.6246 0.8181818 0.4598361 108.994 0.6207625 39.97688 1.671594 1.784353
table1
#>    gamma    p    Y   N_asp  P_A   H1 N_tdt   H2 N_asp.tdt  L_o  L_s
#> 1    4.0 0.01 0.52    6402 0.80 0.05  1201 0.11       257 1.08 1.09
#> 2    4.0 0.10 0.60     277 0.80 0.35   165 0.54        53 1.48 1.54
#> 3    4.0 0.50 0.58     446 0.80 0.50   113 0.42        67 1.36 1.39
#> 4    4.0 0.80 0.53    3024 0.80 0.24   244 0.16       177 1.12 1.13
#> 5    2.0 0.01 0.50  445964 0.67 0.03  6371 0.04      2155 1.01 1.01
#> 6    2.0 0.10 0.52    8087 0.67 0.25   761 0.32       290 1.07 1.08
#> 7    2.0 0.50 0.53    3753 0.67 0.50   373 0.47       197 1.11 1.11
#> 8    2.0 0.80 0.51   17909 0.67 0.27   701 0.22       431 1.05 1.05
#> 9    1.5 0.01 0.50 6944779 0.60 0.02 21138 0.03      8508 1.00 1.00
#> 10   1.5 0.10 0.51  101926 0.60 0.21  2427 0.25      1030 1.02 1.02
#> 11   1.5 0.50 0.51   27048 0.60 0.50  1039 0.49       530 1.04 1.04
#> 12   1.5 0.80 0.51  101926 0.60 0.29  1820 0.25      1030 1.02 1.02

Population-based design

The example involving population-based design is as follows,

library(gap)
kp <- c(0.01,0.05,0.10,0.2)
models <- matrix(c(
          4.0, 0.01,
          4.0, 0.10,
          4.0, 0.50, 
          4.0, 0.80,
          2.0, 0.01,
          2.0, 0.10,
          2.0, 0.50,
          2.0, 0.80,
          1.5, 0.01,    
          1.5, 0.10,
          1.5, 0.50,
          1.5, 0.80), ncol=2, byrow=TRUE)
outfile <- "pbsize.txt"
cat("gamma","p","p1","p5","p10","p20\n",sep="\t",file=outfile)
for(i in 1:dim(models)[1])
{
   g <- models[i,1]
   p <- models[i,2]
   n <- vector()
   for(k in kp) n <- c(n,ceiling(pbsize(k,g,p)))
   cat(models[i,1:2],n,sep="\t",file=outfile,append=TRUE)
   cat("\n",file=outfile,append=TRUE)
} 
table5 <- read.table(outfile,header=TRUE,sep="\t")
table5
#>    gamma    p      p1     p5    p10   p20
#> 1    4.0 0.01   46681   8959   4244  1887
#> 2    4.0 0.10    8180   1570    744   331
#> 3    4.0 0.50   10891   2091    991   441
#> 4    4.0 0.80   31473   6041   2862  1272
#> 5    2.0 0.01  403970  77530  36725 16323
#> 6    2.0 0.10   52709  10116   4792  2130
#> 7    2.0 0.50   35285   6772   3208  1426
#> 8    2.0 0.80   79391  15237   7218  3208
#> 9    1.5 0.01 1599920 307056 145448 64644
#> 10   1.5 0.10  192105  36869  17465  7762
#> 11   1.5 0.50   98013  18811   8911  3961
#> 12   1.5 0.80  192105  36869  17465  7762

Case-cohort design

For case-cohort design, we obtain results for ARIC and EPIC studies.

library(gap)
# ARIC study
outfile <- "aric.txt"
n <- 15792
pD <- 0.03
p1 <- 0.25
alpha <- 0.05
theta <- c(1.35,1.40,1.45)
beta <- 0.2
s_nb <- c(1463,722,468)
cat("n","pD","p1","hr","q","power","ssize\n",file=outfile,sep="\t")
for(i in 1:3)
{
  q <- s_nb[i]/n
  power <- ccsize(n,q,pD,p1,log(theta[i]),alpha,beta,TRUE)
  ssize <- ccsize(n,q,pD,p1,log(theta[i]),alpha,beta)
  cat(n,"\t",pD,"\t",p1,"\t",theta[i],"\t",q,"\t",
      signif(power,3),"\t",ssize,"\n",
      file=outfile,append=TRUE)
}
read.table(outfile,header=TRUE,sep="\t")
#>       n   pD   p1   hr          q power ssize
#> 1 15792 0.03 0.25 1.35 0.09264184   0.8  1463
#> 2 15792 0.03 0.25 1.40 0.04571935   0.8   722
#> 3 15792 0.03 0.25 1.45 0.02963526   0.8   468
unlink(outfile)
# EPIC study
outfile <- "epic.txt"
n <- 25000
alpha <- 0.00000005
beta <- 0.2
s_pD <- c(0.3,0.2,0.1,0.05)
s_p1 <- seq(0.1,0.5,by=0.1)
s_hr <- seq(1.1,1.4,by=0.1)
cat("n","pD","p1","hr","alpha","ssize\n",file=outfile,sep="\t")
# direct calculation
for(pD in s_pD)
{
   for(p1 in s_p1)
   {
      for(hr in s_hr)
      {
         ssize <- ccsize(n,q,pD,p1,log(hr),alpha,beta)
         if (ssize>0) cat(n,"\t",pD,"\t",p1,"\t",hr,"\t",alpha,"\t",
                          ssize,"\n",
                          file=outfile,append=TRUE)
      }
   }
}
read.table(outfile,header=TRUE,sep="\t")
#>        n  pD  p1  hr alpha ssize
#> 1  25000 0.3 0.1 1.3 5e-08 14391
#> 2  25000 0.3 0.1 1.4 5e-08  5732
#> 3  25000 0.3 0.2 1.2 5e-08 21529
#> 4  25000 0.3 0.2 1.3 5e-08  5099
#> 5  25000 0.3 0.2 1.4 5e-08  2613
#> 6  25000 0.3 0.3 1.2 5e-08 11095
#> 7  25000 0.3 0.3 1.3 5e-08  3490
#> 8  25000 0.3 0.3 1.4 5e-08  1882
#> 9  25000 0.3 0.4 1.2 5e-08  8596
#> 10 25000 0.3 0.4 1.3 5e-08  2934
#> 11 25000 0.3 0.4 1.4 5e-08  1611
#> 12 25000 0.3 0.5 1.2 5e-08  7995
#> 13 25000 0.3 0.5 1.3 5e-08  2786
#> 14 25000 0.3 0.5 1.4 5e-08  1538
#> 15 25000 0.2 0.1 1.4 5e-08  9277
#> 16 25000 0.2 0.2 1.3 5e-08  7725
#> 17 25000 0.2 0.2 1.4 5e-08  3164
#> 18 25000 0.2 0.3 1.3 5e-08  4548
#> 19 25000 0.2 0.3 1.4 5e-08  2152
#> 20 25000 0.2 0.4 1.2 5e-08 20131
#> 21 25000 0.2 0.4 1.3 5e-08  3648
#> 22 25000 0.2 0.4 1.4 5e-08  1805
#> 23 25000 0.2 0.5 1.2 5e-08 17120
#> 24 25000 0.2 0.5 1.3 5e-08  3422
#> 25 25000 0.2 0.5 1.4 5e-08  1713
#> 26 25000 0.1 0.2 1.4 5e-08  8615
#> 27 25000 0.1 0.3 1.4 5e-08  3776
#> 28 25000 0.1 0.4 1.3 5e-08 13479
#> 29 25000 0.1 0.4 1.4 5e-08  2824
#> 30 25000 0.1 0.5 1.3 5e-08 10837
#> 31 25000 0.1 0.5 1.4 5e-08  2606
unlink(outfile)

Graphics

I now include some figures from the documentation that may be of interest.

Genome-wide association

The following code is used to obtain a Q-Q plot via qqunif function,

library(gap)
u_obs <- runif(1000)
r <- qqunif(u_obs,pch=21,bg="blue",bty="n")
u_exp <- r$y
hits <- u_exp >= 2.30103
points(r$x[hits],u_exp[hits],pch=21,bg="green")
legend("topleft",sprintf("GC.lambda=%.4f",gc.lambda(u_obs)))
A Q-Q plot

A Q-Q plot

Based on a chicken genome scan data, the code below generates a Manhattan plot, demonstrating the use of gaps to separate chromosomes.

ord <- with(w4,order(chr,pos))
w4 <- w4[ord,]
oldpar <- par()
par(cex=0.6)
colors <- c(rep(c("blue","red"),15),"red")
mhtplot(w4,control=mht.control(colors=colors,gap=1000),pch=19,srt=0)
axis(2,cex.axis=2)
suggestiveline <- -log10(3.60036E-05)
genomewideline <- -log10(1.8E-06)
abline(h=suggestiveline, col="blue")
abline(h=genomewideline, col="green")
abline(h=0)
A genome-wide association study on chickens

A genome-wide association study on chickens

The code below obtains a Manhattan plot with gene annotation,

data <- with(mhtdata,cbind(chr,pos,p))
glist <- c("IRS1","SPRY2","FTO","GRIK3","SNED1","HTR1A","MARCH3","WISP3",
           "PPP1R3B","RP1L1","FDFT1","SLC39A14","GFRA1","MC4R")
hdata <- subset(mhtdata,gene%in%glist)[c("chr","pos","p","gene")]
color <- rep(c("lightgray","gray"),11)
glen <- length(glist)
hcolor <- rep("red",glen)  
par(las=2, xpd=TRUE, cex.axis=1.8, cex=0.4)
ops <- mht.control(colors=color,yline=1.5,xline=3)
hops <- hmht.control(data=hdata,colors=hcolor)
mhtplot(data,ops,hops,pch=19)
axis(2,pos=2,at=1:16,cex.axis=0.5)
A Manhattan plot with gene annotation

A Manhattan plot with gene annotation

All these look familiar, so revised form of the function called mhtplot2 was created for additional features such as centering the chromosome ticks, allowing for more sophisticated coloring schemes, using prespecified fonts, etc. Please refer to the function’s documentation example.

We could also go further with a circos Manhattan plot,

circos.mhtplot(mhtdata, glist)
#> Note: 11 points are out of plotting region in sector 'chr16', track '3'.