In this problem, you will simulate a binary outcome Y and genotype X for 100 individuals (nsamp) in 1000 scenarios (nsim). Set the minor allele frequency of X to be 0.30 (maf).

Work on an R markdown document and turn in the html generated by knitting with R

  1. (10 pts.) For the first 100 scenarios, sample Y from a Bernoulli distribution with probability that varies with the genotype X as follows:

    • P( Y=1 | X=0 ) = p0_alt = 0.20
    • P( Y=1 | X=1 ) = p1_alt = 0.30
    • P( Y=1 | X=2 ) = p2_alt = 0.44
  2. (10 pts.) For the remaining 900 scenarios, Y sample from a Bernoulli distribution with probability p0_null = p1_null = p2_null = 0.184.

  3. (10 pts.) For each scenario fit a logistic regression (fit = glm(Y ~ X, family = "binomial")) and save the p-values of the association in a vector (pvec = fit)

    • Plot the histogram of the p-values
    • Plot the qqplot of the p-values (you can use this qqunif function or write your own)
  4. (10 pts.) Assuming that you reject the null hypothesis when the p-value is > 0.05

    • What are the values of \(m\), \(m_0\), \(S\) as defined in slide 8 of the multiple testing document
    • what’s the false positive rate?
    • what is the false discovery rate?
  5. (10 pts.) Calculate the \(m\), \(m_0\), \(S\), FDR, and FPR when the threshold for rejection is 0.25.

Comparison of logistic and linear regression

  1. (10 pts.) For each scenario fit a linear regression (fit = glm(Y ~ X, family = "binomial")), save the p-values and plot the histogram
    • Plot the logistic regression p-values vs the linear regression p-values
  2. (Optional) Explain why calling significant the points above the red oblique line in the qqplot (slide 23 of the multiple testing document) yields a FPR of 5%?

For each figure, comment or relate it to the class discussion on multiple testing.

If you don’t know where to start, you can use the following structure

define relevant variables

## number of individuals
nsamp = 100
## number of scenarios
nsim = 1000
## minor allele frequency
maf = 0.30

simulate genotype data

## simulate long vector genotype data with maf
simx = rbinom(nsamp * nsim, 2, maf) 
## arrange the simulated genotype data in a matrix that has nsamp rows and nsim colums
Xmat = matrix(simx, nsamp, nsim)
## create an empty matrix Ymat with the same dimensions as Xmat and fill them with NA for now
Ymat = matrix(NA, nsamp, nsim)

Note 1

simulate Y from the alternative

## fill the first 100 scenarios with Y simulated from the alternative
for(s in 1:10)
{
  ## to set Ymat[,s] = rbinom(nsamp, 2, pX) as a function of the genotype
  ## define ind0 = Xmat[,s] == 0, ind1 = Xmat[,s] == 1, and ind2 = Xmat[,s] == 2
  ## set Ymat[ind0,s] = rbinom(sum(ind0), 2, p0_alt ), and the case status for the other genotypes accordingly
}

simulate Y from the null

## if you undertand how to simulate under the alternative, this should be straightforward

  1. Columns of the Ymat and Xmat correspond to different scenarios.↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The source code is licensed under MIT.

Suggest changes

If you find any mistakes (including typos) or want to suggest changes, please feel free to edit the source file of this page on Github and create a pull request.

Citation

For attribution, please cite this work as

Haky Im (2022). Homework 2 - 2022. HGEN 471 Class Notes. /post/2022/01/19/homework-2-2022/

BibTeX citation

@misc{
  title = "Homework 2 - 2022",
  author = "Haky Im",
  year = "2022",
  journal = "HGEN 471 Class Notes",
  note = "/post/2022/01/19/homework-2-2022/"
}