Homework 2 - 2022

Haky Im

2022-01-19

Categories: homework

In this problem, you will simulate a binary outcome Y and genotype X for 100 individuals (nsamp) in 1000 scenarios (nsim). Set the minor allele frequency of X to be 0.30 (maf).

Work on an R markdown document and turn in the html generated by knitting with R

(10 pts.) For the first 100 scenarios, sample Y from a Bernoulli distribution with probability that varies with the genotype X as follows:
- P( Y=1 | X=0 ) = p0_alt = 0.20
- P( Y=1 | X=1 ) = p1_alt = 0.30
- P( Y=1 | X=2 ) = p2_alt = 0.44
(10 pts.) For the remaining 900 scenarios, Y sample from a Bernoulli distribution with probability p0_null = p1_null = p2_null = 0.184.
(10 pts.) For each scenario fit a logistic regression (fit = glm(Y ~ X, family = "binomial")) and save the p-values of the association in a vector (pvec = fit)
- Plot the histogram of the p-values
- Plot the qqplot of the p-values (you can use this qqunif function or write your own)
(10 pts.) Assuming that you reject the null hypothesis when the p-value is > 0.05
- What are the values of \(m\), \(m_0\), \(S\) as defined in slide 8 of the multiple testing document
- what’s the false positive rate?
- what is the false discovery rate?
(10 pts.) Calculate the \(m\), \(m_0\), \(S\), FDR, and FPR when the threshold for rejection is 0.25.

Comparison of logistic and linear regression

(10 pts.) For each scenario fit a linear regression (fit = glm(Y ~ X, family = "binomial")), save the p-values and plot the histogram
- Plot the logistic regression p-values vs the linear regression p-values
(Optional) Explain why calling significant the points above the red oblique line in the qqplot (slide 23 of the multiple testing document) yields a FPR of 5%?

For each figure, comment or relate it to the class discussion on multiple testing.

If you don’t know where to start, you can use the following structure

define relevant variables

## number of individuals
nsamp = 100
## number of scenarios
nsim = 1000
## minor allele frequency
maf = 0.30

simulate genotype data

## simulate long vector genotype data with maf
simx = rbinom(nsamp * nsim, 2, maf) 
## arrange the simulated genotype data in a matrix that has nsamp rows and nsim colums
Xmat = matrix(simx, nsamp, nsim)
## create an empty matrix Ymat with the same dimensions as Xmat and fill them with NA for now
Ymat = matrix(NA, nsamp, nsim)

Note ¹

simulate Y from the alternative

## fill the first 100 scenarios with Y simulated from the alternative
for(s in 1:10)
{
  ## to set Ymat[,s] = rbinom(nsamp, 2, pX) as a function of the genotype
  ## define ind0 = Xmat[,s] == 0, ind1 = Xmat[,s] == 1, and ind2 = Xmat[,s] == 2
  ## set Ymat[ind0,s] = rbinom(sum(ind0), 2, p0_alt ), and the case status for the other genotypes accordingly
}

simulate Y from the null

## if you undertand how to simulate under the alternative, this should be straightforward

Columns of the Ymat and Xmat correspond to different scenarios.↩︎