In this problem, you will simulate a binary outcome Y and genotype X for 100 individuals (nsamp) in 1000 scenarios (nsim). Set the minor allele frequency of X to be 0.30 (maf).
Work on an R markdown document and turn in the html generated by knitting with R
(10 pts.) For the first 100 scenarios, sample Y from a Bernoulli distribution with probability that varies with the genotype X as follows:
- P( Y=1 | X=0 ) = p0_alt = 0.20
- P( Y=1 | X=1 ) = p1_alt = 0.30
- P( Y=1 | X=2 ) = p2_alt = 0.44
(10 pts.) For the remaining 900 scenarios, Y sample from a Bernoulli distribution with probability p0_null = p1_null = p2_null = 0.184.
(10 pts.) For each scenario fit a logistic regression (
fit = glm(Y ~ X, family = "binomial")
) and save the p-values of the association in a vector (pvec = fit
)- Plot the histogram of the p-values
- Plot the qqplot of the p-values (you can use this qqunif function or write your own)
(10 pts.) Assuming that you reject the null hypothesis when the p-value is > 0.05
- What are the values of \(m\), \(m_0\), \(S\) as defined in slide 8 of the multiple testing document
- what’s the false positive rate?
- what is the false discovery rate?
(10 pts.) Calculate the \(m\), \(m_0\), \(S\), FDR, and FPR when the threshold for rejection is 0.25.
Comparison of logistic and linear regression
- (10 pts.) For each scenario fit a linear regression (
fit = glm(Y ~ X, family = "binomial")
), save the p-values and plot the histogram- Plot the logistic regression p-values vs the linear regression p-values
- (Optional) Explain why calling significant the points above the red oblique line in the qqplot (slide 23 of the multiple testing document) yields a FPR of 5%?
For each figure, comment or relate it to the class discussion on multiple testing.
If you don’t know where to start, you can use the following structure
define relevant variables
## number of individuals
nsamp = 100
## number of scenarios
nsim = 1000
## minor allele frequency
maf = 0.30
simulate genotype data
## simulate long vector genotype data with maf
simx = rbinom(nsamp * nsim, 2, maf)
## arrange the simulated genotype data in a matrix that has nsamp rows and nsim colums
Xmat = matrix(simx, nsamp, nsim)
## create an empty matrix Ymat with the same dimensions as Xmat and fill them with NA for now
Ymat = matrix(NA, nsamp, nsim)
Note 1
simulate Y from the alternative
## fill the first 100 scenarios with Y simulated from the alternative
for(s in 1:10)
{
## to set Ymat[,s] = rbinom(nsamp, 2, pX) as a function of the genotype
## define ind0 = Xmat[,s] == 0, ind1 = Xmat[,s] == 1, and ind2 = Xmat[,s] == 2
## set Ymat[ind0,s] = rbinom(sum(ind0), 2, p0_alt ), and the case status for the other genotypes accordingly
}
simulate Y from the null
## if you undertand how to simulate under the alternative, this should be straightforward
Columns of the
Ymat
andXmat
correspond to different scenarios.↩︎