Answers to Exercises 1) Read in the above two files pathogen = read.table("http://www.public.iastate.edu/~gdancik/summer2007/files/pathogen.txt") design = read.table("http://www.public.iastate.edu/~gdancik/summer2007/files/design.txt",header=TRUE) 2) Construct a graph of pathogen load versus time for all simulations on a single graph (hint: use the plotLines function that we have created earlier; the pathogen table must be in matrix form when it is passed into this function) plotLines = function(m, ...) { # this is a comment lower = min(m) upper = max(m) for (i in 1:dim(m)[1]) { plot(m[i,], ylim = c(lower, upper), type = "l", ...) par(new=TRUE) } } plotLines(as.matrix(pathogen), main = "pathogen load vs. time", ylab = "pathogen", xlab = "days") 3)What is the relationship between each of the parameters and parasite load at day 6. (note: day 6 corresponds to the 6th column of the pathogen.txt file) a. Look at this relationship visually for each parameter by plotting the parameter value on the x-axis and parasite load at day 6 on the y-axis. par(mfrow = c(3,2)) for (i in 1:5) { plot(design[,i], pathogen[,6], xlab = colnames(design)[i], ylab = "pathogen load") } b.Fit a (multiple) linear model using parasite load at day 6 as the response variable and the six computer model parameters as the input. (Note: you made need to convert the design table to a matrix using as.matrix) fitp6 = lm(pathogen[,6] ~ as.matrix(design)) Is this form of a multiple linear regression model the best? The parameter aI appears to have the strongest effect, however, based on the graph from the plot in (a), the relationship between aI and pathogen load at day 6 is not linear. 4)Permutation Test a) Read in the workspace above using the command: load(url('http://www.public.iastate.edu/~gdancik/stat430/files/lifespan.RData')) b) Use the function ‘t.test’ to test the alternative hypothesis above (type ‘?t.test’ to see how to test with a one-tailed alternative hypothesis) ans1 = t.test(lifespan$program1,lifespan$program2, alternative = "less") c) A lot of people would report the results above and be happy. However, visually verify that it is not clear whether the assumptions of the t-test have been met, namely the assumption that both populations are normally distributed (since n < 30). This actually should have been done prior to (a). Use the functions ‘hist’, and if you are familiar with normal quantiles plots, ‘qqnorm’, and ‘qqline’. par(mfrow = c(1,2)) hist(lifespan$program1, main = "program1") hist(lifespan$program2, main = "program2") par(mfrow = c(1,2)) qqnorm(lifespan$program1, main = "program1") qqline(lifespan$program1) qqnorm(lifespan$program2, main = "program2") qqline(lifespan$program2) From these plots the assumption of normality should be questioned, although the sample sizes are probably not large enough for formal rejection of a null hypothesis that both data sets come from normal populations. d) In this situation, a permutation test is more reliable (and more conservative) than the t-test. Construct an approximate randomization test to test H1 against H0. What is your conclusion? Might it differ from a conclusion using the p-value obtained in (b)? t.perm <- function(x1, x2, K = 5000) { all = matrix(c(x1,x2)) T = matrix(0,K) t.star = mean(x1) - mean(x2) for (i in 1:K) { all = sample(all) xbar1 = mean(all[1:length(x1)]) xbar2 = mean(all[(length(x1)+1):length(x2)]) T[i] = xbar1 - xbar2 ## we don't take absolute value since H1 is one-sided } p = sum(T <= t.star) / length(T) return (p) } ans2 = t.perm(lifespan$program1, lifespan$program2) ans1 is less than .1, a common cutoff (and closer to 0.05), whereas ans2 is much larger, suggesting there is not evidence to reject the null hypothesis. 7. Maximum Likelihood Estimation k = c(5, 11, 8, 15, 15, 10) geolike <- function(p, k) { k = as.vector(k) ## in case k is passed in as a matrix ans = matrix(0,length(k)) for (i in 1:length(k)) { ans[i] = (1-p)**(k[i]-1) * p } return (prod(ans)) } ans1 = optimize(geolike, interval = c(0,1), maximum = TRUE, k = k) #the true optimum is ans2 = length(k) / sum(k) Note that the answers are essentially identical (although this will not always be true). 8. Monte Carlo Estimation k = c(5, 11, 8, 15, 15, 10) t.star = mean(k) K = 5000 T = matrix(0,K) for (i in 1:K) { s = rgeom(length(k), prob = .11) s[ s >= 15] = 15 ## since we only observe censored data T[i] = mean(s) } p = sum(T >= t.star) / K