A branch of systems biology deals with the design and analysis of mathematical and computer models of disease in order to estimate immunological and pathogen-related parameters, compare alternative models of the immune response, and to explore treatment and vaccination strategies in silico. We will be looking at output from a computer model of Leishmania major infection. Leishmania are protozoan parasites that infect macrophages, are transmitted through the bites of infected sandflies, and are endemic in 88 countries (though not the United States). Disease can either be cutaneous, where skin ulcers occur on exposed surfaces of the body, or visceral (where the infection spreads to organs such as the liver or the spleen), which has near-certain mortality if left untreated. Today, we will analyze computer model output in order to understand the effect that computer model parameters have on simulation output. Important Files 1. http://www.public.iastate.edu/~gdancik/summer2007/files/design.txt This file contains a 100 x 5 matrix of the computer model inputs. Each column corresponds to a particular computer model parameter, and each row corresponds to the parameter setting for a particular computer model run. All parameter values are scaled to be integers between 0 and 49. Note that the column names are given in this file. The computer model parameters are as follows: aI - the growth rate of the parasite kI - the amount of intracellular parasite that an infected macrophage can hold mrecr - the recruitment rate of macrophages (the host cell) trecr - the recruitment rate of t cells (which eliminate the pathogen) tdelay - t cells are not recruited until a threshold level of total pathogen is reached 2. http://www.public.iastate.edu/~gdancik/summer2007/files/pathogen.txt This is a 100 x 35 table of computer model output corresponding to pathogen load over time. The columns correspond to days 1 through 35, and each row corresponds to output from a particular simulation. (e.g. pathogen[3,] is the pathogen load over time from simulation 3). Exercises / Questions: 1) Read in the above two files 2) Construct a graph of pathogen load versus time for all simulations on a single graph (hint: use the plotLines function that we have created earlier; the pathogen table must be in matrix form when it is passed into this function) 3) What is the relationship between each of the parameters and parasite load at day 6. (note: day 6 corresponds to the 6th column of the pathogen.txt file): a. Look at this relationship visually for all parameters by plotting the parameter value on the x-axis and parasite load at day 6 on the y-axis. b. Fit a (multiple) linear regression model using parasite load at day 6 as the response variable and the six computer model parameters as the input. (Note: you may need to convert the design table to a matrix using as.matrix) Is this form of a multiple linear regression model the best? 4) Hypotheses Testing (and Permutation Tests). You have recently developed an algorithm for predicting someone's lifespan based on information obtained during childhood. You are interested in whether or not your algorithm does a better job than a competitor's. Each program has been tested against 10 known (but different) cases, and for each case, a score is recorded that indicates a measure of difference between the individual's true lifespan and the predicted lifespan. These results can be found in the workspace http://www.public.iastate.edu/~gdancik/stat430/files/lifespan.RData which contains the object lifespan that has components 'program1' and 'program2'. You wish to test whether your algorithm is better (i.e., has a lower mean score) H0: µ1 = µ2 H1: µ1 < µ2 a) Read in the workspace above using the command: load(url('http://www.public.iastate.edu/~gdancik/stat430/files/lifespan.RData')) b) Use the function 't.test' to test the alternative hypothesis above (type '?t.test' to see how to test with a one-tailed alternative hypothesis) c) A lot of people would report the results above and be happy. However, visually verify that it is not clear whether the assumptions of the t-test have been met, namely the assumption that both populations are normally distributed (since n < 30). This actually should have been done prior to (a). Use the functions 'hist', and if you are familiar with normal quantiles plots, 'qqnorm', and 'qqline'. d) In this situation, a permutation test is more reliable (and more conservative) than the t-test. Construct an approximate randomization test to test H1 against H0. What is your conclusion? Might it differ from a conclusion using the p-value obtained in (b)? 7) Maximum Likelihood Estimation Recall from example 0.2.2 (pg. 5 of the course notes) that for X1, ..., Xn, Xi ~ independent Geometric (p), with joint probability , a maximum likelihood estimate (mle) of p is In this example, the maximum likelihood estimate of p exists in closed form, but this is not always the case. When this happens, numerical search methods are often used. In this exercise we will use R to demonstrate the use of numerical methods for finding the mle of p using the likelihood function above. Let k = the number of trials before successfully connecting to a server. k = c(5, 11, 8, 15, 15, 10) Note that the likelihood of 'k' can be written in R as geolike <- function(p, k) { ## k is a vector / matrix of observations k = as.vector(k) ## in case k is passed in as a matrix ans = matrix(0,length(k)) for (i in 1:length(k)) { ans[i] = (1-p)**(k[i]-1) * p } return (prod(ans)) } Use the 'geolike' function and the function 'optimize' to find the maximum likelihood estimate of p for the vector k above. Type '?optimize' for information on how to use this function, and note that we want to maximize the function, not minimize it. Using R, find the true mle of p using the formula on the top of this page. How does this result compare to the result obtained using 'optimize'? It may also be useful to know that R supports multidimensional minimization (or maximization) using the function 'optim'. 8) Monte Carlo Estimation (of p-values) Suppose that the vector k = c(5, 11, 8, 15, 15, 10) is actually censored. That is, there is a vector k* that contains the true but unknown observations, and ki = min(ki*, 15) for all i. However, we believe that the distribution of k is geometric with p = 0.11, and we wish to test the hypothesis: H0: p = 0.11 H1: p ? 0.11 In what follows, we will use the test statistic s = the sample mean. We will use a Monte Carlo Sampling approach (Cohen, pg. 150) to find the sampling distribution of t under the null hypothesis. For i = 1, ..., M a. Draw a sample of size n = 6 under H0 using 'rgeom' b. Calculate the test statistic from the sample in (a), taking into account that some of the data might be censored. Use the distribution of s to calculate a p-value, and make a conclusion.