Homework #2: Due Friday, 16 Dec, 5pm.
You are encouraged to work together on these problems. However, you must write up your own answers. You may share plots but Do not copy another person's words.
Although I expect you to do the computing, this is not
an evaluation of your ability to use R. Try to write
appropriate code, but ask if you have any questions or need
help correcting errors. If you can't fix a problem in 10 min,
e-mail me and work on something else.
Please send me:
the line of code that's creating the problem, the error message from R, and the output produced by str(XXX), where XXX is the name of the object used in your line of code.
For example, if R complains about
lipo.d <- vegdist(lipo.m,'canb')
send me the line of code, the error message, and the output from str(lipo.m).
Or, if you are someone who collects all your code in a script file, paste the script file into the body of an e-mail message. Include commands from reading the data file, through the command that produces the error. Also copy / paste the error message from R.
1) We will continue with the analysis of the Echinacea lipophyllic metabolite data set. I repeat the description from HW 1.
The data in lipo.csv on the datasets part of the class web site is from a study of metabolites in plant species in the genus Echinacea. Echinacea is one the most commonly used herbal botanical products. The metabolites measured here are only those related to the putative active compounds. The rows are samples, each from one species and geographic location. Samples are labelled by a code for the species (e.g. Tenn) and sometimes for the species and subspecies (e.g. ParaPara) and a number that indicates the geographic location. For our purposes, we will not make any distinction between species and subspecies, so we will treat Tenn, ParaNeg, and ParaPara as three separate groups that will be called species. Columns represent different metabolites. Some could be identified; others, labelled unknown_X, could not. Each column represents a different metabolit. The numbers in each column are the abundance of that metabolite in each sample. There are 40 samples and 43 metabolites in the data set.
The investigators who collected the data want to know about the differences between species and the metabolites responsible for those differences. These metabolomic data can be treated like species composition data, except that distance between samples should be quantified by the Canberra distance computed from the raw abundance (vegdist( XXX , method='canb') with no standardization). (If you do metabolomics, there are better distance measures, but the Canberra is ok and widely available). The formula for Canberra distance is on the sheet of distance measures handed out in week 1. It is also in the help file for vegdist. One thing that may help to know about Canberra distance is that when a metabolite is present in one sample and absent in the other, that metabolite contributes the maximum possible value of 1 to the sum.
We will focus on differences between species. We can add a spp variable
to the lipo data frame by:
lipo$spp <- substring(lipo$access,1,3)
lipo$spp <- as.factor(lipo$spp)
The hyptest.r code shows how to compute the 'between/within' distance. That is 0 if the pair of samples are in the same group (species) and 1 if the pair is in different groups.
We can calculate the mean distance for each group either by subsetting or using tapply() to apply a function to a 'table'. I shift to the Park Grass data to illustrate this. park.bray is the BC distance; park.DN is the between/within distance for Nitrogen groups.
or tapply(park.bray, park.DN, mean)
You can draw parallel boxplots by:
boxplot(split(park.bray, park.DN), names=c('Within','Between'))
a) Calculate the average dissimilarity between pairs of samples from the same species and the average dissimilarity between pairs of samples from different species.
b) Draw a boxplot comparing the between-species and within-species distances.
c) Based on what you see in the boxplot, would you expect a statistical test to reject the null hypothesis of no difference in metabolic profile among species? Briefly explain why or why not.
d) A t-test comparing the two groups of distances gives a p-value that is much less than 0.00001. Is this a reasonable approach to test the null hypothesis in question c? Briefly explain why or why not.
e) Use mantel() to test the null hypothesis in question c. Report the F statistic, your p-value, and your conclusion.
f) Use adonis() to test the null hypothesis in question c. Report the F statistic, your p-value, and your conclusion.
g) Plot the 2 dimensional nMDS computed from the Canberra distance. Overlay
convex hulls for each species. Which species seem to differ from the others?
It would be nice to have the ability to each compare pairs of species based on adonis. Even if it's possible statistically, it's not been implemented.
h) My plot for question g suggests that the species differ in the within species variability in metabolomic composition. Test the null hypothesis that all species have the same variability. Report what test or vegan function you used and your p-value.
2) We also continue with the Missouri river fish data analysis. Some have asked what 'main channel' and 'chute' mean. The photo illustrates the two. The main channel is on the left; the narrow shallow channel on the right is chute habitat. The key question is whether the species composition differs between the two habitats. Our investigation on HW 1 found big differences between gear types and little to no difference when we looked only at the large hoop net (LH) data.
Because R uses sequential tests, the most appropriate (in my opinion) evaluation of a factor is when that factor is put last (right-hand-most) in the model. The sequential test for the last factor is the same as the partial (Type III) test. If you want to evaluate more than one factor, you have to run a new analysis for each factor, each time putting the factor of interest last.
The data in fish.csv are Missouri river fish samples. There are 106 samples with 65 different species of fish. Sampling was done at three sites (L, C, and O) along the river, separated by at least 10km. At each site, data were collected in two habitats, called main channel (m) and chute (c) by the biologists. At each habitat, fish were collected by five (usually) different types of gear: electrofishing (EF), fyke nets (FN), large hoop nets (LH), small hoop nets (SH), and seines (SN). These are different ways to catch fish and some may be more effective for some species. Fyke nets could not be used in the chute habitat at the O site. At all other sites and habitats, there are between 2 and 4 samples per gear, habitat, and site. The site, habitat and gear information is encoded in the sample name (SiteHabitatGearNumber). It is also contained in fishenv.csv.
a) Use adonis() to fit a model with additive effects of gear, habitat, and site (i.e. the right hand side of the formula will be gear + habitat + site). This model uses all the data to evaluate differences among sites, after adjusting for differences in gear and habitat. If you use all the data, is the species composition the same at all sites? Report the F statistic, your p-value, and your conclusion.
b) Is the species composition the same in all habitats? Report the F statistic, your p-value, and your conclusion.
c) Rerun the analysis for question b using 500 permutations. Do you get the same F statistic? The same p-value? Explain why you do (or do not) get the same p-value.
That's all! It's been fun working with you all.
Have a great break.