Data sets used in the text are all on the CD enclosed with the book. You need to look at the README file for instructions on copying and uncompressing the files.

Data sets that have been discussed in class or are used in homework are:

  • Data discussed in Aug 27 lecture Each column is one set of data. All four sets have the same mean and standard deviation but differ in other characteristics.
  • Data for HW 1, problem 26. Zinc in rats. Variables are: Group (A or B) and zinc level (mg/ml).
  • Tomato fertilizer data discussed in class and lab, Aug 30. Each row is a data for a single plant. Variables are fertilizer (a: standard, b: improved), and yield (lbs).
  • Transgenic mice weight dataData for HW 2, problem 1. Weights of transgenic and non-transgenic mice. Each row is a single mouse. Variables are tg (0: non-transgenic, 1: transgenic) and weight (in grams)
  • Radon concentrations in Ramsey Co, MN. Data for HW 2, problem 2. Each row is data for a single house. The value is the radon concentration (units are picoCuries/liter of air, pCi/l).
  • Microsatellite counts for mutagen study. Data for the problem on HW 3. The two columns are the dose of mutagen (0 or 80) and the count of microsatellite nuclei in 100 cells.
  • Bumpus weight data Each line of data has two variables: weight (grams) and a code that indicates whether that sparrow lived or died. Sparrows that survived have a code of 1; sparrows that perished have a code of 2.
  • Bumpus sparrow humerus length data The Bumpus sparrow data is that from Case Study 2.1 (p 28) Each row is a sparrow. The first value is the humerus length (in 1000'ths of inches) the second value is a code: 1 = perished, 2 = survived.
  • Change in blood pressure for men who received 4 weeks of a fish oil diet
  • Schizophrenia data (Case Study 2.2) Each line of data corresponds to a pair of maternal twins. The first number is the hippocampus volume for the unaffected individual. The second number is the hippocampus volume for the individuals affected by schizophrenia.
  • Bioremediation data. This is a subset of data from an experiment evaluating ways to grow crops in areas contaminated by radioactivity (e.g. Chernobyl fallout). The two treatments here are HiK: a fertilize with large amounts of potassium and Bio: a plastic barrier to stop roots penetrating too deeply into the soil. Each line corresponds to an experimental plot. The first variable is the treatment; the second is the level of contamination (pCi/gm) in collards grown in that plot.
  • Darwin cross/self fertilization data In the Darwin data, each row of the data set represents a pair of plants. The first number is some extra information not needed for this problem. The second is the height (in inches) of the cross-fertilized plant. The third is the height of the self-fertilized plant.
  • Bumpus sparrow humerus length data The first column is the humerus length in thousands'th of an inch. The second column is whether or not they survived: 1 = died, 2 = survived.
  • Bee data for chapter 3, problem 28. First column is proportion of pollen removed, 2nd column is duration of visit (in seconds) and 3nd column is type of bee: 1 = bumblebee, 2 = honeybee workers.
  • Iron supplementation data for chapter 3, problem 31. First column is the percent retention of iron. The second column is the form of the iron: 1 = Fe3, 2=Fe4.
  • Rainfall data from Case study 3.1 Used in transf.sas (lab on 27 Sept). The first column is the rainfall, the second is a treatment code: 1 = unseeded day, 2 = seeded day.
  • Diet and longevity data set (Case study 5.1), for diet.sas in lab on 11 Oct 2004.
  • Tyrannosaurus data Oxygen isotope data on bones from a single T. rex for problem 5:23. Column 1 is the oxygen isotope value; column 2 is the bone number.
  • Fatty acid data for problem 5:18. The first column is the protein level, the second column is the treatment (numbered 1 to 6 where 1 is CPFA 50, 2 is CPFA 150, and so on through 6 is Control), the third column is the day (1 to 5) and the last column is 'group'. Group has 10 unique values, one for each unique combination of treatment and day. So, the first five treatments are each one single group. Then, each day of the control is a unique group.
  • Handicap data set (Case study 6.1) for HW 7. The two columns are the score of perceived qualifications and a code for the handicap. The codes are: 1 = None, 2 = Amputee, 3=Crutches, 4=Hearing, and 5=Wheelchair.
  • Peanut and aflatoxin concentration data. The first column is the percent clean peanuts. The second is the aflatoxin concentration (ppb).
  • meat pH data set (case study 6.2), for meat.sas in lab on 25 Oct 2004.
  • Planet data) for HW 8. Problem 7.14, data from display 1.15. Three columns: planet name, order from sun, and distance from sun.
  • Pollen data, queens only for HW 8, problem 7.17. This is the subset of bee.txt with only queen data in it. First column is proportion of pollen removed, 2nd column is duration of visit (in seconds) and 3nd column is type of bee: 1 = bumblebee queen, 2 = honeybee workers.
  • Music / brain activity data for HW 8. First column is the number of years the subject has played a string instrument. The second is the neuronal activity index, a measure of brain activity.
  • Eruption.txt Wait time between eruptions of the Old Faithful geyser in Yellowstone. Columns are date (ignore for hw 9 problem), interval between eruptions (in minutes), and duration of the interval (in minutes).
  • full meat pH data set , Meat.txt from case study 6.2 with 2 additional observations at 24 hr.
  • wine.txt Data on wine consumption (liter/person/yr) and ischemic heart disease deaths (deaths/1000 people) for 18 industrialized countries. First column is the country, second is the wine consumption and third is the mortality.
  • Anscombe data sets used in handout that illustrated the need for regression diagnostics.
  • Brain weight data set (case study 9.2), for brain.sas in lab on 15 Nov 2004.
  • Flowering time data (Case study 9.1) Uses E or L to mark early or late groups.
  • Flowering time data (Case study 9.1) Uses 1 or 2 to mark early or late groups.
  • SAT score data (Case study 12.1). Note: Alaska is omitted.
  • Modified anscombe data sets used to illustrate Cook's distance. A small amount of random jitter is added to all X values. Used in anscombe2.sas
  • Corn data set Used in corn.sas to illustrate polynomial regression
  • Collard data set Used in collard.sas
  • Pygmalion experiment from textbook Used in pygmalion.sas
  • Zinc and Copper data for homework problem 13.16
  • Iridium data for HW problem 13.17.
  • Donner party data from textbook (Case study 20.1) Used in donner.sas
  • Excel worksheet with Donner party data from textbook (Case study 20.1) Used in readexcel.sas