Statistics 501 Datasets

Spring 2012


Statistics 501 Home Page

tr>
DatasetSome Information
olive oils
Composition of eight chemicals (columns 2-9) from nine regions of Italy (column 1)
lizards
Mass, snout-vent length and hind limb span of lizards (Table 1.3)
bears
Time series data on heights and weights of female bears (Table 1.4)
Paper Quality
Density and machine direction and cross-direction strengths of different paper specimens (Table 1.2)
Public Utilities
Characteristics of 22 public utilities (Table 12.4)
Los Angeles air pollution data
Wind, Solar radiation, carbon monoxide, nitric oxide, nitrogen dioxide, ozone and hydrocarbon content at noon on 25 days in Los Angeles (Table 1.5)
National Track Records for Women
Country, and records in the women's segment (in seconds) for the 100 m, 200 m, 400 m, and in minutes, 800 m, 1500 m, 3000 m, and the marathon. (Table 1.9)
Data on Bulls
Breed (1/5/8), Sale Price, Yearling height at shoulder (in.), Fat Free Body (lbs.), Percent Fat-free body, Frame -- scale from 1 (small) to 8 (large), Back fat (in.), sale height at shoulder (in.) and sale weight (lbs.) of three breeds of bulls. (Table 1.10)
Radiation Data on Microwaves with doors closed
Oven no. and Radiation. (Table 4.1)
Stiffness of Boards
Stiffness of boards as measured while sending down a shockwave, while vibrating the board and (the last two measurements) statically (Table 4.3)
MRI dataset
Anatomic 128x128 Magnetic Resonance Image of the brain (one slice)
Sweat Data
Perspiration from 20 healthy females in terms of (a) sweat rate, (b) sodium and (c) potassium content. (Table 5.1)
Effluent Data
Biochemical oxygen demand (BOD) and suspended solids (SS) measurements on effluents of sample from a commercial and a state lab. (Table 6.1)
Bird Data
Tail and wing lengths (in millimeters) for 45 female hook-billed kites. (Table 5.12)
Milk Transportation Costs Data
Fuel, repair and capital costs associated with transporting milk from farms to dairy plants for gasoline trucks. (Table 6.10)
Number Parity Data
Data on Word Parity for the 2x2 word/Arabic format by different/same parity combinations. (Table 6.8)
Oxygen Consumption Data
Data on resting volume O_2 (L/min), resting volume O_2 (mL/kg/min), maximum volume O_2 (L/min), maximum volume O_2 (mL/kg/min) for 25 males and females while running on a treadmill until exhaustion. (Table 6.12)
Dog anesthetics data
Data on (1) high CO_2 pressure, (2) low CO_2 pressure, (3) high pressure + halothane, (4) low pressure + halothane
Steel data
Data on (1) different rolling temperatures (1 and 2) and its effect on (2) yield point and (3) ultimate strength of steel
Image training data
Data on random selection of images from a database of 7 outdoor images from Machine Learning Repository. This is the training data and contains neasurements on 19 attributes (provided in the header of the file) on 30 images from each type (first column of the file).
Image test data
Data on random selection of images from a database of 7 outdoor images from Machine Learning Repository. This is the test data and contains neasurements on 19 attributes (provided in the header of the file) on 300 images from each type (first column of the file).
Student scores data
Data on student scores in aptitude, mathematics, language and general knowledge for students in technical disciplines (group 1), architecture (group 2) and medical technology students (group 3) as indicated in column 1.
Peanuts data
Data on yield, weight in grams of sound mature kernels - 250 grams, seed size measured as weight in grams of 100 seeds (columns 3-5) on peanuts of three varieties (column 1) and two locations (column 2).
National Track Records for Men
Country, and records in the Men's segment (in seconds) for the 100 m, 200 m, 400 m, and in minutes, 800 m, 1500 m, 3000 m, and the marathon. (Table 8.6).
Anaconda data
Data on snout vent length, weight and gender of anaconda snakes (Table 6.19).
Spectral reflectance data
Percent spectral reflectance at wavelengths 560 nm (green, column 1) and 720 nm (near infrared) for three species (sitka spruce -- SS, Japanese larch -- JL, lodgepole pine -- LP) of 1-year-old seedlings (in column 3) at three different times (Julian day 150, 235, and 320) during the growing season, all seeds grown with the optimal level of nutrient (Table 6.18).
Wireless services breakdown data
Breakdown data from a designed experiment on cellphone relay towers. Measurements on severity level of the problem (low/high, column 1), level of complexity of the problem (simple/complex, column 2), experience level of engineer (novice/guru, column 3), and time to assess problem (column 4), implement solution (column 5) and total resolution (column 6) (Table 6.20).
Amitriptyline data
Data on total TCAD plasma level (column 1), amount of amitryptiline present in TCAD plasma level (column 2), gender (1 = female/0 = male, column 3), amount of antidepressants over time (column 4), PR and QRS wave measurements (columns 5 and 7), and diastolic BP (column 6) of 7 patients who overdosed on the amitryptiline drug and had to be admitted to hospital for treatment (Table 7.6).
Pulp and Paper Properties data
Breaking length (BL), elastic modulus (EM), stress at failure (SF), burst strength (BS), of pulp fibers and the paper made from them with characteristics of arithmetic fiber length (AFL), long fiber fraction (LFF), fine fiber fraction (FFF) and zero span tensile (ZST). The 62 observations have variables measures in that order (with the response variables in the first four columns and the predictor variables in the last four) (Table 7.7).
Stocks-Prices Data
Weekly rates of return, for each week (first column), for stocks of J. P. Morgan (column 2), Citibank (column 3), Wells Fargo (column 4), Royal Dutch Shell (column 5), and Exxon Mobil (column 6) as per the New York Stock Exchange (Table 8.4).
Mali Family Farm Data
Data from survey of 76 farmers in the Sikasso region of Mali (West Africa) on (from columns 1 through 9) on family size (Family), distance in km to nearest passable road (DistRd), hectares of cotton, maize, sorghum, millet planted in the year 2000 (Cotton, Maize, Sorg, Millet), total number of bullocks or draft animals (Bull), total cattle (Cattle) and total goats (Goats) (Table 8.7).
Painted Turtles Data
Data from 24 female and 24 male (column 1 -- male = 2, female = 1) on logarithms of carapace length, breadth and height.
100k road race times This dataset is from Everitt(1994)
There are 80 runners with times for each of the successive 10k segments.
Zipcode Training Data
This dataset is from Stuetzle and Nugent (JCGS, 2010). Intensity values of 16x16 images of a sample of 2000 digits (200 from each kind of 10 digits. To be read in conjunction with the Zipcode Digits dataset.
Zipcode Digits Data
This dataset is from Stuetzle and Nugent (JCGS, 2010). The true digits for each specimen in the Zipcode Training dataset.
Multiple Sclerosis Data
Data on responses of subjects of different ages (column 1) with and without multiple sclerosis (last column indicator) on responses to stimuli in the left (L) and right (R) eye. Columns 2 and 4 denote the response of both eyes to stimulus 1 and 2 respectively, while columns 3 and 5 represent the differences in responses between the left and the right eyes to stimulus 1 and 2 respectively.
Bankruptcy Data
Annual financial data of financially sound firms (last column, indicator 1) and for the last two years before firms went bankrupt (last column, indicator 0). Data are in the form of ratios of cash flow to total debt (column 1), net income to total assets (column 2), current assets to total liabilities (column 3) and current assets to net sales (column 4).
Crude oil Data
Chemical analysis of crude oil samples from three zones of sandstone (last column): Wilhelm, Sub-Mulinia and Upper. Reported measures are percent ash of vanadium (column 1), iron (column 2), beryllium (column 3), and percent area of saturated (column 4) and aromatic (column 5) hydrocarbons.
Stocks Data
Weekly rates of return for five stocks (Allied Chemical, du Pont, Union Carbide, Exxon, Texaco) listed on the New York Stock Exchange for the period January 1975-December 1976.
Wisconsin Diagnostic Breast Cancer Data
Measurements on 30 features (columns 3-32) of cell nuclei of 357 malignant (column 2, ``M'') and 212 benign (column 2, ``B'') breast tissue samples. The 30 features were obtained from a digitized image of a fine needle aspirate (FNA) of a breast mass and describe characteristics of the cell nuclei present in the image. They were obtained from the following ten real-valued features computed on each image: (a) radius (mean of distances from center to points on the perimeter) (b) texture (standard deviation of gray-scale values) (c) perimeter (d) area (e) smoothness (local variation in radius lengths) (f) compactness (squared perimeter / area - 1.0) (g) concavity (severity of concave portions of the contour) (h) concave points (number of concave portions of the contour) (i) symmetry (j) fractal dimension ("coastline approximation" - 1). The mean, standard error, and "worst" or largest (mean of the three largest values) of ten features were computed for each image, resulting in 30 features. For instance, field (column) 3 is mean radius, field 13 is radius standard error, and field 23 is the worst radius. At the same time, field 4 is mean texture, field 14 is texture standard error, field 24 is worst texture value and so on. The data is made available courtesy of the University of California Machine Learning Repository.
Vowels Training Data
Speaker independent recognition of the eleven steady state vowels (column 2) of British English using a specified training set of lpc (linear predictive analysis) derived log area ratios.The objective is the recognition of vowel sounds from multiple speakers. Deterding (1998) recorded examples of the eleven steady state vowels of English spoken by fifteen speakers (eight training and seven test) for a speaker normalisation study. Each speaker yielded six frames of speech from eleven vowels. This gave 528 frames from the eight speakers used to train the classifier.
Vowels Test Data
Vowels test data from the seven test speakers, resulting in 462 frames for the test dataset.
Egyptian Skulls Data
Measurements in mm on maximum breadth (column 1), basibregmatic height (column 2), basialveolar length (column 3) and nasal height (column 4) of male Egyptian skulls from (column 5) period 1 (4000 BC), period 2 (3300 BC) and period 3 (1850 BC).
Breakfast Cereals Data
Data on Calories (column 3), Protein (column 4), Fat (column 5), Sodium (column 6), Fiber (column 7), Carbohydrates (column 8), Sugar (column 9),Potassium (column 10) of breakfast cereals (column 1) from one of three manufacturers (column 2). Column 11 contains some (unclear) undefined group information.
Loans Data
Data on eight financial variables for 68 farmers who have borrowed money from a bank. These farmers are classified as "good" or "bad" customers (column 2), with the eight financial variable in columns 3-10 being the ratio between assets and liabilities, the proportion of working capital to total assets, the proportion of current liabilities to total assets, the current liabilities to total assets, the total liabilities to total assets, the change in net worth to total assets, the farm land value to total assets, and the total liabilities to the net worth.
GMAT Data
Data on GMAT (column 3) and undergraduate GPA (column 4) scores to decide on applicant's admission to graduate program. The three classes are (1) admit, (2) not admit, and (3) borderline, represented in column 2.
Diabetes Data
Stanford Heart Diabetes Dataset.
Senators voting data
Voting records of Senators of the 109th US Congress (2005-07)as culled from the Congressional Database by Banerjee et al. (2008). Here, "1" denotes voting in favor, "-1" denotes voting against while "0" means a vote not recorded. The voting resolutions are in the rows while the votes are the columns.Also,the first row of the file contains their names, states and affiliations.
Europe Protein consumption
Protein consumption in twenty-five European countries for nine food groups. The variable names and their brief descriptions are: 1. Economy: whether member of the EEC ("E") or the COMECON or Non-Aligned ("C") , input by me. 2. Country: Country name 3. RdMeat: Red meat 4. WhMeat: White meat 5. Eggs: Eggs 6. Milk: Milk 7. Fish: Fish 8. Cereal: Cereals 9. Starch: Starchy foods 10. Nuts: Pulses, nuts, and o il-seeds 11. Fr&Veg: Fruits and vegetables
Wine Dataset
Results of a chemical analysis of 13 constituents (columns 2-14) of wines grown in the same region in Italy but derived from three different cultivars (given by Column 1). The constituent attributes measured in the chemical analysis are: 1) Alcohol 2) Malic acid 3) Ash 4) Alkalinity of ash 5) Magnesium 6) Total phenols 7) Flavanoids 8) Nonflavanoid phenols 9) Proanthocyanins 10)Color intensity 11)Hue 12)OD280/OD315 of diluted wines 13) Proline.