Statistics 503 - Exploratory Methods and Data Mining

Statistics 503 - Exploratory Methods and Data Mining

Instructor: Dr Di Cook
325 Snedecor Hall
515-294-8865
dicook@iastate.edu
www.public.iastate.edu/~dicook

Meeting Times: MWF 2:10-3:00 Room TBA
Office hours: TBA

Most material is on Web CT

Textbooks: W.N. Venables and B.D. Ripley (2003) "Modern Applied Statistics with S (4th ed)" Springer
Chatfield, C. (1995) "Problem Solving: A Statistician's Guide" Chapman and Hall/CRC
Recommended Reading: Bishop, C. (2006) "Pattern Recognition and Machine Learning" Springer
Hastie, T., Tibshirani, R., and Friedman, J. (2001) "The Elements of Statistical Learning", Springer
Brian Ripley, "Pattern Recognition and Neural Networks"

Prerequisites: Statistics 401, 447 or 341, or permission of the instructor
Description: Approaches to finding the unexpected in data: data mining, pattern recognition and gaining understanding. Emphasis is on data-centered, non-inferential statistics, for large or high-dimensional data, and topical problems. Simple graphical methods, as well as classical and computer-intensive methods applied in an exploratory manner, and presentation graphics.
Objectives: Information in our age is exploding in amount and complexity. New disciplines, such as data mining, are emerging to address the needs in this area. This course is designed to provide students with the essentials for approaching new, complex data, and arriving at preliminary descriptive statements.
Approach: This will be a data-centered course, with segments of the semester focusing on particular data sets, each getting more complex as the semester progresses.

Tentative Schedule

 Dates Topic Notes Data Assignment Homework Code Jan 8, 10, 12 Introduction to R. Intro to R Data . . . Jan 17, 19 What is data mining? Outline , Background . . Hwk 1 Code end of Jan Case study 1: Tipping Data Notes Data Assignment 1, Womens tennis statistics, Mens tennis statistics . Code Feb Case study 2: Olive Oils Data Notes Data . Hwk 2 , Hwk 3 Music data Code Mar Case study 3: Clustering music clips Notes Data . . Code Classification of music clips . Notes Data . . . Apr Case study 4: Hurricanes Notes Data . . Code Mar 12 Exam 1 | 2005 Exam 1, Solution, Guide Apr 30 Exam 2 | 2005 Exam 2, Solution, Guide Apr 11 Project due Apr 23, 25, 27 project presentations . . .

Methods to be Covered:
• Data cleaning: re-formulating variables to extract different types of information, handling missing values, fixing errors in data, transformations
• Interactive and dynamic graphics
• Classical procedures: regression, GLM, PCA, hierarchical and k-means clustering, model-based clustering
• Computationally intensive procedures: neural networks, CART, forests, support vector machines, smoothing, bootstrap, boosting, projection pursuit
• Presentation graphics: trellis/lattice/ggplot.

Copyright for the material on this page belongs to the course instructor.

Dianne Cook, Dept of Statistics, ISU, 325 Snedecor Hall, Ames, IA 50011-1210
Tel: (515) 294 8865, Fax: (515) 294 4040
email: dicook@iastate.edu
http://www.public.iastate.edu/~dicook/