R Tutorial (http://www.public.iastate.edu/~gdancik/stat430) R is a freely available software package used for statistical analysis, data visualization, and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating systems. R is a command-based (functional) language with many objects and functions built-in. Users can also define their own objects and functions, and many specialized packages are also available (http://cran.r-project.org/src/contrib/PACKAGES.html) For more background, downloads, and a more thorough user-manual: http://cran.r-project.org/ Note: On certain platforms, R will not recognize the opening and closing quotation marks (" and ") found throughout this file, but will recognize the generic quotation marks ('). If any of the commands gives an error when copied and pasted into R, try typing in the quotation marks manually into R, or using the text version of this file. R can be used like a calculator 5 + 9 4 / 7 + (100-2) / 5 sqrt(16) exp(8) The assignment operator is the '=' sign; '<-' can also be used a = 3 x = 4 x**a or x^a returns xa The workspace is defined as all objects and user-defined functions in the current environment. The command ls() returns a list of all elements in the environment The command rm(a, x) can be used to remove the two elements from the workspace that we created above Getting Help (?): ?ls ?matrix Comments: The # sign is used to denote a comment (the same is in perl) Data types: vectors - these are 1 dimensional (1 row of numbers, characters, etc.) v= 1:15 # type the name of the object, in this case v, to view it v[2] #returns the 2nd element of the vector length(v) #returns the number of elements in the vector v = c('a', 'b', 'c') v = c(1,2,5) v = seq(1,10,by=2) v = rep(10,6) matrices - these are two-dimensional, with all elements of the same type v = 1:15 m = matrix(v, nrow = 3,ncol = 5,byrow = T) #creates a 3 (row) x 5 (column) matrix m = matrix(v, 3, byrow = T) # does the same thing ## can you create the matrix below: 1 1 1 2 2 2 3 3 3 4 4 4 dim(m) # returns the number of rows and columns of matrix m dim(m)[1] #the number of rows dim(m)[2] #the number of columns we can access elements of the matrix m using m[rows, columns], where rows and columns are the rows and columns of interest m[1:2,2:3] returns rows 1 and 2 and columns 2 and 3 m[rows, ] returns the specified rows (and all columns) m[, columns] returns the specified columns (and all rows) Note: if only 1 row or column is specified, then a vector will be returned Can you change the element in the 3rd column and the 4th row to 0? Matrix arithmetic m + 3 # adds 3 to each element of m m * 5 # multiplies all elements in m by 5 # for 2 matrices m1 and m2 of equal dimension, add corresponding elements m1 + m2 Suppose we want to evaluate a function on all rows or columns of a matrix. We can easily do this with the apply function The function mean() returns the mean value of the elements in the object passed into it. apply(m,1,mean) #returns the mean of each row apply(m,2,mean) #returns the mean of each column lists - can store elements of various types; in this respect, are similar to classes in object oriented programming languages student = list(name = 'Bob', age = 20) Elements of a list can be accessed according to their names... student$name # returns 'Bob' student$age # returns 20 We can easily add an element to the list: student$grade = 'A' And delete an element from the list: student$age = NULL ... We can also access elements according to their index using double brackets: student[[1]] returns 'Bob' Important generic functions (which may or not be useful depending on the type of object): summary() mean() plot() #data coercion: #check the type of an object: as.vector() is.vector() as.matrix() is.matrix() Similar functions are defined for all data types (e.g., as.list(), as.character(), etc. ) The command typeof(object) returns the type of the object (obviously) Note: some objects may be more than one type, and only one type will be returned. ex: v = 1:1000 typeof(v) returns 'integer' is.vector(v) and is.integer(v) BOTH return TRUE is.matrix(v) returns FALSE General statistical functions for any object x (usually a vector or matrix) mean(x) min(x) var(x) max(x) sd(x) quantile(x) summary(x) # returns the five number summary, as well as the mean Data input A list of commands in a file can be read using source(file.name) source('http://www.public.iastate.edu/~gdancik/summer2007/files/setx.txt') Reading in a file data = read.table('http://www.public.iastate.edu/~gdancik/summer2007/files/BigClass.txt', sep = ',', header = T) data.frames Data frames are objects that combine features (particularly element access methods) of matrices and lists The columns of 'data' are 'name', 'age', 'sex', 'height', and 'weight' This can be determined using colnames(data) data$name data$age summary(data) Suppose we want to change the heading of 'sex' to 'gender' We can rename all of the columns using colnames(data) = new.names - Can you create a vector of the column names we want? - Can you change 'sex' to 'gender'? - Can you rename the column names? Alternately, we could have used colnames(data)[3] = 'gender' Another data type is the logical data type (TRUE or FALSE; or alternatively T and F) 5 > 3 5 > 9 Logical operators (e.g. to compare two numbers): >, <, >=, <=, ==, != v = 1:10 index = v > 5 # for each element of v, check if that element > 5 v[index] # returns the elements of v that are > 5 - In the big class data set, retrieve a list of students greater than 15 years old o index = data$age > 15 o data[index,] - note that we need to include the ',' after 'index'. Why is this? o The previous two steps may also be combined: data[data$age > 15,] - Other examples: o data[data$gender == 'M',] o data[data$age == 12,] o data[data$age == 12]$height ## the heights of all 12 year olds o data[data$age == 12, 4] Relationship between two variables: To reduce typing, let x = data$height and y = data$weight plot(x,y, xlab = 'height', ylab = 'weight', main = 'scatterplot of height and weight') cor(x,y) returns the correlation between x and y Linear models: A linear model (for one input variable) has the form: y = b0 + b1x, where y is referred to as the response variable and x is an input variable. fit1 = lm(y~x) # fits a linear model of the above form summary(fit1) Estimates of b0 and b1 are the first and second elements of fit1$coeff, respectively. These can also be found through summary(fit1) or simply by printing the fit1 object If we plug in our estimates of b0 and b1 into the original equation, we can predict a person's weight (y) from their known height (x). Doing this for our known weights, we get a list of fitted values, fit1$fitted. plot(x,y, xlab = 'height', ylab = 'weight', main = 'scatterplot of height and weight') lines(x, fit1$fitted, col = 'red') Now let us consider the model y = b0 + b1x1 + b2x2, where y = data$weight x1 = data$age x2 = data$height fit2 = lm(y~ x1 + x2) An alternate way to do this is to create a matrix with the 1st column = x1 and the 2nd column = x2. X = data[,c(2,4)] #returns a data.frame containing age and height (this returns a # data.frame since the original object is a data.frame) X = as.matrix(X) # X must be a matrix for the lm function to work! fit2 = lm(y~X) How does fit1 compare to fit2? Comparing two variables / populations (box plots t-tests, and permutation tests) hm = data[data$gender == 'M',]$height hf = data[data$gender == 'F',]$height boxplot(hm, hf, names = c('male', 'female'), main = 'height') For independent samples from 2 populations, the t-test is used to test a hypothesis about the population means (µM and µF in this example). By default, the t-test considers: H0: µM - µF = 0 H1: µM - µF ? 0 t = t.test(hm, hf) t Permutation (Randomization) Tests (pg. 165, Cohen): H0: µM - µF = 0 H1: µM - µF ? 0 Note that under the null hypothesis, the category of individual (male, female) does not matter, so the means should not be different if we randomly switched genders of individuals. In other words, from the 40 individuals, randomly select 22 as male and 18 as female. Theoretically, we can do this for all possible permutations of the 40 individuals (there are n = 113,380,261,800 of them!), calculate a statistic Ti for each permuation, and then calculate a p-value. We will use the test statistic which will be large if H0 is false. We calculate a p-value using where T* is the observed test statistic. This approach uses the classical definition of probability (theoreom 1.3.4 of notes), where A = Ti = T*, O = {all possible ways of assigning 22 individuals to be male and 18 individuals to be female from 40 individuals}, where Most of the time, it is not practical to calculate Ti for all permutations of the data. However, we can make use the empirical definition of probability, that is The text refers to the use of empirical probabilities to carry out permutation tests as an approximate randomization. Let's do this in R: K = 5000 ## the number of samples to obtain heights = rbind(matrix(hm), matrix(hf)) T = matrix(0,K) ## matrix of test statistics t.star = abs(mean(hm) - mean(hf)) for (i in 1:K) { heights = sample(heights) ## randomly permute the order of heights xm = mean(heights[1:22]) xf = mean(heights[23:40]) T[i] = abs(xm - xf) } hist(T, main = "distribution of test statistic") abline(v = t.star, col = "red") p = sum(T >= t.star) / length(T) Writing your own functions (and loops) f = function(x1, x2 = 0) { return (x1 + x2) } The return statement above is optional. We may want to have a function that simply 'does something'. For example, suppose we have a matrix m, and we want to plot a line that corresponds to each row of the matrix. m = matrix(1:15,ncol=5,byrow=TRUE) plotLines = function(m, ...) { # this is a comment lower = min(m) upper = max(m) for (i in 1:dim(m)[1]) { plot(m[i,], ylim = c(lower, upper), type = 'l', ...) par(new=TRUE) } } There are many options for graphical parameters, see '?plot' and '?par', for examples. R also allows while loops: i =0 while (i < 10) { print(i) i = i + 1 } Within a loop you may use break or next statements, similar to Perl. Conditional statements is5 = function(x) { if (x == 5) { print ('x is equal to 5') } else { print ('x is not equal to 5') } } # Note: There is no if else expression in R - you must used nested if...else statements. Saving and Loading R objects First let's check our current working directory. This is the directory in which files will be saved or read from if only a file name (no path) is specified. getwd() setwd("name of directory") # this function can be used to change the working directory # save the current workspace in the current working directory save.image(file = 'file.RData') save(x,y,z, file = 'xyz.RData') # can be used to save a subset of objects in the workspace # will load in the specified workspace or R object (note: objects in the workspace you are # loading in will overwrite any objects currently defined load('file.RData') # save the matrix m as a text file write(t(m), ncolumns = ncol(m), file = 'm.txt') # this can later be read in using the # read.table command. Probability distributions R can handle all common probability distributions, including the normal and (continuous) uniform distribution. For the normal distribution (standard normal by default), 'dnorm' gives the density, 'pnorm' gives the distribution function, 'qnorm' gives the quantile function, and 'rnorm' generates random deviates. Other probability functions work similarly (e.g., dunif, punif, etc. for the uniform distribution) #Let's visualize the standard normal density x = seq(-5,5, by=0.1) plot(x, dnorm(x), type = 'l') #We can generate 1000 observations from the standard normal distribution z = rnorm(1000) hist(z) #Let Z ~ N(0,1). Then pnorm(1.645) # returns P(Z < 1.645) qnorm(.95) # returns the value z*, for which (P(Z < z*) = 0.95 # Simulate flipping a coin 100 times, where P(H) = P(T) = 1/2 flips = runif(100) flips[ flips < 0.5 ] = 'H' flips[ flips != 'H'] = 'T' flips = as.factor(flips) summary(flips) ## suppose we did not know about the summary function, and had to write our ## own function to count the number of heads / tails ## a C-style function ## countHT.C = function(x) { numH = 0 numT = 0 x = toupper(x) ## convert to all uppercase for (i in 1:length(x)) { if (x[i] == 'H') numH = numH + 1 if (x[i] == 'T') numT = numT+ 1 } print(paste("Heads: ", numH)) print(paste("Tails: ", numT)) } ## an R style function ## countHT.R = function(x) { x = toupper(x) numH = length( x[ x == 'H'] ) numT = length( x[ x == 'T'] ) print(paste("Heads: ", numH)) print(paste("Tails: ", numT)) } Recall the Central Limit Theoreom: If Xi is a random variable with mean µ and variance s2, the sample mean of a random sample of size n has the approximate distribution This is exactly true if X is normally distributed, and is approximately true for all other distributions of X, with the approximation being more accurate as n increases. Let's visualize this using R: ## the dots (...) are used as additional arguments to dsn CLT <- function(r = 5000, dsn = rnorm, ...) { ## divide the window into 4 panels ## par(mfrow = c(2,2)) hist(dsn(r, ...), xlab = "x", main = "original dsn") n.sizes = c(10, 30, 500) for (n in n.sizes) { m = matrix(0, r) for (i in 1:r) { m[i] = mean(dsn(n, ...)) } title = paste("n = ", n) hist(m, main = title, xlab = "xbar") } }