HW4: You have to use R to complete this HW – copy paste your R commands and plots-and add your notes if necessary for each problem:

 

  1. For the cereals data (story behind the data), make a histogram of sodium contents of all types of cereals measured (do it with number of bins = 20 and  prob=T). Also plot a smoothed curve on the histogram showing what true density may this random variable have (do it with bw=10) . Calculate mean, standard deviation, 80th percentile for this variable.

 

Solution:

> cerealdata<-read.table("cereals.txt",header=TRUE)

> sod<-cerealdata[,7]

> sod

 [1] 130  15 260 140 200 180 125 210 200 210 220 290 210 140 180 280 290  90 180

[20] 140  80 220 140 190 125 200   0 160 240 135  45 280 140 170  75 220 250 180

[39] 170 170 260 150 180   0  95 150 150 220 190 220 170 170 200 320   0   0 135

[58]   0 210 140   0 240 290   0   0   0  70 230  15 200 190 200 250 140 230 200

[77] 200

> hist(sod,n=20,prob=T)

 

lines(density(sod,bw=10))

> mean(sod)

[1] 159.6753

> sd(sod)

[1] 83.8323

> quantile(sod,0.8,type=1)

80%

220

 

 

  1. For the cereals data (story behind the data) compare the calorie contents in the cereals manufactured by Kellogs and Nabisco, by giving their summary statistics(min, max, quartiles), and a single boxplot for both (you can do two boxplots in the same diagram – useful for comparison) for both these measurements. How do these two measurements compare?

 

Solution:

> cal_kel=cerealdata[,4][cerealdata[,2]=="K"]

> cal_nab=cerealdata[,4][cerealdata[,2]=="N"]

> summary(cal_kel)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

   50.0   100.0   110.0   108.7   115.0   160.0

> summary(cal_nab)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.

  70.00   82.50   90.00   86.67   90.00  100.00

> boxplot(cal_kel,cal_nab)

Overall calorie content for Nabisco cereals is much less than that of the Kellog brand.  

 

 

  1. Find the 67th percentile for the t distribution with d.f=7 (call this number b). Generate 1000 samples from t(df=7) distribution. Estimate probability of getting a value smaller than b, using these 1000 samples.   

 

Solution:

> b=qt(0.67,7)

> b

[1] 0.4592146

> y<-rt(1000,7)

> prob=length(y[y<=b])/length(y)

> prob

[1] 0.663

 

  1. For the CEO data (story behind data), plot ages and salaries in a scatter plot. Does it show strong association between variables (justify by calculating the correlation coefficient). Do a normal q-q  plot on the ages of the CEOs – does it look like it is normally distributed? Find a 95% confidence interval for the age of CEOs.

 

Solution:

> ceodata<-read.table("ceo.txt",header=TRUE)

> plot(ceodata[,2],ceodata[,1])

 

The plot does not suggest strong association.

cor(ceodata[,2],ceodata[,1])

[1] 0.1275554

The correlation coefficient also supports the guess.

 

> qqnorm(ceodata[,1])

> qqline(ceodata[,1])

Yes, it looks like it is normally distributed.

 

> x<-ceodata[,1]

> confidenceint=c(mean(x)-qnorm(0.975,0,1)*sd(x)/sqrt(length(x)),(mean(x)+qnorm(0.975,0,1)*sd(x)/sqrt(length(x))))

> confidenceint

[1] 49.25111 53.83364

> length(x)

[1] 59

 

This confidence interval is calculated using the Normal probabilities (instead of t, since sigma unknown), since data size n=59 > 30.

 

  1.  Darwin’s Data: Charles Darwin measured heights (in inches) for 15 pairs of plants. Each pair consists of two plants of the same age but one grown from a seed of cross-fertilized flower and the other from a self-fertilized flower. Test the hypothesis that the growth in cross-fertilized flower is more.

 

Solution:

> tree<-read.table("darwin.txt",header=TRUE)

> tree

    cross   self

1  23.500 17.375

2  12.000 20.375

3  21.000 20.000

4  22.000 20.000

5  19.125 18.375

6  21.500 18.625

7  22.125 18.625

8  20.375 15.250

9  18.250 16.500

10 21.625 18.000

11 23.250 16.250

12 21.000 18.000

13 22.125 12.750

14 23.000 15.500

15 12.000 18.000

> dim(tree)

[1] 15  2

> z<-tree[,1]-tree[,2]

> z

 [1]  6.125 -8.375  1.000  2.000  0.750  2.875  3.500  5.125  1.750  3.625

[11]  7.000  3.000  9.375  7.500 -6.000

> t=mean(z)/(sd(z)/sqrt(length(z)))

> t

[1] 2.147987

> t>=qt(0.95,length(z)-1)

[1] TRUE

> qt(0.95,length(z)-1)

[1] 1.76131

> pval=1-pt(t,length(z)-1)

> pval

[1] 0.02485147

The paired-t test gives the t-statistic value 2.14789 and at 5% level the cut off is 1.76. So we reject H0(that mean difference = 0) and accept H1(that mean difference is >0). So at 5% level, the claim that growth in cross-fert plants is more is accepted ( a p-value of [1] 0.02485147 strongly supports this claim). At 1% level however, we will accept H0.