Interpreting Principal Components

It is common for people to make a description for the new principal component variables. For example in the women's track data, the first principal component can be described as overall athletic prowess. The second principal component can be described as the prowess at short distance events but lack of prowess at long distance events, that is, a country that has a high value on principal component 2 is good at short distance events but not so good at long distance events, relative to itself.

Interpreting principal components can be a very subjective activity. There is no methematical basis for it. But the way that it is conducted is to look at what a high value and a low value of the principal component represent. In the women's track data the first principal component contains roughly an equal amount of each variable, so a high value for each variable gives a high value for the first principal component, and all lows for gives a low. Hence the interpretation that the first principal component represents overall athletic prowess. The second principal components has positive coefficients for the first 3 short distance events and negative for all the other events. High values for these three short distance events and low values for the other events will give a high value for the second principal component, and vice versa for a low value on the second principal component.

NOTE: Always look at the coefficients to interpret the principal components and do some quick hand calculations. Different software may give different orientations of the principal components. For example, the algorithm used in XGobi gave the first principal component coefficients to be

  a1 = (-0.368356,-0.365364,-0.381610,-0.384559,-0.389104,-0.388866,-0.367004)'
which are just the reverse of those given by SAS. Hence the appearance in the plot that the data is reversed.

The plots of the principal components was generated from XGobi. Using tour mode in Pause, select all variables on, turn on ``PC Basis'', and ``PC Axes''. Click on ``Reinit'' to get the first two principal components. Then ``Identify'' can be used so that countries can be identified. To look at more principal components the manipulation controls need to be used.

Another Example

Can the dimension of the particle physics data be reduced?

SAS code


OPTIONS LS=80;
DATA PRIM7;
   INFILE "/home/dicook/data/prim7";
   INPUT V1-V7;
PROC PRINCOMP COV;
  VAR V1-V7;
RUN;

SAS output


                          Principal Component Analysis

     500 Observations
       7 Variables


                               Simple Statistics

                      V1                V2                V3                V4

  Mean       5.767659220       1.553932460       9.612386940       1.839435320
  StD        4.652606168       2.186015813       7.228750604       2.652585530


                               V5                V6                V7

           Mean       8.859937520      -2.365427500      -2.533923260
           StD        7.085819787       3.961944315       4.064969770


                               Covariance Matrix

                     V1                V2                V3                V4

   V1       21.64674416        5.27598594       -5.38415394        6.92669900
   V2        5.27598594        4.77866513       -6.39241935       -0.49264811
   V3       -5.38415394       -6.39241935       52.25483529        1.91936755
   V4        6.92669900       -0.49264811        1.91936755        7.03620999
   V5       -5.84822874        1.65150051      -28.11684285       -7.72202503
   V6       -9.84654643       -5.05491245       12.93879946        0.74077194
   V7      -10.41130772        0.67392973      -13.00628778       -6.58859088


                              V5                V6                V7

            V1       -5.84822874       -9.84654643      -10.41130772
            V2        1.65150051       -5.05491245        0.67392973
            V3      -28.11684285       12.93879946      -13.00628778
            V4       -7.72202503        0.74077194       -6.58859088
            V5       50.20884206      -12.20545118       13.35839830
            V6      -12.20545118       15.69700276       -2.16813588
            V7       13.35839830       -2.16813588       16.52397923

                         Total Variance = 168.14627862

                      Eigenvalues of the Covariance Matrix

                Eigenvalue      Difference      Proportion      Cumulative

     PRIN1         90.1370         50.1756        0.536063         0.53606
     PRIN2         39.9614         19.4193        0.237659         0.77372
     PRIN3         20.5421          9.9815        0.122168         0.89589
     PRIN4         10.5606          7.1873        0.062806         0.95870
     PRIN5          3.3733          1.1151        0.020062         0.97876
     PRIN6          2.2583          0.9446        0.013430         0.99219
     PRIN7          1.3136           .            0.007812         1.00000

                                  Eigenvectors

        PRIN1      PRIN2      PRIN3      PRIN4      PRIN5      PRIN6      PRIN7

V1   0.012040   -.701020   0.245869   -.039038   0.001502   0.399965   -.535234
V2   -.078112   -.187176   -.001083   -.306234   -.481096   0.488213   0.628711
V3   0.665573   0.333878   0.591679   -.233833   0.092700   0.179393   0.000846
V4   0.099379   -.254174   -.004705   0.368757   0.710679   0.231019   0.480709
V5   -.649369   0.296660   0.616215   0.276121   -.001350   0.185322   -.001745
V6   0.234248   0.314842   -.328932   0.587369   -.261544   0.531564   -.204545
V7   -.253648   0.332153   -.318627   -.541102   0.431833   0.446340   -.212892

Based on the percentage variation it may be possible to use just the first 3 principal components, because they account for 90%of the variation in the data. Based on the scree plot, it is not clear where the cut should be made.

While it may be possible to reduce the dimensionality it is not desirable because the last principal components contain important information about the particle combinations. (Note the plot of PC5 vs PC7.)

Exercise

This data contains carapace measurements in millimeters on painted turtles. It was collected to study the size and shape of the animals (from Jolicoeur and Mosimann, ``Size and Shape Variation in the Painted Turtle: A Principal Component Analysis'', Growth, 24 (1960, 339-354).

  1. Look at the data using xgobi. (The data contains both male and female turtles, hence the two colors in the xgobi views.)

    
    % xgobi cdata/turtles &
  2. From examining the Length, Width and Height variables in a tour, guess how many dimension(s) need to be used to describe most of the variation in the data.

  3. Conduct PCA, using only the first three variables (Length, Width and Height). (SAS code is below.)

    
    OPTIONS LS=80;
    DATA TURTLES;
       INFILE "/home/dicook/data/johnson.and.wichern/turtles.dat";
       INPUT LENGTH WIDTH HEIGHT  SEX;
    PROC PRINCOMP COV OUT=PCA1;
      VAR LENGTH WIDTH HEIGHT;
    PROC PLOT DATA=PCA1;
      PLOT PRIN2*PRIN1;
      PLOT PRIN3*PRIN1;
      PLOT PRIN3*PRIN2;
    RUN;

  4. Report the eigenvalues and eigenvectors from the analysis.
  5. Plot the principal components against each other. Note any interesting features.
  6. Why was it appropriate to use the covariance matrix rather than the correlation matrix?
  7. How many principal components are needed to adequately describe the data? (Use both a scree plot and the proportion of variation explained to answer this.)
  8. Interpret the first and second principal components. (Draw a picture explaining your interpretation.)

Summary of Principal Component Analysis

  1. PCA is primarily used for dimension reduction. In this case it is necessary to decide how many principal components are necessary to adequately summarize the data.
  2. Using the standardized data rather than the raw data, eliminates unit measurement differences between variables. Then PCA is not simply detecting unit differences between variables. (This is the same as when people say that the correlation matrix is used rather than the variance-covariance matrix.)
  3. The last few principal components are useful for detecting non-linear relationships between the variables.
  4. When using a few principal components instead of all the variables it is customary to make some interpretation of what the new variables, for example, the first component might be "animal size", the second might be "animal shape".


dicook@iastate.edu
Class notes 9/30/96