It is common for people to make a description for the new principal component variables. For example in the women's track data, the first principal component can be described as overall athletic prowess. The second principal component can be described as the prowess at short distance events but lack of prowess at long distance events, that is, a country that has a high value on principal component 2 is good at short distance events but not so good at long distance events, relative to itself.
Interpreting principal components can be a very subjective activity. There is no methematical basis for it. But the way that it is conducted is to look at what a high value and a low value of the principal component represent. In the women's track data the first principal component contains roughly an equal amount of each variable, so a high value for each variable gives a high value for the first principal component, and all lows for gives a low. Hence the interpretation that the first principal component represents overall athletic prowess. The second principal components has positive coefficients for the first 3 short distance events and negative for all the other events. High values for these three short distance events and low values for the other events will give a high value for the second principal component, and vice versa for a low value on the second principal component.
NOTE: Always look at the coefficients to interpret the principal components and do some quick hand calculations. Different software may give different orientations of the principal components. For example, the algorithm used in XGobi gave the first principal component coefficients to be
a1 = (-0.368356,-0.365364,-0.381610,-0.384559,-0.389104,-0.388866,-0.367004)'which are just the reverse of those given by SAS. Hence the appearance in the plot that the data is reversed.


The plots of the principal components was generated from XGobi. Using tour mode in Pause, select all variables on, turn on ``PC Basis'', and ``PC Axes''. Click on ``Reinit'' to get the first two principal components. Then ``Identify'' can be used so that countries can be identified. To look at more principal components the manipulation controls need to be used.
Can the dimension of the particle physics data be reduced?
SAS code
OPTIONS LS=80; DATA PRIM7; INFILE "/home/dicook/data/prim7"; INPUT V1-V7; PROC PRINCOMP COV; VAR V1-V7; RUN;
SAS output
Principal Component Analysis
500 Observations
7 Variables
Simple Statistics
V1 V2 V3 V4
Mean 5.767659220 1.553932460 9.612386940 1.839435320
StD 4.652606168 2.186015813 7.228750604 2.652585530
V5 V6 V7
Mean 8.859937520 -2.365427500 -2.533923260
StD 7.085819787 3.961944315 4.064969770
Covariance Matrix
V1 V2 V3 V4
V1 21.64674416 5.27598594 -5.38415394 6.92669900
V2 5.27598594 4.77866513 -6.39241935 -0.49264811
V3 -5.38415394 -6.39241935 52.25483529 1.91936755
V4 6.92669900 -0.49264811 1.91936755 7.03620999
V5 -5.84822874 1.65150051 -28.11684285 -7.72202503
V6 -9.84654643 -5.05491245 12.93879946 0.74077194
V7 -10.41130772 0.67392973 -13.00628778 -6.58859088
V5 V6 V7
V1 -5.84822874 -9.84654643 -10.41130772
V2 1.65150051 -5.05491245 0.67392973
V3 -28.11684285 12.93879946 -13.00628778
V4 -7.72202503 0.74077194 -6.58859088
V5 50.20884206 -12.20545118 13.35839830
V6 -12.20545118 15.69700276 -2.16813588
V7 13.35839830 -2.16813588 16.52397923
Total Variance = 168.14627862
Eigenvalues of the Covariance Matrix
Eigenvalue Difference Proportion Cumulative
PRIN1 90.1370 50.1756 0.536063 0.53606
PRIN2 39.9614 19.4193 0.237659 0.77372
PRIN3 20.5421 9.9815 0.122168 0.89589
PRIN4 10.5606 7.1873 0.062806 0.95870
PRIN5 3.3733 1.1151 0.020062 0.97876
PRIN6 2.2583 0.9446 0.013430 0.99219
PRIN7 1.3136 . 0.007812 1.00000
Eigenvectors
PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7
V1 0.012040 -.701020 0.245869 -.039038 0.001502 0.399965 -.535234
V2 -.078112 -.187176 -.001083 -.306234 -.481096 0.488213 0.628711
V3 0.665573 0.333878 0.591679 -.233833 0.092700 0.179393 0.000846
V4 0.099379 -.254174 -.004705 0.368757 0.710679 0.231019 0.480709
V5 -.649369 0.296660 0.616215 0.276121 -.001350 0.185322 -.001745
V6 0.234248 0.314842 -.328932 0.587369 -.261544 0.531564 -.204545
V7 -.253648 0.332153 -.318627 -.541102 0.431833 0.446340 -.212892
Based on the percentage variation it may be possible to use just the first 3 principal components, because they account for 90%of the variation in the data. Based on the scree plot, it is not clear where the cut should be made.


While it may be possible to reduce the dimensionality it is not desirable because the last principal components contain important information about the particle combinations. (Note the plot of PC5 vs PC7.)
This data contains carapace measurements in millimeters on painted turtles. It was collected to study the size and shape of the animals (from Jolicoeur and Mosimann, ``Size and Shape Variation in the Painted Turtle: A Principal Component Analysis'', Growth, 24 (1960, 339-354).
% xgobi cdata/turtles &
From examining the Length, Width and Height variables in a tour, guess how many dimension(s) need to be used to describe most of the variation in the data.
Conduct PCA, using only the first three variables (Length, Width and Height). (SAS code is below.)
OPTIONS LS=80; DATA TURTLES; INFILE "/home/dicook/data/johnson.and.wichern/turtles.dat"; INPUT LENGTH WIDTH HEIGHT SEX; PROC PRINCOMP COV OUT=PCA1; VAR LENGTH WIDTH HEIGHT; PROC PLOT DATA=PCA1; PLOT PRIN2*PRIN1; PLOT PRIN3*PRIN1; PLOT PRIN3*PRIN2; RUN;