Classification Society of North America Newsletter

July 1997, Issue #50
Peter Bryant, President
F.R. McMorris, Newsletter Editor

In this issue:

::::::: President's Corner :::::::

Peter Bryant
College of Business
University of Colorado at Denver
Denver, CO 80217-3364
pbryant@castle.cudenver.edu
303-556-5833

CSNA 97 was held in June at American University in Washington, D. C. More than 100 people attended, including attendees at our usual introductory short course and our new "advanced" course on Multivariate Nonparametric Regression. The conference chair,Olga Cordero-Bra~na of American University, and the program chair, David Banks of NIST, provided ample facilities and a stimulating variety of talks and presentations. We are in their debt.

Every CSNA conference has its own particular flavor, and for me this one offered more emphasis on medical and public health oriented issues. Perhaps this reflected the governmental agency "presence." We get increasing numbers of papers on neural nets, and this must surely be a good trend: the more we can communicate and avoid (what strikes me as a lot of) reinvention of the wheel, the better off we must be. I confess I wish we had better connections to the remote sensing folks. We did have a presentation on applications in meteorology, and I hope we'll have more. Those who HAVE connections to such areas might wish to encourage workers in that area to attend CSNA meetings.

The complete program is available for those who wish to peruse it.

Finally, I note that the Board approved next year's dues with no increase! This reflects in part the substantial savings from electronic distribution of our newsletter and diskette publication of the bibliographic search, and in part the admirable discipline of the business managers.

Return to top of newsletter

::::::: Fromthe Secretary/Treasurer :::::::

Stanley L. Sclove
Department of Information and Decision Science
College of Business Administration
University of Illinois at Chicago
601 S. Morgan Street
Chicago, IL 60607-7124
slsclove@uic.edu
www.uic.edu/~slsclove

Though membership applications are accepted all year long, each year at the annual meeting those applying during the year are officially voted into membership. Forty-nine new members (27 USA, 22 International) were voted in at the June meeting:--

USA: Adams, Dean C.
Ball, R. Martin Jr.
Brownstein, Steven
Cellini, Richard J.
Coleman, S. C.
Collier, Geoffrey L.
Crouch, Brad
Dalezios, Isidoros
Darrow, Ross M.
Flaherty, Brian P.
Heady, Ronald B.
Johnson, Timothy R.
Kartashov, Alex
Kumbasar, Ece
Larsen, Michael D.
Li, Jinlu
Lyons-Weiler, James
Machuga, Joseph
Myers, Wayne
Obrosky, David Scott
Rao, J. Sunil
Robinson, Wendell
Shannon, WIlliam
Smith, Bradley G.
Street, Steven R.
Sutherland, Michael
Xu, Feng

International:
Almeida, Susano Nasimento
Beaulne, Albert
Bergman, Lars
Braaten, Oivina
Brueggemann, R.
Constantin, Julien
Deerwester, Scott
Fernandez, J. Antoni Martin
Flokos, Ioannis
Ishihara, Shigekazu
Jost, Oliver
Lozano, Jose A.
Marktonderzoek, M. Voor
Odegard, Kristin
Reineke, Torsten
Sato, Yoshiharu
Shaw, Craig Douglas
Smets, Erik
Thorsten, Reineke
Weihs, Erich
Yadohisa, Hiroshi
Yum, Bong-Jin

We are delighted to accept them into membership. We are also happy to note that thirteen persons submitted membership applications at the meeting.

If this is your first newsletter since paying your CSNA membership, thank you and welcome to our ranks. It is your financial support that enables us CSNA to bring you the journal, our bibliographic research service, and our meetings. We depend upon your participation to enable us to continue to provide you with classification-related publications, information and activities.

Return to top of newsletter

::::::: Fromthe Newsletter Editor :::::::

F.R. McMorris
Department of Mathematics
University of Louisville
Louisville, KY 40292
frmcmo01@homer.louisville.edu
(502)852-6826

In this issue, David Banks gives his last Forum - - until we can convince him to resume at a later date. It is based on the presentation he gave at a recent workshop at DIMACS, Rutgers University.

Thanks for the provocative and informative columns David!

Return to top of newsletter

::::::::::: Forum :::::::::::

THE ANALYSIS OF SUPERLARGE DATASETS

David Banks, Statistical Engineering Division, NIST

Datasets have grown large and multivariate. Automated process monitors in the semiconductor industry typically produce records with hundreds of thousands of observations on dozens of variables. Similarly, a satellite can transmit hundreds of images each day, the IRS must process millions of complex tax forms, and supermarket scanners record nearly all grocery purchases in most large cities.

This colossal scale poses serious obstacles to statistical analysis. In particular, it creates four new problem areas:

1) preanalysis of superlarge datasets
2) compression and summarization
3) triage to determine which datasets repay the cost of analysis
4) index creation.

This article reviews the issues in first, third, and fourth of these areas, and sometimes makes suggestions for solution strategies.

Preanalysis

Banks and Parmigiani (1992) define preanalysis as all the things that must be done before the data can be submitted to the scrutiny that the researcher originally planned. In actuality, there is no sharp division between late preanalysis and early conventional analysis, and much of EDA might fall under the preanalysis umbrella.

In superlarge datasets, the preanalysis must be automated or semiautomated. This is because no human eye can scan such datasets to make a sanity check. Although good preanalysis ultimately depends upon the specific data one has in hand, nonetheless, a common preanalysis strategy applies to many situations.

Banks and Parmigiani (1992) suggest a twelve-step program for the preanalysis of multivariate time series data. Some of steps are:

1. Put all data into common format.
2. Create a time stamp for each set of observations.
3. Classify missing data (e.g., intentionally missing, missing for a known cause, missing for an unknown cause, etc.)
4. Check the sample sizes against the values that should be present; this can discover missing data that were missed in the previous step.
5. Look for impossible values or values inconsistent with other values.
6. Synchronize the data, so that all measurements pertain to the same product (e.g., in plate glass manufacture, the features of the product made at noon today depend upon the tank temperature 24 hours earlier; thus the tank temperature needs to be lagged forward 24 hours, to correspond to the current glass).
7. Create a missing value chart, to show patterns of missing data that may be present.
8. Use imputation or some other approach to "fill-in" the data that are missing (note: this will tend to cause one to underestimate the uncertainty in the analysis). I recommend local linear interpolation over more clever imputation methods for this task.
9. Create an extreme value chart, showing data that are peculiar (say three standard deviations away from the average value)
10. Outlier detection determines which of the extreme values will be deleted and replaced by an imputation. One does this for fear that the outliers might make the analysis unrobust. If possible, look also for data that are outliers in a multivariate sense (e.g., large Mahalanobis distance).
11. Descriptive statistics enable one to review summaries of the data that will guide new analysis. For example, one might look at the maximum and minimum values of each variable, or use Q-Q plots to assess normality.
12. Begin elementary EDA. Make boxplots, scatterplots, and so forth to determine what kinds of more sophisticated analyses will be warranted.

Finally, there should always be a thirteenth rule: Get an area expert to review all that's been done, to ensure that no damage has been done to the data by these various tinkerings.

Compression

Compression may necessary because the dataset is too large, and would swamp the computer analysis one wanted to perform. Also, compression is useful when there is too much data to permanently store, so one seeks a summary. Barnsley (1988) is a point-of- entry to the image compression literature.

Triage

The last 15 years have seen an enormous number of new statistical techniques proposed for multivariate nonparametric analysis. Each of these techniques performs well in some cases, but none is dominant. One reason for this is that each of the new techniques is tuned to notice some special kind of locally-low dimensional structure.

In order to use these methods, one should first check whether, locally, one's data have simple structure. For example, suppose one drew points on a piece of paper, crumpled it up, and then handed it to Persi Diaconis, who made the paper disappear, leaving only the points visible. If one looked at the points casually, the crumpling would have made them seem a three-dimensional blob. But if one looked more microscopically, one would notice that in small regions, the points lie almost exactly upon a two- dimensional surface.

An approach to the problem of assessing average local dimensionality is to take a hypersphere of radius r (for small r) and place it at random in the data cloud. Then one does a principal components analysis of the data, and counts how many axes are needed to account for, say, 80% of the total variation. This number is p_1. Then one finds a new random location for the hypersphere, repeats the process m times, and ultimately averages the p_1, ..., p_m to estimate the local dimensionality. See Banks and Olszewski (1997).

If the average local dimensionality is relatively small, even though the apparent dimensionality may be large, then there is a chance that one of the new analytical tools will be pertinent. But when the average local dimensionality is not small, then it is hard to imagine that any statistical analysis will have much success in uncovering a complex model with highly multivariate interactions.

Indexing

When one is faced with too much data, then a useful thing to do is to find some way of organizing the data to reflect its variation. For example, if one were given all of the IRS returns for 1997 and asked to make some kind of statistical sense of the data, then it would be enormously useful to begin by getting a sense of the possible range of variation in the data. In particular, one might want to look at 20 returns that are widely spaced (with respect to a user-defined metric) in the space of all returns---presumably one would see the family with 25 children, the single mother, the two-paycheck household, and so forth.

To make index creation work well, the metric must be chosen with an eye to a human's sense of distance. Ideally, one would have an area expert go through a preliminary sample, declaring rough distances based on experience, and then develop a mathematical program to find the metric that best accords with the expert's judgments.

In terms of getting a rapid understanding of complex data, it is enormously useful to have a list of observations whose ambit includes virtually all kinds of behaviour found in the superlarge dataset. The construction of such an index has not been, to my knowledge, seriously addressed.

References

Banks, D. L. and Olszewski, R. (1997). "Estimating Local Dimensionality," to appear in _Proceedings of the Statistical Computing Section of the American Statistical Association_.

Banks, D. L. and Parmgiani, G. (1992). "Preanalysis of Superlarge Datasets," _Journal of Quality Technology, 24, 115-129.

Barnsley, M. (1988). _Fractals Everywhere_. Academic Press, NY.

Return to top of newsletter

::::::: 1997 General Business Meeting :::::::

The 1997 general business meeting of CSNA was called to order at 11:55 a.m., 13-Jun-1997, at American University, Washington, DC, in conjunction with the society's annual meeting, by CSNA President Peter Bryant.

Announcements:
Persons interested in being slated for office should contact the Nominating Committee (Pascale Rousseau, Chair; Stephen Hirtle; Glenn Milligan).
Persons interested in hosting CSNA99 should contact President Peter Bryant, President-Elect Stephen Hirtle, or another Board member.
Next year's meeting (1998) will be joint with the Psychometric Society, in Urbana, Illinois. The short courses will be offered on Wed., 17-June, and the meeting itself will run from Thursday, 18-June, to Sunday, 21-June.

Doug Carroll commented that he had recently been in touch with Fionn Murtagh and Murtagh is not planning to host an IFCS meeting in the year 2000 but maybe at a later date. IFCS/2000 has been tentatively set for a location in Belgium, and would be jointly sponsored by SFC (Societe Francophone de Classification) and VOC, the Dutch/Flemish Class'n Society.

The meeting adjourned at 12:15 p.m.

s. sclove

Return to top of newsletter

:::::::::::::::::::::::::::::::: CSNA-98 PRELIMINARY ANNOUNCEMENT ::::::::::::::::::::::::::::::::

The 1998 annual meetings of the Classification Society of North America (CSNA) and the Psychometric Society (PS) will be held jointly at the University of Illinois in Urbana, Illinois from Wednesday June 17th until Sunday June 21st, 1998, at the Levis Faculty Center on the University of Illinois campus. The meeting is supported by the Department of Statistics and the Department of Psychology, University of Illinois.

Short courses are planed for Wednesday June 17th, with regular sessions beginning on the morning of Thursday June 18th. There will be a reception, business meetings of both societies, and a banquet during the meeting. CSNA and PS meetings are traditionally informal and very interdisciplinary. Abstracts of papers are distributed, but no formal proceedings are produced. Speakers are encouraged to discuss work in progress, of either applied or methodological nature.

The following contributed paper sessions are currently planned:
Applications
Applied Statistical Methods
Bayesian Statistical Methods
Categorical Data Analysis
Classical Test Theory
Classification
Cluster Analysis
Correspondence and Homogeneity Analysis, and Optimal Scaling
Covariance Structure and Factor Analysis
Exploratory Data Analysis
Graphical Models
Item Response Theory
Linear Models
Longitudinal Data Analysis
Multidimensional Scaling
Multivariate Statistical Methods
Networks and Graph Theory

As traditional, the meeting will have presentations by the Presidents of the societies, and several invited lectures. There will be a special session for graduate student presentations. Current information about the meeting can be found at the WWW sites http://www.pitt.edu/~csna/ and http://www.conted.ceps.uiuc.edu/fmpro/psychometric_society.form.ht ml

The program committee is Ulf Bockenholt, David Budescu, David Dubin, Stephen Hirtle, Jacqueline Meulman, with co-chairs Ivo Molenaar, Carolyn Anderson, and Stanley Wasserman. The committee is open to suggestions for topics, symposia, panel discussions, and other contributions. Please direct your suggestions to Stanley Wasserman, University of Illinois, 603 East Daniel Street, Champaign, IL 61820 USA; telephone 1-217-333-3325; fax 1-217- 244-5876; email pscsna98@s.psych.uiuc.edu. A formal announcement of the meeting will be issued late in 1997. Abstracts will be due in March, 1998.

Return to top of newsletter

::::::::::::::::::::: OTHER CONFERENCE NEWS :::::::::::::::::::::

* JULY 6-10, 1998: 14th Australian Statistical Congress, Jupiter's Casino, Gold Coast, Queensland, Australia. Programme Chair: K. Basford, Local Organization: W. Robb. Address: ASC14, School of Mathematical Sciences, Queensland University of Technology, GPO Box 2434, Brisbane QLD 4001, Australia. email: asc14@qut.edu.au, fax: +61 7 38642310. web: http://www.math.fsc.qut.edu.au/asc14.html

* JULY 21-24, 1998: 6th Conference of the International Federation of Classification Societies, Rome, Italy. Put it on your calendars - - - more information will be appearing later.

Return to top of newsletter


The WWW version of the CSNA Newsletter is made available as a service of the Classification Society of North America. For further information on becoming a member of CSNA, please contact the CSNA Business Manager.
Stephen Hirtle, hirtle+@pitt.edu, CSNA Webmaster.