|
As improbable as this may seem now, I was at one
time in college a statistics major. After
taking all the undergraduate courses in statistics, I enrolled in a
graduate course in mathematical
statistics at Columbia with the eminent Harold Hotelling, one of the
founders of modern
mathematical economics. After listening to several lectures of
Hotelling, I experienced an
epiphany: the sudden realization that the entire "science" of
statistical inference rests on one
crucial assumption, and that that assumption is utterly groundless. I
walked out of the Hotelling
course, and out of the world of statistics, never to return.
Statistics, of course, is far more than the mere
collection of data. Statistical inference is
the conclusions one can draw from that data. In particular,
since--apart from the decennial US
census of population--we never know all the data, our conclusions must
rest on very small
samples drawn from the population. After taking our sample or samples,
we have to find a way to
make statements about the population as a whole. For example, suppose
we wish to conclude
something about the average height of the American male population.
Since there is no way that
we can mobilize every male American and measure everyone's height, we
take samples of a
small number, say 500 people, selected in various ways, from which we
presume to say what the
average American's height may be.
In the science of statistics, the way we move from
our known samples to the unknown
population is to make one crucial assumption: that the samples will, in
any and all cases, whether
we are dealing with height or unemployment or who is going to vote for
this or that candidate, be
distributed around the population figure according to the so-called
"normal curve."
The normal curve is a symmetrical, bell-shaped
curve familiar to all statistics textbooks.
Because all samples are assumed to fall around the population figure
according to this curve, the
statistician feels justified in asserting, from his one or more limited
samples, that the
height of the American population, or the unemployment rate, or
whatever, is definitely XYZ
within a "confidence level" of 90 or 95 %. In short, if, for example, a
sample height for the
average male is 5 feet 9 inches, 90 or 95 out of every 100 such samples
will be within a certain
definite range of 5 feet 9 inches. These precise figures are arrived at
simply by assuming that all
samples are distributed around the population according to this normal
curve.
It is because of the properties of the normal
curve, for example, that the election pollsters
could assert, with overwhelming confidence, that Bush was favored by a
certain percentage of
voters, and Dukakis by another percentage, all within "three percentage
points" or "five
percentage points" of "error." It is the normal curve that permits
statisticians not to claim
absolute knowledge of all population figures precisely but instead to
claim such knowledge
within a few percentage points.
Well, what is the evidence for this vital
assumption of distribution around a normal
curve? None whatever. It is a purely mystical act of faith. In my old
statistics text, the only
"evidence" for the universal truth of the normal curve was the
statement that if good riflemen
shoot to hit a bullseye, the shots will tend to be distributed around
the target in something like a
normal curve. On this incredibly flimsy basis rests an assumption vital
to the validity of all
statistical inference.
Unfortunately, the social sciences tend to follow
the same law that the late Dr. Robert
Mendelsohn has shown is adopted in medicine: never drop any procedure,
no matter how faulty,
until a better one is offered in its place. And now it seems that the
entire fallacious structure of
inference built on the normal curve has been rendered obsolete by
high-tech.
Ten years ago, Stanford statistician Bradley Efron
used high-speed computers to generate
"artificial data sets" based on an original sample, and to make the
millions of numerical
calculations necessary to arrive at a population estimate without using
the normal curve, or any
other arbitrary, mathematical assumption of how samples are distributed
about the unknown
population figure. After a decade of discussion and tinkering,
statisticians have agreed on
methods of practical use of this "bootstrap" method, and it is now
beginning to take over the
profession.
Stanford statistician Jerome H. Friedman, one of the pioneers of the
new
method, calls it "the most important new idea in statistics in the last
20 years, and probably the
last 50."
At this point, statisticians are finally willing to
let the cat out of the bag. Friedman now
concedes that "data don't always follow bell-shaped curves, and when
they don't, you make a
mistake" with the standard methods. In fact, he added that "the data
frequently are distributed
quite differently than in bell-shaped curves." So that's it; now we
find that the normal curve
Emperor has no clothes after all. The old mystical faith can now be
abandoned; the Normal
Curve god is dead at long last.
Previous Page * Next Page
|