s-news
[Top] [All Lists]

Re: automate writing formulas

To: Alan Hochberg <alan.hochberg@prosanos.com>
Subject: Re: automate writing formulas
From: Frank E Harrell Jr <f.harrell@vanderbilt.edu>
Date: Mon, 31 Jul 2006 11:08:10 -0500
Cc: "'Kamil Toth'" <kamiltoth@yahoo.com>, s-news@lists.biostat.wustl.edu
In-reply-to: <20060731132729.5DAE7100B08C@mailgate.biostat.wustl.edu>
References: <20060731132729.5DAE7100B08C@mailgate.biostat.wustl.edu>
User-agent: Thunderbird 1.5.0.5 (X11/20060728)
Alan Hochberg wrote:
You could, if you wish, use names() to get the variable names in your data
frame, use cat() with sink() to write them in the form of a formula into a
little text file, and then copy/paste them into R/S-plus.  However I
STRONGLY suggest that you, as my children would say, "Don't go there."

Below is a quote from a book called "Numerical Methods that Work", written
in 1970 by the professor who taught me numerical methods, Dr. Forman Acton,
specifically in a chapter called, "Interlude: What Not to Compute".
Apparently the book has been reissued in the UK.  There are references to
"desk machines" and "valuable computer time" that are now dated, but the
principles still hold.  I also think it's one of the most cogent and
funniest bits of writing in the field, which is why it has stuck with me for
30+ years:

LARGE SETS OF LINEAR EQUATIONS

'What fools these mortals be'   -- Seneca

Whenever a person eagerly inquires if my computer can solve a set of 300
equations in 300 unknowns, I must quickly suppress the temptation to retort,
'Yes, but why bother?'  There are, indeed, legitimate sets of equations that
large.  They arise from replacing a partial differential equation on a set
of grid points, and the person who knows enough to tackle this type of
problem also usually knows what kind of computer he needs.  The odds are all
to high that our inquiring friend is suffering from quite a different
problem: he [sic] probably has collected a set of experimental data and is
now attempting to fit a 300-parameter model to it--by Least Squares!...

It does no good to point out that several parameters are nearly certain to
be competing to "explain" the same variations in the data and hence the
equation system will be nearly indeterminate.  It does no good to point out
that ALL large least-squares matrices are striving mightily to be proper
subsets of the Hilbert matrix--which is virtually indeterminate and
uninvertible--and so even if all 300 parameters were beautifully
independent, the fitting equations would still be violently unstable.  All
of this, I repeat, does no good--and you end up getting angry and throwing
the guy out of your office.

...They should merely fit a five-parameter model, then a six-parameter one.
If all goes well and there is a statistically-valid reduction of the
residual variability, then a somewhat more elaborate model may be tried.

Alan - these are wonderful quotes until this part. The belief that the data can tell you which variables to use, in the absence of penalization for discarded predictors, is illusory. This approach solves absolutely no part of the instability problem, due to extreme model uncertainty. Put another way, Ye's approach (see below) will show that the effective degrees of freedom remains about 300. Otherwise thanks very much for the excellent quotes.

Frank Harrell

@ARTICLE{ye98mea,
  author = {Ye, Jianming},
  year = 1998,
  title = {On measuring and correcting the effects of data mining and model
          selection},
  journal = JASA,
  volume = 93,
  pages = {120-131},
  annote = {generalized degrees of freedom;GDF;effective degrees of
           freedom;data mining;model selection;model
           uncertainty;overfitting;nonparametric regression;CART;simulation
           setup}
}




Somewhere along the line--and it will be much closer to 15 parameters than
to 300--the significant improvement will cease and the fitting operation is
over.  There is no system of 300 equations, no 300 parameter, and know
glamour.  But a person...has to have a clear idea about the mechanisms by
which variability has entered his data, and he has to know the intended use
for his fitted formula... [End of quote: Some of the funniest writing has
been removed in the name of netiquette.]

I recommend both the book and the approach.

Alan

Alan Hochberg
Vice President, Research
ProSanos Corp.
225 Market Street
Suite 502
Harrisburg, PA 17101
Tel. 717-635-2124
Fax 717-635-2575
alan.hochberg@prosanos.com
www.prosanos.com





--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news



--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

<Prev in Thread] Current Thread [Next in Thread>