jmp-l
[Top] [All Lists]

Re: model validation - data scrambling versus hold out sets?

To: jmp-l@lists.biostat.wustl.edu
Subject: Re: model validation - data scrambling versus hold out sets?
From: paige.miller@kodak.com
Date: Tue, 14 Jun 2005 08:22:19 -0400
Spectroscopy is one application where you might have 5000 Xs and only n=15.
Correlations of >0.99 between the columns are common.

Cross-validation is the most commonly used approach to determine if the
predictability of the model is above random chance. This is the "hold-out
method" mentioned by Dr. Metz. One can find a zillion (well, maybe not a
zillion, but lots) of articles in which this type of cross-validation is
performed even on small samples such as n=15. One reference: Martens H and
Martens, M (2001) "Mutlivariate Analysis of Quality", John Wiley and Sons,
Ltd (although this textbook does not have examples with 5000 Xs, the
principles remain the same when looking at 49 Xs or 5000 Xs).

I have often thought that the randomly permuting the Ys will also give
useful information. Suppose you randomly permute the Ys 1000 times (leaving
the Xs unchanged) and each time compute a measure of fit. You get a
distribution of this measure of fit taht you can compare to the actual
measure of fit. If the actual measure of fit is better than the 95%-ile of
the distribution, then perhaps you have uncovered a statistically
significant effect. This method guards against there being so many patterns
in the 5000 X variables that you can fit almost any pattern of Ys.

I know of no study in which the two methods are compared. In fact, I know
of no published work in which the permutation method is used or espoused.
But I think both methods have their place and should be considered
complimentary rather than mutually exclusive.

--
Paige Miller
Eastman Kodak Company

paige.miller@kodak.com
(585) 477-2946
http://www.kodak.com

"It's nothing until I call it!" -- Bill Klem, NL Umpire
"When you get the choice to sit it out or dance, I hope you dance" -- Lee
Ann Womack


                                                                                
                                             
                      "David Ikle"                                              
                                             
                      <david.ikle@wilm.ppdi.co         To:      
jmp-l@lists.biostat.wustl.edu                                
                      m>                               cc:      James T Metz 
<james.metz@abbott.com>                         
                      Sent by:                         Subject: Re: [jmp-l] 
model validation - data scrambling versus hold   
                      jmp-l-owner@lists.biosta         out sets?                
                                             
                      t.wustl.edu                                               
                                             
                                                                                
                                             
                                                                                
                                             
                      06/13/2005 09:40 PM                                       
                                             
                      Please respond to jmp-l                                   
                                             
                                                                                
                                             
                                                                                
                                             





My first question is what kind of a problem generates >5000 Xs
on each of 15 observations?  David

James T Metz wrote:
>
> JMP Users,
>
>         I have a general question concerning model validation.  Does
> anyone have any thoughts or comments concerning
> (Y data) (multiple) scrambling (using the column shuffle option in
> JMP) versus hold-out data sets (using excluded rows) as
> a means to  "validate" models?  Is one method generally preferred over
> the other?  Is one method generally better for regression
> while another method is better for partition models, etc?  Is the
> number of observations important?
>
>         Case-in-point - I have a data set of about 15 observables (Y
> values).  I can obtain > 5000 X values (descriptors or columns)
> for each of the rows.  Obviously, there is a great, and highly likely
> danger of chance correlation.  I could use either method mentioned
> above to "validate" generated models.  However, my intuition says that
> the hold-out method is not appropriate in this case, since my
> data set is so small.  Do others agree?
>
>         I welcome thoughts, comments, literature references, etc.
>
>         Regards,
>         Jim Metz
>
> James T. Metz, Ph.D.
> Research Investigator Chemist
>
> GPRD R46Y AP10-2
> Abbott Laboratories
> 100 Abbott Park Road
> Abbott Park, IL  60064-6100
> U.S.A.
>
> Office (847) 936 - 0441
> FAX    (847) 935 - 0548
>
> james.metz@abbott.com
>
> This communication may contain information that is legally privileged,
> confidential, or exempt from disclosure.  If you are not the intended
> recipient, please note that any dissemination, distribution, use, or
> copying of this communication is strictly prohibited.  Anyone who
> receives this message in error should notify the sender immediately by
> telephone or return email and delete it from his or her computer.
______________________________________________________________________
This email transmission and any documents, files or previous email
messages attached to it may contain information that is confidential or
legally privileged. If you are not the intended recipient or a person
responsible for delivering this transmission to the intended recipient,
you are hereby notified that you must not read this transmission and
that any disclosure, copying, printing, distribution or use of this
transmission is strictly prohibited. If you have received this
transmission in error, please immediately notify the sender by telephone
or return email and delete the original transmission and its attachments
without reading or saving in any manner.





<Prev in Thread] Current Thread [Next in Thread>