jmp-l
[Top] [All Lists]

Re: model validation - data scrambling versus hold out sets?

To: <jmp-l@lists.biostat.wustl.edu>
Subject: Re: model validation - data scrambling versus hold out sets?
From: <Gunter.Hartel@csl.com.au>
Date: Tue, 14 Jun 2005 11:24:17 +1000
Thread-index: AcVwb490tVjZoaviRGGB1ndaEeVAygADyBNg
Thread-topic: [jmp-l] model validation - data scrambling versus hold out sets?
Hi Jim,
The Y-shuffle approach is really a permutation method.  The problem is that 
permuting the Y only also breaks up any correlation with the X's.  So, you can 
use this approach to test Y=f(x) vs Y=mu, but not Y=f1(x)+f2(x) vs Y=f1(x), 
that is, the important case of comparison of a reduced model versus a full 
model.  The other approach using 'hold-out' data sets is like cross-validation. 
 You can do a k-fold cross validation where you hold out 1/kth of the data and 
try your model and do that k times.  If k = N you have a leave one out 
approach..  Another approach you should consider is bootstrapping.  
 
Look for books by Philip Good and ones by Fortunato Pesarin.  
Cheers
Gunter
 

-----Original Message-----
From: jmp-l-owner@lists.biostat.wustl.edu 
[mailto:jmp-l-owner@lists.biostat.wustl.edu]On Behalf Of James T Metz
Sent: Tuesday, 14 June 2005 9:23 AM
To: jmp-l@lists.biostat.wustl.edu
Cc: James T Metz
Subject: [jmp-l] model validation - data scrambling versus hold out sets?



JMP Users, 

        I have a general question concerning model validation.  Does anyone 
have any thoughts or comments concerning 
(Y data) (multiple) scrambling (using the column shuffle option in JMP) versus 
hold-out data sets (using excluded rows) as 
a means to  "validate" models?  Is one method generally preferred over the 
other?  Is one method generally better for regression 
while another method is better for partition models, etc?  Is the number of 
observations important? 

        Case-in-point - I have a data set of about 15 observables (Y values).  
I can obtain > 5000 X values (descriptors or columns) 
for each of the rows.  Obviously, there is a great, and highly likely danger of 
chance correlation.  I could use either method mentioned 
above to "validate" generated models.  However, my intuition says that the 
hold-out method is not appropriate in this case, since my 
data set is so small.  Do others agree? 

        I welcome thoughts, comments, literature references, etc. 

        Regards, 
        Jim Metz 


James T. Metz, Ph.D.
Research Investigator Chemist

GPRD R46Y AP10-2
Abbott Laboratories
100 Abbott Park Road
Abbott Park, IL  60064-6100
U.S.A.

Office (847) 936 - 0441
FAX    (847) 935 - 0548

james.metz@abbott.com

This communication may contain information that is legally privileged, 
confidential, or exempt from disclosure.  If you are not the intended 
recipient, please note that any dissemination, distribution, use, or copying of 
this communication is strictly prohibited.  Anyone who receives this message in 
error should notify the sender immediately by telephone or return email and 
delete it from his or her computer.



****************************************************************************************************************
This email and any attached files are intended solely for the named addressee, 
are confidential and may contain legally privileged information. The copying or 
distribution of them or any information they contain, by anyone other than the 
addressee, is prohibited.  If you have received this email in error, please let 
us know by telephone or return the email to the sender
and destroy all copies. Thank you.

CSL Limited A.C.N. 051 588 348
45 Poplar Road Parkville Victoria 3052 Australia
Phone: +61 3 9389 1911  Fax: +61 3 9389 1434
***************************************************************************************************************


<Prev in Thread] Current Thread [Next in Thread>