Thanks to Elizabeth Atkinson, Terry Therneau, Andreas Krause and Volker
Bahn for their insights.
I should have been more specific about what I am trying to acheive.
I have a large data set, around 1,750,000 rows, with somewhere around 80
very highly correlated continuous variables with missing values on all of
them, some with lots of missing -- excluding all the missings is not an
option since if I do that I have no data left!
What I was hoping to be able to acheive with trees, built in "tree" and /
or "rpart", was to get help in categorizing those continuous variables as
optimally as possible with respect to some continuous response variable and
sort out, in some way, the correlated variables. Once these task
accomplished, the remaining categorized variables will be used in a glm.
Any suggestion, preferably within the realm of S+, would be greatly
welcomed.
Thanks to all,
Gérald Jean
Conseiller senior en statistiques, Actuariat
télephone : (418) 835-4900 poste (7639)
télecopieur : (418) 835-6657
courrier électronique: gerald.jean@dgag.ca
"In God we trust, all others must bring data" W. Edwards Deming
Le message ci-dessus, ainsi que les documents l'accompagnant, sont destinés
uniquement aux personnes identifiées et peuvent contenir des informations
privilégiées, confidentielles ou ne pouvant être divulguées. Si vous avez reçu
ce message par erreur, veuillez le détruire.
This communication (and/or the attachments) is intended for named recipients
only and may contain privileged or confidential information which is not
to be disclosed. If you received this communication by mistake please destroy
all copies.
|