The increasing error is indeed a bad sign. Usually, the error,
determined by cross-validation, decreases up to a certain number of
splits (= model complexity) and then starts to increase again. One
pruning strategy is to take the model with the lowest error (xerror) and
add 1 std (xstd) to it and then go to the simplest model that has an
error smaller than this sum. (1 standard error rule, Breiman 1984). In
your case, however, every included variable only increases the error,
indicating, that you do not have any variables in your set that can
robustly explain variability in your dependent variable by the method of
regression trees.
HTH
Volker
Stephanie A Mather wrote:
Hi all -
I am doing tree regression using S-Plus 7.0 and rpart. I came up with a model
and am trying to figure out what the rel error, xerror, and xstd rows stand for
when using the printcp command. Here is my output when using printcp:
printcp(trpmilav.rp)
Regression tree:
rpart(formula = trpmilav ~ hhsize + kids18 + adults18 + hhtype +
numveh + numwrker + numdrver + income + poppersq + urban +
cbdg25 + cbdg100, data = HHTable.11.01Final, control =
rpart.control(minbucket = 30, cp = 0.001, xval = 10))
Variables actually used in tree construction:
[1] cbdg100 cbdg25 income numwrker poppersq
Root node error: 5.8745e6/4072 = 1442.6
n=4072 (271 observations deleted due to missing values)
CP nsplit rel error xerror xstd
1 0.0064231 0 1.00000 1.0006 0.40944
2 0.0035024 2 0.98715 1.0156 0.41187
3 0.0027420 3 0.98365 1.0193 0.41225
4 0.0017144 6 0.97543 1.0251 0.41228
5 0.0012832 12 0.96500 1.0331 0.41248
6 0.0010689 13 0.96371 1.0342 0.41245
7 0.0010000 14 0.96264 1.0354 0.41245
In all the examples in textbooks I've seen, the xerror column decreases as CP
increases - why does mine go up? And what's the best CP value to prune at?
Any advice would be greatly appreciated - thanks in advance!
-Stephanie
Stephanie Mather
Graduate Research Assistant
University of Connecticut
Dept of Civil & Enviro. Engineering
261 Glenbrook Rd, Unit 2037
Storrs, CT 06269-2037
|