A better subject for this would have been "Using
by() with a custom function"
With help from Kirsten Smith at TIBCO support,
this is a revised recap of the problem and solution.
# data constructed to have breakpoint in a piecewise
linear regression = 5
a <- list( group=as.character(
rep("A",10) ),
xcol=as.numeric( c(1,2,3,4,5,6,7,8,9,10) ),
ycol=as.numeric( c(9,9,9,9,9,8,7,6,5,4) )
)
# data constructed to have breakpoint = 4
b <- list( group=as.character(
rep("B",10) ),
xcol=as.numeric( c(1,2,3,4,5,6,7,8,9,10) ),
ycol=as.numeric( c(7,7,7,7,6,5,4,3,2,1) )
)
# Combine a and b in a single data frame
ab <- rbind(data.frame(a),data.frame(b))
# Plot the data
printer(height=35, width=80)
plot(ab$xcol,ab$ycol, xlab='X', ylab='Y',
pch=as.vector(ab$group) )
title("Two datasets A and B with
breakpoints")
show.printer()
dev.off()
# Bill Venables' function to use with optimize() to
find the breakpoint threshold, th
# The round() function is not required - I just use
it to get breakpoints of exactly 5 and 4
optFun <- function( xvar, yvar )
{ round( optimize( f=function( th, xvar, yvar )
{ X <- cbind(xvar, pmax(0, xvar - th))
sum(lsfit(X, yvar)$resid^2)
}
, interval=range(xvar)
, xvar=xvar
, yvar=yvar
)$minimum
, 1
)
}
# This is where I try to invoke (without success)
optFun() for each group in ab
by( x = ab
, INDICES = ab$group
, FUN = optFun
, xvar = x$xcol
, yvar = x$ycol
)
# This is the version from TIBCO support that works.
# When using by() with a custom function, the
calling syntax is a little different.
# The variable in the FUN= argument is the dataset
specified in the x= argument, and then you can call the actual arguments for
optFun().
by( x = ab
, INDICES = ab$group
, FUN = function(x)
optFun(x$xcol, x$ycol)
)
Douglas Hersh
Environmental Quality Department
Massachusetts Water Resource
Authority
100
First Ave. Boston MA 02129
phone: (617) 788-4945
fax: (617) 788-4888
email: douglas.hersh@mwra.state.ma.us
http://www.mwra.state.ma.us/harbor/html/bhrecov.htm
From: Hersh, Douglas
Sent: Friday, May 22, 2009 4:52 PM
To: s-news@lists.biostat.wustl.edu
Subject: Minimizing a function for
grouped data
I am trying to apply a technique of Bill Venables
(http://www.biostat.wustl.edu/archives/html/s-news/2000-04/msg00206.html)
for finding break points in a broken line
regression.
I need to do this for each group in a data set with
multiple groups. I have
tried using the by() function with no success.
Would someone please have a look at my example and
see if you can create a
single statement that would produce a data frame
such the following?
group threshold
----- ---------
A 5
B 4
For example:
> # data constructed to have breakpoint = 5
> a <- list( group=as.character(
rep("A",10) ),
+
xcol=as.numeric( c(1,2,3,4,5,6,7,8,9,10) ),
+
ycol=as.numeric( c(9,9,9,9,9,8,7,6,5,4) )
+
)
>
> # data constructed to have breakpoint = 4
> b <- list( group=as.character(
rep("B",10) ),
+
xcol=as.numeric( c(1,2,3,4,5,6,7,8,9,10) ),
+
ycol=as.numeric( c(7,7,7,7,6,5,4,3,2,1) )
+
)
>
> # Combine a and b in a single data frame
> ab <- rbind(data.frame(a),data.frame(b))
>
> # Show the data in an ASCII graph
> printer(height=35, width=80)
> plot(ab$xcol,ab$ycol, xlab='X', ylab='Y',
pch=as.vector(ab$group) )
> title("Two datasets A and B with
breakpoints")
> show.printer()
Two datasets A and B with breakpoints
......................................................................
.
A A
A A A
.
.
8..
A
.
.
.
B B
B
B
A
.
.
6..
B
A
.
Y
.
B
A
.
.
4..
B
A
.
.
.
B
.
.
2..
B
.
.
.
B
......................................................................
2
4
6
8 10
X
> dev.off()
graphsheet
2
>
> # Bill Venables' function to use with
optimize() to find the breakpoint threshold, th
> fn <- function( th, xvar, yvar )
+ {
+ X
<- cbind(xvar, pmax(0, xvar - th))
+
sum(lsfit(X, yvar)$resid^2)
+ }
>
> # breakpoint for a should be 5
> round( optimize( f=fn, interval=range(a$xcol),
xvar=a$xcol, yvar=a$ycol )$minimum, 1 )
[1] 5
> # breakpoint for b should be 4
> round( optimize( f=fn, interval=range(b$xcol),
xvar=b$xcol, yvar=b$ycol )$minimum, 1 )
[1] 4
>
> # This is where I try to invoke (without
success) optimize(fn ... for each ab$group
> by( x = ab
+ , INDICES = ab$group
+ , FUN = fn
# I forgot to include the optimize() function here. See improved version in
follow-up above.
+ , xvar = ab$xcol
+ , yvar = ab$ycol
+ )
Warning messages:
Number of rows of result is not a multiple of
vector length (arg 1) in: cbind(..1, ..2, deparse.level = deparse.level)
Problem in specials.x | specials.y: length of longer
operand (30) should be a multiple of length of shorter (20)
Use traceback() to see the call stack
>
Thanks you for any help you can offer.
- Doug
Douglas Hersh
Environmental Quality Department
Massachusetts
Water Resource Authority
100
First Ave. Boston
MA 02129
phone: (617) 788-4945
fax: (617) 788-4888
email: douglas.hersh@mwra.state.ma.us
http://www.mwra.state.ma.us/harbor/html/bhrecov.htm