s-news
[Top] [All Lists]

[S] Summary of Responses to my Data Frame Subsetting Question

To: "'s-news@wubios.wustl.edu'" <s-news@wubios.wustl.edu>
Subject: [S] Summary of Responses to my Data Frame Subsetting Question
From: "Humbolt, Allen" <HumboltA@kochind.com>
Date: Fri, 29 May 1998 15:11:11 -0500
Sender: owner-s-news@wubios.wustl.edu
My original question was similar to what I restated below.  I've changed the
first level of variable B below to 17.50 from 17.43 since it is indeed
possible for me to have ties in my goal of finding those records where A is
closest to B within each Class.  I thank Patrick Connolly for pointing out
the relevance of knowing how I wish to handle ties.  In my case, it isn't
critical which record I get as long as A is reasonably close to B.  My
actual application involves the identification of which options data are "at
the money".  The "strike" of the option is what I called A below, and the
current price level is what I called B.


START ORIGINAL QUESTION
I have a data frame called "mydata" with data like the following.
Class  A   B     Index
  1   16  17.50    1
  1   17  17.50    2
  1   18  17.50    3
  1   19  17.50    4
  2   17  18.02    5
  2   18  18.02    6
  2   19  18.02    7
My goal is to select the subset of this data frame where A is closest to B
within each class.  My desired result for the above data would be the
following.
Class  A   B     Index
  1   17  17.50    2
  2   18  18.02    6
END ORIGINAL QUESTION


I received several solutions which worked on my example.  I wish to share
the two solutions which were simple and fast not only on my small example,
but on a real and rather large data frame.

>From Bill Venables
>  newdat <- mydata[order(mydata$Class, abs(mydata$A - mydata$B)),]
>  newdat <- newdat[c(1, diff(as.numeric(mydata$Class))) > 0.1,]
Note:  Bill had "==1" instead of ">0.1" .  The ">0.1" happens to be more
appropriate for my actual data where Class is a numeric representation of
dates and any change of more than zero (1, a daily change, 3 a weekend
change) involves a new date or a new Class.
On a real data set this reduced 18,185 rows to 1967 rows in 1.6 seconds on
my PC.

>From Nicole DePriest Demers
> ordered.dif <- order(abs(mydata$A-mydata$B))
> newdat <- mydata[ordered.dif[tapply(ordered.dif, list(mydata$Class),
min)],]
> newdat <- newdat[order(newdat$Class),]
On a real data set this reduced 18,185 rows to 1967 rows in 1.1 seconds on
my PC.

I received several other solutions which worked nicely on my small example
data frame, but languished for my real and much larger data -- an issue I
avoided when asking my question.  There were also a few solutions which I
didn't investigate because I didn't understand the solution or the output.
I thank Douglas Bates, Charles Berry, Charles Pollak,
buttrey@sun10or.or.nps.navy.mil, Don MacQueen, Patrick Connolly, Jens
Oehlschlaegel, Jan Schelling, james.holtman@cbis.com, Bill Venables, and
Nicole DePriest Demers for taking the time to respond to my question.

Allen Humbolt
Quantitative Analyst
Koch Industries, Inc.
HumboltA@kochind.com



-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>
  • [S] Summary of Responses to my Data Frame Subsetting Question, Humbolt, Allen <=