To: Gary Sabot <gary@sabot.com>
From: Tim Hesterberg x319 <timh@insightful.com>
Date: Wed, 5 Sep 2001 14:01:12 -0700
Subject: Re: [S] dimnames in Sparc Splus 6.0
Gary Sabot just posted a version of [.data.frame for which
X[,aSingleColumn] # where X is a data frame
adds the row names of the data frame as names to the vector
that is returned.
Unfortunately, that version of [.data.frame makes assumptions about
the type of data included in data frames that are often unjustified.
This causes it to mess up with some kinds of data, including:
* factors (e.g. try fuel.frame[,"Type"] with and without that version)
* objects with new-style classes. This makes it incompatible with
with library("missing") (a library in S-PLUS 6.0 for handling
missing data using multiple imputations).
Tim was able to fix my function to deal with factors and
library(missing) correctly. However, he suspects that there may be
other objects for which it still may have a problem, such as the new
time-series objects in S+.
In any case, I'm posting the revised code, but won't post any further
revisions unless there seems to be some interest, so email me if you
are going to use this and want to see any additional fixes.
There are also some comments on the pros and cons of this after the code:
----------------
#first make sure that if you run this twice, you still get the real original
data.frame.original.fcn <- get("[.data.frame", where="splus")
#make sure that dataframe[,col.id] keeps its dimnames
"[.data.frame" <- function(x,..., drop=T)
{
result <- data.frame.original.fcn(x,..., drop=F)
if (drop && ncol(result)==1) {
save.names <- dimnames(result)[[1]]
#this approach works for factors too
result <- result[[1]]
names(result) <- save.names
result
#unfortunately still broken for objects with new style classes,
#since it does not distinguish among methods that have or do
#not have a getnames method.
#library(missing) is an example: The multiple imputations on
#an object get lost if subscripted with this function.
} else {
if (!missing(drop) && drop && nrow(result)==1) {
#replicate documented behavior of [.data.frame: drop=T acts
#differently then missing drop arg for this case!
as.list(result)
} else {
result
}
}
}
#additional tests
#
# > data.frame(a=letters[1:3], b=2:4)[,1]
# [1] a b c
# > dput(data.frame(a=letters[1:3], b=2:4)[,1])
# structure(.Data = c(1, 2, 3)
# , levels = c("a", "b", "c")
# , class = "factor"
# , names = c("1", "2", "3")
# )
# library("missing")
# #compare:
# data.frame.original.fcn(cholesterolImpExample,,3)
# cholesterolImpExample[,3]
----------------
More generally, adding names to vectors is often undesirable; it may
* substantially increase the size of the resulting object,
My change only affects the case of taking a single column out of a
much larger object that has names. So you start out with an m x n
array, and I am pulling off an nx1 column and making it have n names.
So although it is true that it doubles the size of the n x 1 column by
adding n names alongside the n values, but this is not big compared to
the whole m x n dataframe, and seems a reasonable price for the
convenience of regularizing the behavior of dimnames on dataframes to
match matrices.
* slow down some S-PLUS computations, and
I find it confusing that dimnames are preserved through many
operations, but through all, and that dimnames on matrices and
dataframes behave differently. dimnames are a high level, helpful
feature, and their presence in the language certainly has a cost. Why
object to paying it just in this one small context? If they make
performance unacceptable in an inner loop, I would prefer to remove
them manually on occasion if performance tuning leads me in that
direction, rather than have to be concerned about putting dimnames
back in place as I write code and find that they have gone missing
because of some seemingly innocuous subscripting. This
convenience/performance tradeoff is of course one on which people can
reasonably disagree.
* cause bugs for code that expects data without names; also
That is a concern, but I am having trouble imagining why code would
want to rely on the names not being present.
* there is no way to determine which variables in a data frame had
names originally. So names may be added to variables that should
not have them, or which are incorrect for the variable.
I think of dataframes as being like arrays, rather than the more
general and more correct variable/observation viewpoint, and I agree
this may be leading me into trouble. My code is financial, so the
rows are invariably dates, and the columns are various measurements,
like stock price, factor levels, etc. So in my very narrow use and
view of dataframes, the dates are always meaningful for any column in
the dataframe. Many Splus users in my industry seem to use dataframes
in that same fashion, and encounter the same problems (though some
solve them with a convention of always stripping dimnames and then
putting them back on in every single function, which is burdensome in
its own way.) Novice users may also have this view, since read.table
is a convenient way to read in a table of data, and it always returns
a dataframe even if the data is all numeric and a matrix might be the
best data structure for it.
It is safer not to mask the default version of [.data.frame,
but rather to use a function like the following to extract a single
variable from a data frame and add names to it:
extractVariableAddNames <- function(df, j){
# df is a data frame, j a column to subscript.
# Return df[,j], but with df's row names added.
x <- df[,j]
if(is.data.frame(x))
stop("Subscripted more than one column")
names(x) <- row.names(df)
x
}
========================================================
| Tim Hesterberg Research Scientist |
| timh@insightful.com Insightful Corp. |
| (206)283-8802x319 1700 Westlake Ave. N, Suite 500 |
| (206)283-6310 (fax) Seattle, WA 98109-3044, U.S.A. |
========================================================
Formerly known as MathSoft, Insightful Corporation provides analytical
solutions leveraging S-PLUS, StatServer, and Consulting services
I agree that that is a safe approach that could be used for writing
new code if one remembered to use it rather than using conventional
subscripting. But my immediate goal is to port my old S-engine code
that unfortunately relies throughout on the dimnames being passed
through many of its subscripting calls.
--gary
|