First, thanks to all of you who replied to my question. I am summarizing the
responses here, as several people were interested in the answer. The
original question is included below for reference.
Second, I apologize for the confusion my example created. I gave price and
return as examples of measurements. I did not intend to indicate that I
only had price and/or return as measurements. I have about 50 numeric
variables derived from the CRSP and Compustat databases for each asset.
As for the issue of arrays vs. data frames, the answer really depends on how
regular the data are, i.e., do we have roughly equal amounts of data for
each asset. Let's assume for the second that for each asset we have roughly
the same amount of data (e.g., in a 5-year period most of the assets under
investigation have about 5-years of data).
In this case, the array will be more efficient (in general), as Patrick
Burns and Scott Chasalow pointed out. This is mostly due to the way arrays
are organized in memory, if I understand things correctly. Scott Chasalow
also pointed out the advantage of being able to subscript more directly: if
the dimnames of the array contain the dates, asset names, and measurement
names, respectively, it is generally faster to do things like
A[ ,"XYZ", ]
than
B[is.element(B[, "Asset"], "XYZ"), ]
not only because of the memory organization, but also because the lengths of
the vector we match against are shorter in the array case (number of assets
vs. number of assets * number of dates, for example).
Finally, let's discuss the case of data with unequal histories. With the
array, we will have to do a lot of padding with NA's because of the
structure of an array. With the data frame, we don't need all those NA's.
So, if the amount of padding needed in the array is fairly large, the data
frame will be faster in general simply because it is the smaller object.
I hope my summary is accurate and is helpful to others. If I have misspoke,
please correct me.
Once again, thanks for all your help.
Chris Green
Graduate Student, Statistics
University of Washington
%-----Original Message-----
%From: s-news-owner@lists.biostat.wustl.edu
%[mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of
%Christopher Green
%Sent: Sunday, October 17, 2004 6:24 PM
%To: s-news@lists.biostat.wustl.edu
%Subject: [S] Efficiency of arrays vs. data frames for large data sets
%
%Hello all,
%
%I am working with large financial data sets (several GB's) for
%a research project. The data have three natural indices: time
%(e.g., month/year), asset/company (e.g., IBM), and measurement
%(e.g., price or return).
%
%Let's assume for simplicity that the data are all numeric.
%Since the data are naturally three-dimensional in the above
%sense, we might think about storing the data in a
%three-dimensional array in S-PLUS. We might also think about
%"stacking" the data, say by asset, into a two-dimensional data
%frame, with an added (factor) variable to indicate the asset.
%(Example:
%
%Date Price Asset
%01/99 2.45 abc
%02/99 2.56 abc
%01/99 10.45 xyz
%02/99 10.47 xyz
%
%)
%
%My question: for large amounts of data, which of these setups
%is generally more efficient? The primarily concern here is the
%speed of subsetting the data by measurement value, date, and/or asset.
%
%My intuition is that the array should be faster in general, as
%it is atomic.
%But I have built both in S-PLUS, they seem to be pretty
%comparable in terms of memory use and speed (as measured with
%object.size and proc.time).
%
%I haven't been able to find a clear explanation of this aspect
%anywhere (on-line docs, books, searching the web)...just the
%general principle that the atomic structures should be faster.
%From what I've heard the data frame is implemented as an array
%of pointers...presumably the atomics are implemented as
%contiguous blocks of memory (say, something returned by a
%malloc or something of that ilk)?
%
%Are there any strong reasons to prefer one representation over
%the other?
%
%(If it matters, I am using S-PLUS 6.2 Build 6713 on a Windows
%2000 SP4 machine, Pentium IV 1.5 Ghz processor with 1GB RAM.)
%
%Thanks,
%
%Chris Green
%Graduate Student, Statistics
%University of Washington
%
%--------------------------------------------------------------------
%This message was distributed by
%s-news@lists.biostat.wustl.edu. To unsubscribe send e-mail to
%s-news-request@lists.biostat.wustl.edu with the BODY of the
%message: unsubscribe s-news
%
|