Hi,
I have been interested by the discussion about dataset sizes and import
methods. I have been recently grappling with these issues in my work.
I wanted to find the upper limit for a data set size for my computing
platform and to seperate the issue of the import method and whether the
dataset was read in a single block or many blocks. I have always thought
that the best method to import data is to pre-dimension space in the data
structure and then fill these with data as Gérald Jean describes. This
eliminates any wastage from "growing" data strucutres.
I wrote a simple function that allows me to build a data frame from
arguments such as the column names, their mode and the number of rows. I
have attached the code (code.txt) for the function below. I reasoned that if
I can't dimension a dataframe of a certain size then I have no hope of
creating it with a single or multiple importing command.
I used the function to build dataframes with five numeric columns and for
different numbers of rows while measuring the object size, the amount of
memory invoked (using mem.tally.report) and the time it took to do the
operation. I have attached the results in the file timings.txt. The results
surprised me. First let me say that the code was run under S+6.2 on XP with
a 3GHz Pentium 4 processor with 2GB of RAM.
The maximum dataset I could create was 388MB in size which was far lower
than I expected but tallys with other users experiences (Roger Bos and Tony
Plate). On a PC with 1.5GB the maximum I created was 320MB so for pratical
purposes the increase in memory of 500MB only got me a 70MB bigger dataset.
The ratio of the amount of memory invoked to the data set size was
approximately 2.5 for small and large datasets alike.
To compare the timings I converted the timings into secs per datasets of 1
million rows. The largest datasets took almost 20 minutes to generate. One
would expect that data sets that are created completely in RAM would have
execution times roughly linear to the data set size (or times per million
rows that were constant). For all but the tiniest data sets this is not what
is observed. Data sets bigger than 100MB take an exponentially long time to
create.
All in all, I'm not impressed, 2MB of RAM to allow me build a 390MB dataset
is not a very good return. Under Windows it would appears that there is are
diminishing returns from adding more memory. I suspect part of the problem
is operating system dependant as Gérald Jean has testified to working with
2MB datasets under UNIX. I would be very interested to know (from
Insightful) why S+ is not able to take advantage of large amounts of RAM. I
would also be interested to hear the views of S+ users that have crossed
over from Windows to UNIX and know whether they found big differences in
performance. I am at a point where I spend more time fooling around with
memory issues of one sort or another rather than working with the data.
Regards,
Glenn
**********************************************************************
This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.
This footnote also confirms that this email message has been swept
for the presence of computer viruses.
**********************************************************************
code.txt
Description: Text document
Timings.txt
Description: Text document
|