Hello,
Last week I posted a request regarding the reading of large data sets. I
had one or two replies suggesting the preprocessing of the data using Perl
or similar languages, unfortunately I don't know Perl and can't afford the
time to learn it right now.
Insightful's tech. support also replied to my request. In the first couple
of exchanges concerning the problem they basically suggested two things:
1) Try using other formats than SAS transport files, ASCII format was
suggested.
2) Read the data by blocks using the openData and readNextDataRows
functions.
Since I had quite a bit of preprocessing of the data to do before being
ready for more interesting things, for example modeling, I decided to write
back my large data set to an external file, by blocks, in ASCII format, --
this went smoothly, it took about 30 minutes. Then I wrote the code to
process the data and tried the "by blocks" processing using the newly
created ASCII file as input and writing back to an ASCII file. Surprise!
it took for ever only to open the data set; back to Insightful's tech.
support and what do I find out?
Quoting tech. support's reply:
----------------------------------------------------------------------------------
I discussed this problem with one of our developer and he mentioned:
-----------------------------------------------------------------------------------------
I believe the slow call to openData is the same as a problem we have
reported in our bug database:
Problem Description:
openData for large text files is over 600 times slower in S-PLUS 6.x
compared to 5.1. It appears that openData now tries to read through the
entire file. openData(file= "/dept/devel/data/text/kdd98.1e5.txt") Takes
0.44 secs in 5.1 and 259.97 secs in 6.1
I think that we are reading the entire file to get the maximum width for
any character columns.
I don't know of anyway around this expect possible storing the data in some
format other than ASCII (I am not sure if fixed format has the same
problem, I would think not but I have not tested it). A call to openData
for a SAS or SPSS file is not slow, that we have found.
-----------------------------------------------------------------------------------------
This is something our development team is currently investigating to try to
fix for the S-PLUS 6.2 release.
End quote.
Hence using an ASCII format didn't turn out to be such a great idea! I
ended up reading, and preprocessing, the initial SAS transport file by
blocks and writing it back to a SAS format, this step went pretty smoothly,
it took roughly one hour to read, do a fair amount of processing and write
back over 4M records by 58 variables. After that I imported the new SAS
data set into S+ by blocks and this was again pretty high on resources, it
took 7 hours versus 21.5 hours for the all at once approach. A substantial
gain but still a pretty long time.
Tech. support supplied an example function as guidance on how to process,
reading, writing, whatever, the data by blocks; this was pretty helpful,
thank you. I didn't find any reference to optimal block size for reading
or writing data by blocks from or to an external file in the help files of
the related functions; this would also be useful.
Good weekend to everyone,
Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone : (418) 835-4900 poste (7639)
télecopieur : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca
"In God we trust all others must bring data" W. Edwards Deming
|