The read.table() function uses scan() to actually read the data (after
getting the field headings, if any, from the first line). Note the following
comment from the help file for scan():
As it reads more and more records, scan allocates more space to
accommodate the growing vectors. If you supply a what argument that is
identical in size to the result you expect, S-PLUS uses that space and
does not have to perform memory allocations. This may produce
significant memory savings when dealing with large files of data.
read.table() doesn't know how big the dataset is until it has read it.
If you use scan() directly you can use the what= argument to allocate the
correct amount of space from the beginning and thereby save some memory
and time. It would not be hard to modify read.table() to include an
argument for the number of cases that will be read, and then create a
"what" argument for scan() that makes use of this information. It could
also speed things to know in advance which columns are numeric and specify
that in the "what" argument to scan.
|