s-news
[Top] [All Lists]

Re: really large files - import vs. use

To: s-news@lists.biostat.wustl.edu
Subject: Re: really large files - import vs. use
From: Alan Zaslavsky <zaslavsk@hcp.med.harvard.edu>
Date: Thu, 16 Dec 2004 15:10:46 -0500 (EST)
The read.table() function uses scan() to actually read the data (after
getting the field headings, if any, from the first line).  Note the following
comment from the help file for scan():

   As it reads more and more records, scan allocates more space to  
   accommodate the growing vectors. If you supply a what argument that is
   identical in size to the result you expect, S-PLUS uses that space and
   does not have to perform memory allocations. This may produce
   significant memory savings when dealing with large files of data.

read.table() doesn't know how big the dataset is until it has read it.
If you use scan() directly you can use the what= argument to allocate the
correct amount of space from the beginning and thereby save some memory
and time.  It would not be hard to modify read.table() to include an
argument for the number of cases that will be read, and then create a
"what" argument for scan() that makes use of this information.  It could 
also speed things to know in advance which columns are numeric and specify
that in the "what" argument to scan.


<Prev in Thread] Current Thread [Next in Thread>