Don and all,
For the portion of your post dealing with importData(), I have had very
similar experiences, and coincidentally, was planning to write to the list
on issues related to this topic. In my case, I have for example 2.5 million
observations and 5 columns in a csv file. I only tried using S-PLUS 6 for
Linux and S-PLUS 6 for Windows Beta 2. The machines these reside on our
similar with dual processors around 800 MHZ and 1.5 GB or RAM.
(What versions and platforms were you using?)
read.table() takes just under 8 minutes to read in a file like the above,
which is acceptable for me. importData() was taking so long on the same
file that I killed the process after nearly an hour. My recollection may be
faulty, but it seems that importData() would not be so bad on other formats,
such as a SAS data set. Also, I wonder if the Sv4 read enhancements come
into play here with read.table().
Also interestingly, exportData() was much faster in producing a CSV file
than write.table() after processing such a data set -- a couple of seconds
versus over 2 minutes, respectively, to output a file of approximately
12,000 observations and 30 columns.
I'd certainly like to hear any others experiences and opinions on these
types of issues.
Thanks,
Bill
______________________________
Bill Pikounis, Ph.D.
E-mail: v_bill_pikounis@merck.com
Phone: 732.594.3913
Fax: 732.594.1565
Merck Research Laboratories
P.O. Box 2000, MailDrop RY70-38
126 E. Lincoln Ave
Rahway, NJ 07065-0900
> -----Original Message-----
> From: Don MacQueen [mailto:macq@llnl.gov]
> Sent: Thursday, March 22, 2001 8:43 PM
> To: s-news@wubios.wustl.edu
> Subject: [S] compare scan(), read.table(), and importData()
>
>
> I just spent the last hour or so figuring out why read.table() and
> importData() produced dataframes with different numbers of rows from
> the same ascii data file.
>
> It turns out that read.table() ignores blank lines in the data file;
> importData() gives a row of all NAs.
>
> This is not documented in the help() pages for either of these. In
> fact, help(read.table) says explicitly,
> "creates a data frame with the same number of rows
> as there are lines in the file"
> and therefore does not do what it says it does. The
> documentation is wrong.
>
> A performance comparison of
> (1) scan() using the n= argument followed by conversion to
> a data frame
> (2) read.table()
> (3) importData()
> on a file with 116577 lines and 16 fields had the following elapsed
> times in seconds (two repetitions each)
> (1) 20, 23
> (2) 29, 30
> (3) 69, 71
>
> So
> - scan + conversion is fastest, but not all that much
> compared with
> read.table
> - importData is much slower
>
> However, I don't know of convenient way to disable the coercion of
> characters to factors when converting a list, as created by scan() to
> a dataframe. read.table() has the as.is=T option, and importData()
> has the stringsAsFactors=F option. The latter being, perhaps,
> somewhat better, since it gives character variables the class "AsIs"
> which prevents subsequent conversion to factors, at least in some
> types of operations.
>
> My bottom line: read.table() is the best choice.
>
> -Don
> --
> --------------------------------------
> Don MacQueen
> Environmental Protection Department
> Lawrence Livermore National Laboratory
> Livermore, CA, USA
> --------------------------------------
> ---------------------------------------------------------------------
> This message was distributed by s-news@lists.biostat.wustl.edu. To
> unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> the BODY of the message: unsubscribe s-news
>
|