Oops, I forgot to include
version
Version 6.0 Release 1 for Sun SPARC, SunOS 5.6 : 2000
I've used importData() with SAS datasets on the order of 50000
records and not noticed a problem.
-Don
At 8:33 AM -0500 3/23/01, Pikounis, V. Bill wrote:
Don and all,
For the portion of your post dealing with importData(), I have had very
similar experiences, and coincidentally, was planning to write to the list
on issues related to this topic. In my case, I have for example 2.5 million
observations and 5 columns in a csv file. I only tried using S-PLUS 6 for
Linux and S-PLUS 6 for Windows Beta 2. The machines these reside on our
similar with dual processors around 800 MHZ and 1.5 GB or RAM.
(What versions and platforms were you using?)
read.table() takes just under 8 minutes to read in a file like the above,
which is acceptable for me. importData() was taking so long on the same
file that I killed the process after nearly an hour. My recollection may be
faulty, but it seems that importData() would not be so bad on other formats,
such as a SAS data set. Also, I wonder if the Sv4 read enhancements come
into play here with read.table().
Also interestingly, exportData() was much faster in producing a CSV file
than write.table() after processing such a data set -- a couple of seconds
versus over 2 minutes, respectively, to output a file of approximately
12,000 observations and 30 columns.
I'd certainly like to hear any others experiences and opinions on these
types of issues.
Thanks,
Bill
______________________________
Bill Pikounis, Ph.D.
E-mail: v_bill_pikounis@merck.com
Phone: 732.594.3913
Fax: 732.594.1565
Merck Research Laboratories
P.O. Box 2000, MailDrop RY70-38
126 E. Lincoln Ave
Rahway, NJ 07065-0900
-----Original Message-----
From: Don MacQueen [mailto:macq@llnl.gov]
Sent: Thursday, March 22, 2001 8:43 PM
To: s-news@wubios.wustl.edu
Subject: [S] compare scan(), read.table(), and importData()
I just spent the last hour or so figuring out why read.table() and
importData() produced dataframes with different numbers of rows from
the same ascii data file.
It turns out that read.table() ignores blank lines in the data file;
importData() gives a row of all NAs.
This is not documented in the help() pages for either of these. In
fact, help(read.table) says explicitly,
"creates a data frame with the same number of rows
as there are lines in the file"
and therefore does not do what it says it does. The
documentation is wrong.
A performance comparison of
(1) scan() using the n= argument followed by conversion to
a data frame
(2) read.table()
(3) importData()
on a file with 116577 lines and 16 fields had the following elapsed
times in seconds (two repetitions each)
(1) 20, 23
(2) 29, 30
(3) 69, 71
So
- scan + conversion is fastest, but not all that much
compared with
read.table
- importData is much slower
However, I don't know of convenient way to disable the coercion of
characters to factors when converting a list, as created by scan() to
a dataframe. read.table() has the as.is=T option, and importData()
has the stringsAsFactors=F option. The latter being, perhaps,
somewhat better, since it gives character variables the class "AsIs"
which prevents subsequent conversion to factors, at least in some
types of operations.
My bottom line: read.table() is the best choice.
-Don
--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
--------------------------------------
---------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu. To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message: unsubscribe s-news
>
--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
--------------------------------------
|