s-news
[Top] [All Lists]

Re: value of enterprise edition / big data?

To: "Dave Cacela" <DCacela@stratusconsulting.com>, <s-news@wubios.wustl.edu>
Subject: Re: value of enterprise edition / big data?
From: "Michael Camilleri" <MichaelCamilleri@branz.co.nz>
Date: Mon, 19 Mar 2007 09:59:39 +1200
In-reply-to: <88D649D1E09B684E8CBF77A5ADB3D306010847BB@tron.stratus.local>
References: <88D649D1E09B684E8CBF77A5ADB3D306010847BB@tron.stratus.local>
Thread-index: Acdn8PCzVsxPPTljTP2ivauebDQ8CQBssexw
Thread-topic: value of enterprise edition / big data?
I don't have the Big Data version so don't know how it would work,
however I have a lot of experience in using S+ and large data sets. Here
are some tips that might let you do the job without new software.

I presume you have set up S+ to use a much memory as it can. Specify an
unlimited object size for starters options(object.size=Inf), otherwise
it only goes up to the specified size.

As a general rule of thumb you need at least a 100% RAM overhead in S+.
So if you are manipulating a 100 MB object you will need at least 200 MB
of available RAM. This puts a limit on how large an object you can deal
with directly in S+. Work out the actual physical storage size of your
objects and multiply by 2-3, and that is roughly the amount of memory
you will need. If you want to find out how large an object is use the
object.size command. To find out how large it would be import a few
thousand rows, calculate the object size, then multiply it up.

Get more RAM as 1 GB is simply not enough. Max out your machine to 2 or
4 GB. It doesn't cost much. With Windows running you already lose a few
hundred megabytes for the OS (my S+ with Windows XP uses ~350 MB), so
with only 600 MB of RAM left you would be able to go up to maybe a 200
MB object maximum. Get 2 GB of RAM and you might be OK up to a 500 MB
object.

Using rbind is an inefficient way of adding rows to a large data frame
or merging data frames. If you want to merge a number of data frames
with identical columns use new.df <- do.call('rbind',
list(df1,df2,df3,etc)). This can be up to 100 times faster then using
rbind, and is much better for large data objects. Just be careful with
factor levels - they may need to be identical in all the data frames.

If you can pre-process your data to make it more compact, or only import
what you need then that would help. This could be done in a database if
S+ can't handle the size.

You need to import character data as factors, as otherwise the memory
size may blow out - this is the default S+ behaviour, but check your
system. There are functions now in S+ for importing data from big files
in chunks to make them more manageable. You can make the objects a bit
smaller by storing integer data as integers, not floating point, and
character data as factors. This can either be done in the import
procedure (some let you specify import formats and storage modes),
otherwise convert after importing, e.g. x[1]_as.integer(x[,1]).

Occasionally a bit of bad data will force a numeric column to be
imported as character data, which blows out the object size. Numeric
columns must have no odd data in them. You can filter out values, or
specify what data to be interpreted as NA, depending on what import
function you use.
 
In my experience large time series objects are more difficult to handle
then large data frames. If you don't really need to store them as time
series, store them as data frames, and convert to time series as needed.

A couple of million rows by a couple of dozen fields should give an
object that is about 100-200 MB. That size object should be easily
manageable if you get more RAM. If you were dealing with 10 million rows
by 500 columns then you really would need the Big Data library, but I
don't think you need it. 

Michael

MICHAEL CAMILLERI BSc, MSc, PhD 
BUILDING PHYSICIST      
T +64 4 237 1170
DDI +64 4 237 1174
PRIVATE BAG 50908       
PORIRUA CITY 5240       
WWW.BRANZ.CO.NZ 


<Prev in Thread] Current Thread [Next in Thread>