Hello everyone,
many people requested the function, hence I am posting it on the list.
Please note that I am not the original author, but Stephen Weller's of
Insightful is; I am a firm beleiver of giving credit where credit is due!
Here it is, and feel free to adapt it to your own situation.
import.data.in.blocks <- function(file.to.import, rows.at.once, nobs,
dfname,
filetype = "SAS7",
delimiter = ',', stringsAsFactors = F)
{
### Author : Gérald Jean, modeled after Stephen Weller's, of Insightful,
### function of the same name.
### Date : November 2002
### Purpose: Reading a data file by block of rows. This procedure will
### considerably cut down memory requirements for the import.
###
### Arguments:
### file.to.import : A character string containing the full path and name
of
### the input data file.
### rows.at.once : Number of rows to process in a single call of the
### 'writeNextDataRows()' function, i.e. the blocks size.
### nobs : Total number of observations to read, used to
allocate the
### output data.frame.
### dfname : name of output data.frame.
### filetype : Any file type allowable by Splus.
### delimiter : Required for ASCII file type.
### stringAsFactors: should character variables be imported as factors or
kept
### as strings.
###
------------------------------------------------------------------------------
"ElapsedTime" <- function (start, finish)
{
# Author : Gérald Jean
# Date : September 1999
# Purpose : Will calculate elapsed time, in hours, minutes and seconds.
Note
# that cpu time = user time + system time.
# Upgrades: 01/11/2001 GJ -- Updated for S+6, uses 3 "times": user, system
and
# elapsed; all these times are given by the builtin function
# "proc.time".
# Parameters:
# start : results of the builtin function "proc.time" called at the start
of a
# process.
# finish: results of the builtin function "proc.time" called at the end of
a
# process.
all.seconds <- finish - start
all.seconds <- c(all.seconds[1:2], sum(all.seconds[1:2]), all.seconds[3])
hours <- all.seconds %/% 3600
all.seconds <- all.seconds %% 3600
minutes <- all.seconds %/% 60
seconds <- all.seconds %% 60
cat(paste(c(' User time = ', ' System time = ', ' CPU time = ',
' Elapsed time = '), format(hours), 'h. ', format(minutes),
'min. ', format(seconds), 's.', collapse = '\n'), sep = '\n')
invisible(list(hours = hours, minutes = minutes, seconds = seconds))
}
###
---------------------------------------------------------------------------------
### "import.data.in.blocks" begins.
### Open input data set.
ttt.time <- proc.time()[1:3]
cat('\n\tOpening input file, this may take a little while!', sep = '\n')
if (filetype != 'ASCII') delimiter <- ""
in.handle <- NULL
in.handle <- openData(file = file.to.import, type = filetype,
delimiter = delimiter,
openType = 'read',
rowsToRead = rows.at.once,
stringsAsFactors = stringsAsFactors)
on.exit({closeData(in.handle)})
cat('\n\t Input file opened.', sep = '\n')
### First read one block of data and then create an empty data.frame which
will
### have same storage.mode for it's columns as the storage modes of the
columns
### of this first block. We will then replace the initial entries by
blocks of
### 'rows.at.once' rows of the dataframe.
tmpdf <- readNextDataRows(in.handle)
cat('\n\t Creating output data.frame, this may take a longer while!!!',
sep = '\n\n')
store.mode <- lapply(tmpdf, FUN = function(x) storage.mode(x))
modes.paste <- paste(unlist(store.mode), '(', nobs, ')', sep = '')
names.paste <- paste(names(store.mode), ' = ', modes.paste, sep = '',
collapse = ', ')
how.to.build.frame <- paste('data.frame(', names.paste,
', dup.row.names = T)', collapse = '', sep
= '')
resultdf <- eval(parse(text = how.to.build.frame))
ReadObs <- nrow(tmpdf)
loopCount <- 1
resultdf[1:ReadObs, ] <- tmpdf
ElapsedTime(ttt.time, proc.time()[1:3])
cat("\nIteration number: ", loopCount, ", Initial ", ReadObs,
" rows processed.", "\n", sep = "")
if (!is.null(in.handle)){
cat('\n\tStarting to process remaining rows.', sep = '\n\n')
while (ReadObs < nobs){
tmpdf <- readNextDataRows(in.handle)
if(!length(tmpdf)) ### Break out of the while-loop if all rows
break ### have been read from the external file.
tmp.nrows <- nrow(tmpdf)
loopCount <- loopCount + 1
start <- ReadObs + 1
ReadObs <- ReadObs + tmp.nrows
resultdf[start:ReadObs, ] <- tmpdf
cat("Iteration number: ", loopCount, ", Total rows processed = ",
ReadObs, "\n", sep = "")
}
cat("\n\tPermanently assigning output data.frame", "\n", sep = "")
}
ElapsedTime(ttt.time, proc.time()[1:3])
resultdf
}
Cheers,
Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone : (418) 835-4900 poste (7639)
télecopieur : (418) 835-6657
courrier électronique: gerald.jean@dgag.ca
"In God we trust all others must bring data" W. Edwards Deming
|