s-news
[Top] [All Lists]

Re: Réf. : really large files

To: s-news@wubios.wustl.edu
Subject: Re: Réf. : really large files
From: gerald.jean@dgag.ca
Date: Thu, 16 Dec 2004 11:51:23 -0500
Hello everyone,

many people requested the function, hence I am posting it on the list.
Please note that I am not the original author, but Stephen Weller's of
Insightful is; I am a firm beleiver of giving credit where credit is due!
Here it is, and feel free to adapt it to your own situation.

import.data.in.blocks <- function(file.to.import, rows.at.once, nobs,
dfname,
                                  filetype = "SAS7",
                                  delimiter = ',', stringsAsFactors = F)
{
### Author : Gérald Jean, modeled after Stephen Weller's, of Insightful,
###          function of the same name.
### Date   : November 2002
### Purpose: Reading a data file by block of rows.  This procedure will
###          considerably cut down memory requirements for the import.
###
### Arguments:
###  file.to.import : A character string containing the full path and name
of
###                   the input data file.
###  rows.at.once   : Number of rows to process in a single call of the
###                   'writeNextDataRows()' function, i.e. the blocks size.
###  nobs           : Total number of observations to read, used to
allocate the
###                   output data.frame.
###  dfname         : name of output data.frame.
###  filetype       : Any file type allowable by Splus.
###  delimiter      : Required for ASCII file type.
###  stringAsFactors: should character variables be imported as factors or
kept
###                   as strings.

###
------------------------------------------------------------------------------
"ElapsedTime" <- function (start, finish)
{
# Author  : Gérald Jean
# Date    : September 1999
# Purpose : Will calculate elapsed time, in hours, minutes and seconds.
Note
#           that cpu time = user time + system time.
# Upgrades: 01/11/2001 GJ -- Updated for S+6, uses 3 "times": user, system
and
#           elapsed; all these times are given by the builtin function
#           "proc.time".
# Parameters:
#  start : results of the builtin function "proc.time" called at the start
of a
#          process.
#  finish: results of the builtin function "proc.time" called at the end of
a
#          process.

  all.seconds <- finish - start
  all.seconds <- c(all.seconds[1:2], sum(all.seconds[1:2]), all.seconds[3])
  hours       <- all.seconds %/% 3600
  all.seconds <- all.seconds %% 3600
  minutes     <- all.seconds %/% 60
  seconds     <- all.seconds %% 60
  cat(paste(c(' User time    = ', ' System time  = ', ' CPU time     = ',
              ' Elapsed time = '), format(hours), 'h. ', format(minutes),
              'min. ', format(seconds), 's.', collapse = '\n'), sep = '\n')
  invisible(list(hours = hours, minutes = minutes, seconds = seconds))
}
###
---------------------------------------------------------------------------------

### "import.data.in.blocks" begins.
### Open input data set.

  ttt.time <- proc.time()[1:3]
  cat('\n\tOpening input file, this may take a little while!', sep = '\n')
  if (filetype != 'ASCII') delimiter <- ""

  in.handle <- NULL
  in.handle <- openData(file = file.to.import, type = filetype,
                        delimiter = delimiter,
                        openType = 'read',
                        rowsToRead = rows.at.once,
                        stringsAsFactors = stringsAsFactors)
  on.exit({closeData(in.handle)})
  cat('\n\t Input file opened.', sep = '\n')

### First read one block of data and then create an empty data.frame which
will
### have same storage.mode for it's columns as the storage modes of the
columns
### of this first block.  We will then replace the initial entries by
blocks of
### 'rows.at.once' rows of the dataframe.

  tmpdf <- readNextDataRows(in.handle)
  cat('\n\t Creating output data.frame, this may take a longer while!!!',
      sep = '\n\n')
  store.mode <- lapply(tmpdf, FUN = function(x) storage.mode(x))
  modes.paste <- paste(unlist(store.mode), '(', nobs, ')', sep = '')
  names.paste <- paste(names(store.mode), ' = ', modes.paste, sep = '',
                       collapse = ', ')
  how.to.build.frame <- paste('data.frame(', names.paste,
                              ', dup.row.names = T)', collapse = '', sep
= '')

  resultdf <- eval(parse(text = how.to.build.frame))
  ReadObs <- nrow(tmpdf)
  loopCount <- 1
  resultdf[1:ReadObs, ] <- tmpdf
  ElapsedTime(ttt.time, proc.time()[1:3])
  cat("\nIteration number: ", loopCount, ", Initial ", ReadObs,
      " rows processed.", "\n", sep = "")

  if (!is.null(in.handle)){
    cat('\n\tStarting to process remaining rows.', sep = '\n\n')
    while (ReadObs < nobs){
      tmpdf <- readNextDataRows(in.handle)
      if(!length(tmpdf))     ### Break out of the while-loop if all rows
        break                ### have been read from the external file.
      tmp.nrows <- nrow(tmpdf)
      loopCount <- loopCount + 1
      start <- ReadObs + 1
      ReadObs <- ReadObs + tmp.nrows
      resultdf[start:ReadObs, ] <- tmpdf
      cat("Iteration number: ", loopCount, ", Total rows processed = ",
          ReadObs, "\n", sep = "")
    }

    cat("\n\tPermanently assigning output data.frame", "\n", sep = "")
  }
  ElapsedTime(ttt.time, proc.time()[1:3])
  resultdf
}

Cheers,

Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone            : (418) 835-4900 poste (7639)
télecopieur          : (418) 835-6657
courrier électronique: gerald.jean@dgag.ca

"In God we trust all others must bring data"  W. Edwards Deming



<Prev in Thread] Current Thread [Next in Thread>