s-news
[Top] [All Lists]

extracting data vector from multiple Excel files

To: s-news@wubios.wustl.edu
Subject: extracting data vector from multiple Excel files
From: P.J.Wells@open.ac.uk
Date: Thu, 29 Aug 2002 17:11:25 +0100
My dataset of accounting data on 110,000 companies has its primary existence
in the form of Excel spreadsheets.

Each year's data  is stored in a directory containing 13 spreadsheet files
(with columns for each of 105 variables) corresponding to the one-digit SIC
code applicable to the company, with supplementary files where the number of
firms under a given one-digit code exceeds the maximum number of rows
allowed in Excel.

Thus, obtaining an S+ vector for a given variable for a given year involves
extracting the relevant column from each spreadsheet and concatenating the
results.

A function on the following lines is intended to do this:

make.my.data <- function(date, data=c("foo", "bar", ...)){
        
        # where "foo", "bar" are variable names (I only need approx. 20
variables from the total available)

        data <- match.arg(data)

        switch(data,
                foo = {data.col <- 53},
                bar = {data.col <- 48}.
                ...)#

        # in order to set TargetStartCol, EndCol = data.col
        
        switch(data,
                foo = {data.mode <- "numeric"},
                bar = {data.mode <- "character"},
                ...)#
        
        # sets data.mode [= "numeric" for most]
        

        import.data(DataFrame = "temp",
                FileName=
                        paste("C:\\  ... [drills down through possibly
over-complex directory structure ] ... \\19", date, "\\0-", date, ".xls",
sep=""),
                FileType="EXCEL",
                TargetStartCol=data.col, EndCol=data.col,
                NameRow=1)
        guiClose("data.frame", "temp")

        temp.0 <- as.vector(temp, mode = data.mode)

        # ...

        import.data(DataFrame = "temp",
                FileName=
                        paste("C:\\Data\\Data sets\\From
Roberto\\Data97xl\\Excel files\\19", date, "\\9-", date, ".xls", sep=""),
                FileType="EXCEL",
                TargetStartCol=data.col, EndCol=data.col,
                NameRow=1)
        guiClose("data.frame", "year")

        temp.9 <- as.vector(temp)

        rm(temp)

        output <- rbind(temp.0, temp.1, temp.2, temp.3, temp.4, temp.5,
temp.6a, temp.6b, temp.7, temp.8a, temp.8b, temp.8c, temp.9)
}

This seems to work OK if one confines oneself to extracting only a few of
the SIC code sections, but trying this out (for a numeric variable) on the
full set crashes S+ -- apparently after about five sections.

(S+ 2k Professional, on a Wintel box -- Pentium Pro with 265MB of RAM; speed
escapes me).

FWIW, the resulting error message is:

SPLUS caused an invalid page fault in
module SAPIOBJ.DLL at 0137:00e88471.
Registers:
EAX=00000000 CS=0137 EIP=00e88471 EFLGS=00010286
EBX=00000000 SS=013f ESP=0110bf5c EBP=0110bf60
ECX=ffffffff DS=013f ESI=029a4980 FS=0cff
EDX=ffffffff ES=013f EDI=027cbbfc GS=0000
Bytes at CS:EIP:
0f be 08 85 c9 74 2f 8b 55 fc d1 e2 89 55 fc 8b 
Stack dump:
00000000 0110bf78 00e88403 00000000 00000000 ffffffff 00000000 0110bfb4
00e87efa 00000000 0110bfac 0110bfa8 0284b96c 00002a9d 00002a9c 00000004 

A version which rbinds the dataframes before coercing them to vectors
suffers a similar fate.

Can anyone explain this behaviour and/or suggest a better approach?

Julian Wells
OU Business School
The Open University
Walton Hall
Milton Keynes
MK7 6AA
United Kingdom
+44 1908 654658


<Prev in Thread] Current Thread [Next in Thread>
  • extracting data vector from multiple Excel files, P . J . Wells <=