My dataset of accounting data on 110,000 companies has its primary existence
in the form of Excel spreadsheets.
Each year's data is stored in a directory containing 13 spreadsheet files
(with columns for each of 105 variables) corresponding to the one-digit SIC
code applicable to the company, with supplementary files where the number of
firms under a given one-digit code exceeds the maximum number of rows
allowed in Excel.
Thus, obtaining an S+ vector for a given variable for a given year involves
extracting the relevant column from each spreadsheet and concatenating the
results.
A function on the following lines is intended to do this:
make.my.data <- function(date, data=c("foo", "bar", ...)){
# where "foo", "bar" are variable names (I only need approx. 20
variables from the total available)
data <- match.arg(data)
switch(data,
foo = {data.col <- 53},
bar = {data.col <- 48}.
...)#
# in order to set TargetStartCol, EndCol = data.col
switch(data,
foo = {data.mode <- "numeric"},
bar = {data.mode <- "character"},
...)#
# sets data.mode [= "numeric" for most]
import.data(DataFrame = "temp",
FileName=
paste("C:\\ ... [drills down through possibly
over-complex directory structure ] ... \\19", date, "\\0-", date, ".xls",
sep=""),
FileType="EXCEL",
TargetStartCol=data.col, EndCol=data.col,
NameRow=1)
guiClose("data.frame", "temp")
temp.0 <- as.vector(temp, mode = data.mode)
# ...
import.data(DataFrame = "temp",
FileName=
paste("C:\\Data\\Data sets\\From
Roberto\\Data97xl\\Excel files\\19", date, "\\9-", date, ".xls", sep=""),
FileType="EXCEL",
TargetStartCol=data.col, EndCol=data.col,
NameRow=1)
guiClose("data.frame", "year")
temp.9 <- as.vector(temp)
rm(temp)
output <- rbind(temp.0, temp.1, temp.2, temp.3, temp.4, temp.5,
temp.6a, temp.6b, temp.7, temp.8a, temp.8b, temp.8c, temp.9)
}
This seems to work OK if one confines oneself to extracting only a few of
the SIC code sections, but trying this out (for a numeric variable) on the
full set crashes S+ -- apparently after about five sections.
(S+ 2k Professional, on a Wintel box -- Pentium Pro with 265MB of RAM; speed
escapes me).
FWIW, the resulting error message is:
SPLUS caused an invalid page fault in
module SAPIOBJ.DLL at 0137:00e88471.
Registers:
EAX=00000000 CS=0137 EIP=00e88471 EFLGS=00010286
EBX=00000000 SS=013f ESP=0110bf5c EBP=0110bf60
ECX=ffffffff DS=013f ESI=029a4980 FS=0cff
EDX=ffffffff ES=013f EDI=027cbbfc GS=0000
Bytes at CS:EIP:
0f be 08 85 c9 74 2f 8b 55 fc d1 e2 89 55 fc 8b
Stack dump:
00000000 0110bf78 00e88403 00000000 00000000 ffffffff 00000000 0110bfb4
00e87efa 00000000 0110bfac 0110bfa8 0284b96c 00002a9d 00002a9c 00000004
A version which rbinds the dataframes before coercing them to vectors
suffers a similar fate.
Can anyone explain this behaviour and/or suggest a better approach?
Julian Wells
OU Business School
The Open University
Walton Hall
Milton Keynes
MK7 6AA
United Kingdom
+44 1908 654658
|