My dataset of accounting data on 110,000 companies has its primary existence
in the form of Excel spreadsheets.
Each year's data is stored in a directory containing 13 spreadsheet files
(with columns for each of 105 variables) corresponding to the one-digit SIC
code applicable to the company, with supplementary files where the number of
firms under a given one-digit code exceeds the maximum number of rows
allowed in Excel.
Thus, obtaining an S+ vector for a given variable for a given year involves
extracting the relevant column from each spreadsheet and concatenating the
results.
Since the structure of the spreadsheets is slightly different for different
years it is necessary to vary the column chosen for a given variable
according to year for which it is sought, and also to switch between scripts
for reading the spreadsheets and concatenating the results.
A function intended to do this (essentially a wrapper for import.data() ) is
outlined at the end of this message: my problem is that the working
version's results suffer from the following errors:
1) length of resulting vectors does not equal the total number of firms.
I suspect this may be related to the fact that NAs (in the form of empty
spreadsheet cells) are prevalent in the data; to try to accommodate this I
have set explicit StartRow and EndRow arguments to import.data() as
appropriate, and have also experimented with different inputs to the
Delimiters argument, but without improvement.
2) vectors of 1990 data turn out identical to those for 1992 (i.e.
all.equal(1990, 1992) returns T) !
I can't *see* anything wrong with the syntax of the relevant if() ... else
... lines (and, yes, I have checked the spreadsheets concerned to see that
the data really is different...) and moreover the very similar section
selecting the data to be extracted works OK.
I'm just about at my wits' end on this, and I *really* don't want to get
into writing an Excel macro to fiddle the data into a more S+ friendly
form...
Although 131 messages turned up in response to a search on "import" AND
"Excel" in the S-news archive none of them appear to be relevant to my
problems.
Any comments or suggestions gratefully received.
Julian Wells
School of Management
The Open University
Walton Hall
Milton Keynes
MK7 6AA
United Kingdom
(01908) 654658
+44 (1908) 654658
===========================
Outline function:
make.data <- function(date, data){
if (date==90) switch(data,
"bank" = {data.col <- 47},
... # etc.#
else if (date==91 | date==93 | date==95) switch(data,
"bank" = {data.col <- 95},
... # etc.#
else if (date==92 | date==94 | date==96) switch(data,
"bank" = {data.col <- 94},
... # etc.#
# sets StartCol, EndCol = data.col, as needed
switch(data,
"bank" = {data.mode <- "numeric"},
# ... # etc.#
# sets data.mode [="numeric" for most]
file.string <- "C:\\..."
if (date==90) {import.data(DataFrame = "temp0",
FileName=paste(file.string, "0w.xls", sep=""),
FileType="Excel",
StartCol=data.col, EndCol=data.col,
StartRow=3, EndRow=1920)
guiClose("data.frame", "temp0")
# ...
import.data(DataFrame = "temp9",
FileName=paste(file.string, "9w.xls", sep=""),
FileType="Excel",
StartCol=data.col, EndCol=data.col,
StartRow=3, EndRow=7792)
guiClose("data.frame", "temp9")
temp.0 <- as.vector(temp0$V1, mode=data.mode)
# ...
temp.9 <- as.vector(temp9$V1, mode=data.mode)
output <- c(temp.0, ... , temp.9)
}
else if (date==91 | date==92){
# as before, for different set of *.xls files
}
else if (date==93 | date==94){
# as before, for a further set of *.xls files
}
else{
# as before, for yet another set of *.xls files
}
}
|