Perhaps this isn't the most elegant answer available, but the 10 minute
answer is to import the 13 spreadsheets into a single Access database,
ensure that each company has a common identifier (number or string),
create a query based on the 13 tables, and re-export the data for
analysis.
The fact that your spreadsheets are not identical in shape is irrelevant
using this approach.
I'd be happy to walk you through a query if you are unfamiliar with
Access. It really isn't difficult, especially for just pulling data
together from separate tables.
Cheers,
Michael
> -----Original Message-----
> From: s-news-owner@lists.biostat.wustl.edu [mailto:s-news-
> owner@lists.biostat.wustl.edu] On Behalf Of P.J.Wells@open.ac.uk
> Sent: Wednesday, January 22, 2003 2:47 PM
> To: s-news@wubios.wustl.edu
> Subject: [S] Importing from Excel
>
> My dataset of accounting data on 110,000 companies has its primary
> existence
> in the form of Excel spreadsheets.
>
> Each year's data is stored in a directory containing 13 spreadsheet
files
> (with columns for each of 105 variables) corresponding to the
one-digit
> SIC
> code applicable to the company, with supplementary files where the
number
> of
> firms under a given one-digit code exceeds the maximum number of rows
> allowed in Excel.
>
> Thus, obtaining an S+ vector for a given variable for a given year
> involves
> extracting the relevant column from each spreadsheet and concatenating
the
> results.
>
> Since the structure of the spreadsheets is slightly different for
> different
> years it is necessary to vary the column chosen for a given variable
> according to year for which it is sought, and also to switch between
> scripts
> for reading the spreadsheets and concatenating the results.
>
> A function intended to do this (essentially a wrapper for
import.data() )
> is
> outlined at the end of this message: my problem is that the working
> version's results suffer from the following errors:
>
> 1) length of resulting vectors does not equal the total number of
firms.
>
> I suspect this may be related to the fact that NAs (in the form of
empty
> spreadsheet cells) are prevalent in the data; to try to accommodate
this I
> have set explicit StartRow and EndRow arguments to import.data() as
> appropriate, and have also experimented with different inputs to the
> Delimiters argument, but without improvement.
>
> 2) vectors of 1990 data turn out identical to those for 1992 (i.e.
> all.equal(1990, 1992) returns T) !
>
> I can't *see* anything wrong with the syntax of the relevant if() ...
else
> ... lines (and, yes, I have checked the spreadsheets concerned to see
that
> the data really is different...) and moreover the very similar section
> selecting the data to be extracted works OK.
>
> I'm just about at my wits' end on this, and I *really* don't want to
get
> into writing an Excel macro to fiddle the data into a more S+ friendly
> form...
>
> Although 131 messages turned up in response to a search on "import"
AND
> "Excel" in the S-news archive none of them appear to be relevant to my
> problems.
>
> Any comments or suggestions gratefully received.
>
> Julian Wells
>
> School of Management
> The Open University
> Walton Hall
> Milton Keynes
> MK7 6AA
> United Kingdom
>
> (01908) 654658
> +44 (1908) 654658
>
> ===========================
>
> Outline function:
>
> make.data <- function(date, data){
>
> if (date==90) switch(data,
> "bank" = {data.col <- 47},
> ... # etc.#
>
> else if (date==91 | date==93 | date==95) switch(data,
> "bank" = {data.col <- 95},
> ... # etc.#
>
> else if (date==92 | date==94 | date==96) switch(data,
> "bank" = {data.col <- 94},
> ... # etc.#
>
> # sets StartCol, EndCol = data.col, as needed
>
>
> switch(data,
> "bank" = {data.mode <- "numeric"},
> # ... # etc.#
>
> # sets data.mode [="numeric" for most]
>
>
> file.string <- "C:\\..."
>
> if (date==90) {import.data(DataFrame = "temp0",
> FileName=paste(file.string, "0w.xls", sep=""),
> FileType="Excel",
> StartCol=data.col, EndCol=data.col,
> StartRow=3, EndRow=1920)
> guiClose("data.frame", "temp0")
>
> # ...
>
> import.data(DataFrame = "temp9",
> FileName=paste(file.string, "9w.xls", sep=""),
> FileType="Excel",
> StartCol=data.col, EndCol=data.col,
> StartRow=3, EndRow=7792)
> guiClose("data.frame", "temp9")
>
> temp.0 <- as.vector(temp0$V1, mode=data.mode)
> # ...
> temp.9 <- as.vector(temp9$V1, mode=data.mode)
>
> output <- c(temp.0, ... , temp.9)
> }
>
> else if (date==91 | date==92){
> # as before, for different set of *.xls files
> }
>
> else if (date==93 | date==94){
> # as before, for a further set of *.xls files
> }
>
> else{
> # as before, for yet another set of *.xls files
> }
> }
>
> --------------------------------------------------------------------
> This message was distributed by s-news@lists.biostat.wustl.edu. To
> unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> the BODY of the message: unsubscribe s-news
|