I received several very useful suggestions to my question on translating
multiple, uniquely identified rows from multiple files into one, numeric
Splus file. I thank Stephen Weller, Andy Liaw, and James Holtman for
their assistance. I have posted the replies below with the exception of
Stephen Weller's message because he sent it to the list. At the end of
the replies, I have posted the original message for those interested in
re-reading it.
#----James Holtman---------------
Here is one of doing it. You read the file in with "scan(...what
='character', sep='\n')" to read in each line as a character vector.
Then separate out the lines that you are interested in in a new object
(in the example below, lines that start with "C"). Then use
'textConnection' to read in this object.
> x.in <- scan('tempxx', what='character', sep='\n')
Read 20 items
> x.in
[1] "C 0.575115921808201 0.0912244697313813"
[2] "X 0.51203689130769 0.351417834486677"
[3] "C 0.336942071639221 0.324503383441945"
[4] "X 0.228913358931642 0.395039733824096"
[5] "C 0.850863703026172 0.224212603462909"
[6] "X 0.594487485148592 0.576069293444992"
[7] "I 0.900631305272838 0.6758116068495"
[8] "I 0.337903537866171 0.846088978891747"
[9] "X 0.986568547316493 0.282605149383332"
[10] "X 0.216593418320779 0.649524857208488"
[11] "C 0.422773490991158 0.296612558722639"
[12] "X 0.136980706857746 0.38444003704573"
[13] "C 0.183686651099400 0.157469714795581"
[14] "I 0.73923632356786 0.187183499379825"
[15] "X 0.59747994099685 0.930189237680796"
[16] "X 0.401679007895682 0.774042770446754"
[17] "X 0.762639131807405 0.735627770827996"
[18] "C 0.0531840282616168 0.80239231670331"
[19] "I 0.326288505998973 0.727728606371146"
[20] "X 0.226769466005910 0.613449229768815"
> x.C <- x.in[grep("^C", x.in)] # select lines that start with "C"
> x.C
[1] "C 0.575115921808201 0.0912244697313813"
[2] "C 0.336942071639221 0.324503383441945"
[3] "C 0.850863703026172 0.224212603462909"
[4] "C 0.422773490991158 0.296612558722639"
[5] "C 0.183686651099400 0.157469714795581"
[6] "C 0.0531840282616168 0.80239231670331"
> x.df <- read.table(textConnection(x.C))
> x.df
V1 V2 V3
1 C 0.57511592 0.09122447
2 C 0.33694207 0.32450338
3 C 0.85086370 0.22421260
4 C 0.42277349 0.29661256
5 C 0.18368665 0.15746971
6 C 0.05318403 0.80239232
#--------Andy Liaw--------------------------------------------
The easiest way (I can think of, anyway) is to use awk to extract the
data before reading into Splus. You didn't say what OS you're using,
which happens to Windoze users the most (they don't seem to realize
there *are* other OSs). There are several versions of awk under
Windoze, e.g., cygwin.
Assuming the files are delimited by white space, you can do something
like:
awk '$1=="I" || $1=="S"' original.data > filtered.data
at the command prompt. This extracts all rows where the first column
are either "I" or "S" and write them out to the file "filtered.data".
You can even give awk all the input files at the same time, if you don't
need the output in separate files.
#--------original message-------------------------------------------
> I have multiple data files [ASCII format]. Each file is sequentially
> numbered and contains a combination of text and numbers. My interest
> in the text is only to identify rows. I want the rows that begin with
> 'I' or 'S'. There are multiple occurrences of the 'I' and 'S'
> rows in each
> data file. I want to extract the numbers in those rows that
> begin with
> 'I' or 'S' for further analysis. Each row of interest contains 8
> columns of numbers. I would like to put those columns of numbers in 8
> separate columns of a data frame. Furthermore, I would like
> to put in a
> row of "NA" if the row does not exist in the data file.
Respectfully,
Frank R. Lawrence
|