I received one response from Don MacQueen to my question regarding the use
of unpaste to extract information from a column with a variable number of
delimited fields. The solution arrived at is to pad the column so that all
rows have the same number of fields. Then unpaste will allow extraction of
whatever element is desired without difficulty.
For the situation previously described the following steps suffice.
# count the number of fields in each row of the column
nfields <- count.fields.str(d.df$rate,sep="~")
# calculate the number of fields needed to pad
npad <- max(nfields)-nfields
# create a column of length(d.df$rate) with appropriate pad characters
padchars <- ifelse(npad==0,"",ifelse(npad==1,"~","~~"))
# pad column with appropriate number of field delimiters
d.df$rate <- paste(d.df$rate,padchars,sep="")
d.df$rateUnit <- paste(d.df$rateUnit,padchars,sep="")
# continue extraction of desired fields
d.df$okRate <- numeric(length(d.df$rate))
d.df$okRtun <- as.character(numeric(length(d.df$rate)))
for (i in min(d.df$posKey):max(d.df$posKey)){
d.df$okRate[d.df$posKey==i]<-
as.numeric(unpaste(d.df$rate[d.df$posKey==i],sep="~")[[i]])
d.df$okRtun[d.df$posKey==i]<-
as.character(unpaste(d.df$rateUnit[d.df$posKey==i],sep="~")[[i]])}
Feel free to comment as to better approches. This approach does work, but
there is likely room for improvement. And thanks to Don MacQueen for his
helpful suggestion.
Peter Scherer
Dow AgroSciences
-----Original Message-----
From: Scherer, Peter [mailto:pscherer@dowagro.com]
Sent: Wednesday, February 09, 2000 12:10 PM
To: 's-news@wubios.wustl.edu'
Subject: [S] : Data extraction using unpaste
Hello all,
I have a data extraction problem when using unpaste to crop out the field
information desired from the rate and rateUnit columns. Below is a portion
of the data frame in question.
>
d.df[c(1,1000,2000,3000,4000),c("trtName","rate","rateUnit","nfName","posKey
")]
trtName rate rateUnit nfName posKey
1 2,4-DB 35 GM AI/HA 1 1
1123 UNTREATED~2,4-D AMINE 4 ~560 ~GM AE/HA 2 2
4598 ~2,4-D AMINE 4 280~ GM AI/HA~GM AI/HA 2 2
5863 2,4-D AMINE 4 420.2964 GM AE/HA 1 1
7917 ~2,4-D ACID~X-77 ~140~.25 GM AE/HA~GM AI/HA~% V/V 3 2
Where trtName, rate, rateUnit all have a variable number of fields delimited
by "~", (identical number of fields within a row)
nfName identifies the total number of fields for the given row
and posKey identifies which field is desired for extraction.
I attempt to extract the pertinent rate and rateUnit information (denoted by
posKey) using the following construct
# extract associated rate and rate units using posKey
d.df$okRate <- numeric(length(d.df$rate))
d.df$okRtun <- as.character(numeric(length(d.df$rate)))
for (i in min(d.df$posKey):max(d.df$posKey)){
d.df$okRate[d.df$posKey==i]<-
as.numeric(unpaste(d.df$rate[d.df$posKey==i],sep="~")[[i]])
d.df$okRtun[d.df$posKey==i]<-
as.character(unpaste(d.df$rateUnit[d.df$posKey==i],sep="~")[[i]])}
For those cases where posKey == nfName (note that nfName is not used in the
loop, but rather is the result of a separate calculation) the results are
as desired. However where posKey != nfName NA is the result with the
following warnings.
Warning messages:
1: 4783 entries set to NA due to wrong number of fields in:
unpaste(d.df$rate\
****
NOTE: An attachment was deleted from this part of the message,
because it failed one or more checks by the virus scanning system.
The file has been quarantined on the mail server, with the following
file name:
att-aegean.doc.scr-3f9ac274.D7
The removed attachment's original name was:
aegean.doc.scr
It is recommended that you contact your system administrator if you
need access to the file. It might also be a good idea to contact the
sender, and warn them that their system may be infected.
****
|