s-news
[Top] [All Lists]

Comparing data from two dataframes

To: "'s-news@lists.biostat.wustl.edu'" <s-news@lists.biostat.wustl.edu>
Subject: Comparing data from two dataframes
From: "Austin, Matt" <maustin@amgen.com>
Date: Fri, 15 Nov 2002 10:23:02 -0800
I have two dataframes, df.1 and df.2.  Each dataframe has the variables
subjectID and studyday.  I want to create a flag in df.1 that indicates
whether the studyday in df.1 is in a 28 day period starting at studyday in
df.2 for an individual patient.

I have included some simple code to create the data structures and also
attached two manually annotated dataframes that mark which observations
should be flagged.  

Two possible approaches I have used:
1.  I can do this through nested "for" loops where the external loop subsets
df.1by subject and the internal loop compares the individual records to df.2

2.  I have used by() to loop through subjects in df.1 and within used
apply() to cycle through the individual subject observations and compare
against df.2.

Both are prohibitively slow, and I know there must be a better method.  My
data is usually of moderate size (df.1 ~ 30000 observations, df.2 ~ 5000
observations).

df.1 <- data.frame( subjectID = rep( 1:20, each = 10),
                    studyday  = unlist( lapply( 1:20, function( x ) c( 1,
sort( sample( 2:100, 9 ))))))

df.2 <- data.frame( subjectID = sort(sample( unique( df.1$subjectID ), 20,
replace = T)),
                    studyday  = sample( 1:100, 20, replace = T ) )

> df.1[1:20,  ]
   subjectID studyday 
 1         1        1
 2         1       26*    occurs within 28 day period starting at day 22
observation in df.2
 3         1       36*    occurs within 28 day period starting at day 22
observation in df.2
 4         1       47*    occurs within 28 day period starting at day 22
observation in df.2
 5         1       62
 6         1       68
 7         1       76
 8         1       79
 9         1       87
10         1       92
11         2        1
12         2       45*    occurs within 28 day period starting at day 36
observation in df.2
13         2       46*    occurs within 28 day period starting at day 36
observation in df.2
14         2       47*    occurs within 28 day period starting at day 47
observation in df.2
15         2       53*    occurs within 28 day period starting at day 47
observation in df.2
16         2       69*    occurs within 28 day period starting at day 47
observation in df.2
17         2       76
18         2       84
19         2       89
20         2       93
> df.2[1:10,  ]
   subjectID studyday 
 1         1       22
 2         2       36
 3         2       47
 4         5       17
 5         5       59
 6         7       23
 7         8       17
 8         8       34
 9         9       42
10         9       36

Thanks for any advice,

--Matt




<Prev in Thread] Current Thread [Next in Thread>
  • Comparing data from two dataframes, Austin, Matt <=