s-news
[Top] [All Lists]

Re: Reducing a data set

To: "Eric yang" <yang_eric9@yahoo.com>
Subject: Re: Reducing a data set
From: "Douglas Bates" <bates@stat.wisc.edu>
Date: Tue, 19 Sep 2006 21:39:10 -0500
Cc: S-News <s-news@lists.biostat.wustl.edu>
Domainkey-signature: a=rsa-sha1; q=dns; c=nofws; s=beta; d=gmail.com; h=received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=r9axXsNazNE1aLNr6cNjQsCYTXcfAXBkuLGQ+TJHfDG/Sok6c1HK1DCbwbs9FFr35rwWlZwX99YVQK4bbWtRfI+f63rubacck3zy0bn7gvDumNRJRcArpBO/G2KgN3E0X4dtXHJVQNqXcBoOL/fNA6BmoxyfijTAHVK/YKINvr4=
In-reply-to: <20060919214958.60134.qmail@web33911.mail.mud.yahoo.com>
References: <20060919214958.60134.qmail@web33911.mail.mud.yahoo.com>
On 9/19/06, Eric yang <yang_eric9@yahoo.com> wrote:


Dear all,

I would like to know a fast way of reducing a data set based on certain
conditions. For example, suppose I have the following data set

my.data <- data.frame(ID=c(rep("101-1", 10), rep("102-12", 14),
rep("103-10", 3), rep("104-2", 8)), score=round(100*runif(35)))

> my.data
ID score
1 101-1 85
2 101-1 32
3 101-1 22
4 101-1 74
5 101-1 48
6 101-1 47
7 101-1 46
8 101-1 6
9 101-1 58
10 101-1 37
11 102-12 16
12 102-12 78
13 102-12 15
14 102-12 45
15 102-12 99
16 102-12 4
17 102-12 99
18 102-12 35
19 102-12 78
20 102-12 16
21 102-12 91
22 102-12 34
23 102-12 10
24 102-12 20
25 103-10 43
26 103-10 12
27 103-10 57
28 104-2 86
29 104-2 45
30 104-2 85
31 104-2 81
32 104-2 9
33 104-2 40
34 104-2 47
35 104-2 74
I would like to reduce the data set such that I have at most the top 5
scores for each ID number. Thus, I would end up with the following data set:
1 101-1 85
4 101-1 74
5 101-1 48
6 101-1 47
9 101-1 58
12 102-12 78
15 102-12 99
17 102-12 99
19 102-12 78
21 102-12 91
25 103-10 43
26 103-10 12
27 103-10 57
28 104-2 86
30 104-2 85
31 104-2 81
34 104-2 47
35 104-2 74
Thanks for any help in advance.

my.data
      ID score
1   101-1    79
2   101-1    53
3   101-1     1
4   101-1    49
5   101-1    84
6   101-1    14
7   101-1    17
8   101-1    90
9   101-1    99
10  101-1    85
11 102-12    16
12 102-12    79
13 102-12     9
14 102-12    86
15 102-12    43
16 102-12    89
17 102-12    55
18 102-12    33
19 102-12    93
20 102-12    93
21 102-12    61
22 102-12    80
23 102-12    40
24 102-12    36
25 103-10    35
26 103-10    85
27 103-10    98
28  104-2    22
29  104-2    96
30  104-2    44
31  104-2    57
32  104-2     1
33  104-2    76
34  104-2    24
35  104-2    11
do.call("rbind", lapply(split(my.data, my.data$ID), function(fr) 
fr[rev(order(fr$score)),][1:min(5, nrow(fr)),]))
             ID score
101-1.9    101-1    99
101-1.8    101-1    90
101-1.10   101-1    85
101-1.5    101-1    84
101-1.1    101-1    79
102-12.20 102-12    93
102-12.19 102-12    93
102-12.16 102-12    89
102-12.14 102-12    86
102-12.22 102-12    80
103-10.27 103-10    98
103-10.26 103-10    85
103-10.25 103-10    35
104-2.29   104-2    96
104-2.33   104-2    76
104-2.31   104-2    57
104-2.30   104-2    44
104-2.34   104-2    24

<Prev in Thread] Current Thread [Next in Thread>