s-news
[Top] [All Lists]

Re: group by

To: "'Wensui Liu'" <liuwensui@gmail.com>, "'Neung-Hwan Oh'" <ultisol@gmail.com>
Subject: Re: group by
From: <Rich@Mango-Solutions.com>
Date: Thu, 15 Mar 2007 09:38:13 -0000
Cc: <s-news@wubios.wustl.edu>
In-reply-to: <1115a2b00703141607u451fb30fg116edd4ea83d1164@mail.gmail.com>
Thread-index: AcdmjZUGYCL0uqY2SzSCnHquCz1odAAVsBIQ
I think of the "tapply" function returning a (simplified) list structure.
Typically this is a single mode structure for single value returns (eg.
vector, matrix, array depending on # of "by" variables).  For multiple
returns, it will return a list structure (eg. try
"tapply(fuel.frame$Mileage, fuel.frame$Type, range)").

> data
        group          x 
1 treatment_1 -0.8514052
2 treatment_2 -0.2327822
3 treatment_1  0.3106438
4 treatment_2 -0.5681113
5 treatment_1 -1.2298195
6 treatment_2 -0.1028086
7 treatment_1  0.8201559
8 treatment_2 -0.9596621

> tapply(data$x, data$group, mean) # Returns a vector
 treatment_1 treatment_2 
  -0.2376172   -0.465841

> tapply(data$x, data$group, range) # Returns a list
$"treatment_1":
[1] -1.2298195  0.8201559

$"treatment_2":
[1] -0.9596621 -0.1028086

The aggregate function will return a data frame.  However, the aggregate
function can only return a single value:

> aggregate(data$x, data$group, mean) 
                  Group          x 
treatment_1 treatment_1 -0.2376172
treatment_2 treatment_2 -0.4658410

So, whether you use tapply or aggregate on your data, you'll still need to
do some manipulation of the results.

Note: There is a small bug in aggregate (fixed in S+7), where the ellipses
arguments are not passed through to the function call.

Cheers,
Rich.

mangosolutions

-----Original Message-----
From: Wensui Liu [mailto:liuwensui@gmail.com] 
Sent: 14 March 2007 23:07
To: Neung-Hwan Oh
Cc: Rich@mango-solutions.com; s-news@wubios.wustl.edu
Subject: Re: [S] group by

following is an example I copied from my blog and HTH.

CALCULATE GROUP SUMMARY IN R

##################################################
# HOW TO CALCULATE GROUP SUMMARY IN R            #
##################################################
# EQUIVALENT SAS CODE:                           #
#                                                #
# DATA DATA;                                     #
#   DO I = 1 TO 2;                               #
#     DO J = 1 TO 4;                             #
#       GROUP = 'TREATMENT_'||PUT(I, 1.);        #
#       X = RANNOR(1);                           #
#       OUTPUT;                                  #
#     END;                                       #
#   END;                                         #
#   KEEP GROUP X;                                #
# RUN;                                           #
#                                                #
# PROC SQL;                                      #
#   CREATE TABLE COMBINE AS                      #
#   SELECT *, MEAN(X) AS MEAN_X, SUM(X) AS SUM_X #
#   FROM DATA                                    #
#   GROUP BY GROUP;                              #
# QUIT;                                          #
##################################################

# GENERATE A TREATMENT GROUP #
group<-as.factor(paste("treatment", rep(1:2, 4), sep = '_'));

 # CREATE A SERIES OF RANDOM VALUES #
x<-rnorm(length(group));

# CREATE A DATA FRAME TO COMBINE THE ABOVE TWO #
data<-data.frame(group, x);

# CALCULATE SUMMARY FOR X #
x.mean<-tapply(data$x, data$group, mean, na.rm = T);
x.sum<-tapply(data$x, data$group, sum, na.rm = T);

# CREATE A DATA FRAME TO COMBINE SUMMARIES #
summ<-data.frame(x.mean, x.sum, group = names(x.mean));

# COMBINE DATA AND SUMMARIES TOGETHER #
combine<-merge(data, summ, by = "group");

On 3/14/07, Neung-Hwan Oh <ultisol@gmail.com> wrote:
> Thanks a lot for the quick replies.
> But, doesn't "tapply" or "aggregate" provide a matrix format rather
> than data frame format?
>
> On 3/14/07, Rich@mango-solutions.com <Rich@mango-solutions.com> wrote:
> > Sorry ...
> >
> > I mean "aggregate(df$value, df[,1:3], mean)"
> >
> > Rich.
> > mangosolutions
> >
> > -----Original Message-----
> > From: Rich@Mango-Solutions.com [mailto:Rich@Mango-Solutions.com]
> > Sent: 14 March 2007 21:14
> > To: 'Wensui Liu'; 'Neung-Hwan Oh'
> > Cc: s-news@wubios.wustl.edu
> > Subject: Re: [S] group by
> >
> > Yes ... you can use tapply in S.  For the structure you're using, I'd
> > probably recommend "aggregate" though ...
> >
> > Something like "aggregate(df[,1:3], df$value, mean)"
> >
> > Rich.
> > mangosolutions
> >
> > -----Original Message-----
> > From: Wensui Liu [mailto:liuwensui@gmail.com]
> > Sent: 14 March 2007 21:02
> > To: Neung-Hwan Oh
> > Cc: s-news@wubios.wustl.edu
> > Subject: Re: [S] group by
> >
> > I am not sure if in Splus, there is a nice function like tapply() in R
or
> > not.
> >
> > On 3/14/07, Neung-Hwan Oh <ultisol@gmail.com> wrote:
> > > Hello,
> > >
> > > How can you calculate the following example in s-plus?   In Access, it
is
> > > relatively easy with "Group By" and I am wondering whether there is a
> > > similar function that I missed in S-Plus.
> > >
> > >
> > >
> > > "From this table"
> > >
> > > site.no date time value
> > >
> > > 1 1989/04/27 12:00   1.0
> > >
> > > 2 1975/10/01 19:00   2.0
> > >
> > > 2 1975/10/01 20:00   4.0
> > >
> > > 3 1993/04/10 09:00   3.0
> > >
> > > 3 1993/04/10 12:00   6.0
> > >
> > > 3 1993/04/10 15:00   9.0
> > >
> > >
> > >
> > > "To this (averages per date per site)" + (count column?)
> > >
> > > 1 1989/04/27 12:00 1.0  (1.0)
> > >
> > > 2 1975/10/01 19:30 3.0  (2.0)
> > >
> > > 3 1993/04/10 12:00 6.0  (3.0)
> > >
> > >
> > >
> > > Many thanks!
> > >
> > > NH
> >
> >
> > --
> > WenSui Liu
> > A lousy statistician who happens to know a little programming
> > (http://spaces.msn.com/statcompute/blog)
> > --------------------------------------------------------------------
> > This message was distributed by s-news@lists.biostat.wustl.edu.  To
> > unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> > the BODY of the message:  unsubscribe s-news
> >
> > --------------------------------------------------------------------
> > This message was distributed by s-news@lists.biostat.wustl.edu.  To
> > unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> > the BODY of the message:  unsubscribe s-news
> >
> > --------------------------------------------------------------------
> > This message was distributed by s-news@lists.biostat.wustl.edu.  To
> > unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> > the BODY of the message:  unsubscribe s-news
> >
>


-- 
WenSui Liu
A lousy statistician who happens to know a little programming
(http://spaces.msn.com/statcompute/blog)
--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news


<Prev in Thread] Current Thread [Next in Thread>