> From: s-news-owner@lists.biostat.wustl.edu
> [mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Data
Analytics Corp.
> Sent: Tuesday, November 11, 2008 5:00 AM
> To: s-news@lists.biostat.wustl.edu
> Subject: [S] Selecting data from a data frame
>
> Hi,
>
> I have a data frame of countries that has the following three
> variables:
> the country's isocode (a 3-letter code for the country), year and
> population size in that year. Let's say there are 3 countries and
> anywhere from 1 - 6 years of population data for each country.
> Basically, it's a panel dataset. An example would be:
>
> isocode year pop
> usa 1990 10
> usa 1991 12
> usa 1992 15
> usa 1993 13
> usa 1994 16
> usa 1995 17
> can 1992 5
> can 1993 6
> gbr 1997 15
>
> I want to create another data frame that has just 3 rows, one for each
> country, the most recent year for that country (i.e., the max year),
> and the population size for that year for that country. How do I do
> this in the easiest way possible? In SQL, I would use
>
> select isocode, max(year), pop
> from isocode
> group by isocode
> having year = max(year)
> order by isocode;
>
> which gives the answer
>
> isocode year pop
> usa 1995 17
> can 1993 6
> gbr 1997 15
You could use the following:
> i<-unlist(lapply(X=split(seq(len=nrow(data)), data$iso),
FUN=function(iso,data)iso[which(data$year[iso]==max(data$year[iso]))],
data=data))
> data[i,]
isocode year pop
8 can 1993 6
9 gbr 1997 15
6 usa 1995 17
Note that the isocodes are in alphabetical order. If you
want to use the order in which the first instance of each
isocode was found in the original data.frame, replace the
'data$iso' in the call to split() with
factor(data$iso, levels=unique(data$iso))
If there is a chance that there will be more than one entry
for a given isocode/year combination, add [1] after the
which() to select the first entry.
Bill Dunlap
TIBCO Spotfire Inc
wdunlap tibco.com
|