s-news
[Top] [All Lists]

removing duplicated line in a df and crosstabulating

To: S-Plus List <s-news@wubios.wustl.edu>
Subject: removing duplicated line in a df and crosstabulating
From: Antoine Guisan <Antoine.Guisan@cscf.unine.ch>
Date: Thu, 07 Dec 2000 08:50:14 +0100
Organization: CSCF
Hi S+ers,

I did not see this message appear on the list. Hence, I resend it. Sorry
if crossposting occurred.

I have a dataframe with the columns "cluster", "period" and "sp" (for
species), and a last column called "r1" filled with a series of 1, which
I use to aggregate the data (see below; not very elegant, but it works).

As a first step, I would like to remove those rows which have exactly
the same values for the three columns (there are many of them and the
dataframe has ca. 25'000 rows), e.g.

>come1
    cluster period   sp r1 
206      17      1 casp  1
207      17      1 casp  1
208      16      2 casp  1
276      12      1 casp  1
277      12      2 casp  1
278      11      2 casp  1
493       3      1 casp  1
494       3      1 casp  1
525      10      1 casp  1
784       6      2 cavi  1
...

To do this, I used the following command:

>aggregate.data.frame(come1,list(cluster=come1$cluster,period=come1$period,sp=come1$sp),FUN=mean)[,c(1,2,3,7)]
    cluster period   sp r1 
  cluster period   sp r1 
1       3      1 casp  1
2      10      1 casp  1
3      17      1 casp  1
4      11      2 casp  1
5      12      1 casp  1
6      12      2 casp  1
7      16      2 casp  1
8       6      2 cavi  1
...

but there is probably a much simpler way to reach the same result,
especially since some columns appeared to be duplicated in the resulting
dataframe (that why I have to specify a subset of columns in the final
brackets). Thanks for any better solution on this particular point !!

Second, I would then like to get a table with every single combination
of "cluster" and "period" defining the rows, and "sp" defining the
columns.  When I do this in Excel or Access, I usually use a column of
zero (see "r1" in the df above) to specify as the values to aggregate. 
The result should be a table with 0 or 1 as values for the species'
columns (I will have about 80 of them), of the type:

cluster  period  casp  cavi ...
3        1       1     0
6        2       0     1
10       1       1     0
11       2       1     0       
12       1       1     0       
12       2       1     0
16       2       1     0
17       1       1     0
...

I was thinking to use either the 'table' or 'crosstabs' functions, but
it does not really give me what I want. I need to be able to use the
resulting table as a dataframe for further analyses (i.e. being able to
read the "casp", "cavi", ... columns). 'Table' could perhaps work if I
could define the first argument as a factor defined from the combination
of "cluster" AND "period" (every single combination would be a level -->
see e.g. for cluster 12 in the table above). Entering all three columns
as arguments in 'table' results in a three-way array which is not
exactly what I want.

Thanks a lot in anticipation for your help.

I will summarize,

Cheers
Antoine
+----------------------------------------------------------------+
|                                                                |
|   Antoine Guisan, Dr.                                          |
|                                                                |
|   Swiss Center for Faunal Cartography (CSCF)                   |
|   Terreaux 14                   tel +41 32 724 92 97           |
|   CH-2000 Neuchâtel             fax +41 32 717 79 69           |
|   Suisse                        Antoine.Guisan@cscf.unine.ch   |
|                                 http://www.cscf.ch             |
|                                                                |
|   See also  http://www.cscf.ch/projets/odo/index_fr.htm        |
              http://www.cscf.ch/projets/chs/index.htm
|                                                                |
|   You can also send me a fax at my home office                 |
|                                 fax +41 21 683 30 67           |
|                                                                |
+----------------------------------------------------------------+

<Prev in Thread] Current Thread [Next in Thread>
  • removing duplicated line in a df and crosstabulating, Antoine Guisan <=