Dear all,
My original question was this:
> Suppose I have a set of n observations, and I want to
> find out all possible unique ways in which I can split
> these observations into r groups of equal size n/r.
> For example if I have n=4 observations {1, 2, 3, 4} and I
> split these into r=2 groups I can have
> {1, 2} {3, 4}
> {1, 3} {2, 4}
> {1, 4} {2, 3}
> Can anyone suggest an efficient way I can do this in
> Splus? I need something which can cope with any r and n
> such that n/r is an integer. Every solution I have come
> up with involves endless loops, which I realise is far
> from ideal.
Thanks very much to Stephen Smith, Brian Ripley, Yuelin Li,
Ian Clough, Renaud Lancelot and John Maindonald. The
solution suggested by Professor Ripley concerning
splitting the observations up into groups sequentially is
probably the most relevant for my own purposes (since I am
interested in all distinct *sets* of subsets, not simply
all distinct *subsets* - I don't think I was clear enough
about that in my original message; for example for the
numbers 1,2,3,4 into two groups of size two, I would get
three distinct possibilities, namely ({1,2},{3,4}),
({1,3},{2,4}) and ({1,4},{2,3})). All the responses
received have been useful along the way, so thanks to all.
I've had a few requests from people to forward any
solutions to the list, so they are included below:
-------------------------------------------------------------
See page 149 of Venables and Ripley (2nd ed) for the following function
subsets<-function(r,n,v=1:n)
{
if(r<=0) NULL else
if(r>=n) v[1:n] else
rbind(cbind(v[1],Recall(r-1,n-1,v[-1])), Recall(r, n-1,v[-1]))
}
e.g.,
> subsets(2,4)
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
>
(from Stephen Smith)
----------------------------------------------------------
Start with 2 groups, one of size k= n/r and one of size n-k, then
split off another k and so on.
For two groups, Bill Venables' subsets function does the trick.
subsets <- function(n, r, s = 1:n) {
if(mode(n) != "numeric" || length(n) != 1
|| n < 1 || (n %% 1) != 0) stop("bad value of n")
if(mode(r) != "numeric" || length(r) != 1
|| r < 1 || (r %% 1) != 0) stop("bad value of r")
if(!is.atomic(s) || length(s) < n)
stop("s is either non-atomic or too short")
fun <- function(n, r, s)
if(r <= 0) vector(mode(s), 0) else if(r >= n) s[1:n] else
rbind(cbind(s[1], Recall(n - 1, r - 1, s[-1])),
Recall(n - 1, r, s[-1]))
fun(n, r, s)
}
Here n is n, r is k and s is the set of labels for the current group.
subsets(4, 2)
> subsets(4, 2)
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since (I presume) you do not care about the group ordering you can always
put the minimum element in the first group, and thereby simplify this a bit.
(from Brian Ripley)
----------------------------------------------------------------
Hi, I have an example for the unique combinations of n choose 3.
Perhaps it might help you in solving the n choose r problem?
> x <- 1:9
> y <- cbind(rep(x, length(x)^2), rep(rep(x, each=length(x)), length(x)),
rep(x, each=length(x)^2))
> y2 <- y[c(y[,1] != y[,2] & y[,2]!=y[,3] & y[,1]!=y[,3] & y[,1] < y[,2]
& c(y[, 2] < y[, 3])), ]
> dim(y2)
(from Yuelin Li)
-----------------------------------------------------------------
An easy solution is to write a funtion like this found in VR (*) 2nd ed.
p. 149 or in the S-news archives:
subsets <- function(n, k, set = 1:n)
if(k <= 0) NULL else
if(k >= n) set else
rbind(cbind(set[1], Recall(n - 1, k - 1, set[-1])),
Recall(n - 1, k, set[-1]))
then:
> subsets(4, 2)
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Hope this helps,
Renaud
(*) Venables, W. N. and B. D. Ripley (1998). Modern applied statistics
with S-Plus. New York (USA), Springer-Verlag.
(from Renaud Lancelot)
---------------------------------------------------------------
I had the same dilemma myself with loops
and generating permutations. Below are 2 emails, one from
Professor Ripley and the other from Dr. Venables. They
should help you decide whether S-plus will do the job ( i.e
n,r aren't too large ) or whether like me you have to move
to a C routine called from within S-plus. See Chapter 16 of
the S-Plus 2000 Programmer's Guide for details on creating
the DLL file. Credit should be given to messrs Ripley and
Venables if this works for you ( and the S-Plus routine
from Dr. Venables is the fastest I know of for an S-plus
solution)
Hope this helps,
Ian Clough
cloughi@email.uc.edu
Dr. Venables
Doing things in a loop can be rather
costly in S, too, but if you need to generate all possible
subsets sequentially here is *one*
way to do it.
(Code below; figuring out how it works, and improving it,
remains an exercise for the casual onlooker.)
It returns the next subset at each call, and NULL when the
sequence is finished. Here is an example of how you might
use it:
> while(!is.null(m <- next.subset(5, 3)))
{
# `m' holds the index of the subset
print(m)
}
[1] 1 2 3
[1] 1 2 4
[1] 1 2 5
[1] 1 3 4
[1] 1 3 5
[1] 1 4 5
[1] 2 3 4
[1] 2 3 5
[1] 2 4 5
[1] 3 4 5
>
(Instead of "print(m)" you may wish to do something more
interesting...)
---(cut here)---
next.subset <- function(n, r) {
if(exists(NR <- paste(n, r), frame = 0)) {
nr <- get(NR, frame = 0)
if(any(w <- (nr < (n-r+1):n))) {
m <- max((1:r)[w])
k <- m:r
nr[k] <- k + nr[m] - m + 1
}
else {
remove(NR, frame = 0)
return(NULL)
}
}
else
nr <- 1:r
assign(NR, nr, frame = 0)
nr
}
cancel <- function(n, r)
invisible(if(exists(nr <- paste(n, r), frame = 0))
remove(nr, frame = 0))
---(cut here)---
If you wish to re-start the sequence you should use the
function
> cancel(5, 3)
to go back to the beginning.
The function is just a way of presenting an algorithm. If
you really are committed to using a very long loop in S
code, you probably should build in the next.subset
computation as part of the loop rather than calling a
function every time, but there may not be very much in it,
especially if each step is going to be a pretty large
computation, anyway. This is the sort of territory where
dynamically loaded code starts to become unavoidable.
Professor Ripley
...there is C code to do precisely that in the lqs function
in the V&R3 MASS
library, as in
/*
Find all subsets of size k in order: this gets a new one
each call */ void next_set(Sint * x, int n, int k) { int
i, j, tmp;
j = k - 1;
tmp = x[j]++;
while (j > 0 && x[j] >= n - (k - 1 - j)) tmp =
++x[--j]; for (i = j + 1; i < k; i++) x[i] = ++tmp;
}
and you will need to look at the code to see how it is
initialized (x is 1:k). (This is often called a Gray code
algorithm.)
And although I originally wrote this as a loop in S, doing
100000 subsets meant that I needed to do it all in C.
(from Ian Clough)
-----------------------------------------------------------
Helen -
You seem to be asking for a special case of the obtaining
of all nCm combinations. I cannot now remember where I found
the following, probably on statlib:
> combn
function(x, m, fun = NULL, simplify = T, ...)
{
# DATE WRITTEN: 14 April 1994 LAST REVISED: 10 July 1995
# AUTHOR: Scott Chasalow
#
# DESCRIPTION:
# Generate all combinations of the elements of x taken m at a
time.
# If x is a positive integer, returns all combinations
# of the elements of seq(x) taken m at a time.
# If argument "fun" is not null, applies a function given
# by the argument to each point. If simplify is FALSE, returns
# a list; else returns a vector or an array. "..." are passed
# unchanged to function given by argument fun, if any.
# REFERENCE:
# Nijenhuis, A. and Wilf, H.S. (1978) Combinatorial Algorithms
for
# Computers and Calculators. NY: Academic Press.
# EXAMPLES:
# > combn(letters[1:4], 2)
# > combn(10, 5, min) # minimum value in each combination
# Different way of encoding points:
# > combn(c(1,1,1,1,2,2,2,3,3,4), 3, tabulate, nbins = 4)
# Compute support points and (scaled) probabilities for a
# Multivariate-Hypergeometric(n = 3, N = c(4,3,2,1)) p.f.:
# > table.mat(t(combn(c(1,1,1,1,2,2,2,3,3,4), 3,
tabulate,nbins=4)))
#
if(length(m) > 1) {
warning(paste("Argument m has", length(m), "elements: only the
first used")
)
m <- m[1]
}
if(m < 0)
stop("m < 0")
if(m == 0)
return(if(simplify) vector(mode(x), 0) else list())
if(is.numeric(x) && length(x) == 1 && x > 0 && trunc(x) == x)
x <- seq(x)
n <- length(x)
if(n < m)
stop("n < m")
e <- 0
h <- m
a <- 1:m
nofun <- is.null(fun)
count <- nCm(n, m, 0.1)
out <- vector("list", count)
out[[1]] <- if(nofun) x[a] else fun(x[a], ...)
if(simplify) {
dim.use <- NULL
if(nofun) {
if(count > 1)
dim.use <- c(m, count)
}
else {
out1 <- out[[1]]
d <- dim(out1)
if(count > 1) {
if(length(d) > 1)
dim.use <- c(d, count)
else if(length(out1) > 1)
dim.use <- c(length(out1), count)
}
else if(length(d) > 1)
dim.use <- d
}
}
i <- 2
nmmp1 <- n - m + 1
mp1 <- m + 1
while(a[1] != nmmp1) {
if(e < n - h) {
h <- 1
e <- a[m]
j <- 1
}
else {
h <- h + 1
e <- a[mp1 - h]
j <- 1:h
}
a[m - h + j] <- e + j
out[[i]] <- if(nofun) x[a] else fun(x[a], ...)
i <- i + 1
}
if(simplify) {
if(is.null(dim.use))
out <- unlist(out)
else out <- array(unlist(out), dim.use)
}
out
}
(from John Maindonald)
-------------------------------------------------------------
Dr. Helen Clough,
Postdoctoral Research Assistant,
Department of Veterinary Clinical Sciences,
Leahurst,
Neston,
South Wirral
CH64 7TE
e-mail: wilsonh@liverpool.ac.uk
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news
|