'Re: [Rd] [patch] add is.set parameter to sample()'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-devel
Subject:    Re: [Rd] [patch] add is.set parameter to sample()
From:       Andrew Clausen <clausen () econ ! upenn ! edu>
Date:       2010-03-25 13:39:33
Message-ID: 57a6d2111003250639t209713a2n2e2d8d8ea51285eb () mail ! gmail ! com
[Download RAW message or body]

Hi Martin,

I re-attached the patch with a filename that will hopefully get
through the filters this time.

I agree that the case that you want to specify an integer is already
well handled with sample.int().  I disagree that the resample() code
for the set case given in the example is trivial.  The user has to
load the code into their program, which is annoying for such basic
functionality.  Moreover, the example code doesn't work for sampling
with replacement, and is poorly documented.  Finally, it isn't obvious
to new users of R what to do with resample().  (They would probably
try using resample() without cutting & pasting it into their program.
And why is it called resample()?  It's a mysterious name, that
suggests some technical concept, like resampling digital audio from
one sampling rate to another.)

So, the upside of my patch is that sample() becomes more convenient,
and the documentation becomes simpler.  What's the downside?  It is
backwards compatible.

sample() is one of the most important functions in R... I teach it to
my undergraduate economics students in the first 20 minutes of their
first R lesson.  It is the first probability/statistics function they
learn.  It is important that it is easy and convenient to use.

My first R problem set that I assigned my students was to do a Monte
Carlo simulation of the Monty Hall problem.  sample()'s surprise
really bites here because Monty has either one or two choices of door
to open.  It's bad enough that there is a surprise, but even worse
that there is no workaround that my students can understand easily.

Cheers,
Andrew

On 25 March 2010 06:53, Martin Maechler <maechler@stat.math.ethz.ch> wrote:
>>>>>> "AndrewC" == Andrew Clausen <clausen@econ.upenn.edu>
>>>>>>       on Tue, 23 Mar 2010 08:04:12 -0400 writes:
>
>      AndrewC> Hi all,
>      AndrewC> I forgot to test my patch!   I fixed a few bugs.
>
> and this time, you even forgot to attach it (in a way to pass
> through the list filters).
>
> Note however, that all this seems unnecessary,
> as we have   sample.int()
> and a trivial definition of resample()
> at least in R-devel, which will be released as R 2.11.0 on
> April 22.
>
> Thank you anyway, for your efforts!
> Martin
>
> Martin Maechler, ETH Zurich
>
>      AndrewC> On 22 March 2010 22:53, Andrew Clausen <clausen@econ.upenn.edu> wrote:
>      >> Hi all,
>      >>
>      >> sample() has some well-documented undesirable behaviour.
>      >>
>      >> sample(1:6, 1)
>      >> sample(2:6, 1)
>      >> ...
>      >> sample(5:6, 1)
>      >>
>      >> do what you expect, but
>      >>
>      >> sample(6:6, 1)
>      >> sample(1:6, 1)
>      >>
>      >> do the same thing.
>      >>
>      >> This behaviour is documented:
>      >>
>      >>       If 'x' has length 1, is numeric (in the sense of 'is.numeric') and
>      >>       'x >= 1', sampling _via_ 'sample' takes place from '1:x'.   _Note_
>      >>       that this convenience feature may lead to undesired behaviour when
>      >>       'x' is of varying length 'sample(x)'.   See the 'resample()'
>      >>       example below.
>      >>
>      >> My proposal is to add an extra parameter is.set to sample() to control
>      >> this behaviour.   If the parameter is unspecified, then we keep the old
>      >> behaviour for compatibility.   If it is TRUE, then we treat the first
>      >> parameter x as a set.   If it is FALSE, then we treat it as a set size.
>      >>   This means that
>      >>
>      >> sample(6:6, 1, is.set=TRUE)
>      >>
>      >> would return 6 with probability 1.
>      >>
>      >> I have attached a patch to implement this new option.
>      >>
>      >> Cheers,
>      >> Andrew
>      >>
>      AndrewC> ______________________________________________
>      AndrewC> R-devel@r-project.org mailing list
>      AndrewC> https://stat.ethz.ch/mailman/listinfo/r-devel
>

["sample.diff" (text/x-patch)]

diff --git a/src/library/base/R/sample.R b/src/library/base/R/sample.R
index 8d22469..01498c0 100644
--- a/src/library/base/R/sample.R
+++ b/src/library/base/R/sample.R
@@ -14,13 +14,17 @@
 #  A copy of the GNU General Public License is available at
 #  http://www.r-project.org/Licenses/
 
-sample <- function(x, size, replace=FALSE, prob=NULL)
+sample <- function(x, size, replace=FALSE, prob=NULL, is.set=NULL)
 {
-    if(length(x) == 1L && is.numeric(x) && x >= 1) {
+    is.natural <- function(x) length(x) == 1L && is.numeric(x) && x >= 1
+    if(is.null(is.set)) is.set <- !is.natural(x)
+    if(!is.set) {
+	stopifnot(is.natural(x))
 	if(missing(size)) size <- x
 	.Internal(sample(x, size, replace, prob))
     }
     else {
+	stopifnot(length(x) >= 1)
 	if(missing(size)) size <- length(x)
 	x[.Internal(sample(length(x), size, replace, prob))]
     }
diff --git a/src/library/base/man/sample.Rd b/src/library/base/man/sample.Rd
index 3929ff2..811fed2 100644
--- a/src/library/base/man/sample.Rd
+++ b/src/library/base/man/sample.Rd
@@ -12,26 +12,31 @@
   of \code{x} using either with or without replacement.
 }
 \usage{
-sample(x, size, replace = FALSE, prob = NULL)
+sample(x, size, replace = FALSE, prob = NULL, is.set=NULL)
 
 sample.int(n, size, replace = FALSE, prob = NULL)
 }
 \arguments{
   \item{x}{Either a (numeric, complex, character or logical) vector of
-    more than one element from which to choose, or a positive integer.}
+    elements from which to choose, or a positive integer.  The interpretation
+    depends on is.set, or heuristics described below.}
   \item{n}{a non-negative integer, the number of items to choose from.}
   \item{size}{positive integer giving the number of items to choose.}
   \item{replace}{Should sampling be with replacement?}
   \item{prob}{A vector of probability weights for obtaining the elements
     of the vector being sampled.}
+  \item{is.set}{A vector of probability weights for obtaining the elements
+    of the vector being sampled.}
 }
 \details{
-  If \code{x} has length 1, is numeric (in the sense of
-  \code{\link{is.numeric}}) and \code{x >= 1}, sampling \emph{via}
-  \code{sample} takes place from
-  \code{1:x}.  \emph{Note} that this convenience feature may lead to
-  undesired behaviour when \code{x} is of varying length
-  \code{sample(x)}.  See the \code{resample()} example below.
+  The \code{is.set} parameter controls whether the \code{x} is interpreted as a
+  set of items to sample from (when \code{is.set} is \code{TRUE}), or the size
+  of the set of samples (when \code{is.set} is \code{FALSE}), in which case
+  the sample set is \code{1:x}.  If \code{is.set} is unspecified, then
+  \code{is.set} is set to \code{FALSE} when \code{x} has length 1, is numeric
+  (in the sense of \code{\link{is.numeric}}) and \code{x >= 1}.  \emph{Note}
+  that when \code{x} is a vector of varying size, leaving \code{is.set} can
+  lead to undesirable behaviour.
 
   By default \code{size} is equal to \code{length(x)}
   so that \code{sample(x)} generates a random permutation
@@ -93,13 +98,9 @@ x <- 1:10
     sample(x[x >  9]) # oops -- length 10!
 try(sample(x[x > 10]))# error!
 
-## This is safer, but only for sampling without replacement
-resample <- function(x, size, ...)
-  if(length(x) <= 1) { if(!missing(size) && size == 0) x[FALSE] else x
-  } else sample(x, size, ...)
-
-resample(x[x >  8])# length 2
-resample(x[x >  9])# length 1
-resample(x[x > 10])# length 0
+## This is safer
+sample(x[x >  8], is.set=TRUE)# length 2
+sample(x[x >  9], is.set=TRUE)# length 1
+sample(x[x > 10], is.set=TRUE)# length 0
 }
 \keyword{distribution}


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[prev in list] [next in list] [prev in thread] [next in thread]