[prev in list] [next in list] [prev in thread] [next in thread]
List: r-devel
Subject: Re: [Rd] internal string comparison (Scollate)
From: <romain () r-enthusiasts ! com>
Date: 2014-03-31 14:44:54
Message-ID: 6773B8A8-4F2F-453C-9412-89F102CB1CC1 () r-enthusiasts ! com
[Download RAW message or body]
Hello,
The use case I have might involve sorting many small such STRSXP vectors.
If I have Scollate, I don’t need to materialize the vectors and I can use the sorting \
algorithm I choose.
Here is some made up data:
df <- data.frame(
x = sample( 1:10, 1000, replace = TRUE),
y = sample( 1:100, 100, replace = TRUE),
z = replicate( 10000, paste( sample(letters, sample(1:100, size = 1), replace = \
TRUE ), collapse = "" ) ), stringsAsFactors = FALSE
)
For which I’d like something like what order( df$x, df$y, df$z ) gives me.
For example:
> system.time( res1 <- order( df$x, df$y, df$z) )
utilisateur système écoulé
0.017 0.000 0.017
> system.time( res2 <- dplyr::order_( df$x, df$y, df$z ) )
utilisateur système écoulé
0.005 0.000 0.005
> identical( res1, res2 )
[1] TRUE
The way dplyr::order_ is implemented I don’t need to materialize 500 STRSXP vectors \
and call order or sort on them ( 492 == nrow( unique( df[, c("x", "y" ) ] ) ) )
I just need to be able to compare two scalars together (either two ints, two doubles, \
or two CHARSXP SEXP). We already have special code to handle what it means to compare \
int, double etc in the R world with NA and NaN, etc ...
Scollate would give a way to compare two CHARSXP SEXP, the way R would. Of course one \
has to be careful how it is called, I have read the source.
Materialising temporary values into an R vector may be the R way of doing things, but \
sometimes it is a waste of both memory and time. Yes, this is about performance. We \
are often asked to choose between performance and correctness when in fact we can \
have both.
Romain
Le 27 mars 2014 à 22:12, Duncan Murdoch <murdoch.duncan@gmail.com> a écrit :
> On 14-03-27 3:01 PM, Kevin Ushey wrote:
> > I too think it would be useful if R exported some version of its
> > string sorting routines, since sorting strings while respecting
> > locale, and doing so in a portable fashion while respecting the user's
> > environment, is not trivial. R holds a fast, portable, well-tested
> > solution, and I think package developers would be very appreciative if
> > some portion of this was exposed at the C level.
>
> It does. You can put your strings in an R STRSXP vector, and call the R sort \
> function on it.
> The usual objection to constructing an R expression and evaluating it is that it is \
> slow, but if you are talking about sorting, the time spent in the sort is likely to \
> dominate the time spent in the setup.
> >
> > If not `Scollate`, then perhaps other candidates could be the more
> > generic `sortVector`, or the more string-specific (and NA-respecting)
> > `scmp`.
>
> Evaluating an R expression gives you sortVector.
>
> I can see an argument for Scollate being useful (sorting isn't the only reason to \
> compare strings), but I can see arguments against exposing it too. Take a look at \
> the source: it needs to be used carefully. In particular, it can return a 0 for \
> unequal strings, and users are likely to get messed up by that, or to submit bogus \
> bug reports. And it's not impossible to work around: if you can collect the \
> universe of strings to compare in advance, then just use order() to convert them to \
> integer values, and compare those.
> Duncan Murdoch
>
> >
> > I understand that the volunteers at R Core have limited time and
> > resources, and exposing an API imposes additional maintenance burdens
> > on an already thinly stretched team, but this is a situation where the
> > R users and package authors alike could benefit. Or, if there are
> > other reasons why exporting such routines is not possible nor
> > recommended, it would be very informative to know why.
> >
> > Thanks,
> > Kevin
> >
> > On Thu, Mar 27, 2014 at 11:08 AM, Dirk Eddelbuettel <edd@debian.org> wrote:
> > >
> > > On 26 March 2014 at 19:09, Romain François wrote:
> > > > That's one part of the problem. Indeed I'd rather use something rather than
> > > > copy and paste it and run the risk of being outdated. The answer to that is
> > >
> > > We all would. But "they" won't let us by refusing to create more API access \
> > > points.
> > > > testing though. I can develop a test suite that can let me know I'm out of
> > >
> > > Correct.
> > >
> > > > date and I need to copy and paste some new code, etc ... Done that before, \
> > > > this
> > > > is tedious, but so what.
> > > >
> > > > The other part of the problem (the real part of the problem actually) is \
> > > > that,
> > > > at least when R is built with ICU support, Scollate will depend on a the
> > > > collator pointer in util.c
> > > > https://github.com/wch/r-source/blob/trunk/src/main/util.c#L1777
> > > >
> > > > And this can be controlled by the base::icuSetCollate function. Of course the
> > > > collator pointer is not public.
> > >
> > > So the next (and even less pleasant) answer is to build a new package which
> > > links to, (or worse yet, embeds) libicu.
> > >
> > > As you want ICU behaviour, you will need ICU code.
> > >
> > > Dirk
> > >
> > > --
> > > Dirk Eddelbuettel | edd@debian.org | http://dirk.eddelbuettel.com
> > >
> > > ______________________________________________
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic