[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-devel
Subject:    Re: [Rd] internal string comparison (Scollate)
From:       <romain () r-enthusiasts ! com>
Date:       2014-03-31 14:44:54
Message-ID: 6773B8A8-4F2F-453C-9412-89F102CB1CC1 () r-enthusiasts ! com
[Download RAW message or body]

Hello, 

The use case I have might involve sorting many small such STRSXP vectors. 

If I have Scollate, I don’t need to materialize the vectors and I can use the sorting \
algorithm I choose. 

Here is some made up data: 

df <- data.frame( 
  x = sample( 1:10, 1000, replace = TRUE), 
  y = sample( 1:100, 100, replace = TRUE), 
  z = replicate( 10000, paste( sample(letters, sample(1:100, size = 1), replace = \
TRUE ), collapse = "" ) ),   stringsAsFactors = FALSE
)

For which I’d like something like what order( df$x, df$y, df$z ) gives me. 

For example: 

> system.time( res1 <- order( df$x, df$y, df$z) )
utilisateur     système      écoulé
      0.017       0.000       0.017
> system.time( res2 <- dplyr::order_( df$x, df$y, df$z ) )
utilisateur     système      écoulé
      0.005       0.000       0.005
> identical( res1, res2 )
[1] TRUE

The way dplyr::order_ is implemented I don’t need to materialize 500 STRSXP vectors \
and call order or sort on them ( 492 == nrow( unique( df[, c("x", "y" ) ] ) ) )

I just need to be able to compare two scalars together (either two ints, two doubles, \
or two CHARSXP SEXP). We already have special code to handle what it means to compare \
int, double etc in the R world with NA and NaN, etc ... 

Scollate would give a way to compare two CHARSXP SEXP, the way R would. Of course one \
has to be careful how it is called, I have read the source. 

Materialising temporary values into an R vector may be the R way of doing things, but \
sometimes it is a waste of both memory and time. Yes, this is about performance. We \
are often asked to choose between performance and correctness when in fact we can \
have both. 

Romain

Le 27 mars 2014 à 22:12, Duncan Murdoch <murdoch.duncan@gmail.com> a écrit :

> On 14-03-27 3:01 PM, Kevin Ushey wrote:
> > I too think it would be useful if R exported some version of its
> > string sorting routines, since sorting strings while respecting
> > locale, and doing so in a portable fashion while respecting the user's
> > environment, is not trivial. R holds a fast, portable, well-tested
> > solution, and I think package developers would be very appreciative if
> > some portion of this was exposed at the C level.
> 
> It does.  You can put your strings in an R STRSXP vector, and call the R sort \
> function on it. 
> The usual objection to constructing an R expression and evaluating it is that it is \
> slow, but if you are talking about sorting, the time spent in the sort is likely to \
> dominate the time spent in the setup. 
> > 
> > If not `Scollate`, then perhaps other candidates could be the more
> > generic `sortVector`, or the more string-specific (and NA-respecting)
> > `scmp`.
> 
> Evaluating an R expression gives you sortVector.
> 
> I can see an argument for Scollate being useful (sorting isn't the only reason to \
> compare strings), but I can see arguments against exposing it too.  Take a look at \
> the source:  it needs to be used carefully.  In particular, it can return a 0 for \
> unequal strings, and users are likely to get messed up by that, or to submit bogus \
> bug reports.  And it's not impossible to work around: if you can collect the \
> universe of strings to compare in advance, then just use order() to convert them to \
> integer values, and compare those. 
> Duncan Murdoch
> 
> > 
> > I understand that the volunteers at R Core have limited time and
> > resources, and exposing an API imposes additional maintenance burdens
> > on an already thinly stretched team, but this is a situation where the
> > R users and package authors alike could benefit. Or, if there are
> > other reasons why exporting such routines is not possible nor
> > recommended, it would be very informative to know why.
> > 
> > Thanks,
> > Kevin
> > 
> > On Thu, Mar 27, 2014 at 11:08 AM, Dirk Eddelbuettel <edd@debian.org> wrote:
> > > 
> > > On 26 March 2014 at 19:09, Romain François wrote:
> > > > That's one part of the problem. Indeed I'd rather use something rather than
> > > > copy and paste it and run the risk of being outdated. The answer to that is
> > > 
> > > We all would. But "they" won't let us by refusing to create more API access \
> > > points. 
> > > > testing though. I can develop a test suite that can let me know I'm out of
> > > 
> > > Correct.
> > > 
> > > > date and I need to copy and paste some new code, etc ... Done that before, \
> > > > this
> > > > is tedious, but so what.
> > > > 
> > > > The other part of the problem (the real part of the problem actually) is \
> > > > that,
> > > > at least when R is built with ICU support, Scollate will depend on a the
> > > > collator pointer in util.c
> > > > https://github.com/wch/r-source/blob/trunk/src/main/util.c#L1777
> > > > 
> > > > And this can be controlled by the base::icuSetCollate function. Of course the
> > > > collator pointer is not public.
> > > 
> > > So the next (and even less pleasant) answer is to build a new package which
> > > links to, (or worse yet, embeds) libicu.
> > > 
> > > As you want ICU behaviour, you will need ICU code.
> > > 
> > > Dirk
> > > 
> > > --
> > > Dirk Eddelbuettel | edd@debian.org | http://dirk.eddelbuettel.com
> > > 
> > > ______________________________________________
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > 
> > ______________________________________________
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> > 
> 
> ______________________________________________
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic