'[Rd] order(..., na.last = NA) performance hit'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-devel
Subject:    [Rd] order(..., na.last = NA) performance hit
From:       Murat Tasan <mmuurr () gmail ! com>
Date:       2015-01-19 20:20:31
Message-ID: CA+YV+HyRjMbB+A4wRxCm1woZSHmxCQerShLm10ynYgVeSqXHdg () mail ! gmail ! com
[Download RAW message or body]

I've just recently noticed that using the na.last = NA setting with
order incurs a HUGE performance hit.
It appears that much of order(...) (the R wrapper, not the internal
calls) is written in as general a manner as possible to handle the
large number of input types.
But the canonical case of ordering a single vector of numerics suffers
greatly with the current implementation.
Below is a single trivial example, but overall I've been noticing
somewhere on the order of a 10X performance hit when using na.last =
NA.
Would it be worth (i) attempting a re-write of the wrapping order(...)
function, or (ii) at least mentioning the performance implications in
the help page for order(...)?

Here's an example of the performance hit:

x <- runif(1e6)
x[runif(1e6) > 0.9] <- NA ## add some (~10%) NA values
order2 <- function(x) {
    iix <- order(x, na.last = TRUE)
    iix[!is.na(x[iix])]
}

system.time(y1 <- order(x, na.last = TRUE))
##    user  system elapsed
##    0.48    0.00    0.48

system.time(y2 <- order(x, na.last = NA))
##    user  system elapsed
##   3.060   0.056   3.118

system.time(y3 <- order2(x))
##    user  system elapsed
##   0.520   0.004   0.520

all(y2 == y3)
## [1] TRUE
identical(y2, y3)
## [1] TRUE

Cheers,

-murat

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
[prev in list] [next in list] [prev in thread] [next in thread]