[prev in list] [next in list] [prev in thread] [next in thread] 

List:       r-help
Subject:    Re: [R] jitter-bug? problematic behaviour of the jitter function
From:       Duncan Murdoch <murdoch.duncan () gmail ! com>
Date:       2020-09-24 17:24:20
Message-ID: e8c2cd29-5e2e-ecbb-afc8-f044822f17c3 () gmail ! com
[Download RAW message or body]

Those seem like useful properties if jitter() is used in plotting (as it 
was originally intended), but that use isn't even mentioned in the help 
page.  Martin wanted to "add a small amount of noise to a numeric 
vector" "in order to break ties" (quoting from that help page).

For Martin's use, it sounds as though quantreg::dither might be a better 
solution (though I think it won't work when numerical error splits ties, 
so some differences are extremely small, if the scale of the values 
varies too much, but I'd guess that's a fairly rare circumstance).

Duncan Murdoch

On 24/09/2020 1:03 p.m., Bert Gunter wrote:
> Folks: Please note:
> 
> There is *no* way to "jitter" the 3 values 1,2, and 1e5 so that:
> 
> a) the jittered values differ from the original ones by a fraction of 
> their original value;
> b) the plotting symbols for the jittered values will be distinguishable 
> on a linear scale holding all 3 values.
> 
> Cheers,
> Bert
> 
> Bert Gunter
> 
> "The trouble with having an open mind is that people keep coming along 
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> 
> 
> On Thu, Sep 24, 2020 at 8:39 AM Martin Keller-Ressel 
> <martin.keller-ressel@tu-dresden.de 
> <mailto:martin.keller-ressel@tu-dresden.de>> wrote:
> 
>     Dear Duncan, Dear Rui,
> 
>     thanks for the responses and for pointing out that it is the ‚fuzz‘
>     part that is causing the problem. I agree that this is not a bug,
>     but could be undesirable/surprising behaviour, since it causes a
>     large ‚discontinuity‘ in the jitter functions output depending on
>     the input data.
> 
>     I was (ab?)using the jitter function to break ties, where the
>     desired behaviour would be to add noise just small enough to make
>     all values unique. (Such a function can easily be hand coded of course.)
> 
>     best regards,
>     Martin
> 
>     Am 23.09.2020 um 22:25 schrieb Duncan Murdoch
>     <murdoch.duncan@gmail.com
>     <mailto:murdoch.duncan@gmail.com><mailto:murdoch.duncan@gmail.com
>     <mailto:murdoch.duncan@gmail.com>>>:
> 
>     On 23/09/2020 4:03 p.m., Rui Barradas wrote:
>     Hello,
>     I believe that though Duncan's explanation is right it is also not
>     explaining the value of the digits argument. round makes the first 2
>     numbers 0 but why?
> 
>     If there had been rounding in their computation, you might see a
>     difference like 1e-15.  You wouldn't want to use that for the scale
>     of jittering, so some rounding is needed.
> 
>     I think the documentation for the function is poor, but the
>     intention was probably to use the function in graphics (as the
>     references did), and in that case, any values too close together
>     should be treated as equal and jittering should separate them.  The
>     particular computation used says that if the range is in [1, 10),
>     values equal to 3 decimal places will be too close and need separation.
> 
>     So I don't think this is a bug, but it might be a valid wishlist
>     item: document what "apart from fuzz" means, and perhaps allow it to
>     be controlled by the user.
> 
>     Duncan Murdoch
> 
> 
> 
>     The function below prints the digits argument and
>     then outputs d. The code is taken from jitter.
>     f <- function(x){
>         z <- diff(r <- range(x[is.finite(x)]))
>         cat("digits:", 3 - floor(log10(z)), "\n")
>         diff(xx <- unique(sort.int <http://sort.int>(round(x, 3 -
>     floor(log10(z))))))
>     }
>     Now see what cat outputs for 'digits'.
>     f(c(1,2,10^4))  # desired behaviour
>     #digits: 0
>     #[1]    1 9998
>     f(c(0,1,10^4))  # bad behaviour
>     #digits: -1
>     #[1] 10000
>     f(c(-1,0,10^4))  # bad behaviour
>     #digits: -1
>     #[1] 10000
>     f(c(1,2,10^5))  # bad behaviour
>     #digits: -1
>     #[1] 1e+05
>     And according to the documentation of ?round, negative digits are
>     allowed:
>     Rounding to a negative number of digits means rounding to a power of
>     ten, so for example round(x, digits = -2) rounds to the nearest hundred.
>     But in this case two of the numbers are closer to 0 than they are of 10.
>     And unique keeps only 0 and the largest, then diff is big.
>     round(c(1,2,10^4),0)  # desired behaviour
>     #[1]     1     2 10000
>     round(c(0,1,10^4),-1)  # bad behaviour
>     #[1]     0     0 10000
>     round(c(-1,0,10^4),-1)  # bad behaviour
>     #[1]     0     0 10000
>     round(c(1,2,10^5),-1)  # bad behaviour
>     #[1] 0e+00 0e+00 1e+05
>     Isn't it still a bug?
>     Rui Barradas
>     Às 15:57 de 23/09/20, Duncan Murdoch escreveu:
>     On 23/09/2020 6:32 a.m., Martin Keller-Ressel wrote:
>     Dear all,
> 
>     i have noticed some strange behaviour in the „jitter“ function in R.
>     On the help page for jitter it is stated that
> 
>     "The result, say r, is r <- x + runif(n, -a, a) where n <- length(x)
>     and a is the amount argument (if specified).“
> 
>     and
> 
>     "If amount is NULL (default), we set a <- factor * d/5 where d is the
>     smallest difference between adjacent unique (apart from fuzz) x values.“
> 
>     This works fine as long as there is no (very) large outlier
> 
>     jitter(c(1,2,10^4))  # desired behaviour
>     [1]    1.083243    1.851571 9999.942716
> 
>     But for very large outliers the added noise suddenly ‚jumps‘ to a much
>     larger scale:
> 
>     jitter(c(1,2,10^5)) # bad behaviour
>     [1] -19535.649   9578.702 115693.854
>     # Noise should be of order (2-1)/5  = 0.2 but is of much larger order.
> 
>     This probably does not matter much when jitter is used for plotting,
>     but it can cause problems when jitter is used to break ties.
> 
>     I think this is kind of documented:  "apart from fuzz" is what counts.
>     If you look at the code for jitter, you'll see this important line:
> 
>        d <- diff(xx <- unique(sort.int <http://sort.int>(round(x, 3 -
>     floor(log10(z))))))
> 
>     By the time you get here, z is the length of the rante of the data, so
>     it's 99999 in your example.  The rounding changes your values to
>     0,0,1e5, so the smallest difference is 1e5.
> 
>     Duncan Murdoch
> 
>     ______________________________________________
>     R-help@r-project.org
>     <mailto:R-help@r-project.org><mailto:R-help@r-project.org
>     <mailto:R-help@r-project.org>> mailing list -- To UNSUBSCRIBE and
>     more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 
>              [[alternative HTML version deleted]]
> 
>     ______________________________________________
>     R-help@r-project.org <mailto:R-help@r-project.org> mailing list --
>     To UNSUBSCRIBE and more, see
>     https://stat.ethz.ch/mailman/listinfo/r-help
>     PLEASE do read the posting guide
>     http://www.R-project.org/posting-guide.html
>     and provide commented, minimal, self-contained, reproducible code.
>

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic