'Re: Request for review: 8008546: G1: Wrong G1CONFIDENCEPERCENT results in GUARANTEE(VARIANCE() > -1.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-hotspot-gc-dev
Subject:    Re: Request for review: 8008546: G1: Wrong G1CONFIDENCEPERCENT results in GUARANTEE(VARIANCE() > -1.
From:       John Cuthbertson <john.cuthbertson () oracle ! com>
Date:       2013-02-22 17:42:49
Message-ID: 5127AE19.60307 () oracle ! com
[Download RAW message or body]

Hi Bengt,

I think it's a very good idea. I said as much. I'll create the bug and 
attach the email thread and the entire instrumented log from my run. 
When you look at the entire log you can see how quickly the values 
diverge and grow in size until we start getting infinities.

As for what the right strategy would be - my instinct says to not let 
the target marking step time become negative.

Regardless of whether we have a negative value or zero for the target 
step time, the first regular clock call will result in the marking task 
aborting due to timing out (elapsed time will be > target time). A zero 
value will, however, constrain the times being recorded in the number seq.

JohnC

On 2/21/2013 11:16 PM, Bengt Rutisson wrote:
>
> John, Vladimir,
>
> Thanks for the very detailed analysis of the issue! :)
>
> Vladimir, I agree with John. Your change is fine, so consider it 
> reviewed and go ahead and push it.
>
> John, I really like how well you described what is going on here. 
> Unfortunately I think I still don't have enough knowledge to answer 
> the question of whether we should just make sure to never get a 
> negative target step time or if we should avoid adding large step 
> times all together. For me it is not too important to sort this out 
> right now, but it might be good to create a bug and add your excellent 
> analysis. What do you think?
>
> Thanks,
> Bengt
>
> On 2/21/13 10:43 PM, John Cuthbertson wrote:
>> Hi Bengt, Vladimir,
>>
>> Bumping up the value of G1ConfidencePercent eventually caused the 
>> guarantee to fail. I believe I know what's going on here now.
>>
>> The following is tracing and instrumentation from 
>> CMTask::do_marking_step(), from the marking cycle where it starts to 
>> go wrong...
>>
>> 3.788: [GC concurrent-mark-start]
>> [0] >>>>>>>>>> START, call = 1,
>>      _num = 5, _sum =   1.180, _sum_of_squares =   0.432
>>      _davg =   0.275, _dvariance =   0.023, _alpha =   0.700
>>          _length = 10, _next = 5
>>
>>         [0]=  0.500    [1]=  0.137    [2]=  0.126    [3]=  0.036 
>> [4]=  0.382
>>         [5]=  0.000    [6]=  0.000    [7]=  0.000    [8]=  0.000 
>> [9]=  0.000
>> davg: 0.2747654, sig: 3.0000000, dsd: 0.1517040, first: 0.7298773
>> davg: 0.2747654, num: 5, cf: 1.0000000, second: 0.2747654
>> Result: 0.7298773
>> [0] >>>>>>>>>> START, call = 1, target = 9.27ms >>>>>>>>>>
>> [0] Adding 0.1885053 to counter
>> [0] <<<<<<<<<< ABORTING, target = 9.27ms, elapsed = 9.46ms <<<<<<<<<<
>>
>> [0] >>>>>>>>>> START, call = 2,
>>      _num = 6, _sum =   1.369, _sum_of_squares =   0.467
>>      _davg =   0.249, _dvariance =   0.017, _alpha =   0.700
>>          _length = 10, _next = 6
>>
>>         [0]=  0.500    [1]=  0.137    [2]=  0.126    [3]=  0.036 
>> [4]=  0.382
>>         [5]=  0.189    [6]=  0.000    [7]=  0.000    [8]=  0.000 
>> [9]=  0.000
>> davg: 0.2488873, sig: 3.0000000, dsd: 0.1311627, first: 0.6423756
>> davg: 0.2488873, num: 6, cf: 1.0000000, second: 0.2488873
>> Result: 0.6423756
>> [0] >>>>>>>>>> START, call = 2, target = 9.36ms >>>>>>>>>>
>> [0] Adding 0.2263276 to counter
>> [0] <<<<<<<<<< ABORTING, target = 9.36ms, elapsed = 9.58ms <<<<<<<<<<
>>
>> [0] >>>>>>>>>> START, call = 3,
>>      _num = 7, _sum =   1.595, _sum_of_squares =   0.519
>>      _davg =   0.242, _dvariance =   0.012, _alpha =   0.700
>>          _length = 10, _next = 7
>>
>>         [0]=  0.500    [1]=  0.137    [2]=  0.126    [3]=  0.036 
>> [4]=  0.382
>>         [5]=  0.189    [6]=  0.226    [7]=  0.000    [8]=  0.000 
>> [9]=  0.000
>> davg: 0.2421194, sig: 3.0000000, dsd: 0.1100790, first: 0.5723563
>> davg: 0.2421194, num: 7, cf: 1.0000000, second: 0.2421194
>> Result: 0.5723563
>> [0] >>>>>>>>>> START, call = 3, target = 9.43ms >>>>>>>>>>
>> [0] Adding 16.8998093 to counter
>> [0] <<<<<<<<<< ABORTING, target = 9.43ms, elapsed = 26.33ms <<<<<<<<<<
>>
>> The step time difference prediction we obtain for call #1 was 0.73ms, 
>> resulting in a target step time of 9.27ms. The marking step aborts 
>> after 9.46ms and adds 0.189ms to the number sequence.
>>
>> This causes the step time difference prediction we obtain for call #2 
>> to be calculated as 0.642ms; target step time of 9.36ms. The marking 
>> step aborts after an elapsed 9.58ms. The difference (9.58 - 9.36) 
>> 0.226ms is added to the number sequence.
>>
>> This causes the step time difference prediction we obtain for call #3 
>> to be calculated 0.572ms; the target step time of 9.43. The marking 
>> step aborts after an elapsed time of 26.33ms. We add the difference 
>> (26.33 - 9.43) 16.90 to the number sequence. This is where it gets 
>> interesting. :)
>>
>> In call 4:
>>
>> [0] >>>>>>>>>> START, call = 4,
>>      _num = 8, _sum =  18.495, _sum_of_squares = 286.122
>>      _davg =   5.239, _dvariance =  40.798, _alpha =   0.700
>>          _length = 10, _next = 8
>>
>>         [0]=  0.500    [1]=  0.137    [2]=  0.126    [3]=  0.036 
>> [4]=  0.382
>>         [5]=  0.189    [6]=  0.226    [7]= 16.900    [8]=  0.000 
>> [9]=  0.000
>> davg: 5.2394264, sig: 3.0000000, dsd: 6.3873188, first: 24.4013827
>> davg: 5.2394264, num: 8, cf: 1.0000000, second: 5.2394264
>> Result: 24.4013827
>> [0] >>>>>>>>>> START, call = 4, target = -14.40ms >>>>>>>>>>
>> [0] Adding 14.5933617 to counter
>> [0] <<<<<<<<<< ABORTING, target = -14.40ms, elapsed = 0.19ms <<<<<<<<<<
>>
>> the predicted step time difference (because of the large value added 
>> to the number sequence the previous call) is calculated as 24.40ms. 
>> This gives a NEGATIVE target step time of -14.40ms. The marking step 
>> aborts after an extremely short 0.19ms. We add the difference (0.19 - 
>> (-14.40)) 14.59 to the number sequence.
>>
>> From this point on the predicted step time difference gets larger and 
>> larger until we start seeing infinities for dsd, sum of squares, etc. 
>> This gives a negative infinity as the time step target and we attempt 
>> to add infinity into the number sequence which results in a variance 
>> value of infinity.
>>
>> Why was the elapsed time for call #3 much longer than the target step 
>> time? My guess is that the marking step claimed a large enough object 
>> that it just took that amount of time to scan it. (Note this was with 
>> debug binary).
>>
>> Why does a higher G1ConfidencePercent value cause the problem? The 
>> confidence percent is used to calculate the value of 
>> G1CollectorPolicy::sigma() which is used in the expression that 
>> predicts a new value from a given number sequence:
>>
>>     return MAX2(seq->davg() + sigma() * seq->dsd(),
>>                 seq->davg() * confidence_factor(seq->num()));
>>
>> it's a multiplier of the decaying standard deviation. So a value 
>> larger than 100 will result in larger and larger values in the LHS of 
>> the MAX2.
>>
>> Vladimir: your fix is good and a valid fix for this issue but I 
>> wonder if we should also be preventing the target step time from 
>> becoming negative, e.g.:
>>
>> [0] >>>>>>>>>> START, call = 17,
>>      _num = 10, _sum =   1.816, _sum_of_squares =   0.554
>>      _davg =   0.177, _dvariance =   0.009, _alpha =   0.700
>>          _length = 10, _next = 7
>>
>>         [0]=  0.057    [1]=  0.541    [2]=  0.245    [3]=  0.084 
>> [4]=  0.248
>>         [5]=  0.231    [6]=  0.085    [7]=  0.030    [8]=  0.035 
>> [9]=  0.258
>> davg: 0.1774224, sig: 3.0000000, dsd: 0.0971153, first: 0.4687682
>> davg: 0.1774224, num: 10, cf: 1.0000000, second: 0.1774224
>> Result: 0.4687682
>> [0] >>>>>>>>>> START, call = 17, target = 9.53ms >>>>>>>>>>
>> [0] Adding 20.1697202 to counter
>> [0] <<<<<<<<<< ABORTING, target = 9.53ms, elapsed = 29.70ms <<<<<<<<<<
>>
>> [0] >>>>>>>>>> START, call = 18,
>>      _num = 10, _sum =  21.955, _sum_of_squares = 407.371
>>      _davg =   6.175, _dvariance =  58.761, _alpha =   0.700
>>          _length = 10, _next = 8
>>
>>         [0]=  0.057    [1]=  0.541    [2]=  0.245    [3]=  0.084 
>> [4]=  0.248
>>         [5]=  0.231    [6]=  0.085    [7]= 20.170    [8]=  0.035 
>> [9]=  0.258
>> davg: 6.1751117, sig: 3.0000000, dsd: 7.6655934, first: 29.1718919
>> davg: 6.1751117, num: 10, cf: 1.0000000, second: 6.1751117
>> Result: 29.1718919
>> [0] >>>>>>>>>> START, call = 18, target = 0.00ms >>>>>>>>>>
>> [0] Adding 0.1926280 to counter
>> [0] <<<<<<<<<< ABORTING, target = 0.00ms, elapsed = 0.19ms <<<<<<<<<<
>>
>> This should prevent the values in the sequence diverging and reading 
>> Inf. Or perhaps we shouldn't even add large step time differences to 
>> the number seq. Whatever we decide to do (if anything) should be done 
>> as separate CR.
>>
>> Regards,
>>
>> JohnC
>>
>> On 2/21/2013 10:50 AM, John Cuthbertson wrote:
>>> Hi Bengt,
>>>
>>> Thanks for the stack trace. It helps. The counter is obviously 
>>> _marking_step_diffs_ms and the confidence value feeds into this via 
>>> G1CollectorPolicy::get_new_prediction(). I ran jbb2005 with the 
>>> flags you suggest and still no failure. I'm still a little confused 
>>> though. If we look at TruncatedSeq:add() and AbsSeq::variance():
>>>
>>> void TruncatedSeq::add(double val) {
>>>   AbsSeq::add(val);
>>>
>>>   // get the oldest value in the sequence...
>>>   double old_val = _sequence[_next];
>>>   // ...remove it from the sum and sum of squares
>>>   _sum -= old_val;
>>>   _sum_of_squares -= old_val * old_val;
>>>
>>>   // ...and update them with the new value
>>>   _sum += val;
>>>   _sum_of_squares += val * val;
>>>
>>>   // now replace the old value with the new one
>>>   _sequence[_next] = val;
>>>   _next = (_next + 1) % _length;
>>>
>>>   // only increase it if the buffer is not full
>>>   if (_num < _length)
>>>     ++_num;
>>>
>>>   guarantee( variance() > -1.0, "variance should be >= 0" );
>>> }
>>>
>>> the guarantee can only trip if the result of variance() is -1.0 or 
>>> less. Correct? What value is being returned by variance()?
>>>
>>> Now look at AbsSeq::variance():
>>>
>>> double AbsSeq::variance() const {
>>>   if (_num <= 1)
>>>     return 0.0;
>>>
>>>   double x_bar = avg();
>>>   double result = _sum_of_squares / total() - x_bar * x_bar;
>>>   if (result < 0.0) {
>>>     // due to loss-of-precision errors, the variance might be negative
>>>     // by a small bit
>>>
>>>     //    guarantee(-0.1 < result && result < 0.0,
>>>     //        "if variance is negative, it should be very small");
>>>     result = 0.0;
>>>   }
>>>   return result;
>>> }
>>>
>>> How can this return a result -1.0 or less? If result is negative 
>>> then the return value should be 0.0 or a small magnitude negative; 
>>> not -1.0 or less. If the value of result is close enough to 0.0 to 
>>> make the "result < 0.0" evaluate to false, it shouldn't be able to 
>>> result in "variance() > -1.0" evaluate to false.
>>>
>>> Vladimir: I still think your change is good but I'm not sure it's a 
>>> true fix for the problem.
>>>
>>> Regards,
>>>
>>> JohnC
>>>
>>> On 2/21/2013 1:56 AM, Bengt Rutisson wrote:
>>>>
>>>> John,
>>>>
>>>> The bug report lists a guarantee in nuberSeq.cpp. You need a 
>>>> concurrent cycle for the issue to reproduce.
>>>>
>>>> I can repro it with this command line using SPECjbb2005:
>>>>
>>>> java -XX:+UseG1GC -XX:G1ConfidencePercent=300 -Xmx256m 
>>>> -XX:InitiatingHeapOccupancyPercent=1 -XX:+PrintGC -cp 
>>>> jbb.jar:check.jar spec.jbb.JBBmain
>>>>
>>>> Here is the stack trace from the hs_err file i get:
>>>>
>>>> #
>>>> # A fatal error has been detected by the Java Runtime Environment:
>>>> #
>>>> #  Internal Error 
>>>> (/Users/brutisso/repos/hs-gc-test-deprecated/src/share/vm/utilities/numberSeq.cpp:166), 
>>>> pid=34072, tid=16131
>>>> #  guarantee(variance() > -1.0) failed: variance should be >= 0
>>>> #
>>>> # JRE version: Java(TM) SE Runtime Environment (8.0-b68) (build 
>>>> 1.8.0-ea-b68)
>>>> # Java VM: Java HotSpot(TM) 64-Bit Server VM 
>>>> (25.0-b15-internal-jvmg mixed mode bsd-amd64 compressed oops)
>>>> # Failed to write core dump. Core dumps have been disabled. To 
>>>> enable core dumping, try "ulimit -c unlimited" before starting Java 
>>>> again
>>>> #
>>>> # If you would like to submit a bug report, please visit:
>>>> #   http://bugreport.sun.com/bugreport/crash.jsp
>>>> #
>>>>
>>>> ---------------  T H R E A D  ---------------
>>>>
>>>> Current thread (0x00007feda9038000):  ConcurrentGCThread [stack: 
>>>> 0x0000000129587000,0x0000000129687000] [id=16131]
>>>>
>>>> Stack: [0x0000000129587000,0x0000000129687000], 
>>>> sp=0x0000000129686190,  free space=1020k
>>>> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, 
>>>> C=native code)
>>>> V  [libjvm.dylib+0xd48e80] VMError::report(outputStream*)+0x1262
>>>> V  [libjvm.dylib+0xd4a57b]  VMError::report_and_die()+0x88f
>>>> V  [libjvm.dylib+0x5328a7]  report_vm_error(char const*, int, char 
>>>> const*, char const*)+0xc7
>>>> V  [libjvm.dylib+0xac6dd2]  TruncatedSeq::add(double)+0x170
>>>> V  [libjvm.dylib+0x4d389a]  CMTask::do_marking_step(double, bool, 
>>>> bool)+0x15ba
>>>> V  [libjvm.dylib+0x4d9d5a] CMConcurrentMarkingTask::work(unsigned 
>>>> int)+0x192
>>>> V  [libjvm.dylib+0xda8c8e]  GangWorker::loop()+0x666
>>>> V  [libjvm.dylib+0xda7fc0]  GangWorker::run()+0x3a
>>>> V  [libjvm.dylib+0xafef34]  java_start(Thread*)+0x1d4
>>>> C  [libsystem_c.dylib+0x14742]  _pthread_start+0x147
>>>> C  [libsystem_c.dylib+0x1181]  thread_start+0xd
>>>>
>>>>
>>>> Bengt
>>>>
>>>>
>>>> On 2/20/13 9:13 PM, John Cuthbertson wrote:
>>>>> Hi Vladimir,
>>>>>
>>>>> The change looks good to me. But I have a question: which counter 
>>>>> was tripping the guarantee? I just ran jvm98 with a confidence 
>>>>> percent value of 200 and 300 and the guarantee didn't fire for me.
>>>>>
>>>>> I kind of agree with Bengt that moving this kind of error checking 
>>>>> earlier would be better. But that can be done as a separate CR.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> JohnC
>>>>>
>>>>> On 2/20/2013 7:27 AM, vladimir kempik wrote:
>>>>>> Hi Bengt,
>>>>>>
>>>>>> Thanks for looking at this!
>>>>>>
>>>>>> Here is an updated webrev based on your feedback:
>>>>>>
>>>>>> http://cr.openjdk.java.net/~mcherkas/vladimir/8008546/webrev.01/
>>>>>>
>>>>>> I applied what you suggested.
>>>>>>
>>>>>> Thanks,
>>>>>> Vladimir.
>>>>>>
>>>>>> On 20.02.2013 17:54, Bengt Rutisson wrote:
>>>>>>>
>>>>>>> Hi Vladimir,
>>>>>>>
>>>>>>> This looks very similar to how we treat G1ReservePercent, so I 
>>>>>>> think it looks good. An alternative would have been to check 
>>>>>>> this earlier in the initialization phase and update the flag 
>>>>>>> G1ConfidencePercent so that PrintFlagsFinal would have printed 
>>>>>>> the actual value. But for consistency I think it looks good this 
>>>>>>> way.
>>>>>>>
>>>>>>> I think you can change G1ConfidencePercent to be an uintx 
>>>>>>> instead of intx (in g1_globals.hpp). In that case you don't need 
>>>>>>> the second if statment since it can't be less than 0. It is also 
>>>>>>> more consistent with G1ReservePercent which is an uintx.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Bengt
>>>>>>>
>>>>>>> On 2/20/13 2:31 PM, vladimir kempik wrote:
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Could I have a couple of reviews for this change?
>>>>>>>>
>>>>>>>> http://cr.openjdk.java.net/~mcherkas/vladimir/8008546/webrev.00/
>>>>>>>>
>>>>>>>> Input value for G1CONFIDENCEPERCENT wasn't checked before 
>>>>>>>> using. This results in crash sometimes if -XX:+UseG1GC 
>>>>>>>> -XX:G1ConfidencePercent=200 flags are used. Now checking the 
>>>>>>>> value same way as it was done for G1ReservePercent. Increase to 
>>>>>>>> 0 if negative, decrease to 100 if more than 100.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Vladimir
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic