[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-hotspot-runtime-dev
Subject:    Re: RFR (S) 8144940: Broken hash in string table entry in closed/runtime/7158800/BadUtf8.java
From:       Ioi Lam <ioi.lam () oracle ! com>
Date:       2016-03-23 17:32:59
Message-ID: 56F2D34B.4030007 () oracle ! com
[Download RAW message or body]



On 3/23/16 9:38 AM, Coleen Phillimore wrote:
>
> Hi, I realized that I didn't describe this well in either the bug or 
> the RFR, so I put a description of what the problem is and why this 
> fixes it in the bug.
>
> On 3/22/16 10:05 PM, Ioi Lam wrote:
>> I've done a little investigation on Latin1 encoding of 
>> java.lang.String and how that relates to this bug. Here's what I found:
>>
>>  * Latin1 is is a 8-bit encoding of characters.
>>  * The first 256 Unicode characters are exactly the same as the Latin1
>>    encoding.
>>  * So if all the jchars in a java.lang.String are <= 0xff, it can be
>>    stored in (logically) an unsigned byte array with the upper 16 bits
>>    truncated.
>>      o I am not sure if it's REQUIRED for such strings to be stored in
>>        a byte array. It might be possible to create an equivalent
>>        String with jchar[] storage. You can certainly do that with
>>        Unsafe + reflection.
>>
>> This function, which you removed, should return the exact same 
>> hashcode regardless whether the String was stored as a byte[] or jchar[]
>>
>>    unsigned int java_lang_String::hash_string(oop java_string) {
>>       int          length = java_lang_String::length(java_string);
>>       // Zero length string doesn't necessarily hash to zero.
>>       if (length == 0) {
>>         return StringTable::hash_string((jchar*) NULL, 0);
>>       }
>>
>>       typeArrayOop value  = java_lang_String::value(java_string);
>>       bool      is_latin1 = java_lang_String::is_latin1(java_string);
>>       if (is_latin1) {
>>         return StringTable::hash_string(value->byte_at_addr(0), length);
>>       } else {
>>         return StringTable::hash_string(value->char_at_addr(0), length);
>>       }
>>    }
>>
>> That's because these two functions should produce the exact same 
>> result (if all the unsigned arithmetics are correct ...)
>>
>>       static unsigned int hash_code(const jchar* s, int len) {
>>         unsigned int h = 0;
>>         while (len-- > 0) {
>>           h = 31*h + (unsigned int) *s;
>>           s++;
>>         }
>>         return h;
>>       }
>>
>>       static unsigned int hash_code(const jbyte* s, int len) {
>>         unsigned int h = 0;
>>         while (len-- > 0) {
>>           h = 31*h + (((unsigned int) *s) & 0xFF);
>>           s++;
>>         }
>>         return h;
>>       }
>>
>> For shared interned strings, we actually use the <jbyte> version 
>> during dump time (writing into shared strings table), and the <jchar> 
>> version at run time (look up from shared string table). I wrote a 
>> test and validated that the the two hashcodes are identical.
>
> For the alternate hashcode, the two hashcodes were different, which 
> caused this bug.  I don't think you should rely on the jbyte and jchar 
> versions returning the same thing.  I think you should always use the 
> jchar version of the hash code for shared intern strings.

Sorry, what I said above was wrong. I was confused by all the different
hashing functions that had similar names (hash_code vs hash_string)
but do different things.

[1] java_lang_String::hash_code(const jchar* s, int len) vs
     java_lang_String::hash_code(const jbyte* s, int len)

     They generate the same hash code, when you give an equivalent input
         in UTF16  to the first  function
         in Latin1 to the second function

[2] StringTable::hash_string<jchar>(const jchar* s, int len) vs
     StringTable::hash_string<jbyte>(const jbyte* s, int len)

     These two function generate different hash codes
     (when StringTable::use_alternate_hashcode() == true)

In summary, JDK-8144940 was caused by [2], and is unrelated to [1].

Thanks
- Ioi


>>
>> So, I believe your fix for 8144940 works not because you force the 
>> body to be converted to jchar. Rather, it's because 
>> java_lang_String::hash_string does not consider 
>> StringTable::use_alternate_hashcode().
>>
>> Also, I am glad that you removed the <jbyte> version of this template:
>>
>>    template<typename T>
>>    unsigned int StringTable::hash_string(const T* s, int len) {
>>       return use_alternate_hashcode() ? AltHashing::murmur3_32(seed(),
>>    s, len) :
>> java_lang_String::hash_code(s,
>>    len);
>>    }
>>
>>    // Explicit instantiation for all supported types.
>>    template unsigned int StringTable::hash_string<jchar>(const jchar*
>>    s, int len);
>>    template unsigned int StringTable::hash_string<jbyte>(const jbyte*
>>    s, int len);
>>
>> Otherwise, someone is likely to call it with a UTF8 string and get 
>> unexpected results. I am not even sure if it would give identical 
>> results as the <jchar> version if the input was encoded in Latin1.
>
> I think it doesn't get the same results.
>
> Hopefully, I've put enough information in the bug and yeah, removing 
> java_lang_String::hash_string() removed some confusion on my part at 
> least, since there's also a StringTable::hash_string, which probably 
> should be something like hash_string_for_StringTable but not for 
> String.hashCode().
>
> Thanks!
> Coleen
>
>>
>> Thanks
>> - Ioi
>>
>> On 3/22/16 10:39 AM, Coleen Phillimore wrote:
>>>
>>> Thank you, Jiangli.
>>> Coleen
>>>
>>> On 3/22/16 1:35 PM, Jiangli Zhou wrote:
>>>> Hi Coleen,
>>>>
>>>> Looks good to me. I had same question as Tobias yesterday. Your 
>>>> answer cleared it.
>>>>
>>>> Thanks,
>>>> Jiangli
>>>>
>>>>> On Mar 22, 2016, at 10:07 AM, Coleen Phillimore 
>>>>> <coleen.phillimore@oracle.com> wrote:
>>>>>
>>>>>
>>>>> Here's another webrev with the changes pointed out by Tobias and 
>>>>> verified with -XX:+VerifyStringTableAtExit.
>>>>>
>>>>> open webrev at http://cr.openjdk.java.net/~coleenp/8144940.02/webrev
>>>>>
>>>>> Thanks!
>>>>> Coleen
>>>>>
>>>>> On 3/22/16 12:21 PM, Tobias Hartmann wrote:
>>>>>> Hi Coleen,
>>>>>>
>>>>>> On 22.03.2016 13:40, Coleen Phillimore wrote:
>>>>>>> On 3/22/16 4:04 AM, Tobias Hartmann wrote:
>>>>>>>> Hi Coleen,
>>>>>>>>
>>>>>>>> On 21.03.2016 22:11, Coleen Phillimore wrote:
>>>>>>>>> Summary: Fix code broken with compact Strings.
>>>>>>>>>
>>>>>>>>> One of the failure modes of an intermittent bug (but this 
>>>>>>>>> failure is not intermittent).
>>>>>>>>>
>>>>>>>>> Tested with the failing test cases that exercise this code. 
>>>>>>>>> Also, testing in order to find linked bugs.
>>>>>>>>>
>>>>>>>>> open webrev at 
>>>>>>>>> http://cr.openjdk.java.net/~coleenp/8144940.01/webrev
>>>>>>>>> bug link https://bugs.openjdk.java.net/browse/JDK-8144940
>>>>>>>> I wonder why the result is different if you first convert the 
>>>>>>>> latin1 String to Unicode and then use the jchar hash_string() 
>>>>>>>> version compared to just using the jbyte hash_string() version? 
>>>>>>>> Is it because the jbyte version of AltHashing::murmur3_32() is 
>>>>>>>> used?
>>>>>>> Yes, I believe it is.
>>>>>> Okay, thanks for checking.
>>>>>>
>>>>>>>> Now we don't need the StringTable::hash_string<jbyte> version 
>>>>>>>> anymore, right?
>>>>>>> This one is used by Symbol* which are jbyte.
>>>>>> I only see jchar uses of StringTable::hash_string() (after your 
>>>>>> fix). Are you confusing it with java_lang_String::hash_code() 
>>>>>> which also has a jbyte and jchar version? This one is indeed used 
>>>>>> by the SymbolTable.
>>>>>>
>>>>>>>> Just noticed that there is an unused "latin1_hash_code" in 
>>>>>>>> javaClasses.hpp which can be removed as well.
>>>>>>> Thank you, I'll remove it.
>>>>>> Thanks!
>>>>>>
>>>>>> Best regards,
>>>>>> Tobias
>>>>>>
>>>>>>>> Thanks for fixing this!
>>>>>>> Thanks for reviewing it!
>>>>>>> Coleen
>>>>>>>> Best regards,
>>>>>>>> Tobias
>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Coleen
>>>
>>
>

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic