'Re: RFR: 8311906: Improve robustness of String constructors with mutable array inputs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openjdk-core-libs-dev
Subject:    Re: RFR: 8311906: Improve robustness of String constructors with mutable array inputs
From:       Raffaello Giulietti <rgiulietti () openjdk ! org>
Date:       2023-11-08 18:40:07
Message-ID: bMsxzVmUo8Gr8c1ZOEhpZeBbqSuj-5wOErQxPCEAN0k=.347614ca-c219-4a62-b860-73c2f97a6fb6 () github ! com
[Download RAW message or body]

On Mon, 30 Oct 2023 18:34:44 GMT, Roger Riggs <rriggs@openjdk.org> wrote:

> Strings, after construction, are immutable but may be constructed from mutable \
> arrays of bytes, characters, or integers. The string constructors should guard \
> against the effects of mutating the arrays during construction that might \
> invalidate internal invariants for the correct behavior of operations on the \
> resulting strings. In particular, a number of operations have optimizations for \
> operations on pairs of latin1 strings and pairs of non-latin1 strings, while \
> operations between latin1 and non-latin1 strings use a more general implementation. \
>  
> The changes include:
> 
> - Adding a warning to each constructor with an array as an argument to indicate \
> that the results are indeterminate  if the input array is modified before the \
> constructor returns.  The resulting string may contain any combination of \
> characters sampled from the input array. 
> - Ensure that strings that are represented as non-latin1 contain at least one \
> non-latin1 character. For latin1 inputs, whether the arrays contain ASCII, \
> ISO-8859-1, UTF8, or another encoding decoded to latin1 the scanning and \
> compression is unchanged. If a non-latin1 character is found, the string is \
> represented as non-latin1 with the added verification that a non-latin1 character \
> is present at the same index. If that character is found to be latin1, then the \
> input array has been modified and the result of the scan may be incorrect. Though a \
> ConcurrentModificationException could be thrown, the risk to an existing \
> application of an unexpected exception should be avoided. Instead, the non-latin1 \
> copy of the input is re-scanned and compressed; that scan determines whether the \
> latin1 or the non-latin1 representation is returned. 
> - The methods that scan for non-latin1 characters and their intrinsic \
> implementations are updated to return the index of the non-latin1 character. 
> - String construction from StringBuilder and CharSequence must also be guarded as \
> their contents may be modified during construction.

First take ;-)
More will follow.

src/java.base/share/classes/java/lang/String.java line 602:

> 600:                 }
> 601:                 this.value = utf16;
> 602:                 this.coder = (utf16.length == dp) ? LATIN1 : UTF16;

Is it possible to have `utf16.length == dp` here? I think the `coder` can only be \
`UTF16`.

src/java.base/share/classes/java/lang/StringLatin1.java line 37:

> 35: import jdk.internal.util.ArraysSupport;
> 36: import jdk.internal.util.DecimalDigits;
> 37: import jdk.internal.vm.annotation.ForceInline;

This annotation does not seem to be used.

src/java.base/share/classes/java/lang/StringUTF16.java line 158:

> 156:      * {@return an encoded byte[] for the UTF16 characters in char[]}
> 157:      * **Only** use this if it is known that at least one character is UTF16.
> 158:      * Otherwise, an untrusted char array may have racy contents and really be \
> latin1.

While this is a good advice, it turns out that the `compress()` method below invokes \
this method _without_ knowing for sure if the provided `value` contains at least one \
non-latin1 `char` when this method is invoked or while it runs: it only _assumes_ so, \
and indeed prudentially checks afterwards.

It should be suggested to check the result after invoking this method if `value` is \
untrusted.

src/java.base/share/classes/java/lang/StringUTF16.java line 198:

> 196:      * @param val   a char array
> 197:      * @param off   starting offset
> 198:      * @param count length of chars to be compressed, length > 0

Suggestion:

     * @param count count of chars to be compressed, {@code count} > 0

src/java.base/share/classes/java/lang/StringUTF16.java line 214:

> 212:             }
> 213:         }
> 214:         return latin1;     // latin1 success

The original version of this `public` method can return `null` to signal failure to \
compress. Does this change impact callers that might expect `null`?

src/java.base/share/classes/java/lang/StringUTF16.java line 226:

> 224:      * @param val   a byte array with UTF16 coding
> 225:      * @param off   starting offset
> 226:      * @param count length of chars to be compressed, length > 0

Suggestion:

     * @param count count of chars to be compressed, {@code count} > 0

src/java.base/share/classes/java/lang/StringUTF16.java line 232:

> 230:         int ndx = compress(val, off, latin1, 0, count);
> 231:         if (ndx != count) {// Switch to UTF16
> 232:             byte[] utf16 = Arrays.copyOfRange(val, off << 1, (off + count) << \
> 1);

Not sure if the left shifts do not overflow on this `public` method. If that happens, \
the outcomes could be non-negative, so the copy would succeed but be kind of \
corrupted.

src/java.base/share/classes/java/lang/StringUTF16.java line 240:

> 238:             }
> 239:         }
> 240:         return latin1;     // latin1 success

See the `compress()` above for a remark on `null` as a return value.

src/java.base/share/classes/java/lang/StringUTF16.java line 319:

> 317:             int codePoint = val[off];           // read each codepoint from \
>                 val only once
> 318:             int dstLimit = dstOff
> 319:                     + (Character.isBmpCodePoint(codePoint) ? 1 : 2)

Suggestion:

                    + Character.charCount(codePoint)

This method was introduced in 1.5, so should be safe to use even for backports.

src/java.base/share/classes/java/lang/StringUTF16.java line 411:

> 409:             return 2;
> 410:         } else
> 411:             throw new IllegalArgumentException(Integer.toString(codePoint));

Maybe `Character.charCount()` can be used here, although it returns 2 even for \
invalid codepoints.

-------------

PR Review: https://git.openjdk.org/jdk/pull/16425#pullrequestreview-1720680695
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386992886
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386829351
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386885841
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905655
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386923763
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905964
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386915873
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386927514
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386951908
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386997067


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic