[prev in list] [next in list] [prev in thread] [next in thread]
List: openjdk-core-libs-dev
Subject: Re: RFR: 8311906: Improve robustness of String constructors with mutable array inputs
From: Raffaello Giulietti <rgiulietti () openjdk ! org>
Date: 2023-11-08 18:40:07
Message-ID: bMsxzVmUo8Gr8c1ZOEhpZeBbqSuj-5wOErQxPCEAN0k=.347614ca-c219-4a62-b860-73c2f97a6fb6 () github ! com
[Download RAW message or body]
On Mon, 30 Oct 2023 18:34:44 GMT, Roger Riggs <rriggs@openjdk.org> wrote:
> Strings, after construction, are immutable but may be constructed from mutable \
> arrays of bytes, characters, or integers. The string constructors should guard \
> against the effects of mutating the arrays during construction that might \
> invalidate internal invariants for the correct behavior of operations on the \
> resulting strings. In particular, a number of operations have optimizations for \
> operations on pairs of latin1 strings and pairs of non-latin1 strings, while \
> operations between latin1 and non-latin1 strings use a more general implementation. \
>
> The changes include:
>
> - Adding a warning to each constructor with an array as an argument to indicate \
> that the results are indeterminate if the input array is modified before the \
> constructor returns. The resulting string may contain any combination of \
> characters sampled from the input array.
> - Ensure that strings that are represented as non-latin1 contain at least one \
> non-latin1 character. For latin1 inputs, whether the arrays contain ASCII, \
> ISO-8859-1, UTF8, or another encoding decoded to latin1 the scanning and \
> compression is unchanged. If a non-latin1 character is found, the string is \
> represented as non-latin1 with the added verification that a non-latin1 character \
> is present at the same index. If that character is found to be latin1, then the \
> input array has been modified and the result of the scan may be incorrect. Though a \
> ConcurrentModificationException could be thrown, the risk to an existing \
> application of an unexpected exception should be avoided. Instead, the non-latin1 \
> copy of the input is re-scanned and compressed; that scan determines whether the \
> latin1 or the non-latin1 representation is returned.
> - The methods that scan for non-latin1 characters and their intrinsic \
> implementations are updated to return the index of the non-latin1 character.
> - String construction from StringBuilder and CharSequence must also be guarded as \
> their contents may be modified during construction.
First take ;-)
More will follow.
src/java.base/share/classes/java/lang/String.java line 602:
> 600: }
> 601: this.value = utf16;
> 602: this.coder = (utf16.length == dp) ? LATIN1 : UTF16;
Is it possible to have `utf16.length == dp` here? I think the `coder` can only be \
`UTF16`.
src/java.base/share/classes/java/lang/StringLatin1.java line 37:
> 35: import jdk.internal.util.ArraysSupport;
> 36: import jdk.internal.util.DecimalDigits;
> 37: import jdk.internal.vm.annotation.ForceInline;
This annotation does not seem to be used.
src/java.base/share/classes/java/lang/StringUTF16.java line 158:
> 156: * {@return an encoded byte[] for the UTF16 characters in char[]}
> 157: * **Only** use this if it is known that at least one character is UTF16.
> 158: * Otherwise, an untrusted char array may have racy contents and really be \
> latin1.
While this is a good advice, it turns out that the `compress()` method below invokes \
this method _without_ knowing for sure if the provided `value` contains at least one \
non-latin1 `char` when this method is invoked or while it runs: it only _assumes_ so, \
and indeed prudentially checks afterwards.
It should be suggested to check the result after invoking this method if `value` is \
untrusted.
src/java.base/share/classes/java/lang/StringUTF16.java line 198:
> 196: * @param val a char array
> 197: * @param off starting offset
> 198: * @param count length of chars to be compressed, length > 0
Suggestion:
* @param count count of chars to be compressed, {@code count} > 0
src/java.base/share/classes/java/lang/StringUTF16.java line 214:
> 212: }
> 213: }
> 214: return latin1; // latin1 success
The original version of this `public` method can return `null` to signal failure to \
compress. Does this change impact callers that might expect `null`?
src/java.base/share/classes/java/lang/StringUTF16.java line 226:
> 224: * @param val a byte array with UTF16 coding
> 225: * @param off starting offset
> 226: * @param count length of chars to be compressed, length > 0
Suggestion:
* @param count count of chars to be compressed, {@code count} > 0
src/java.base/share/classes/java/lang/StringUTF16.java line 232:
> 230: int ndx = compress(val, off, latin1, 0, count);
> 231: if (ndx != count) {// Switch to UTF16
> 232: byte[] utf16 = Arrays.copyOfRange(val, off << 1, (off + count) << \
> 1);
Not sure if the left shifts do not overflow on this `public` method. If that happens, \
the outcomes could be non-negative, so the copy would succeed but be kind of \
corrupted.
src/java.base/share/classes/java/lang/StringUTF16.java line 240:
> 238: }
> 239: }
> 240: return latin1; // latin1 success
See the `compress()` above for a remark on `null` as a return value.
src/java.base/share/classes/java/lang/StringUTF16.java line 319:
> 317: int codePoint = val[off]; // read each codepoint from \
> val only once
> 318: int dstLimit = dstOff
> 319: + (Character.isBmpCodePoint(codePoint) ? 1 : 2)
Suggestion:
+ Character.charCount(codePoint)
This method was introduced in 1.5, so should be safe to use even for backports.
src/java.base/share/classes/java/lang/StringUTF16.java line 411:
> 409: return 2;
> 410: } else
> 411: throw new IllegalArgumentException(Integer.toString(codePoint));
Maybe `Character.charCount()` can be used here, although it returns 2 even for \
invalid codepoints.
-------------
PR Review: https://git.openjdk.org/jdk/pull/16425#pullrequestreview-1720680695
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386992886
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386829351
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386885841
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905655
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386923763
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386905964
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386915873
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386927514
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386951908
PR Review Comment: https://git.openjdk.org/jdk/pull/16425#discussion_r1386997067
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic