[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-dev
Subject:    [jira] [Created] (HADOOP-17096) ZStandardCompressor throws java.lang.InternalError: Error (generic)
From:       "Stephen Jung (Stripe) (Jira)" <jira () apache ! org>
Date:       2020-06-26 20:23:00
Message-ID: JIRA.13313687.1593202969000.347803.1593202980106 () Atlassian ! JIRA
[Download RAW message or body]

Stephen Jung (Stripe) created HADOOP-17096:
----------------------------------------------

             Summary: ZStandardCompressor throws java.lang.InternalError: Error \
(generic)  Key: HADOOP-17096
                 URL: https://issues.apache.org/jira/browse/HADOOP-17096
             Project: Hadoop Common
          Issue Type: Bug
          Components: io
    Affects Versions: 3.2.1
         Environment: Our repro is on ubuntu xenial LTS, with hadoop 3.2.1 linking to \
libzstd 1.3.1. The bug is difficult to reproduce in an end-to-end environment (eg \
running an actual hadoop job with zstd compression) because it's very sensitive to \
the exact input and output characteristics. I reproduced the bug by turning one of \
the existing unit tests into a crude fuzzer, but I'm not sure upstream will accept \
that patch, so I've attached it separately on this ticket.

Note that the existing unit test for  testCompressingWithOneByteOutputBuffer fails to \
reproduce this bug. This is because it's using the license file as input, and this \
file is too small. libzstd has internal buffering (in our environment it seems to be \
128 kilobytes), and the license file is only 10 kilobytes. Thus libzstd is able to \
consume all the input and compress it in a single call, then return pieces of its \
internal buffer one byte at a time. Since all the input is consumed in a single call, \
uncompressedDirectBufOff and uncompressedDirectBufLen are both set to zero and thus \
the bug does not reproduce.  Reporter: Stephen Jung (Stripe)


A bug in index handling causes ZStandardCompressor.c to pass a malformed  \
ZSTD_inBuffer to libzstd. libzstd then returns an "Error (generic)" that gets thrown. \
The crux of the issue is two variables, uncompressedDirectBufLen and \
uncompressedDirectBufOff. The hadoop code counts uncompressedDirectBufOff from the \
start of uncompressedDirectBuf, then uncompressedDirectBufLen is counted from \
uncompressedDirectBufOff. However, libzstd considers pos and size to both be counted \
from the start of the buffer. As a result, this line  \
https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-c \
ommon/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L228 \
causes a malformed buffer to be passed to libzstd, where pos>size.  Here's a longer \
description of the bug in case this abstract explanation is unclear:

----

Suppose we initialize uncompressedDirectBuf (via setInputFromSavedData) with five \
bytes of input. This results in uncompressedDirectBufOff=0 and \
uncompressedDirectBufLen=5 \
(https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop- \
common/src/main/java/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.java#L140-L146).


Then we call compress(), which initializes a ZSTD_inBuffer \
(https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop- \
common/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L195-L196). \
The definition of those libzstd structs is here \
https://github.com/facebook/zstd/blob/v1.3.1/lib/zstd.h#L251-L261 - note that we set \
size=uncompressedDirectBufLen and pos=uncompressedDirectBufOff. The ZSTD_inBuffer \
gets passed to libzstd, compression happens, etc. When libzstd returns from the \
compression function, it updates the ZSTD_inBuffer struct to indicate how many bytes \
were consumed (https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3919-L3920). \
Note that pos is advanced, but size is unchanged.

Now, libzstd does not guarantee that the entire input will be compressed in a single \
call of the compression function. (Some of the compression libraries used by hadoop, \
such as snappy, _do_ provide this guarantee, but libzstd is not one of them.) So the \
hadoop native code updates uncompressedDirectBufOff and uncompressedDirectBufLen \
using the updated ZSTD_inBuffer: \
https://github.com/apache/hadoop/blob/rel/release-3.2.1/hadoop-common-project/hadoop-c \
ommon/src/main/native/src/org/apache/hadoop/io/compress/zstd/ZStandardCompressor.c#L227-L228


Now, returning to our example, we started with 5 bytes of uncompressed input. Suppose \
libzstd compressed 4 of those bytes, leaving one unread. This would result in a \
ZSTD_inBuffer struct with size=5 (unchanged) and pos=4 (four bytes were consumed). \
The hadoop native code would then set uncompressedDirectBufOff=4, but it would also \
set uncompressedDirectBufLen=1 (five minus four equals one).

Since some of the input was not consumed, we will eventually call compress() again. \
Then we instantiate another ZSTD_inBuffer struct, this time with size=1 and pos=4. \
This is a bug - libzstd expects size and pos to both be counted from the start of the \
buffer, therefore pos>size is unsound. So it returns an error \
https://github.com/facebook/zstd/blob/v1.3.1/lib/compress/zstd_compress.c#L3932 which \
gets escalated as a java.lang.InternalError.

I will be submitting a pull request on github with a fix for this bug. The key is \
that the hadoop code should handle offsets the same way libzstd does, ie \
uncompressedDirectBufLen should be counted from the start of uncompressedDirectBuf, \
not from uncompressedDirectBufOff.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic