[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-dev
Subject:    [jira] [Resolved] (HADOOP-16221) S3Guard: fail write that doesn't update metadata store
From:       "Steve Loughran (JIRA)" <jira () apache ! org>
Date:       2019-04-30 10:56:00
Message-ID: JIRA.13224827.1553867669000.173984.1556621760479 () Atlassian ! JIRA
[Download RAW message or body]


     [ https://issues.apache.org/jira/browse/HADOOP-16221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel \
]

Steve Loughran resolved HADOOP-16221.
-------------------------------------
       Resolution: Fixed
    Fix Version/s: 3.3.0

+1, PR#666 committed. Thanks!

> S3Guard: fail write that doesn't update metadata store
> ------------------------------------------------------
> 
> Key: HADOOP-16221
> URL: https://issues.apache.org/jira/browse/HADOOP-16221
> Project: Hadoop Common
> Issue Type: Sub-task
> Components: fs/s3
> Affects Versions: 3.2.0
> Reporter: Ben Roling
> Assignee: Ben Roling
> Priority: Major
> Fix For: 3.3.0
> 
> 
> Right now, a failure to write to the S3Guard metadata store (e.g. DynamoDB) is \
> [merely logged|https://github.com/apache/hadoop/blob/rel/release-3.1.2/hadoop-tools/ \
> hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L2708-L2712]. \
> It does not fail the S3AFileSystem write operation itself. As such, the writer has \
> no idea that anything went wrong. The implication of this is that S3Guard doesn't \
> always provide the consistency it advertises. For example [this \
> article|https://blog.cloudera.com/blog/2017/08/introducing-s3guard-s3-consistency-for-apache-hadoop/] \
> states: {quote}If a Hadoop S3A client creates or moves a file, and then a client \
> lists its directory, that file is now guaranteed to be included in the listing. \
> {quote} Unfortunately, this is sort of untrue and could result in exactly the sort \
> of problem S3Guard is supposed to avoid: {quote}Missing data that is silently \
> dropped. Multi-step Hadoop jobs that depend on output of previous jobs may silently \
> omit some data. This omission happens when a job chooses which files to consume \
> based on a directory listing, which may not include recently-written items. {quote}
> Imagine the typical multi-job Hadoop processing pipeline. Job 1 runs and succeeds, \
> but one (or more) S3Guard metadata write failed under the covers. Job 2 picks up \
> the output directory from Job 1 and runs its processing, potentially seeing an \
> inconsistent listing, silently missing some of the Job 1 output files. S3Guard \
> should at least provide a configuration option to fail if the metadata write fails. \
> It seems even ideally this should be the default?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic