'[jira] [Resolved] (HADOOP-19047) Support InMemory Tracking Of S3A Magic Commits'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-dev
Subject:    [jira] [Resolved] (HADOOP-19047) Support InMemory Tracking Of S3A Magic Commits
From:       "Steve Loughran (Jira)" <jira () apache ! org>
Date:       2024-03-26 17:32:00
Message-ID: JIRA.13565408.1705659520000.90288.1711474320034 () Atlassian ! JIRA
[Download RAW message or body]


     [ https://issues.apache.org/jira/browse/HADOOP-19047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel \
]

Steve Loughran resolved HADOOP-19047.
-------------------------------------
    Fix Version/s: 3.5.0
                   3.4.1
       Resolution: Fixed

> Support InMemory Tracking Of S3A Magic Commits
> ----------------------------------------------
> 
> Key: HADOOP-19047
> URL: https://issues.apache.org/jira/browse/HADOOP-19047
> Project: Hadoop Common
> Issue Type: Improvement
> Components: fs/s3
> Reporter: Syed Shameerur Rahman
> Assignee: Syed Shameerur Rahman
> Priority: Major
> Labels: pull-request-available
> Fix For: 3.5.0, 3.4.1
> 
> 
> The following are the operations which happens within a Task when it uses S3A Magic \
>                 Committer. 
> *During closing of stream*
> 1. A 0-byte file with a same name of the original file is uploaded to S3 using PUT \
> operation. Refer [here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hado \
> op-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicCommitTracker.java#L152] \
> for more information. This is done so that the downstream application like Spark \
> could get the size of the file which is being written. 2. MultiPartUpload(MPU) \
> metadata is uploaded to S3. Refer \
> [here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicCommitTracker.java#L176] \
>                 for more information.
> *During TaskCommit*
> 1. All the MPU metadata which the task wrote to S3 (There will be 'x' number of \
> metadata file in S3 if a single task writes to 'x' files) are read and rewritten to \
> S3 as a single metadata file. Refer \
> [here|https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/commit/magic/MagicS3GuardCommitter.java#L201] \
> for more information Since these operations happens with the Task JVM, We could \
> optimize as well as save cost by storing these information in memory when Task \
> memory usage is not a constraint. Hence the proposal here is to introduce a new \
> MagicCommit Tracker called "InMemoryMagicCommitTracker" which will store the  1. \
> Metadata of MPU in memory till the Task is committed 2. Store the size of the file \
> which can be used by the downstream application to get the file size before it is \
> committed/visible to the output path. This optimization will save 2 PUT S3 calls, 1 \
> LIST S3 call, and 1 GET S3 call given a Task writes only 1 file.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-dev-help@hadoop.apache.org


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic