[prev in list] [next in list] [prev in thread] [next in thread]
List: hadoop-user
Subject: Yarn resourcemanager stop allocating container when cluster resource is sufficient
From: "=?gb18030?B?Q2hhbmcuV3U=?=" <583424568 () qq ! com>
Date: 2017-11-13 2:21:28
Message-ID: tencent_1F4F7A897760E07581131940EA438D4DBC06 () qq ! com
[Download RAW message or body]
[Attachment #2 (text/plain)]
Hadoop Version: 2.7.2
My Yarn cluster have (1100TB,368vCores) totallly with 15** nodemangers** .
My cluster use fair-scheduler and I have 4 queues for different kinds of jobs:
<allocations>
<queue name="queue1">
<minResources>100000 mb, 30 vcores</minResources>
<maxResources>422280 mb, 132 vcores</maxResources>
<maxAMShare>0.5f</maxAMShare>
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
<maxRunningApps>50</maxRunningApps>
</queue>
<queue name="queue2">
<minResources>25000 mb, 20 vcores</minResources>
<maxResources>600280 mb, 150 vcores</maxResources>
<maxAMShare>0.6f</maxAMShare>
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
<maxRunningApps>50</maxRunningApps>
</queue>
<queue name="queue3">
<minResources>100000 mb, 30 vcores</minResources>
<maxResources>647280 mb, 132 vcores</maxResources>
<maxAMShare>0.8f</maxAMShare>
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
<maxRunningApps>50</maxRunningApps>
</queue>
<queue name="queue4">
<minResources>80000 mb, 20 vcores</minResources>
<maxResources>120000 mb, 30 vcores</maxResources>
<maxAMShare>0.5f</maxAMShare>
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
<maxRunningApps>50</maxRunningApps>
</queue>
</allocations>
Then all new-coming applications get stuck for nearly 5 hours,but the cluster \
resource usage is about(600GB,120vCores)£¬it means£¬the cluster resource is still \
sufficient.
Since my cluster scale is not large ,so I exclude the possibility showed in[ \
YARN-4618].
besides that , all the running applications seems never finished, the Yarn RM seems \
static ,the RM log have no more state change logs about running applications£¬except \
for the log about more and more application is submitted and become ACCEPTED,but \
never fromACCEPTED to RUNNING.
The resource usage of the whole yarn cluster AND of each sinlge queue *stay unchanged \
*for 5 hours, really strange.
The cluster seems like a zombie.
I haved checked the ApplicationMaster log of some running but stucked application ,
2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for \
MAP job_1507795051888_183385. Report-size will be 4 2017-11-11 09:04:55,957 INFO [IPC \
Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: \
Getting task report for REDUCE job_1507795051888_183385. Report-size will be 0 \
2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator] \
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: \
PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 \
CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0 \
2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator] \
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for \
application_1507795051888_183385: ask=6 release= 0 newContainers=0 \
finishedContainers=0 resourcelimit=<memory:109760, vCores:25> knownNMs=15 2017-11-11 \
13:58:56,736 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job \
job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11
You can ses that at ** 2017-11-11 09:04:56,061 **It send resource request to \
ResourceManager but RM allocate zero containers. Then ,no more logs for 5 hours. At \
13:58£¬ I have to kill it manually.
After 5 hours , I kill some pending applications and then everything \
recovered£¬remaining cluster resources can be allocated again, ResourceManager seems \
to be alive again.
I have exclude the possibility of the restriction of maxRunningApps and maxAMShare \
config because they will just affect a single queue, but my problem is that whole \
yarn cluster application get stuck.
Also , I exclude the possibility of a resourcemanger full gc problem because I \
check that with gcutil£¬no full gc happened , resource manager memory is OK.
So , anyone could give me some suggestions?
[Attachment #3 (text/html)]
<div><p cid="n0" mdtype="paragraph" style="box-sizing: border-box; \
-webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 30px 0px 0.8em; \
width: inherit; position: relative; color: rgb(51, 51, 51); font-family: "Open \
Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, \
sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: auto;"><span \
class="md-line md-end-block" cid="n2" mdtype="line" style="box-sizing: border-box; \
display: block;"><span md-inline="plain" style="box-sizing: border-box;">Hadoop \
Version: 2.7.2</span></span></p><p cid="n5" mdtype="paragraph" style="box-sizing: \
border-box; -webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em \
0px; width: inherit; position: relative; color: rgb(51, 51, 51); font-family: \
"Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, \
Arial, sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: \
auto;"><span class="md-line md-end-block" cid="n6" mdtype="line" style="box-sizing: \
border-box; display: block;"><span md-inline="plain" style="box-sizing: \
border-box;">My Yarn cluster have </span><span md-inline="strong" class="" \
style="box-sizing: border-box;"><strong style="box-sizing: \
border-box;">(1100TB,368vCores)</strong></span><span md-inline="plain" \
style="box-sizing: border-box;"> totallly with 15** nodemangers** . \
</span></span></p><p cid="n7" mdtype="paragraph" style="box-sizing: border-box; \
-webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em 0px; width: \
inherit; position: relative; color: rgb(51, 51, 51); font-family: "Open \
Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, \
sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: auto;"><span \
class="md-line md-end-block" cid="n8" mdtype="line" style="box-sizing: border-box; \
display: block;"><span md-inline="plain" style="box-sizing: border-box;">My cluster \
use </span><span md-inline="strong" class="" style="box-sizing: border-box;"><strong \
style="box-sizing: border-box;">fair-scheduler</strong></span><span md-inline="plain" \
class="" style="box-sizing: border-box;"> and I have 4 queues for different kinds of \
jobs:</span></span></p><pre class="md-fences md-end-block" lang="" \
contenteditable="false" cid="n11" mdtype="fences" style="box-sizing: border-box; \
overflow: visible; font-family: Consolas, "Liberation Mono", Courier, \
monospace; font-size: 0.9em; white-space: pre; break-inside: avoid; background-image: \
; background-position: var(--code-block-bg-color); background-size: ; \
background-repeat: var(--code-block-bg-color); background-attachment: ; \
background-origin: ; background-clip: ; background-color: rgb(248, 248, 248); border: \
1px solid rgb(221, 221, 221); border-radius: 3px; padding: 8px 1em 6px; \
margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); \
text-size-adjust: auto; position: relative !important;"><span role="presentation" \
style="box-sizing: border-box; padding-right: \
0.1px;"><allocations></span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> <queue \
name="queue1"></span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> <minResources>100000 mb, 30 \
vcores</minResources></span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> <maxResources>422280 \
mb, 132 vcores</maxResources></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxAMShare>0.5f</maxAMShare></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxRunningApps>50</maxRunningApps></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
</queue></span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> <queue name="queue2"></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<minResources>25000 mb, 20 \
vcores</minResources></span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> <maxResources>600280 \
mb, 150 vcores</maxResources></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxAMShare>0.6f</maxAMShare></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxRunningApps>50</maxRunningApps></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
</queue></span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> <queue name="queue3"></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<minResources>100000 mb, 30 \
vcores</minResources></span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> <maxResources>647280 \
mb, 132 vcores</maxResources></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxAMShare>0.8f</maxAMShare></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxRunningApps>50</maxRunningApps></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
</queue></span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> </span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> <queue \
name="queue4"></span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> <minResources>80000 mb, 20 \
vcores</minResources></span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> <maxResources>120000 \
mb, 30 vcores</maxResources></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxAMShare>0.5f</maxAMShare></span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> \
<fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
<maxRunningApps>50</maxRunningApps></span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> \
</queue></span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"></allocations></span></pre><p cid="n12" \
mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: 1rem; \
-webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: relative; \
color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", \
"Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 16px; \
white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line md-end-block" \
cid="n13" mdtype="line" style="box-sizing: border-box; display: block;"></span></p><p \
cid="n14" mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: \
1rem; -webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: \
relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear \
Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: \
16px; white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line \
md-end-block" cid="n15" mdtype="line" style="box-sizing: border-box; display: \
block;"><span md-inline="plain" style="box-sizing: border-box;">Then all new-coming \
applications get stuck for nearly 5 hours,but the cluster resource usage is \
about</span><span md-inline="strong" class="" style="box-sizing: border-box;"><strong \
style="box-sizing: border-box;">(600GB,120vCores)</strong></span><span \
md-inline="plain" style="box-sizing: border-box;">£¬it means£¬</span><span \
md-inline="strong" class="" style="box-sizing: border-box;"><strong \
style="box-sizing: border-box;">the cluster resource is still \
sufficient.</strong></span></span></p><p cid="n18" mdtype="paragraph" \
style="box-sizing: border-box; -webkit-margin-before: 1rem; -webkit-margin-after: \
1rem; margin: 0.8em 0px; width: inherit; position: relative; color: rgb(51, 51, 51); \
font-family: "Open Sans", "Clear Sans", "Helvetica \
Neue", Helvetica, Arial, sans-serif; font-size: 16px; white-space: pre-wrap; \
text-size-adjust: auto;"><span class="md-line md-end-block" cid="n19" mdtype="line" \
style="box-sizing: border-box; display: block;"><span md-inline="plain" \
style="box-sizing: border-box;">Since my cluster scale is not large ,so I exclude the \
possibility showed in[ </span><span md-inline="link" class="" style="box-sizing: \
border-box;"><a spellcheck="false" \
href="https://issues.apache.org/jira/browse/YARN-4618" style="box-sizing: border-box; \
cursor: pointer; color: rgb(65, 131, 196); -webkit-user-drag: none;"><span \
md-inline="plain" style="box-sizing: border-box;">YARN-4618</span><span \
md-inline="escape" style="box-sizing: border-box;">]</span></a></span><span \
md-inline="plain" class="" style="box-sizing: border-box;">.</span></span></p><p \
cid="n24" mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: \
1rem; -webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: \
relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear \
Sans", "Helvetica Neue", Helvetica, Arial, sans-serif; font-size: \
16px; white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line \
md-end-block" cid="n25" mdtype="line" style="box-sizing: border-box; display: \
block;"><span md-inline="plain" style="box-sizing: border-box;">besides that , all \
the running applications seems never finished, the Yarn RM seems static ,the RM log \
have no more state change logs about running applications£¬except for the log about \
more and more application is submitted and become ACCEPTED,but never fromACCEPTED to \
RUNNING.</span></span></p><p cid="n26" mdtype="paragraph" style="box-sizing: \
border-box; -webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em \
0px; width: inherit; position: relative; color: rgb(51, 51, 51); font-family: \
"Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, \
Arial, sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: \
auto;"><span class="md-line md-end-block" cid="n27" mdtype="line" style="box-sizing: \
border-box; display: block;"><span md-inline="strong" class="" style="box-sizing: \
border-box;"><strong style="box-sizing: border-box;">The resource usage of the whole \
yarn cluster AND of each sinlge queue</strong></span><span md-inline="plain" \
style="box-sizing: border-box;"> </span><span md-inline="strong" class="" \
style="box-sizing: border-box;"><strong style="box-sizing: border-box;"><span \
md-inline="escape" style="box-sizing: border-box;">*</span><span md-inline="plain" \
style="box-sizing: border-box;">stay unchanged</span></strong></span><span \
md-inline="plain" style="box-sizing: border-box;"> *for 5 hours, really \
strange.</span></span></p><p cid="n28" mdtype="paragraph" style="box-sizing: \
border-box; -webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em \
0px; width: inherit; position: relative; color: rgb(51, 51, 51); font-family: \
"Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, \
Arial, sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: \
auto;"><span class="md-line md-end-block" cid="n29" mdtype="line" style="box-sizing: \
border-box; display: block;"><span md-inline="plain" class="" style="box-sizing: \
border-box;">The cluster seems like a zombie.</span></span></p><p cid="n32" \
mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: 1rem; \
-webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: relative; \
color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", \
"Helvetica Neue", Helvetica, Arial, sans-serif; font-size: 16px; \
white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line md-end-block" \
cid="n33" mdtype="line" style="box-sizing: border-box; display: block;"><span \
md-inline="plain" style="box-sizing: border-box;">I haved checked the \
ApplicationMaster log of some running but stucked application ,</span><span \
md-inline="linebreak" class="" style="box-sizing: border-box;"> \
</span></span></p><pre class="md-fences md-end-block" lang="" contenteditable="false" \
cid="n38" mdtype="fences" style="box-sizing: border-box; overflow: visible; \
font-family: Consolas, "Liberation Mono", Courier, monospace; font-size: \
0.9em; white-space: pre; break-inside: avoid; background-image: ; \
background-position: var(--code-block-bg-color); background-size: ; \
background-repeat: var(--code-block-bg-color); background-attachment: ; \
background-origin: ; background-clip: ; background-color: rgb(248, 248, 248); border: \
1px solid rgb(221, 221, 221); border-radius: 3px; padding: 8px 1em 6px; \
margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); \
text-size-adjust: auto; position: relative !important;"><span style="box-sizing: \
border-box;"></span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;">2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for \
MAP job_1507795051888_183385. Report-size will be 4</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;">2017-11-11 \
09:04:55,957 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for \
REDUCE job_1507795051888_183385. Report-size will be 0</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;">2017-11-11 \
09:04:56,037 INFO [RMCommunicator Allocator] \
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: \
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic