[prev in list] [next in list] [prev in thread] [next in thread] 

List:       hadoop-user
Subject:    Yarn resourcemanager stop allocating container when cluster resource is sufficient
From:       "=?gb18030?B?Q2hhbmcuV3U=?=" <583424568 () qq ! com>
Date:       2017-11-13 2:21:28
Message-ID: tencent_1F4F7A897760E07581131940EA438D4DBC06 () qq ! com
[Download RAW message or body]

[Attachment #2 (text/plain)]

Hadoop Version: 2.7.2

My Yarn cluster have (1100TB,368vCores)  totallly with 15** nodemangers** . 

My cluster use fair-scheduler and I have 4 queues for different kinds of jobs:
<allocations>
    <queue name="queue1">
       <minResources>100000 mb, 30 vcores</minResources>
       <maxResources>422280 mb, 132 vcores</maxResources>
       <maxAMShare>0.5f</maxAMShare>
       <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
       <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
       <maxRunningApps>50</maxRunningApps>
    </queue>
    <queue name="queue2">
       <minResources>25000 mb, 20 vcores</minResources>
       <maxResources>600280 mb, 150 vcores</maxResources>
       <maxAMShare>0.6f</maxAMShare>
       <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
       <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
       <maxRunningApps>50</maxRunningApps>
    </queue>
    <queue name="queue3">
       <minResources>100000 mb, 30 vcores</minResources>
       <maxResources>647280 mb, 132 vcores</maxResources>
       <maxAMShare>0.8f</maxAMShare>
       <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
       <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
       <maxRunningApps>50</maxRunningApps>
    </queue>
  
    <queue name="queue4">
       <minResources>80000 mb, 20 vcores</minResources>
       <maxResources>120000 mb, 30 vcores</maxResources>
       <maxAMShare>0.5f</maxAMShare>
       <fairSharePreemptionTimeout>9000000000</fairSharePreemptionTimeout>
       <minSharePreemptionTimeout>9000000000</minSharePreemptionTimeout>
       <maxRunningApps>50</maxRunningApps>
     </queue>
</allocations>


Then all new-coming applications get stuck for nearly 5 hours,but the cluster \
resource usage is about(600GB,120vCores)£¬it means£¬the cluster resource is still \
sufficient.

Since my cluster scale is not large ,so I exclude the possibility showed in[ \
YARN-4618].

besides that , all the running applications seems never finished, the Yarn RM seems \
static ,the RM log  have no more state change logs about running applications£¬except \
for the log about more and more application is submitted and become ACCEPTED,but \
never fromACCEPTED to RUNNING.

The resource usage of the whole yarn cluster AND of each sinlge queue *stay unchanged \
*for 5 hours, really strange.

The cluster seems like a zombie.

I haved checked the ApplicationMaster log of some running but stucked application ,  

2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for \
MAP job_1507795051888_183385. Report-size will be 4 2017-11-11 09:04:55,957 INFO [IPC \
Server handler 0 on 42899] org.apache.hadoop.mapreduce.v2.app.client.MRClientService: \
Getting task report for REDUCE job_1507795051888_183385. Report-size will be 0 \
2017-11-11 09:04:56,037 INFO [RMCommunicator Allocator] \
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: \
PendingReds:0 ScheduledMaps:4 ScheduledReds:0 AssignedMaps:0 AssignedReds:0 \
CompletedMaps:0 CompletedReds:0 ContAlloc:0 ContRel:0 HostLocal:0 RackLocal:0 \
2017-11-11 09:04:56,061 INFO [RMCommunicator Allocator] \
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerRequestor: getResources() for \
application_1507795051888_183385: ask=6 release= 0 newContainers=0 \
finishedContainers=0 resourcelimit=<memory:109760, vCores:25> knownNMs=15 2017-11-11 \
13:58:56,736 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Kill job \
job_1507795051888_183385 received from appuser (auth:SIMPLE) at 10.120.207.11


You can ses that at ** 2017-11-11 09:04:56,061 **It send resource request to \
ResourceManager but RM allocate zero containers. Then ,no more logs  for 5 hours. At  \
13:58£¬ I have to kill it manually.

After 5 hours , I kill some pending applications and then everything \
recovered£¬remaining cluster resources can be allocated again, ResourceManager seems  \
to be alive again. 

I have exclude the possibility of  the restriction of maxRunningApps and maxAMShare \
config because they will just affect a single queue, but my problem is that whole \
yarn cluster application get stuck.

Also , I exclude the possibility of a  resourcemanger  full gc problem because I \
check that with gcutil£¬no full gc happened , resource manager memory is OK.

So , anyone could give me some suggestions?


[Attachment #3 (text/html)]

<div><p cid="n0" mdtype="paragraph" style="box-sizing: border-box; \
-webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 30px 0px 0.8em; \
width: inherit; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open \
Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, \
sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: auto;"><span \
class="md-line md-end-block" cid="n2" mdtype="line" style="box-sizing: border-box; \
display: block;"><span md-inline="plain" style="box-sizing: border-box;">Hadoop \
Version: 2.7.2</span></span></p><p cid="n5" mdtype="paragraph" style="box-sizing: \
border-box; -webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em \
0px; width: inherit; position: relative; color: rgb(51, 51, 51); font-family: \
&quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, \
Arial, sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: \
auto;"><span class="md-line md-end-block" cid="n6" mdtype="line" style="box-sizing: \
border-box; display: block;"><span md-inline="plain" style="box-sizing: \
border-box;">My Yarn cluster have </span><span md-inline="strong" class="" \
style="box-sizing: border-box;"><strong style="box-sizing: \
border-box;">(1100TB,368vCores)</strong></span><span md-inline="plain" \
style="box-sizing: border-box;">  totallly with 15** nodemangers** . \
</span></span></p><p cid="n7" mdtype="paragraph" style="box-sizing: border-box; \
-webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em 0px; width: \
inherit; position: relative; color: rgb(51, 51, 51); font-family: &quot;Open \
Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, \
sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: auto;"><span \
class="md-line md-end-block" cid="n8" mdtype="line" style="box-sizing: border-box; \
display: block;"><span md-inline="plain" style="box-sizing: border-box;">My cluster \
use </span><span md-inline="strong" class="" style="box-sizing: border-box;"><strong \
style="box-sizing: border-box;">fair-scheduler</strong></span><span md-inline="plain" \
class="" style="box-sizing: border-box;"> and I have 4 queues for different kinds of \
jobs:</span></span></p><pre class="md-fences md-end-block" lang="" \
contenteditable="false" cid="n11" mdtype="fences" style="box-sizing: border-box; \
overflow: visible; font-family: Consolas, &quot;Liberation Mono&quot;, Courier, \
monospace; font-size: 0.9em; white-space: pre; break-inside: avoid; background-image: \
; background-position: var(--code-block-bg-color); background-size: ; \
background-repeat: var(--code-block-bg-color); background-attachment: ; \
background-origin: ; background-clip: ; background-color: rgb(248, 248, 248); border: \
1px solid rgb(221, 221, 221); border-radius: 3px; padding: 8px 1em 6px; \
margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); \
text-size-adjust: auto; position: relative !important;"><span role="presentation" \
style="box-sizing: border-box; padding-right: \
0.1px;">&lt;allocations&gt;</span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> &nbsp;  &lt;queue \
name="queue1"&gt;</span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; &lt;minResources&gt;100000 mb, 30 \
vcores&lt;/minResources&gt;</span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; &lt;maxResources&gt;422280 \
mb, 132 vcores&lt;/maxResources&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;maxAMShare&gt;0.5f&lt;/maxAMShare&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;fairSharePreemptionTimeout&gt;9000000000&lt;/fairSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;minSharePreemptionTimeout&gt;9000000000&lt;/minSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;maxRunningApps&gt;50&lt;/maxRunningApps&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp;  \
&lt;/queue&gt;</span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> &nbsp;  &lt;queue name="queue2"&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;minResources&gt;25000 mb, 20 \
vcores&lt;/minResources&gt;</span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; &lt;maxResources&gt;600280 \
mb, 150 vcores&lt;/maxResources&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;maxAMShare&gt;0.6f&lt;/maxAMShare&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;fairSharePreemptionTimeout&gt;9000000000&lt;/fairSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;minSharePreemptionTimeout&gt;9000000000&lt;/minSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;maxRunningApps&gt;50&lt;/maxRunningApps&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp;  \
&lt;/queue&gt;</span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> &nbsp;  &lt;queue name="queue3"&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;minResources&gt;100000 mb, 30 \
vcores&lt;/minResources&gt;</span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; &lt;maxResources&gt;647280 \
mb, 132 vcores&lt;/maxResources&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;maxAMShare&gt;0.8f&lt;/maxAMShare&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;fairSharePreemptionTimeout&gt;9000000000&lt;/fairSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;minSharePreemptionTimeout&gt;9000000000&lt;/minSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;maxRunningApps&gt;50&lt;/maxRunningApps&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp;  \
&lt;/queue&gt;</span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> &nbsp;</span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> &nbsp;  &lt;queue \
name="queue4"&gt;</span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; &lt;minResources&gt;80000 mb, 20 \
vcores&lt;/minResources&gt;</span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; &lt;maxResources&gt;120000 \
mb, 30 vcores&lt;/maxResources&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;maxAMShare&gt;0.5f&lt;/maxAMShare&gt;</span><br><span role="presentation" \
style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; &nbsp; &nbsp; \
&lt;fairSharePreemptionTimeout&gt;9000000000&lt;/fairSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;minSharePreemptionTimeout&gt;9000000000&lt;/minSharePreemptionTimeout&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &nbsp; &lt;maxRunningApps&gt;50&lt;/maxRunningApps&gt;</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;"> &nbsp; \
&nbsp; &lt;/queue&gt;</span><br><span role="presentation" style="box-sizing: \
border-box; padding-right: 0.1px;">&lt;/allocations&gt;</span></pre><p cid="n12" \
mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: 1rem; \
-webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: relative; \
color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, \
&quot;Helvetica Neue&quot;, Helvetica, Arial, sans-serif; font-size: 16px; \
white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line md-end-block" \
cid="n13" mdtype="line" style="box-sizing: border-box; display: block;"></span></p><p \
cid="n14" mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: \
1rem; -webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: \
relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear \
Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, sans-serif; font-size: \
16px; white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line \
md-end-block" cid="n15" mdtype="line" style="box-sizing: border-box; display: \
block;"><span md-inline="plain" style="box-sizing: border-box;">Then all new-coming \
applications get stuck for nearly 5 hours,but the cluster resource usage is \
about</span><span md-inline="strong" class="" style="box-sizing: border-box;"><strong \
style="box-sizing: border-box;">(600GB,120vCores)</strong></span><span \
md-inline="plain" style="box-sizing: border-box;">£¬it means£¬</span><span \
md-inline="strong" class="" style="box-sizing: border-box;"><strong \
style="box-sizing: border-box;">the cluster resource is still \
sufficient.</strong></span></span></p><p cid="n18" mdtype="paragraph" \
style="box-sizing: border-box; -webkit-margin-before: 1rem; -webkit-margin-after: \
1rem; margin: 0.8em 0px; width: inherit; position: relative; color: rgb(51, 51, 51); \
font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica \
Neue&quot;, Helvetica, Arial, sans-serif; font-size: 16px; white-space: pre-wrap; \
text-size-adjust: auto;"><span class="md-line md-end-block" cid="n19" mdtype="line" \
style="box-sizing: border-box; display: block;"><span md-inline="plain" \
style="box-sizing: border-box;">Since my cluster scale is not large ,so I exclude the \
possibility showed in[ </span><span md-inline="link" class="" style="box-sizing: \
border-box;"><a spellcheck="false" \
href="https://issues.apache.org/jira/browse/YARN-4618" style="box-sizing: border-box; \
cursor: pointer; color: rgb(65, 131, 196); -webkit-user-drag: none;"><span \
md-inline="plain" style="box-sizing: border-box;">YARN-4618</span><span \
md-inline="escape" style="box-sizing: border-box;">]</span></a></span><span \
md-inline="plain" class="" style="box-sizing: border-box;">.</span></span></p><p \
cid="n24" mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: \
1rem; -webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: \
relative; color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear \
Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, Arial, sans-serif; font-size: \
16px; white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line \
md-end-block" cid="n25" mdtype="line" style="box-sizing: border-box; display: \
block;"><span md-inline="plain" style="box-sizing: border-box;">besides that , all \
the running applications seems never finished, the Yarn RM seems static ,the RM log  \
have no more state change logs about running applications£¬except for the log about \
more and more application is submitted and become ACCEPTED,but never fromACCEPTED to \
RUNNING.</span></span></p><p cid="n26" mdtype="paragraph" style="box-sizing: \
border-box; -webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em \
0px; width: inherit; position: relative; color: rgb(51, 51, 51); font-family: \
&quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, \
Arial, sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: \
auto;"><span class="md-line md-end-block" cid="n27" mdtype="line" style="box-sizing: \
border-box; display: block;"><span md-inline="strong" class="" style="box-sizing: \
border-box;"><strong style="box-sizing: border-box;">The resource usage of the whole \
yarn cluster AND of each sinlge queue</strong></span><span md-inline="plain" \
style="box-sizing: border-box;"> </span><span md-inline="strong" class="" \
style="box-sizing: border-box;"><strong style="box-sizing: border-box;"><span \
md-inline="escape" style="box-sizing: border-box;">*</span><span md-inline="plain" \
style="box-sizing: border-box;">stay unchanged</span></strong></span><span \
md-inline="plain" style="box-sizing: border-box;"> *for 5 hours, really \
strange.</span></span></p><p cid="n28" mdtype="paragraph" style="box-sizing: \
border-box; -webkit-margin-before: 1rem; -webkit-margin-after: 1rem; margin: 0.8em \
0px; width: inherit; position: relative; color: rgb(51, 51, 51); font-family: \
&quot;Open Sans&quot;, &quot;Clear Sans&quot;, &quot;Helvetica Neue&quot;, Helvetica, \
Arial, sans-serif; font-size: 16px; white-space: pre-wrap; text-size-adjust: \
auto;"><span class="md-line md-end-block" cid="n29" mdtype="line" style="box-sizing: \
border-box; display: block;"><span md-inline="plain" class="" style="box-sizing: \
border-box;">The cluster seems like a zombie.</span></span></p><p cid="n32" \
mdtype="paragraph" style="box-sizing: border-box; -webkit-margin-before: 1rem; \
-webkit-margin-after: 1rem; margin: 0.8em 0px; width: inherit; position: relative; \
color: rgb(51, 51, 51); font-family: &quot;Open Sans&quot;, &quot;Clear Sans&quot;, \
&quot;Helvetica Neue&quot;, Helvetica, Arial, sans-serif; font-size: 16px; \
white-space: pre-wrap; text-size-adjust: auto;"><span class="md-line md-end-block" \
cid="n33" mdtype="line" style="box-sizing: border-box; display: block;"><span \
md-inline="plain" style="box-sizing: border-box;">I haved checked the \
ApplicationMaster log of some running but stucked application ,</span><span \
md-inline="linebreak" class="" style="box-sizing: border-box;">  \
</span></span></p><pre class="md-fences md-end-block" lang="" contenteditable="false" \
cid="n38" mdtype="fences" style="box-sizing: border-box; overflow: visible; \
font-family: Consolas, &quot;Liberation Mono&quot;, Courier, monospace; font-size: \
0.9em; white-space: pre; break-inside: avoid; background-image: ; \
background-position: var(--code-block-bg-color); background-size: ; \
background-repeat: var(--code-block-bg-color); background-attachment: ; \
background-origin: ; background-clip: ; background-color: rgb(248, 248, 248); border: \
1px solid rgb(221, 221, 221); border-radius: 3px; padding: 8px 1em 6px; \
margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); \
text-size-adjust: auto; position: relative !important;"><span style="box-sizing: \
border-box;"></span><br><span role="presentation" style="box-sizing: border-box; \
padding-right: 0.1px;">2017-11-11 09:04:55,896 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for \
MAP job_1507795051888_183385. Report-size will be 4</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;">2017-11-11 \
09:04:55,957 INFO [IPC Server handler 0 on 42899] \
org.apache.hadoop.mapreduce.v2.app.client.MRClientService: Getting task report for \
REDUCE job_1507795051888_183385. Report-size will be 0</span><br><span \
role="presentation" style="box-sizing: border-box; padding-right: 0.1px;">2017-11-11 \
09:04:56,037 INFO [RMCommunicator Allocator] \
org.apache.hadoop.mapreduce.v2.app.rm.RMContainerAllocator: Before Scheduling: \



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic