[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-issues
Subject:    [jira] [Commented] (MESOS-10221) A large number of TASK_LOST causes the task to be unable to run
From:       "Charles Natali (Jira)" <jira () apache ! org>
Date:       2021-05-30 18:50:00
Message-ID: JIRA.13381085.1622273528000.526411.1622400600080 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin \
.system.issuetabpanels:comment-tabpanel&focusedCommentId=17354096#comment-17354096 ] 

Charles Natali commented on MESOS-10221:
----------------------------------------

> In addition, according to the framework running log, the accept information is sent \
> immediately after the offer is received, but the accept information in the master \
> log is far behind the send offer, so is it that the accept has not been processed \
> immediately, or is it that I have a wrong understanding of the time of the send \
> offer.

  

Yeah that looks suspicious, it'd be good to have the full logs of the master and \
                framework so we can compare the timestamps of:
 * the offer being sent by the master
 * the offer being received by the framework
 * the accept being sent by the framework
 * the accept being received by the master

  

> A large number of TASK_LOST causes the task to be unable to run
> ---------------------------------------------------------------
> 
> Key: MESOS-10221
> URL: https://issues.apache.org/jira/browse/MESOS-10221
> Project: Mesos
> Issue Type: Bug
> Components: master
> Affects Versions: 1.9.0, 1.11.0
> Environment: Ubuntu 16.04
> Reporter: clancyhuang
> Priority: Major
> 
> Recently, we found that the mesos master frequently generates Task lost exceptions \
> after task submission, and retrying in a short period of time is not feasible, and \
> it is becoming more and more frequent. We selected two abnormal logs
> {code:java}
> I0528 15:09:55.367336   964 master.cpp:9579] Sending offers [ \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework \
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) I0528 \
> 15:10:25.369561   969 master.cpp:11878] Removing offer \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 I0528 15:10:43.383028   959 \
> http.cpp:1436] HTTP POST for /master/api/v1/scheduler from 10.118.28.66:50484 with \
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' I0528 15:10:43.383656   959 \
> master.cpp:5434] Processing DECLINE call for offers: [ \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework \
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 seconds \
> filter I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework \
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) I0528 \
> 15:10:33.386322   972 master.cpp:11878] Removing offer \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 I0528 15:10:57.181581   967 \
> http.cpp:1436] HTTP POST for /master/api/v1/scheduler from 10.118.28.66:50484 with \
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' W0528 15:10:57.183194   967 \
> master.cpp:3959] Ignoring accept of offer \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid W0528 \
> 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers '[ \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid I0528 \
> 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST for task \
> data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework \
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: Offer \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid' {code}
> The following is a log of normal execution
> {code:java}
> I0528 15:17:03.690855   959 master.cpp:9579] Sending offers [ \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529, \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework \
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) I0528 \
> 15:17:03.742848   970 http.cpp:1436] HTTP POST for /master/api/v1/scheduler from \
> 10.118.28.66:50484 with User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' \
> I0528 15:17:03.745221   970 master.cpp:4356] Processing ACCEPT call for offers: [ \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent \
> cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for framework \
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) I0528 \
> 15:17:03.745889   970 master.cpp:11878] Removing offer \
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 {code}
> We found that the offer was cancelled before accept when the exception \
> occurred,and the interval time is just the configured offer-timeout. Our \
> framework communicates with mesos based on http, I am sure that he sends the accept \
> message immediately after receiving the offer and the request is successful. The \
> question is why sometimes the master processes the accept message after the offer \
> times out. In addition, we tried to increase the offer-timeout, but the problem was \
> not resolved



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic