[prev in list] [next in list] [prev in thread] [next in thread] 

List:       vdsm-devel
Subject:    =?utf-8?q?=5Bovirt-devel=5D?= Re: OST HE: Engine VM went down due to cpu load
From:       Yedidyah Bar David <didi () redhat ! com>
Date:       2021-08-10 10:34:40
Message-ID: CAHRwYXvJZw-h+7XnirMXwMnsCJafw9bpTL05Jke7qhZD5wAHBg () mail ! gmail ! com
[Download RAW message or body]

On Tue, Aug 10, 2021 at 12:05 PM Milan Zamazal <mzamazal@redhat.com> wrote:
> 
> Yedidyah Bar David <didi@redhat.com> writes:
> 
> > On Tue, Aug 3, 2021 at 10:27 PM Michal Skrivanek <
> > michal.skrivanek@redhat.com> wrote:
> > 
> > > 
> > > 
> > > On 3. 8. 2021, at 11:43, Yedidyah Bar David <didi@redhat.com> wrote:
> > > 
> > > On Tue, Aug 3, 2021 at 10:05 AM Yedidyah Bar David <didi@redhat.com>
> > > wrote:
> > > 
> > > 
> > > On Tue, Aug 3, 2021 at 7:50 AM <jenkins@jenkins.phx.ovirt.org> wrote:
> > > 
> > > 
> > > Project:
> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/
> > > Build:
> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/
> > > Build Number: 2126
> > > Build Status:  Failure
> > > Triggered By: Started by timer
> > > 
> > > -------------------------------------
> > > Changes Since Last Success:
> > > -------------------------------------
> > > Changes for Build #2126
> > > [Michal Skrivanek] basic: skipping just the VNC console part of
> > > test_virtual_machines
> > > 
> > > 
> > > 
> > > 
> > > -----------------
> > > Failed Tests:
> > > -----------------
> > > 2 tests failed.
> > > FAILED:
> > > he-basic-suite-master.test-scenarios.test_012_local_maintenance_sdk.test_local_maintenance
> > >  
> > > Error Message:
> > > ovirtsdk4.Error: Failed to read response: [(<pycurl.Curl object at
> > > 0x5555faf11228>, 7, 'Failed to connect to 192.168.200.99 port 443:
> > > Connection refused')]
> > > 
> > > 
> > > This looks very similar to the issue we have with dns/dig failures
> > > that cause the engine VM to go down, and it's similar, but different.
> > > 
> > > dig didn't fail (it now uses TCP), but something else caused the agent
> > > to stop the engine VM - a combination of high cpu load and low free
> > > memory, after restarting the engine VM as part of test_008.
> > > 
> > > 
> > > https://jenkins.ovirt.org/job/ovirt-system-tests_he-basic-suite-master/2126/arti \
> > > fact/exported-artifacts/test_logs/ost-he-basic-suite-master-host-0/var/log/ovirt-hosted-engine-ha/agent.log
> > >  :
> > > 
> > > 
> > > =====================================================================================
> > >  MainThread::INFO::2021-08-03
> > > 
> > > 06:46:55,068::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state ReinitializeFSM (score: 0)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:04,089::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:04,169::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> > >  Success, was notification of state_transition
> > > (ReinitializeFSM-GlobalMaintenance) sent? ignored
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:05,249::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 3400)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:14,439::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:25,526::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >  Penalizing score by 814 due to cpu load
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:25,527::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2586)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:25,537::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:26,029::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2586)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:35,050::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:35,576::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2586)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:45,597::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:46,521::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2586)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:55,577::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:56,559::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2586)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:47:56,559::hosted_engine::525::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Best remote host ost-he-basic-suite-master-host-1 (id: 2, score: 3400)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:05,633::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:06,188::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >  Penalizing score by 820 due to cpu load
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:06,188::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2580)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:16,256::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:16,950::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >  Penalizing score by 831 due to cpu load
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:16,951::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2569)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:26,053::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:26,999::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >  Penalizing score by 839 due to cpu load
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:26,999::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2561)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:36,026::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:36,802::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >  Penalizing score by 844 due to cpu load
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:36,802::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2556)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:45,827::state_decorators::51::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Global maintenance detected
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:46,401::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >  Penalizing score by 849 due to cpu load
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:46,401::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state GlobalMaintenance (score: 2551)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:56,588::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> > >  Success, was notification of state_transition
> > > (GlobalMaintenance-ReinitializeFSM) sent? ignored
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:48:58,685::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state ReinitializeFSM (score: 0)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:49:05,729::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> > >  Success, was notification of state_transition
> > > (ReinitializeFSM-EngineStarting) sent? ignored
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:49:06,146::states::176::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
> > >  Penalizing score by 830 due to cpu load
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:49:06,146::states::72::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_penalize_memory)
> > >  Penalizing score by 400 due to free memory 1782 being lower than
> > > required 3171
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:49:06,146::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state EngineStarting (score: 2170)
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:49:06,150::state_decorators::95::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(check)
> > >  Timeout cleared while transitioning <class
> > > 'ovirt_hosted_engine_ha.agent.states.EngineStarting'> -> <class
> > > 'ovirt_hosted_engine_ha.agent.states.EngineUp'>
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:49:06,172::brokerlink::73::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
> > >  Success, was notification of state_transition
> > > (EngineStarting-EngineUp) sent? ignored
> > > MainThread::INFO::2021-08-03
> > > 
> > > 06:49:06,178::hosted_engine::517::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop)
> > >  Current state EngineUp (score: 2570)
> > > MainThread::ERROR::2021-08-03
> > > 
> > > 06:49:16,197::states::398::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
> > >  Host ost-he-basic-suite-master-host-1 (id 2) score is significantly
> > > better than local score, shutting down VM on this host
> > > 
> > > =====================================================================================
> > >  
> > > I looked a bit at /var/log/messages of the host, and while there is
> > > quite some noise there, can't tell specifically what might have caused
> > > the high cpu load.
> > > 
> > > Also looked at logs of previous runs, and there is indeed a similar
> > > pattern, where the cpuload on the host causes the agent to penalize
> > > the score, but according to the agent log the load goes down faster,
> > > up to the point of being quite low when the engine is up, and only
> > > then we exit global maintenance, allowing the agent to take actions.
> > > 
> > > I now ran it again, but think that otherwise I'll ignore this for now,
> > > unless we see more similar failures. If we do, we might want to
> > > check/monitor/log the cpu load on the hosts, and/or change test_008 to
> > > wait, after the engine is up, until the cpu load on the host goes down
> > > a bit. Before the port to pytest we waited a hard-coded 5 minutes and
> > > I
> > > changed that then to only wait until the engine VM is not migrating,
> > > and this worked more-or-less ok. We might need to refine that,
> > > although I'd rather not introduce another arbitrary long delays but
> > > wait for some condition.
> > > 
> > > I also noticed that this ran was started at "03:04 AM" (UTC), by
> > > timer, whereas all previous timer-based runs started at 01:55 AM,
> > > somewhat earlier - perhaps there are some other things that run at
> > > that time that cause load.
> > > 
> > > 
> > > Checked some more and found another difference: The total cpu use was
> > > not very different between the failed and the last successful runs,
> > > but load caused by the engine vm was different between them - in the
> > > failed run, engine kept being '0.0' for a longer time, compared with
> > > the successful run. Since the agent tries to ignore the load by the
> > > engine vm, the final result is significantly different. The load-by
> > > -engine-vm is calculated based on the result of a VDSM getStats call
> > > on the engine VM. Indeed, in vdsm.log, cpuUser and cpuSys are both
> > > starting with '0.00' for both runs, but the failed one is '0.00' for
> > > a longer time - 45 seconds between VM.create and first-non-zero
> > > cpuUser on the successful run, vs 93 seconds for the failed one.
> > > Does this make sense (from VDSM's POV)? If so, perhaps there is
> > > a way to get from VDSM the answer to this question: "Is this 0.00
> > > CPU usage a result of still-initial, non-accurate, non-complete
> > > monitoring? Or is it really not using the CPU at all?" and use it
> > > in the agent somehow.
> > > 
> > > 
> > > I think it's futile to try to understand that on jenkins.ovirt.org runs.
> > > It's too unpredictable since it runs in a shared environment, you can't
> > > know what's going on on the same physical host.
> > > 
> > 
> > I understand that I can't get/expect stable performance/load there. That's
> > not my question. I asked about VDSM. I don't know the code there well. I
> > assume that when it starts a VM, it initializes something somewhere with
> > 'cpuUser: 0.00',
> 
> Yes.
> 
> > and once its gets _some_ data, whatever that might be, it starts
> > reporting (based on) this data.
> 
> Yes.  It retrieves the data in 15 seconds intervals by default and it
> needs 2 retrievals to compute the data (the difference between the 2
> samples).  It gets the data from libvirt for running VMs.  And then it
> waits until Engine calls getStats, which I believe is also every 15
> seconds.  So under normal circumstances it should take about 15-45
> seconds after the VM starts being running in libvirt before first real
> data is reported to Engine.
> 
> I can think about the following circumstances when it could be delayed:
> 
> - The VM is detected as non-responsive in libvirt.  Then the VM is not
> queried for its data this time and the following 40 seconds.
> 
> - The executor gets overfilled.
> 
> - Engine doesn't ask for the data for some reason.
> 
> - There is something very wrong in the system.
> 
> None of these should happen if everything is all right but sometimes not
> everything is all right, especially when there is something stuck in
> QEMU monitoring (e.g. because of temporary storage problems).
> 
> > My question was if there is a reliable way to know if '0.00' is just
> > real data, or initial-not-yet-real data.
> 
> From this data itself not but from the accompanied data probably yes.
> I'd say that if memory stats are present then it's a real value.

That's indeed what I see also in the referenced vdsm.log. Now pushed:

https://gerrit.ovirt.org/c/ovirt-hosted-engine-ha/+/116145

Please review. Thanks!
-- 
Didi
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-leave@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: https://www.ovirt.org/community/about/community-guidelines/
List Archives: https://lists.ovirt.org/archives/list/devel@ovirt.org/message/LP4OFM4H3VYFY3OVHIHSWLI7MT6P4UTY/



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic