'Re: How to tell if an emulated aarch64 CPU has stopped doing work?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       qemu-discuss
Subject:    Re: How to tell if an emulated aarch64 CPU has stopped doing work?
From:       Alex_Bennée <alex.bennee () linaro ! org>
Date:       2020-06-12 18:46:09
Message-ID: 87tuzgm8b2.fsf () linaro ! org
[Download RAW message or body]

Dave Bort <dbort-PgRGKqEAcmkAvxtiuMwx3w@public.gmane.org> writes:

> We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64 images, \
> running in linux containers on x86_64 alongside other workloads. 
> We've recently run into issues where it looks like an emulated CPU (out of four) \
> sometimes stops making progress for ten or more seconds, and we're trying to \
> characterize the problem. When this happens, the other emulated CPUs run just fine, \
> though sometimes two will stall out at the same time. 
> Any suggestions for how to tell if an emulated CPU stopped doing work?
> 
> Based on our experiments, the guest-visible clocks and cycle counters continue to \
> run when a qemu CPU thread is suspended, so it's hard to tell whether the emulation \
> paused, or if our code is spinning with interrupts disabled (though evidence is \
> mounting that that's not the case). We're adding a bunch more instrumentation to \
> our code, but maybe qemu has some features that will help us out.
> 
> I tried to find a way to count the number of TBs executed by an
> emulated core over time, but I didn't see a cheap way to do that with
> the plugin APIs.

It should be pretty cheap to do. You just need to extend the example bb
plugin to take cpu_index into account and do the proper locking to
update the instruction counter in vcpu_tb_exec.

The qemu_plugin_register_vcpu_idle_cb and
qemu_plugin_register_vcpu_resume_cb functions allow you to register call
backs for everytime we exit the main run loop and sleep for whatever
reason. You could even dump the total instruction counts there.

> 
> We could maybe turn on instruction tracing, but this problem happens pretty rarely \
> (<1%), we don't have a repro case yet, and we can't really afford the cost of \
> slowing down every test run. There's a decent chance that this is caused by an \
> overloaded host, but our host-side investigations haven't turned up anything \
> concrete either. 
> Any advice?
> 
> --dbort
> 

-- 
Alex Bennée


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic