[prev in list] [next in list] [prev in thread] [next in thread]
List: qemu-discuss
Subject: Re: How to tell if an emulated aarch64 CPU has stopped doing work?
From: Alex_Bennée <alex.bennee () linaro ! org>
Date: 2020-06-12 18:46:09
Message-ID: 87tuzgm8b2.fsf () linaro ! org
[Download RAW message or body]
Dave Bort <dbort-PgRGKqEAcmkAvxtiuMwx3w@public.gmane.org> writes:
> We use qemu (4.0.0, about to flip the switch to 5.0.0) to test our aarch64 images, \
> running in linux containers on x86_64 alongside other workloads.
> We've recently run into issues where it looks like an emulated CPU (out of four) \
> sometimes stops making progress for ten or more seconds, and we're trying to \
> characterize the problem. When this happens, the other emulated CPUs run just fine, \
> though sometimes two will stall out at the same time.
> Any suggestions for how to tell if an emulated CPU stopped doing work?
>
> Based on our experiments, the guest-visible clocks and cycle counters continue to \
> run when a qemu CPU thread is suspended, so it's hard to tell whether the emulation \
> paused, or if our code is spinning with interrupts disabled (though evidence is \
> mounting that that's not the case). We're adding a bunch more instrumentation to \
> our code, but maybe qemu has some features that will help us out.
>
> I tried to find a way to count the number of TBs executed by an
> emulated core over time, but I didn't see a cheap way to do that with
> the plugin APIs.
It should be pretty cheap to do. You just need to extend the example bb
plugin to take cpu_index into account and do the proper locking to
update the instruction counter in vcpu_tb_exec.
The qemu_plugin_register_vcpu_idle_cb and
qemu_plugin_register_vcpu_resume_cb functions allow you to register call
backs for everytime we exit the main run loop and sleep for whatever
reason. You could even dump the total instruction counts there.
>
> We could maybe turn on instruction tracing, but this problem happens pretty rarely \
> (<1%), we don't have a repro case yet, and we can't really afford the cost of \
> slowing down every test run. There's a decent chance that this is caused by an \
> overloaded host, but our host-side investigations haven't turned up anything \
> concrete either.
> Any advice?
>
> --dbort
>
--
Alex Bennée
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic