'Re: [Intel-wired-lan] Counter spikes in /proc/net/dev for E810-CQDA2 interfaces (ice driver) on kern'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       intel-wired-lan
Subject:    Re: [Intel-wired-lan] Counter spikes in /proc/net/dev for E810-CQDA2 interfaces (ice driver) on kern
From:       Przemek Kitszel <przemyslaw.kitszel () intel ! com>
Date:       2024-01-30 13:25:54
Message-ID: 0ffd1e6a-35ff-4868-a15d-d0f12c5c9720 () intel ! com
[Download RAW message or body]

On 1/30/24 09:30, Christian Rohmann wrote:
> Hello again Przemek,
> 
> On 11.01.24 16:07, Przemek Kitszel wrote:
>> I plan (and my manager agrees :)) to work on this ~now
>>
>> So far I have found a few bad smells to fix in related area, will work
>> with Ubuntu as main test setup for that too to increase a chance of
>> repro.
>>
>> Also, just from the code there is no obvious bug (even if there is about
>> one patch around stats in 6.1 ... 6.2 range).
>>
>> I would also check exact Ubuntu kernel sources (not just "upstream"). 
> 
> Were you able to find anything in this regard yet? See my further 
> findings below.
> 

[Ben changes]
By looking at the commit range, this time also for iavf, still the only
obvious candidates to look deeper in are two commits by Ben:
2fd5e433cd26 ("ice: Accumulate HW and Netdev statistics over reset")
288ecf491b16 ("ice: Accumulate ring statistics over reset")

[split]
after them we went via major refactor:
6624e780a577 ("ice: split ice_vsi_setup into smaller functions")
After that there is more than 200 commits (for current Ubuntu 6.2 hwe),
including more than 5 fixes for this spilt (and much more just touching
code touched by the split), so reverts for the purpose of just testing
are infeasible.

I would check with code set to prior-to-split, then prior-to-Ben changes

> 
> 
> 
> 
> On 16.01.24 15:40, Christian Rohmann wrote:
>> One observation that I can contribute to maybe narrow down the issue:
>>
>> Looking at traffic graphs of three different machines (attached to 
>> this email; I can provide them in better resolution, but ML only 
>> allows 90kB),
>> there seems to be a correlation to the number / existence of KVM 
>> virtual machines:
>>
>>  * comp-20 has 29 VMs
>>  * comp-21 has 96 VMs
>>  * comp-24 hat 0 VMs (<< !)
>> [...]
> 
> 
>> [...]
>> I have now moved a few VMs to comp-24 to see if the issue start 
>> occurring on that machine then as well.
>> This should only cause some of the mentioned L2 components to now 
>> exist on this machine. The issue did not appear immediately though, 
>> but I keep observing this and maybe start increasing the VM count and 
>> networking load.
>>
>> Maybe the counters spiking is due to some offloading feature such as 
>> VXLAN?
> 
> 
> 1) With only a few VMs and no churn there were no spikes in the counters 
> over a long period of time.
> 2) I then moved some more VMs to this machine yesterday and soon the 
> spikes to multiple TBit/s happend. See the attached screenshot.
> 
> Some observations:
> 
>   * the spikes happened during or right after live migrating of instances

we are still going to propose full solution for live migrating on our HW
(however I'm not sure if that is a problem here), it was already on IWL,
and now Jake has picked it up

>   * the spikes then did not appear for > 12 hours
>   * I believe this relates to either
>   ** the number of linux bridges, tap interfaces or vxlan interfaces
>   ** their chrun (creation / deletion) when VMs are spawned / deleted or 
> migrated away
> 
> 
> 
> Please let me know if there is any more input I could provide to help 
> resolving this issue.
> Regards

I did multiple rounds of review for [Ben changes] mentioned above, still
there could be some omission (IOW code looks fine, but I don't know if
there is some path missing); then there was huge refactor [split], that
touched that part rather "mechanically".

Testing the scenario with code from prior to split, and prior to stats
would certainly help, if you could help that would be really great,
I will do my simpler tests with that too.

If there is any BPF program re/init during that migration it also resets
some stats (and if there is any other bug that could manifest like that)

Finally I'm still unhappy with our resource arrays reallocation,
thankfully it's not as bad as OOT flavour of the driver :~|

> 
> 
> Christian

Thank you!

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic