'Re: pf state-table-induced instability'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openbsd-misc
Subject:    Re: pf state-table-induced instability
From:       Gabor LENCSE <lencse () hit ! bme ! hu>
Date:       2023-08-31 14:10:06
Message-ID: 367e2c79-291c-5f94-6cd0-15cb7f13b1eb () hit ! bme ! hu
[Download RAW message or body]

Dear David,

Thank you very much for all the new information!

I keep only those parts that I want to react.

>> It is not a fundamental issue, but it seems to me that during my tests not
>> only four but five CPU cores were used by IP packet forwarding:
> the packet processing is done in kernel threads (task queues are built
> on threads), and those threads could be scheduled on any cpu. the
> pf purge processing runs in yet another thread.
>
> iirc, the schedule scans down the list of cpus looking for an idle
> one when it needs to run stuff, except to avoid cpu0 if possible.
> this is why you see most of the system time on cpus 1 to 5.

Yes, I can confirm that any time I observed, CPU00 was not used by the 
system tasks.

However, I remembered that PF was disabled during my stateless tests, so 
I think its purge could not be the one that used CPU05. Now I repeated 
the experiment, first disabling PF as follows:

dut# pfctl -d
pf disabled

And I can still see FIVE CPU cores used by system tasks:

load averages:  0.69,  0.29, 0.13                               
dut.cntrg 14:41:06
36 processes: 35 idle, 1 on processor up 0 days 00:03:46
CPU00 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 8.1% intr, 
91.7% idle
CPU01 states:  0.0% user,  0.0% nice, 61.1% sys,  9.5% spin, 9.5% intr, 
19.8% idle
CPU02 states:  0.0% user,  0.0% nice, 62.8% sys, 10.9% spin, 8.5% intr, 
17.8% idle
CPU03 states:  0.0% user,  0.0% nice, 54.7% sys,  9.1% spin, 10.1% intr, 
26.0% idle
CPU04 states:  0.0% user,  0.0% nice, 62.7% sys, 10.2% spin, 9.8% intr, 
17.4% idle
CPU05 states:  0.0% user,  0.0% nice, 51.7% sys,  9.1% spin, 7.6% intr, 
31.6% idle
CPU06 states:  0.2% user,  0.0% nice,  2.8% sys,  0.8% spin, 10.0% intr, 
86.1% idle
CPU07 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.2% intr, 
92.6% idle
CPU08 states:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin, 8.4% intr, 
91.6% idle
CPU09 states:  0.0% user,  0.0% nice,  0.0% sys,  0.0% spin, 9.2% intr, 
90.8% idle
CPU10 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 10.8% intr, 
89.0% idle
CPU11 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 9.2% intr, 
90.6% idle
CPU12 states:  0.0% user,  0.0% nice,  0.2% sys,  0.8% spin, 9.2% intr, 
89.8% idle
CPU13 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.2% intr, 
92.6% idle
CPU14 states:  0.0% user,  0.0% nice,  0.0% sys,  0.8% spin, 9.8% intr, 
89.4% idle
CPU15 states:  0.0% user,  0.0% nice,  0.0% sys,  0.2% spin, 7.8% intr, 
92.0% idle
Memory: Real: 34M/1546M act/tot Free: 122G Cache: 807M Swap: 0K/256M

I suspect that top shows an average (in a few seconds time window) and 
perhaps one of the cores from CPU01 to CPU04 are skipped (e.g. because 
it was used by the "top" command?), this is why I can see system load on 
CPU05. (There is even some low amount of system load on CPU06.)


>> *Is there any way to completely delete its entire content?*
> hrm.
>
> so i just read the code again. "pfctl -F states" goes through the whole
> state table and unlinks the states from the red-black trees used for
> packet processing, and then marks them as unlinked so the purge process
> can immediately claim then as soon as they're scanned. this means that
> in terms of packet processing the tree is empty. the memory (which is
> what the state limit applies to) won't be reclaimed until the purge
> processing takes them.
>
> if you just wait 10 or so seconds after "pfctl -F states" then both the
> tree and state limits should be back to 0. you can watch pfctl -si,
> "systat pf", or the pfstate row in "systat pool" to confirm.
>
> you can change the scan interval with "set timeout interval" in pf.conf
> from 10s. no one fiddles with that though, so i'd put it back between
> runs to be representative of real world performance.

I usually wait 10s between the consecutive steps of the binary search of 
my measurements to give the system a chance to relax (trying to ensure 
that the steps are independent measurements). However, the timeout 
interval of PF was set to 1 hour (using "set timeout interval 3600"). 
You may ask, why?

To have some well defined performance metrics, and to define repeatable 
and reproducible measurements, we use the following tests:
- maximum connection establishment rate (during this test all test 
frames result in a new connection)
- throughput with bidirectional traffic as required by RFC 2544 (during 
this test no test frames result in a new connection, neither connection 
time out happens -- a sufficiently high timeout could guarantee it)
- connection tear down performance (first loading N number of 
connections and then deleting all connections in a single step and 
measuring the execution time of the deletion: connection tear down rate 
= N / deletion time of N connections)

It is a good question, how well the above performance metrics can 
represent the "real word" performance of a stateful NAT64 implementation!

If you are interested (and have time) I would be happy to work together 
with you in this area. We could publish a common paper, etc. Please let 
me know, if you are open for that.

The focus of my current measurements is only to test and demonstrate if 
using multiple IP addresses in stateless (RFC 2544) and stateful 
benchmarking measurements makes a difference. And it definitely seems to 
be so. :-)

(Now, I try to carry out the missing measurements and finish my paper 
about the extension of siitperf with pseudorandom IP addresses ASAP.)

Best regards,

Gábor



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic