'Re: Request for Testing: TCP RACK'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-net
Subject:    Re: Request for Testing: TCP RACK
From:       tuexen () freebsd ! org
Date:       2024-03-28 15:53:35
Message-ID: 5C9863F7-0F1C-4D02-9F6D-9DDC5FBEB368 () freebsd ! org
[Download RAW message or body]

> On 28. Mar 2024, at 15:00, Nuno Teixeira <eduardo@freebsd.org> wrote:
> 
> Hello all!
> 
> Running rack @b7b78c1c169 "Optimize HPTS..." very happy on my laptop (amd64)!
> 
> Thanks all!
Thanks for the feedback!

Best regards
Michael
> 
> Drew Gallatin <gallatin@freebsd.org> escreveu (quinta, 21/03/2024 Ă (s) 12:58):
> The entire point is to *NOT* go through the overhead of scheduling something \
> asynchronously, but to take advantage of the fact that a user/kernel transition is \
> going to trash the cache anyway. 
> In the common case of a system which has less than the threshold  number of \
> connections , we access the tcp_hpts_softclock function pointer, make one function \
> call, and access hpts_that_need_softclock, and then return.  So that's 2 variables \
> and a function call. 
> I think it would be preferable to avoid that call, and to move the declaration of \
> tcp_hpts_softclock and hpts_that_need_softclock so that they are in the same \
> cacheline.  Then we'd be hitting just a single line in the common case.  (I've made \
> comments on the review to that effect). 
> Also, I wonder if the threshold could get higher by default, so that hpts is never \
> called in this context unless we're to the point where we're scheduling thousands \
> of runs of the hpts thread (and taking all those clock interrupts). 
> Drew
> 
> On Wed, Mar 20, 2024, at 8:17 PM, Konstantin Belousov wrote:
> > On Tue, Mar 19, 2024 at 06:19:52AM -0400, rrs wrote:
> > > Ok I have created
> > > 
> > > https://reviews.freebsd.org/D44420
> > > 
> > > 
> > > To address the issue. I also attach a short version of the patch that Nuno
> > > can try and validate
> > > 
> > > it works. Drew you may want to try this and validate the optimization does
> > > kick in since I can
> > > 
> > > only now test that it does not on my local box :)
> > The patch still causes access to all cpu's cachelines on each userret.
> > It would be much better to inc/check the threshold and only schedule the
> > call when exceeded.  Then the call can occur in some dedicated context,
> > like per-CPU thread, instead of userret.
> > 
> > > 
> > > 
> > > R
> > > 
> > > 
> > > 
> > > On 3/18/24 3:42 PM, Drew Gallatin wrote:
> > > > No.  The goal is to run on every return to userspace for every thread.
> > > > 
> > > > Drew
> > > > 
> > > > On Mon, Mar 18, 2024, at 3:41 PM, Konstantin Belousov wrote:
> > > > > On Mon, Mar 18, 2024 at 03:13:11PM -0400, Drew Gallatin wrote:
> > > > > > I got the idea from
> > > > > > https://people.mpi-sws.org/~druschel/publications/soft-timers-tocs.pdf
> > > > > > The gist is that the TCP pacing stuff needs to run frequently, and
> > > > > > rather than run it out of a clock interrupt, its more efficient to run
> > > > > > it out of a system call context at just the point where we return to
> > > > > > userspace and the cache is trashed anyway. The current implementation
> > > > > > is fine for our workload, but probably not idea for a generic system.
> > > > > > Especially one where something is banging on system calls.
> > > > > > 
> > > > > > Ast's could be the right tool for this, but I'm super unfamiliar with
> > > > > > them, and I can't find any docs on them.
> > > > > > 
> > > > > > Would ast_register(0, ASTR_UNCOND, 0, func) be roughly equivalent to
> > > > > > what's happening here?
> > > > > This call would need some AST number added, and then it registers the
> > > > > ast to run on next return to userspace, for the current thread.
> > > > > 
> > > > > Is it enough?
> > > > > > 
> > > > > > Drew
> > > > > 
> > > > > > 
> > > > > > On Mon, Mar 18, 2024, at 2:33 PM, Konstantin Belousov wrote:
> > > > > > > On Mon, Mar 18, 2024 at 07:26:10AM -0500, Mike Karels wrote:
> > > > > > > > On 18 Mar 2024, at 7:04, tuexen@freebsd.org wrote:
> > > > > > > > 
> > > > > > > > > > On 18. Mar 2024, at 12:42, Nuno Teixeira
> > > > > <eduardo@freebsd.org> wrote:
> > > > > > > > > > 
> > > > > > > > > > Hello all!
> > > > > > > > > > 
> > > > > > > > > > It works just fine!
> > > > > > > > > > System performance is OK.
> > > > > > > > > > Using patch on main-n268841-b0aaf8beb126(-dirty).
> > > > > > > > > > 
> > > > > > > > > > ---
> > > > > > > > > > net.inet.tcp.functions_available:
> > > > > > > > > > Stack                           D
> > > > > Alias                            PCB count
> > > > > > > > > > freebsd freebsd                          0
> > > > > > > > > > rack                            *
> > > > > rack                             38
> > > > > > > > > > ---
> > > > > > > > > > 
> > > > > > > > > > It would be so nice that we can have a sysctl tunnable for
> > > > > this patch
> > > > > > > > > > so we could do more tests without recompiling kernel.
> > > > > > > > > Thanks for testing!
> > > > > > > > > 
> > > > > > > > > @gallatin: can you come up with a patch that is acceptable
> > > > > for Netflix
> > > > > > > > > and allows to mitigate the performance regression.
> > > > > > > > 
> > > > > > > > Ideally, tcphpts could enable this automatically when it
> > > > > starts to be
> > > > > > > > used (enough?), but a sysctl could select auto/on/off.
> > > > > > > There is already a well-known mechanism to request execution of the
> > > > > > > specific function on return to userspace, namely AST.  The difference
> > > > > > > with the current hack is that the execution is requested for one
> > > > > callback
> > > > > > > in the context of the specific thread.
> > > > > > > 
> > > > > > > Still, it might be worth a try to use it; what is the reason to
> > > > > hit a thread
> > > > > > > that does not do networking, with TCP processing?
> > > > > > > 
> > > > > > > > 
> > > > > > > > Mike
> > > > > > > > 
> > > > > > > > > Best regards
> > > > > > > > > Michael
> > > > > > > > > > 
> > > > > > > > > > Thanks all!
> > > > > > > > > > Really happy here :)
> > > > > > > > > > 
> > > > > > > > > > Cheers,
> > > > > > > > > > 
> > > > > > > > > > Nuno Teixeira <eduardo@freebsd.org> escreveu (domingo,
> > > > > 17/03/2024 Ă (s) 20:26):
> > > > > > > > > > > 
> > > > > > > > > > > Hello,
> > > > > > > > > > > 
> > > > > > > > > > > > I don't have the full context, but it seems like the
> > > > > complaint is a performance regression in bonnie++ and perhaps other
> > > > > things when tcp_hpts is loaded, even when it is not used.  Is that
> > > > > correct?
> > > > > > > > > > > > 
> > > > > > > > > > > > If so, I suspect its because we drive the
> > > > > tcp_hpts_softclock() routine from userret(), in order to avoid tons
> > > > > of timer interrupts and context switches.  To test this theory,  you
> > > > > could apply a patch like:
> > > > > > > > > > > 
> > > > > > > > > > > It's affecting overall system performance, bonnie was just
> > > > > a way to
> > > > > > > > > > > get some numbers to compare.
> > > > > > > > > > > 
> > > > > > > > > > > Tomorrow I will test patch.
> > > > > > > > > > > 
> > > > > > > > > > > Thanks!
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > Nuno Teixeira
> > > > > > > > > > > FreeBSD Committer (ports)
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Nuno Teixeira
> > > > > > > > > > FreeBSD Committer (ports)
> > > > > > > > 
> > > > > > > 
> > > > > 
> > > > 
> > 
> > > diff --git a/sys/netinet/tcp_hpts.c b/sys/netinet/tcp_hpts.c
> > > index 8c4d2d41a3eb..eadbee19f69c 100644
> > > --- a/sys/netinet/tcp_hpts.c
> > > +++ b/sys/netinet/tcp_hpts.c
> > > @@ -216,6 +216,7 @@ struct tcp_hpts_entry {
> > > void *ie_cookie;
> > > uint16_t p_num; /* The hpts number one per cpu */
> > > uint16_t p_cpu; /* The hpts CPU */
> > > + uint8_t hit_callout_thresh;
> > > /* There is extra space in here */
> > > /* Cache line 0x100 */
> > > struct callout co __aligned(CACHE_LINE_SIZE);
> > > @@ -269,6 +270,11 @@ static struct hpts_domain_info {
> > > int cpu[MAXCPU];
> > > } hpts_domains[MAXMEMDOM];
> > > 
> > > +counter_u64_t hpts_that_need_softclock;
> > > +SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, needsoftclock, \
> > > CTLFLAG_RD, +    &hpts_that_need_softclock,
> > > +    "Number of hpts threads that need softclock");
> > > +
> > > counter_u64_t hpts_hopelessly_behind;
> > > 
> > > SYSCTL_COUNTER_U64(_net_inet_tcp_hpts_stats, OID_AUTO, hopeless, CTLFLAG_RD,
> > > @@ -334,7 +340,7 @@ SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, precision, \
> > > CTLFLAG_RW, &tcp_hpts_precision, 120,
> > > "Value for PRE() precision of callout");
> > > SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, cnt_thresh, CTLFLAG_RW,
> > > -    &conn_cnt_thresh, 0,
> > > +    &conn_cnt_thresh, DEFAULT_CONNECTION_THESHOLD,
> > > "How many connections (below) make us use the callout based mechanism");
> > > SYSCTL_INT(_net_inet_tcp_hpts, OID_AUTO, logging, CTLFLAG_RW,
> > > &hpts_does_tp_logging, 0,
> > > @@ -1548,6 +1554,9 @@ __tcp_run_hpts(void)
> > > struct tcp_hpts_entry *hpts;
> > > int ticks_ran;
> > > 
> > > + if (counter_u64_fetch(hpts_that_need_softclock) == 0)
> > > + return;
> > > +
> > > hpts = tcp_choose_hpts_to_run();
> > > 
> > > if (hpts->p_hpts_active) {
> > > @@ -1683,6 +1692,13 @@ tcp_hpts_thread(void *ctx)
> > > ticks_ran = tcp_hptsi(hpts, 1);
> > > tv.tv_sec = 0;
> > > tv.tv_usec = hpts->p_hpts_sleep_time * HPTS_TICKS_PER_SLOT;
> > > + if ((hpts->p_on_queue_cnt > conn_cnt_thresh) && (hpts->hit_callout_thresh == \
> > > 0)) { + hpts->hit_callout_thresh = 1;
> > > + counter_u64_add(hpts_that_need_softclock, 1);
> > > + } else if ((hpts->p_on_queue_cnt <= conn_cnt_thresh) && \
> > > (hpts->hit_callout_thresh == 1)) { + hpts->hit_callout_thresh = 0;
> > > + counter_u64_add(hpts_that_need_softclock, -1);
> > > + }
> > > if (hpts->p_on_queue_cnt >= conn_cnt_thresh) {
> > > if(hpts->p_direct_wake == 0) {
> > > /*
> > > @@ -1818,6 +1834,7 @@ tcp_hpts_mod_load(void)
> > > cpu_top = NULL;
> > > #endif
> > > tcp_pace.rp_num_hptss = ncpus;
> > > + hpts_that_need_softclock = counter_u64_alloc(M_WAITOK);
> > > hpts_hopelessly_behind = counter_u64_alloc(M_WAITOK);
> > > hpts_loops = counter_u64_alloc(M_WAITOK);
> > > back_tosleep = counter_u64_alloc(M_WAITOK);
> > > @@ -2042,6 +2059,7 @@ tcp_hpts_mod_unload(void)
> > > free(tcp_pace.grps, M_TCPHPTS);
> > > #endif
> > > 
> > > + counter_u64_free(hpts_that_need_softclock);
> > > counter_u64_free(hpts_hopelessly_behind);
> > > counter_u64_free(hpts_loops);
> > > counter_u64_free(back_tosleep);
> > 
> > 
> 
> 
> 
> -- 
> Nuno Teixeira
> FreeBSD Committer (ports)


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic