'Re: [E1000-devel] performance anomaly with two e1000 PT Quad PCI-E'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       e1000-devel
Subject:    Re: [E1000-devel] performance anomaly with two e1000 PT Quad PCI-E
From:       "Brandon Heller" <brandon.heller () gmail ! com>
Date:       2007-06-25 17:33:10
Message-ID: 9230b3630706251033j1935076ame4b68d0a6d21301d () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


I haven't tried pktgen yet, and using netpipe would require an equally fast
machine to send and receive, which I don't have.  Bandwidth is my main
consideration, not RTT.

I've posted new graphs of default Linux forwarding testing, using a fresh
2.6.21 kernel, to this address:
http://cec.wustl.edu/~bdh4/e1000_tests.xls

To summarize, I spent most of last week trying to make sense of a graph of
max forwarding bandwidth vs input packet size, for 4 cores.  Forwarding
increased linearly (pure CPU-limited) until about 200B, then would
alternately rise and fall until about 800B, in a way that is hard to
describe in words.  Looking at the packets per second graph, it is easy to
see the effect of increased cache misses around the 128B and 192B boundaries
- slightly decreased performance.  The Xeon processors on the test system
have a 64B L2 line size.

The first hunch was that the heuristic used in the e1000 driver to adjust
interrupt rate was causing some of the oddities.  Nope - forcing the
interrupt throttle rate to 70K interrupts/sec/card had no effect, and
looking at the code confirmed that this was to be expected.
I forgot to reinstall the 7.5.5.1 driver after recompiling the kernel for
NAPI, so the graphs are a mix of 7.3 and 7.5 drivers, but for the important
4-core case the driver version didn't affect the performance.

The second hunch was that a module load-time parameter, copybreak, was
affecting performance.  If I understand it right, for packets smaller than
copybreak, the driver copies the packet to a new socket buffer (way smaller
than 2KB), and immediately frees the underused buffer (full 2KB).  For
smaller packet sizes, buffers are allocated more frequently, so this
parameter is a tweak to free up buffers faster and reduce the risk of
running out of them.   The copybreak default is 256, and reducing it to 128
seemed to stabilize the 4-core performance around 256B - but not much for
larger sizes.  It's not clear to me if the cause was running out of buffers
or simply improved cache performance.

Then I looked at 2-core performance, with interrupts pinned such that one
core on each socket was used, with the full 4MB cache available to it.  I
expected equally screwy performance with respect to packet size - yet the
graph was perfectly linear!  Nice.  Repeating the test with one CPU showed
the same linear trend, but with about half the performance, as expected.
Also repeated the 2-core test, but with both cores on one socket sharing one
4 MB cache, and the results were linear, but just a tiny bit slower than the
2-core, 2-socket case.  Good - this result implies that front-side bus
saturation is not a bottleneck, and that cache capacity misses are not a big
deal.

The next test used 3 cores, where 1 core on one socket had 4 ports pinned to
it, and the other socket had 2 cores each with 2 ports pinned.  The trend
was linear until 400B packets, and then tapered off until 700B packets where
line rate was achieved.  The processor load drops from 100% starting at
400B, which may be a clue.

There seem to be two unanswered questions:
(1) What causes the forwarding performance issues with 4 cores?  Where is
the bottleneck, and why is performance non-monotonic?  The fact that for
many packet sizes 2 cores outperform 4 seems like a big issue, and it's not
clear yet if the bottleneck is architectural, in the driver, or in the OS.
(2)  Why does cpuload go down after 400B with 3 cores?

For (1), FSB contention, PCI-E bus utlization, DMA engine bandwidth, DRAM
latency, and northbridge buffering all seem like reasonable bottlenecks, but
I'm not sure how they could cause the discontinuous performance graphs.
For (2), I don't have any ideas.

My question to you: what tests/profiling should I do to find answers to (1)
and (2)?

Thanks,
Brandon


On 6/21/07, Ronciak, John <john.ronciak@intel.com> wrote:
>
> Do other tests like pktgen or netpipe show the same kind of results?
> Netpipe is meant for this exact type of issue measuring rtt as the
> metric.  Seems odd but the driver doesn't know one packet size from
> another.  The buffers used from the system as well as the memory
> subsystem do care.  Other testing might be able to narrow this down for
> you.
>
> Also, did you try that latest 7.5.5.1 driver?
>
> Cheers,
> John
> -----------------------------------------------------------
> "Those who would give up essential Liberty, to purchase a little
> temporary Safety, deserve neither Liberty nor Safety.", Benjamin
> Franklin 1755
>
>
> >-----Original Message-----
> >From: e1000-devel-bounces@lists.sourceforge.net
> >[mailto:e1000-devel-bounces@lists.sourceforge.net] On Behalf
> >Of Brandon Heller
> >Sent: Friday, June 15, 2007 9:13 PM
> >To: e1000-devel@lists.sourceforge.net
> >Subject: [E1000-devel] performance anomaly with two e1000 PT
> >Quad PCI-E cards
> >
> >Hi,
> >
> >I'm trying to characterize the routing performance of a recent Dell
> >PowerEdge 5130 system with the following specs:
> >-two Pro/1000 PT Quad Port PCI-Express Ethernet cards
> >-Intel 5000V motherboard, two sockets, each with a Xeon 5130
> >(4 cores total)
> >-2 GB RAM
> >-2.6.19 SMP kernel
> >-e1000 version 7.2.9-k4
> >-irqbalance daemon disabled
> >-balanced, pinned interrupts (eth1 & eth5 to core0, eth2 &
> >eth6 to core1,
> >etc)
> >
> >The 8 GbE ports are connected to copper ports from custom
> >packet generators
> >built from Radisys ENP2611 IXP2400 boards, two with 3 ports
> >and one with 2
> >ports.  These generators have been verified to send at the rate they're
> >told, and are synchronized by the test system with UDP packets
> >to 20 us.
> >All tests have packets sent equally through all 8 ports for 5 seconds.
> >
> >The first tests went fine, with a patched kernel, where
> >packets were stolen
> >before they would enter the IP stack be and sent to a kernel
> >module that
> >counted and dropped them.  This configuration yielded:
> >-2.4 Gbps for 64B packets
> >-4 Gbps 128B packets
> >-7 Gbps+ for 256B packets and above.
> >
> >Not bad.  Next up was Linux forwarding with static routes,
> >filled-in ARP
> >entries, and packets leaving the same port they came in.  The
> >rates come
> >from the TX output counters, and I'm assuming that if packets
> >were counted
> >as TXed that they were actually sent out.  Results are:
> >-1.2 Gbps for 64B packets
> >-2.2 Gbps for 128B packets
> >..but then the results get weird.  256B packet forwarding
> >bandwidth rose
> >with increasing input traffic to 3 Gbps, then dropped suddenly
> >to 2.4 Gbps.
> >I zoomed in on this discontinuity, and it happens with >= 239B
> >packets, but
> >not with 236B packets.  236B packet bandwidth increases until
> >about 4 Gbps
> >then flattens off.
> >
> >I've posted a spreadsheet with all the data and graphs to
> >http://cec.wustl.edu/~bdh4/e1000_tests.xls .  The output bandwidth, CPU
> >load, interrupt totals, and PPS are shown vs input bandwidth
> >(0-7 Gbps).  At
> >the discontinuity, the CPU load hits 100% for all cores and
> >the number of
> >interrupts plummets.  The number of interrupts comes just
> >above the 1/sec I
> >see in NAPI polling mode, so I have to wonder if there's an overhead to
> >switch between polling and interrupt-driven modes that is
> >triggered by these
> >packet sizes, with no hysteresis.
> >
> >Another weirdness is that for certain packet sizes, one card gets more
> >processing time than the other, consistently.  Is this to be expected?
> >
> >Any ideas?
> >
> >Thanks!
> >
> >-Brandon
> >
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic