[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openbsd-tech
Subject:    Re: vmd(8): add MTU feature support to vionet device
From:       Dave Voutila <dave () sisu ! io>
Date:       2021-05-29 20:10:39
Message-ID: 874keljo74.fsf () kogelvis ! sisu ! home
[Download RAW message or body]


Mike Larkin <mlarkin@nested.page> writes:

> On Mon, May 24, 2021 at 08:25:04AM +0200, Claudio Jeker wrote:
>> On Sun, May 23, 2021 at 10:25:38PM -0400, Dave Voutila wrote:
>> > The following diff adds in virtio 1.1's VIRTIO_NET_F_MTU feature support
>> > to vmd(8)'s virtio networking device. This allows for communicating an MTU
>> > to the guest driver and then enforcing it in the emulated device.
>> >
>> > When the feature is offered, per Virtio v1.1, 5.1.4.1 [1]:
>> >
>> > "The device MUST NOT pass received packets that exceed mtu (plus low
>> > level ethernet header length) size with gso_type NONE or ECN after
>> > VIRTIO_NET_F_MTU has been successfully negotiated."
>> >
>> > (GSO is not supported or negotiated, so it's always NONE. This is
>> > primarly because the vmd vionet device also doesn't support or negotiate
>> > checksum offloading.)
>> >
>> > The prior logic in place simply checked the packet was of a allowable
>> > size, which meant the largest IP packet (65535) plus an ethernet header.
<snip>
>> >
>> > [1] https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-2000004
<snip>
>>
>> Is it possible to get proper defines for these two options?
>> This VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI + 11 is ugly.
>>
<snip>
>
> Since this is a pci config space access and I've seen Linux use 1, 2, and 4 byte
> accesses. But, yes, we could improve the actual name.
>
> Once dv@ gets this in I'll go back and redo the other devices (since we do a
> similar thing for those as well).
>
> -ml
>

Adjusted to something far less hideous and leaves a space to define and
migrate other config address offsets.

>> From a network perspective enforcing a MTU 1500 is not ideal in modern
>> days. In general devices should support jumbo frames (around 9000 bytes) or
>> at list mini-jumbo frames (something around 1536 bytes). With a hard limit
>> on 1500 bytes additional encapsulations like vlan(4) wont work properly.
>>
>> Now since these network interfaces depend on tap(4) an option would be to
>> use TUNMRU (16384) as the max limit.
>>
>> --
>> :wq Claudio
>>

Redesigned the diff for mtu feature support after chatting with
Claudio. There are 3 limits in the defined in the new diff:

- VIONET_HARD_MTU: set to TUNMRU (16384)
- VIONET_MIN_TX_LEN: set to ETHER_HDR_LEN
- VIONET_MAX_TX_LEN: set to the hard MTU plus some space for Ethernet
  headers, crc, and a vlan encapsulation

This diff also redesigns the receive (RX) side of the virtio net device
to not only add the MTU feature, but to add support for (optional)
mergeable RX buffers on the driver side. vionet_enq_rx now can work with
either longer descriptor chains provided by the driver-side or, if
negotiated, can simply use individual buffers.

?? Why add MRG_RXBUF feature support ??

It allows vmd guests to use a larger mtu *without* code change to
vio(4). vio will limit itself to an mtu of 2024 (MCLBYTES - some header
space) via a 2-descriptor if not using mergeable rx buffers.

The result is existing OpenBSD guests with the patched vmd host can bump
their mtu's to 16000 (non-OpenBSD can hit the actual hard limit of
16384) resulting in a more efficient way to move data locally even with
the current vio implementation. (Adding the mtu feature support to
vio(4) to allow bumping mtu to 16384 is a separate diff.)

Some *very* unscientific tpcbench output with the host running `tcpbench
-s` and the guest running `tcpbench -t 60 100.64.1.2`.

No diff and default mtu of 1500:
  3062392368 bytes sent over 61.007 seconds
  bandwidth min/avg/max/std-dev = 131.072/400.376/559.758/126.344 Mbps

No diff and mtu 2024:
  4156106860 bytes sent over 61.026 seconds
  bandwidth min/avg/max/std-dev = 174.244/547.617/744.719/162.611 Mbps

With diff and default mtu 1500:
  3214576872 bytes sent over 61.006 seconds
  bandwidth min/avg/max/std-dev = 128.895/420.861/553.280/128.898 Mbp

With diff and mtu 16000:
  26196124612 bytes sent over 61.001 seconds
  bandwidth min/avg/max/std-dev = 981.601/3427.996/4156.555/870.063 Mbps

The speedups are actually more on the TX-side.

For guest-to-guest, testing with an OpenBSD-current and a Debian guest
using iperf3 via an attached virtual switch backed by veb(4):

With mtu's both set to 1500:
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-60.01  sec  5.21 GBytes   745 Mbits/sec  sender
[  5]   0.00-60.12  sec  5.20 GBytes   743 Mbits/sec  receeiver

With mtu's both set to 16000:
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-60.00  sec  12.0 GBytes  1.72 Gbits/sec  sender
[  5]   0.00-60.34  sec  12.0 GBytes  1.71 Gbits/sec  receiver

Note: You don't need to adjust the host-side tap mtu.

I'm not asking for OKs until I get more testing, so if you use vmd
please try out the below diff and see if you notice any regressions
*especially by NOT changing any mtu settings*. If users have to change
mtu, to get performance back, something is definitely wrong with my
design.

-dv

Index: usr.sbin/vmd/virtio.c
===================================================================
RCS file: /cvs/src/usr.sbin/vmd/virtio.c,v
retrieving revision 1.87
diff -u -p -r1.87 virtio.c
--- usr.sbin/vmd/virtio.c	18 May 2021 11:06:43 -0000	1.87
+++ usr.sbin/vmd/virtio.c	29 May 2021 17:39:11 -0000
@@ -60,7 +60,9 @@ int nr_vioblk;

 #define MAXPHYS	(64 * 1024)	/* max raw I/O transfer size */

+#define VIRTIO_NET_F_MTU	(1<<3)
 #define VIRTIO_NET_F_MAC	(1<<5)
+#define VIRTIO_NET_F_MRG_RXBUF	(1<<15)

 #define VMMCI_F_TIMESYNC	(1<<0)
 #define VMMCI_F_ACK		(1<<1)
@@ -1037,14 +1031,34 @@ virtio_net_io(int dir, uint16_t reg, uin
 		}
 	} else {
 		switch (reg) {
-		case VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI:
-		case VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI + 1:
-		case VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI + 2:
-		case VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI + 3:
-		case VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI + 4:
-		case VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI + 5:
-			*data = dev->mac[reg -
-			    VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI];
+		case VIONET_CONFIG_MAC:
+		case VIONET_CONFIG_MAC + 1:
+		case VIONET_CONFIG_MAC + 2:
+		case VIONET_CONFIG_MAC + 3:
+		case VIONET_CONFIG_MAC + 4:
+		case VIONET_CONFIG_MAC + 5:
+			*data = dev->mac[reg - VIONET_CONFIG_MAC];
+			break;
+		case VIONET_CONFIG_MTU:
+			if (sz == 2) {
+				*data = VIONET_HARD_MTU;
+			} else if (sz == 1) {
+				*data &= 0xFF00;
+				*data |= (uint32_t)(VIONET_HARD_MTU) & 0xFF;
+			} else {
+				log_warnx("%s: illegal read of vionet_mtu",
+					__progname);
+			}
+			break;
+		case VIONET_CONFIG_MTU + 1:
+			if (sz == 1) {
+				*data &= 0xFF00;
+				*data |= (uint32_t)(VIONET_HARD_MTU >> 8) &
+				    0xFF;
+		} else {
+				log_warnx("%s: illegal read of vionet_mtu",
+					__progname);
+			}
 			break;
 		case VIRTIO_CONFIG_DEVICE_FEATURES:
 			*data = dev->cfg.device_feature;
@@ -1110,6 +1124,13 @@ vionet_update_qs(struct vionet_dev *dev)
 }

 /*
+ * vionet_enq_rx
+ *
+ * Take a given packet from the host-side tap and copy it into the guest's
+ * buffers utilizing the rx virtio ring. If the packet length is invalid
+ * (too small or too large) or if there are not enough buffers available,
+ * the packet is dropped.
+ *
  * Must be called with dev->mutex acquired.
  */
 int
@@ -1117,24 +1138,26 @@ vionet_enq_rx(struct vionet_dev *dev, ch
 {
 	uint64_t q_gpa;
 	uint32_t vr_sz;
-	uint16_t idx, pkt_desc_idx, hdr_desc_idx;
-	ptrdiff_t off;
-	int ret;
-	char *vr;
-	size_t rem;
+	uint16_t dxx, idx, hdr_desc_idx, chain_hdr_idx;
+	int nbufs = 0, ret = 0;
+	char *vr = NULL;
+	size_t bufsz = 0, off = 0, pkt_offset = 0, chunk_size = 0;
+	size_t chain_len = 0;
 	struct vring_desc *desc, *pkt_desc, *hdr_desc;
 	struct vring_avail *avail;
 	struct vring_used *used;
 	struct vring_used_elem *ue;
 	struct virtio_net_hdr hdr;
+	size_t hdr_sz;

-	ret = 0;
-
-	if (sz < 1 || sz > IP_MAXPACKET + ETHER_HDR_LEN) {
+	if (sz < VIONET_MIN_TXLEN || sz > VIONET_HARD_MTU) {
 		log_warn("%s: invalid packet size", __func__);
 		return (0);
 	}

+	hdr_sz = dev->cfg.guest_feature & VIRTIO_NET_F_MRG_RXBUF ?
+	    sizeof(hdr) : sizeof(hdr) - 2;
+
 	if (!(dev->cfg.device_status & VIRTIO_CONFIG_DEVICE_STATUS_DRIVER_OK))
 		return ret;

@@ -1168,81 +1191,118 @@ vionet_enq_rx(struct vionet_dev *dev, ch
 	hdr_desc_idx = avail->ring[idx] & VIONET_QUEUE_MASK;
 	hdr_desc = &desc[hdr_desc_idx];

-	pkt_desc_idx = hdr_desc->next & VIONET_QUEUE_MASK;
-	pkt_desc = &desc[pkt_desc_idx];
-
-	/* Set up the virtio header (written first, before the packet data) */
-	memset(&hdr, 0, sizeof(struct virtio_net_hdr));
-	hdr.hdr_len = sizeof(struct virtio_net_hdr);
-
-	/* Check size of header descriptor */
-	if (hdr_desc->len < sizeof(struct virtio_net_hdr)) {
-		log_warnx("%s: invalid header descriptor (too small)",
-		    __func__);
-		goto out;
-	}
-
-	/* Write out virtio header */
-	if (write_mem(hdr_desc->addr, &hdr, sizeof(struct virtio_net_hdr))) {
-	    log_warnx("vionet: rx enq header write_mem error @ "
-		    "0x%llx", hdr_desc->addr);
-		goto out;
-	}
-
-	/*
-	 * Compute remaining space in the first (header) descriptor, and
-	 * copy the packet data after if space is available. Otherwise,
-	 * copy to the pkt_desc descriptor.
-	 */
-	rem = hdr_desc->len - sizeof(struct virtio_net_hdr);
+	while (bufsz < sz) {
+		nbufs++;

-	if (rem >= sz) {
-		if (write_mem(hdr_desc->addr + sizeof(struct virtio_net_hdr),
-			pkt, sz)) {
-			log_warnx("vionet: rx enq packet write_mem error @ "
-			    "0x%llx", pkt_desc->addr);
+		/* Take an available buffer index from the ring. */
+		idx = dev->vq[RXQ].last_avail & VIONET_QUEUE_MASK;
+		if ((dev->vq[RXQ].notified_avail & VIONET_QUEUE_MASK) == idx) {
+			log_debug("%s: insufficient available buffer capacity, "
+			    "dropping packet.", __func__);
 			goto out;
 		}
-	} else {
-		/* Fallback to pkt_desc descriptor */
-		if (pkt_desc->len >= sz) {
-			/* Must be not readable */
-			if ((pkt_desc->flags & VRING_DESC_F_WRITE) == 0) {
-				log_warnx("unexpected readable rx desc %d",
-				    pkt_desc_idx);
-				goto out;
+		dxx = avail->ring[idx] & VIONET_QUEUE_MASK;
+		chain_hdr_idx = dxx;
+		chain_len = 0;
+
+		DPRINTF("Working buf %d (bufsz=%lu, sz=%lu)...\n", nbufs, bufsz,
+			sz);
+		/* Process the descriptor and walk any potential chain. */
+		do {
+			off = 0;
+			pkt_desc = &desc[dxx];
+			DPRINTF("...chain entry in buf %d: dxx=%d, len %d\n",
+			    nbufs, dxx, pkt_desc->len);
+			if (!(pkt_desc->flags & VRING_DESC_F_WRITE)) {
+			    log_warnx("%s: invalid descriptor, not writable",
+				__func__);
+			    goto out;
 			}

-			/* Write packet to descriptor ring */
-			if (write_mem(pkt_desc->addr, pkt, sz)) {
-				log_warnx("vionet: rx enq packet write_mem "
-				    "error @ 0x%llx", pkt_desc->addr);
+			/* How much data do we get to write? */
+			if (sz - bufsz > pkt_desc->len)
+				chunk_size = pkt_desc->len;
+			else
+				chunk_size = sz - bufsz;
+
+			if (nbufs == 1 && chain_len == 0) {
+				/*
+				 * Special case: the first buffer in a chain
+				 * might be sized by the driver for *just* the
+				 * virtio header.
+				 */
+				off = hdr_sz;
+				if (chunk_size == pkt_desc->len)
+					chunk_size -= off;
+			}
+
+			/* Write a chunk of data if we need to */
+			if (chunk_size && write_mem(pkt_desc->addr + off,
+				pkt + pkt_offset, chunk_size)) {
+				log_warnx("%s: failed to write to buffer "
+				    "0x%llx", __func__, pkt_desc->addr);
 				goto out;
 			}
-		} else {
-			log_warnx("%s: descriptor too small for packet data",
-			    __func__);
+
+			chain_len += chunk_size + off;
+			bufsz += chunk_size;
+			pkt_offset += chunk_size;
+
+			DPRINTF("(nbufs=%d, chain_len=%lu) writing %lu bytes to"
+			    " 0x%llx + off %lu\n", nbufs, chain_len, chunk_size,
+			    pkt_desc->addr, off);
+
+			dxx = pkt_desc->next & VIONET_QUEUE_MASK;
+		} while (bufsz < sz && pkt_desc->flags & VRING_DESC_F_NEXT);
+
+		/* Update the list of used buffers. */
+		ue = &used->ring[(used->idx) & VIONET_QUEUE_MASK];
+		ue->id = chain_hdr_idx;
+		ue->len = chain_len;
+		off = ((char *)ue - vr);
+		if (write_mem(q_gpa + off, ue, sizeof(*ue))) {
+			log_warnx("%s: error updating rx used ring", __func__);
 			goto out;
 		}
+
+		DPRINTF("offered buffer: ue {id=%d, len=%d }, used idx=%d "
+		    " last=%d, notified=%d (bufsz=%lu, sz=%lu)\n", ue->id,
+		    ue->len, used->idx, dev->vq[RXQ].last_avail,
+		    dev->vq[RXQ].notified_avail, bufsz, sz);
+
+		/* Move our markers in the rings...*/
+		used->idx++;
+		dev->vq[RXQ].last_avail = (dev->vq[RXQ].last_avail + 1) &
+		    VIONET_QUEUE_MASK;
 	}

-	ret = 1;
+	/* Now we can configure the virtio netheader. */
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.hdr_len = hdr_sz;
+	hdr.num_buffers = nbufs;
+
+	/* Prepend the virtio net header in the first buffer. */
+	if (write_mem(hdr_desc->addr, &hdr, hdr_sz)) {
+	    log_warnx("vionet: rx enq header write_mem error @ 0x%llx",
+		hdr_desc->addr);
+		goto out;
+	}
+	DPRINTF("...wrote virtio net hdr @ 0x%llx, hdr_sz=%lu\n",
+	    hdr_desc->addr, hdr_sz);
+
+	/* Update the index field in the used ring */
 	dev->cfg.isr_status = 1;
-	ue = &used->ring[used->idx & VIONET_QUEUE_MASK];
-	ue->id = hdr_desc_idx;
-	ue->len = sz + sizeof(struct virtio_net_hdr);
-	used->idx++;
-	dev->vq[RXQ].last_avail++;
-	*spc = dev->vq[RXQ].notified_avail - dev->vq[RXQ].last_avail;
+	off = (char *)&used->idx - vr;
+	*spc = (dev->vq[RXQ].notified_avail - dev->vq[RXQ].last_avail) &
+	    VIONET_QUEUE_MASK;

-	off = (char *)ue - vr;
-	if (write_mem(q_gpa + off, ue, sizeof *ue))
+	if (write_mem(q_gpa + off, &used->idx, sizeof(used->idx)))
 		log_warnx("vionet: error writing vio ring");
-	else {
-		off = (char *)&used->idx - vr;
-		if (write_mem(q_gpa + off, &used->idx, sizeof used->idx))
-			log_warnx("vionet: error writing vio ring");
-	}
+
+	DPRINTF("%s: finished packet, used->idx = %d, spc=%d\n", __func__,
+	    used->idx, *spc);
+	ret = 1;
+
 out:
 	free(vr);
 	return (ret);
@@ -1259,13 +1319,13 @@ out:
 static int
 vionet_rx(struct vionet_dev *dev)
 {
-	char buf[PAGE_SIZE];
+	char buf[VIONET_MAX_TXLEN];
 	int hasdata, num_enq = 0, spc = 0;
 	struct ether_header *eh;
 	ssize_t sz;

 	do {
-		sz = read(dev->fd, buf, sizeof buf);
+		sz = read(dev->fd, buf, sizeof(buf));
 		if (sz == -1) {
 			/*
 			 * If we get EAGAIN, No data is currently available.
@@ -1274,21 +1334,25 @@ vionet_rx(struct vionet_dev *dev)
 			if (errno != EAGAIN)
 				log_warn("unexpected read error on vionet "
 				    "device");
-		} else if (sz > 0) {
+		} else if (sz > VIONET_MIN_TXLEN) {
 			eh = (struct ether_header *)buf;
-			if (!dev->lockedmac || sz < ETHER_HDR_LEN ||
+			if (!dev->lockedmac ||
 			    ETHER_IS_MULTICAST(eh->ether_dhost) ||
 			    memcmp(eh->ether_dhost, dev->mac,
-			    sizeof(eh->ether_dhost)) == 0)
+				sizeof(eh->ether_dhost)) == 0)
 				num_enq += vionet_enq_rx(dev, buf, sz, &spc);
 		} else if (sz == 0) {
-			log_debug("process_rx: no data");
+			log_debug("%s: no data", __func__);
+			hasdata = 0;
+			break;
+		} else {
+			log_debug("%s: invalid data read from tap", __func__);
 			hasdata = 0;
 			break;
 		}

 		hasdata = fd_hasdata(dev->fd);
-	} while (spc && hasdata);
+	} while (spc > 0 && hasdata);

 	dev->rx_pending = hasdata;
 	return (num_enq);
@@ -1505,11 +1569,11 @@ vionet_notify_tx(struct vionet_dev *dev)
 		/* Remove virtio header descriptor len */
 		pktsz -= hdr_desc->len;

-		/* Only allow buffer len < max IP packet + Ethernet header */
-		if (pktsz > IP_MAXPACKET + ETHER_HDR_LEN) {
+		/* Drop packets violating device MTU-based limits */
+		if (pktsz < VIONET_MIN_TXLEN || pktsz > VIONET_MAX_TXLEN) {
 			log_warnx("%s: invalid packet size %lu", __func__,
 			    pktsz);
-			goto out;
+			goto drop_packet;
 		}
 		pkt = malloc(pktsz);
 		if (pkt == NULL) {
@@ -1590,6 +1654,7 @@ vionet_notify_tx(struct vionet_dev *dev)
 			goto out;
 		}

+	drop_packet:
 		ret = 1;
 		dev->cfg.isr_status = 1;
 		used->ring[used->idx & VIONET_QUEUE_MASK].id = hdr_desc_idx;
@@ -1931,6 +1996,8 @@ virtio_init(struct vmd_vm *vm, int child
 			    sizeof(struct vring_desc) * VIONET_QUEUE_SIZE
 			    + sizeof(uint16_t) * (2 + VIONET_QUEUE_SIZE));
 			vionet[i].vq[RXQ].last_avail = 0;
+			vionet[i].vq[RXQ].notified_avail = 0;
+
 			vionet[i].vq[TXQ].qs = VIONET_QUEUE_SIZE;
 			vionet[i].vq[TXQ].vq_availoffset =
 			    sizeof(struct vring_desc) * VIONET_QUEUE_SIZE;
@@ -1965,6 +2033,12 @@ virtio_init(struct vmd_vm *vm, int child
 				vionet[i].pxeboot = 1;
 			vionet[i].idx = i;
 			vionet[i].pci_id = id;
+
+			/* Allow drivers to identify the MTU */
+			vionet[i].cfg.device_feature |= VIRTIO_NET_F_MTU;
+
+			/* Support driver-mergeable RX buffers */
+			vionet[i].cfg.device_feature |= VIRTIO_NET_F_MRG_RXBUF;

 			log_debug("%s: vm \"%s\" vio%u lladdr %s%s%s%s",
 			    __func__, vcp->vcp_name, i,
Index: usr.sbin/vmd/virtio.h
===================================================================
RCS file: /cvs/src/usr.sbin/vmd/virtio.h,v
retrieving revision 1.38
diff -u -p -r1.38 virtio.h
--- usr.sbin/vmd/virtio.h	21 Apr 2021 18:27:36 -0000	1.38
+++ usr.sbin/vmd/virtio.h	29 May 2021 17:39:11 -0000
@@ -17,6 +17,7 @@
  */

 #include <dev/pv/virtioreg.h>
+#include <net/if_tun.h>

 #define VIRTQUEUE_ALIGN(n)	(((n)+(VIRTIO_PAGE_SIZE-1))&    \
 				    ~(VIRTIO_PAGE_SIZE-1))
@@ -36,6 +37,16 @@
 #define VIONET_QUEUE_SIZE	256
 #define VIONET_QUEUE_MASK	(VIONET_QUEUE_SIZE - 1)

+/* Virtio network device is backed by tap(4), so inherit limits */
+#define VIONET_HARD_MTU		TUNMRU
+#define VIONET_MIN_TXLEN	ETHER_HDR_LEN
+#define VIONET_MAX_TXLEN	VIONET_HARD_MTU + ETHER_HDR_LEN + \
+				    ETHER_CRC_LEN + ETHER_VLAN_ENCAP_LEN
+
+/* Virtio PCI config space offsets */
+#define VIONET_CONFIG_MAC	VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI
+#define	VIONET_CONFIG_MTU	VIRTIO_CONFIG_DEVICE_CONFIG_NOMSI + 10
+
 /* VMM Control Interface shutdown timeout (in seconds) */
 #define VMMCI_TIMEOUT		3
 #define VMMCI_SHUTDOWN_TIMEOUT	120
@@ -226,12 +237,8 @@ struct virtio_net_hdr {
 	uint16_t csum_start;
 	uint16_t csum_offset;

-	/*
-	 * num_buffers is only used if VIRTIO_NET_F_MRG_RXBUF is negotiated.
-	 * vmd(8) doesn't negotiate that, but the field is listed here
-	 * for completeness sake.
-	 */
-/*	uint16_t num_buffers; */
+	/* Only used if VIRTIO_NET_F_MRG_RXBUF is negotiated. */
+	uint16_t num_buffers;
 };

 enum vmmci_cmd {

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic