[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-s390
Subject:    Re: SMC-R throughput drops for specific message sizes
From:       "Goerlitz Andreas (SO/PAF1-Mb)" <Andreas.Goerlitz () de ! bosch ! com>
Date:       2024-03-28 12:18:39
Message-ID: GV2PR10MB8037B9F99338C2A59F26336FBB3B2 () GV2PR10MB8037 ! EURPRD10 ! PROD ! OUTLOOK ! COM
[Download RAW message or body]

Hello Wen Gu and community,

our group performed more experiments with SMC-R. The results discussed subsequently \
were performed on two Mellanox-powered (mlx5, ConnectX-5) PCs, with the following \
configuration: Kernel 6.5.0-25-generic
MTU 9000
net.smc.wmem = $((256*1024))
net.smc.rmem = $((256*1024))
net.smc.autocorking_size = 65536
net.smc.smcr_buf_type = 1
Bandwidth ~ 3.2GB/s (25.0 Gbit/s)

We modified your server.c (consumer) and client.c (producer) to estimate the \
throughput and observed that the "msgsize" of the consumer seems to be mainly \
responsible for the throughput drops, as shown below.

Good cases (server/consumer msgsize <= RMBE/2):
-----------------------------------------------
server:  smc_run ./server -p 12345 -m $((128*1024))
client:  smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000
         Sent 261881856 bytes in 82224.819000 us [3.184939 GB/s]

server:  smc_run ./server -p 12345 -m $((128*1024))
client:  smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000
         Sent 261881856 bytes in 82097.127000 us [3.189892 GB/s]

 

Bad cases (server/consumer msgsize > RMBE/2):
-----------------------------------------------
server:   smc_run ./server -p 12345 -m $((256*1024))
client:   smc_run ./client -i 192.168.0.2 -p 12345 -m $((128*1024)) -c 1000
          Sent 261881856 bytes in 130970.306000 us [1.999545 GB/s]

server:   smc_run ./server -p 12345 -m $((256*1024))
client:   smc_run ./client -i 192.168.0.2 -p 12345 -m $((256*1024)) -c 1000
          Sent 130940928 bytes in 88172.887000 us [1.485037 GB/s]


Our explanation is that in the "bad cases" producer and consumer act synchronously in \
the following sense: The producer is sending messages (e.g., msgsize = RMBE on \
producer side), and at some point, it must wait until the consumer processes some of \
its RMBE, and answers with a CDC message. During this time, the producer is blocked \
(since RMBE of consumer is full). In case the consumer processes the entire RMBE \
(i.e., msgsize=RMBE on consumer side), it is then also blocked as there is nothing \
left to be processed anymore - i.e. it must wait for the producer. We believe/suspect \
that this (unintended) synchronization leads to the throughput drops.

To enforce the consumer to process smaller messages, reply faster to the producer \
(CDC) and still be able to process some remaining data (i.e., to avoid being \
blocked), we cap the value of len to RMBE/2 in smc_rx_recvmsg:

--- a/net/smc/smc_rx.c  2024-03-25 12:31:32.264614422 +0100
+++ b/net/smc/smc_rx.c  2024-03-25 12:22:31.989913322 +0100
@@ -344,7 +344,7 @@
 int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg,
                    struct pipe_inode_info *pipe, size_t len, int flags)
 {
-       size_t copylen, read_done = 0, read_remaining = len;
+       size_t copylen, read_remaining, read_done = 0;
         size_t chunk_len, chunk_off, chunk_len_sum;
         struct smc_connection *conn = &smc->conn;
         int (*func)(struct smc_connection *conn);
@@ -363,6 +363,10 @@
         sk = &smc->sk;
         if (sk->sk_state == SMC_LISTEN)
                 return -ENOTCONN;
+
+       len = min_t(size_t, len, conn->rmb_desc->len / 2);
+       read_remaining = len;
+
         if (flags & MSG_OOB)
                 return smc_rx_recv_urg(smc, msg, len, flags);
         timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);

We ran qperf experiments (as before) on the standard SMC-R module [std] (kernel \
6.5.0-25-generic), Wen Gu’s proposal [wengu] (i.e. setting force = true), and our \
proposal [our] (i.e. capping len to RMBE/2). The measured throughput is shown in \
subplots (a) in the appended figures. Additionally, we traced

tracepoint:smc:smc_tx_sendmsg{
   @tx_ret = lhist(args->len,0,262144,16384);
}

tracepoint:smc:smc_rx_recvmsg{
   @rx_ret = lhist(args->len,0,262144,16384);
} 

and calculated the percentage of rx_ret and tx_ret being greater than RMBE/2 - shown \
in subplots (b) and (c) respectively.

As can be observed, there seems to be a correlation between a drop in throughput and \
rx_ret being greater than RMBE/2. This is avoided in our proposal, and full \
throughput is achieved.

We hope that our analysis and interpretation can help to solve the issue with the \
throughput drops in SMC-R.

p.s., I would like to acknowledge all individuals who contributed to the analysis of \
SMC-R from our team (sorted by last name): Soumyadeep Debnath
Andreas Görlitz
Costin Iordache
Alexandros Nikolaou
Maik Riestock
Ievgen Tatolov



Mit freundlichen Grüßen / Best regards

Andreas Goerlitz (SO/PAF1-Mb)

Bosch Service Solutions Magdeburg GmbH | Otto-von-Guericke-Str. 13 | 39104 Magdeburg \
| GERMANY | [www.boschservicesolutions.com]www.boschservicesolutions.com \
Andreas.Goerlitz@de.bosch.com


Sitz: Magdeburg, Registergericht: Amtsgericht Stendal, HRB 24039

Geschäftsführung: Robert Mulatz, Georg Wessels


["client.c" (text/x-csrc)]

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <stdbool.h>
#include <errno.h>
#include <netinet/tcp.h>
#include <time.h>

#ifndef AF_SMC
#define AF_SMC          43
#endif
#define NET_PROTOCAL    AF_INET
#define SERV_IP         "11.213.5.33"
#define SERV_PORT       10012

#define BUF_SIZE	(5 * 128 * 1024)

int stream_send(int fd, char *buf, int msgsize)
{
	int n = msgsize;

	while (n) {
		int i = write(fd, buf, n);

		if (i < 0)
			return i;
		buf += i;
		n   -= i;
		if (i == 0)
			break;
	}

	return msgsize-n;
}

int net_clnt(char *ip, int port, int msgsize, int msgcnt)
{
	struct timespec start, end;
	double elapsed_us = 0.0;
        double elapsed_us_sum = 0.0;
	int sent;
        long num_bytes = 0;
        double gb_s;

	if (!ip)
		ip = SERV_IP;
	if (!port)
		port = SERV_PORT;
	int sock = socket(NET_PROTOCAL, SOCK_STREAM, 0);

	struct sockaddr_in s_addr;
	memset(&s_addr, 0, sizeof(s_addr));
	s_addr.sin_family = NET_PROTOCAL;
	s_addr.sin_addr.s_addr = inet_addr(ip);
	s_addr.sin_port = htons(port);
	if (connect(sock, (struct sockaddr*)&s_addr, sizeof(s_addr))){
		printf("connect fail\n");
		return 0;
	}

	char *buf = (char *)malloc(sizeof(char) * BUF_SIZE);
        while (--msgcnt) {
		if (msgsize > BUF_SIZE)
			break;
		printf("Send msgsize: %d\n", msgsize);

		clock_gettime(CLOCK_MONOTONIC, &start);
		sent = stream_send(sock, buf, msgsize);
		clock_gettime(CLOCK_MONOTONIC, &end);

		if (send <= 0) {
			printf("Error send %d\n", sent);
			break;
		}

                elapsed_us = (end.tv_sec - start.tv_sec)*1000000.0;
                elapsed_us += (end.tv_nsec - start.tv_nsec) / 1000.0;
                elapsed_us_sum += elapsed_us;
                num_bytes += sent;
	}
	close(sock);

        gb_s  = (num_bytes/1000) / elapsed_us_sum;
        printf("Sent %ld bytes in %f us [%f GB/s]\n", num_bytes, elapsed_us_sum, gb_s);

        return 0;
}

int main(int argc, char **argv){
	int msgsize = BUF_SIZE, msgcnt = 10;
	char *ip = NULL;
	bool wrong_param = false;
	int port = 0;
	int c;

	while(!wrong_param &&
	      (-1 != (c = getopt(argc, argv, "i:p:m:c:")))) {
		switch (c) {
			case 'i':
				ip = optarg;
				break;
			case 'p':
				port = atoi(optarg);
				break;
			case 'm':
				msgsize = atoi(optarg);
				break;
			case 'c':
				msgcnt = atoi(optarg);
				break;
			case '?':
				printf("usage: ./client -i <ip> -p <port> -m <msgsize> -c <cnt>\n");
				wrong_param = true;
				break;
		}
	}
	if (!wrong_param)
		net_clnt(ip, port, msgsize, msgcnt);
	return 0;
}


["results_our.png" (image/png)]
["results_std.png" (image/png)]
["results_wengu.png" (image/png)]
["server.c" (text/x-csrc)]

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <errno.h>
#include <stdbool.h>
#include <netinet/tcp.h>
#include <pthread.h>

#ifndef AF_SMC
#define AF_SMC		43
#endif
#define NET_PROTOCAL	AF_INET
#define SERV_IP		INADDR_ANY
#define SERV_PORT	10012

#define BUF_SIZE	(5 * 128 * 1024)

int stream_recv(int fd, char *buf, int msgsize)
{
	int n = msgsize;

	while (n) {
		int i = read(fd, buf, n);

		if (i < 0)
			return i;
		buf += i;
		n   -= i;
		if (i == 0)
			break;
		//printf("Successfully recv %d B message\n", i);
	}
	return msgsize-n;
}
int net_serv(int port, int msgsize)
{
	int recv;

	if (!port)
		port = SERV_PORT;

	int l_sock = socket(NET_PROTOCAL, SOCK_STREAM, 0);

	struct sockaddr_in s_addr;
	memset(&s_addr, 0, sizeof(struct sockaddr_in));
	s_addr.sin_family = NET_PROTOCAL;
	s_addr.sin_addr.s_addr = SERV_IP;
	s_addr.sin_port = htons(port);
	// bind listen socket
	if (bind(l_sock, (struct sockaddr*)&s_addr, sizeof(s_addr))) {
		printf("bind listen socket error %d\n", errno);
		return 0;
	}
	// listen
	if (listen(l_sock, 20)) {
		printf("listen error\n");
		return 0;
	}

	struct sockaddr_in c_addr;
	socklen_t c_addr_len = sizeof(c_addr);
	int s_sock = accept(l_sock, (struct sockaddr*)&c_addr,
			    &c_addr_len);
	if (s_sock < 0) {
		printf("accept fail\n");
		return 0;
	} else {
		char ip[16] = { 0 };
		inet_ntop(NET_PROTOCAL, &(c_addr.sin_addr), ip, INET_ADDRSTRLEN);
		printf("accept connection: ip %s port %d\n",
			ip, c_addr.sin_port);
	}

	char *buf = (char *)malloc(sizeof(char) * BUF_SIZE);
	while (1) {
		if (msgsize > BUF_SIZE)
			break;
		//printf("Recv msgsize: %d\n", msgsize);
		recv = stream_recv(s_sock, buf, msgsize);
		if (recv <= 0) {
			if (recv)
				printf("Error recv %d\n", recv);
			break;
		}
	}
        printf("done\n");

	close(s_sock);
	close(l_sock);
	return 0;
}

int main(int argc, char **argv)
{
	bool wrong_param = false;
	int msgsize = BUF_SIZE;
	int port = 0;
	int c;
	while(!wrong_param &&
	      (-1 != (c = getopt(argc, argv, "p:m:")))) {
		switch (c) {
			case 'p':
				port = atoi(optarg);
				break;
			case 'm':
				msgsize = atoi(optarg);
				break;
			case '?':
				printf("usage: ./server -p <port> -m <msgsize>\n");
				wrong_param = true;
				break;
		}
	}
	if (!wrong_param)
		net_serv(port, msgsize);
	return 0;
}



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic