[prev in list] [next in list] [prev in thread] [next in thread]
List: mesos-issues
Subject: [jira] [Created] (MESOS-9867) Libevent fd cleanup failure may cause hangs in subsequent tests
From: "Benno Evers (JIRA)" <jira () apache ! org>
Date: 2019-06-26 23:54:00
Message-ID: JIRA.13241812.1561593225000.548087.1561593240058 () Atlassian ! JIRA
[Download RAW message or body]
Benno Evers created MESOS-9867:
----------------------------------
Summary: Libevent fd cleanup failure may cause hangs in subsequent tests
Key: MESOS-9867
URL: https://issues.apache.org/jira/browse/MESOS-9867
Project: Mesos
Issue Type: Bug
Reporter: Benno Evers
A listening LibeventSSLSocket will check cryptographic certificate validity during \
the OpenSSL handshake and afterwards call the `openssl::verify()` function to perform \
hostname validation and other checks on the client certificate. If these checks fail, \
the bufferevent is deleted and the connection closed: {noformat}
// libevent_ssl_socket.cpp, accept_SSL_callback()
if (verify.isError()) {
VLOG(1) << "Failed accept, verification error: " << verify.error();
request->promise.fail(verify.error());
SSL_free(ssl);
bufferevent_free(bev);
// TODO(jmlvanre): Clean up for readability. Consider RAII
// or constructing the impl earlier.
CHECK(request->socket >= 0);
Try<Nothing> close = os::close(request->socket);
if (close.isError()) {
LOG(FATAL)
<< "Failed to close socket " << stringify(request->socket)
<< ": " << close.error();
}
delete request;
return;
}
{noformat}
However, when we close the socket fd in the above code, libevent had already \
registered that file descriptor with epoll() to watch for read and write events on \
that socket. Since the socket is closed, attempts to remove the corresponding fd from \
the epoll() structs will fail: (See also: \
https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/) {noformat}
[warn] Epoll MOD(4) on fd 9 failed. Old events were 6; read change was 2 (del); \
write change was 0 (none): Bad file descriptor [warn] Epoll MOD(1) on fd 9 failed. \
Old events were 6; read change was 0 (none); write change was 2 (del): Bad file \
descriptor {noformat}
However, that in itself is harmless since the kernel will remove the kernel object \
that was associated with fd 9 from the data structure associated with that epoll \
instance in the kernel. So while we get an error attempting to remove fd 9, there is \
actually nothing left to remove. However, in a case of epoll failure, libprocess does \
not adjust the number of readers and writers on that file descriptor: {noformat}
// evmap.c, evmap_io_del()
[...]
if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
return (-1);
[...]
ctx->nread = nread;
ctx->nwrite = nwrite;
{noformat}
In the above, ctx is part of an array collecting information for each file \
descriptor. That still wouldn't be so bad, however libevent also only adds file \
descriptors to `epoll()` struct the *first* time we attempt to create a read or write \
event on that file descriptor:
{noformat}
// evmap.c, evmap_io_del()
if (ev->ev_events & EV_READ) {
if (++nread == 1)
res |= EV_READ;
}
if (ev->ev_events & EV_WRITE) {
if (++nwrite == 1)
res |= EV_WRITE;
}
[...]
if (res) {
[...]
if (evsel->add(base, ev->ev_fd,
old, (ev->ev_events & EV_ET) | res, extra) == -1)
return (-1);
[...]
}
{noformat}
So when the same file descriptor is attempted to be used again by libevent for \
epoll() polling, the process will hang because reads or writes to that file \
descriptor are never noticed.
This can be reproduced for example by running a test where the `verify()`-callback \
fails on the server side twice in a row: (note, the LIBPROCESS_IP below is set in \
order to induce a test failure, result may vary on your local network and ssl \
configuration) {noformat}
LIBPROCESS_IP=127.0.1.1 ./libprocess-tests --gtest_filter="*VerifyCertificate*" \
--gtest_repeat=2 {noformat}
There is a chance that the issue described here is the same as the ominous "issues" \
described in MESOS-3008,
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic