[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-issues
Subject:    [jira] [Created] (MESOS-9867) Libevent fd cleanup failure may cause hangs in subsequent tests
From:       "Benno Evers (JIRA)" <jira () apache ! org>
Date:       2019-06-26 23:54:00
Message-ID: JIRA.13241812.1561593225000.548087.1561593240058 () Atlassian ! JIRA
[Download RAW message or body]

Benno Evers created MESOS-9867:
----------------------------------

             Summary: Libevent fd cleanup failure may cause hangs in subsequent tests
                 Key: MESOS-9867
                 URL: https://issues.apache.org/jira/browse/MESOS-9867
             Project: Mesos
          Issue Type: Bug
            Reporter: Benno Evers


A listening LibeventSSLSocket will check cryptographic certificate validity during \
the OpenSSL handshake and afterwards call the `openssl::verify()` function to perform \
hostname validation and other checks on the client certificate. If these checks fail, \
the bufferevent is deleted and the connection closed: {noformat}
// libevent_ssl_socket.cpp, accept_SSL_callback()
          if (verify.isError()) {
            VLOG(1) << "Failed accept, verification error: " << verify.error();
            request->promise.fail(verify.error());
            SSL_free(ssl);
            bufferevent_free(bev);
            // TODO(jmlvanre): Clean up for readability. Consider RAII
            // or constructing the impl earlier.
            CHECK(request->socket >= 0);
            Try<Nothing> close = os::close(request->socket);
            if (close.isError()) {
              LOG(FATAL)
                << "Failed to close socket " << stringify(request->socket)
                << ": " << close.error();
            }
            delete request;
            return;
          }
{noformat}

However, when we close the socket fd in the above code, libevent had already \
registered that file descriptor with epoll() to watch for read and write events on \
that socket. Since the socket is closed, attempts to remove the corresponding fd from \
the epoll() structs will fail: (See also: \
https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/) {noformat}
[warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 (del); \
write change was 0 (none): Bad file descriptor [warn] Epoll MOD(1) on fd 9 failed.  \
Old events were 6; read change was 0 (none); write change was 2 (del): Bad file \
descriptor {noformat}

However, that in itself is harmless since the kernel will remove the kernel object \
that was associated with fd 9 from the data structure associated with that epoll \
instance in the kernel. So while we get an error attempting to remove fd 9, there is \
actually nothing left to remove. However, in a case of epoll failure, libprocess does \
not adjust the number of readers and writers on that file descriptor: {noformat}
        // evmap.c, evmap_io_del()
        [...]
        if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
               return (-1);

        [...]
        ctx->nread = nread;
        ctx->nwrite = nwrite;
{noformat}

In the above, ctx is part of an array collecting information for each file \
descriptor. That still wouldn't be so bad, however libevent also only adds file \
descriptors to `epoll()` struct the *first* time we attempt to create a read or write \
event on that file descriptor:

{noformat}
        // evmap.c, evmap_io_del()
        if (ev->ev_events & EV_READ) {
                if (++nread == 1)
                        res |= EV_READ;
        }
        if (ev->ev_events & EV_WRITE) {
                if (++nwrite == 1)
                        res |= EV_WRITE;
        }

        [...]

        if (res) {
                [...]
                if (evsel->add(base, ev->ev_fd,
                        old, (ev->ev_events & EV_ET) | res, extra) == -1)
                        return (-1);
                [...]
        }
{noformat}

So when the same file descriptor is attempted to be used again by libevent for \
epoll() polling, the process will hang because reads or writes to that file \
descriptor are never noticed.

This can be reproduced for example by running a test where the `verify()`-callback \
fails on the server side twice in a row: (note, the LIBPROCESS_IP below is set in \
order to induce a test failure, result may vary on your local network and ssl \
configuration) {noformat}
LIBPROCESS_IP=127.0.1.1 ./libprocess-tests --gtest_filter="*VerifyCertificate*" \
--gtest_repeat=2 {noformat}

There is a chance that the issue described here is the same as the ominous "issues" \
described in MESOS-3008, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic