[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-issues
Subject:    [jira] [Commented] (MESOS-9867) Libevent fd cleanup failure may cause hangs in subsequent tests
From:       "Benno Evers (JIRA)" <jira () apache ! org>
Date:       2019-06-27 8:24:00
Message-ID: JIRA.13241812.1561593225000.550054.1561623840068 () Atlassian ! JIRA
[Download RAW message or body]


    [ https://issues.apache.org/jira/browse/MESOS-9867?page=com.atlassian.jira.plugin. \
system.issuetabpanels:comment-tabpanel&focusedCommentId=16873908#comment-16873908 ] 

Benno Evers commented on MESOS-9867:
------------------------------------

This was fixed upstream with libevent-2.1.10: \
https://github.com/libevent/libevent/commit/9b5a527f5bf898250a797dde59cadb4f64e8967a

> Libevent fd cleanup failure may cause hangs in subsequent tests
> ---------------------------------------------------------------
> 
> Key: MESOS-9867
> URL: https://issues.apache.org/jira/browse/MESOS-9867
> Project: Mesos
> Issue Type: Bug
> Reporter: Benno Evers
> Priority: Major
> Labels: libevent, libprocess, ssl, tls
> 
> A listening LibeventSSLSocket will check cryptographic certificate validity during \
> the OpenSSL handshake and afterwards call the `openssl::verify()` function to \
> perform hostname validation and other checks on the client certificate. If these \
> checks fail, the bufferevent is deleted and the connection closed: {noformat}
> // libevent_ssl_socket.cpp, accept_SSL_callback()
> if (verify.isError()) {
> VLOG(1) << "Failed accept, verification error: " << verify.error();
> request->promise.fail(verify.error());
> SSL_free(ssl);
> bufferevent_free(bev);
> // TODO(jmlvanre): Clean up for readability. Consider RAII
> // or constructing the impl earlier.
> CHECK(request->socket >= 0);
> Try<Nothing> close = os::close(request->socket);
> if (close.isError()) {
> LOG(FATAL)
> << "Failed to close socket " << stringify(request->socket)
> << ": " << close.error();
> }
> delete request;
> return;
> }
> {noformat}
> However, when we close the socket fd in the above code, libevent had already \
> registered that file descriptor with epoll() to watch for read and write events on \
> that socket. Since the socket is closed, attempts to remove the corresponding fd \
> from the epoll() structs will fail: (See also: \
> https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/) {noformat}
> [warn] Epoll MOD(4) on fd 9 failed.  Old events were 6; read change was 2 (del); \
> write change was 0 (none): Bad file descriptor [warn] Epoll MOD(1) on fd 9 failed.  \
> Old events were 6; read change was 0 (none); write change was 2 (del): Bad file \
> descriptor {noformat}
> However, that in itself is harmless since the kernel will remove the kernel object \
> that was associated with fd 9 from the data structure associated with that epoll \
> instance in the kernel. So while we get an error attempting to remove fd 9, there \
> is actually nothing left to remove. However, in a case of epoll failure, libprocess \
> does not adjust the number of readers and writers on that file descriptor: \
> {noformat} // evmap.c, evmap_io_del()
> [...]
> if (evsel->del(base, ev->ev_fd, old, res, extra) == -1)
> return (-1);
> [...]
> ctx->nread = nread;
> ctx->nwrite = nwrite;
> {noformat}
> In the above, ctx is part of an array collecting information for each file \
> descriptor. That still wouldn't be so bad, however libevent also only adds file \
> descriptors to `epoll()` struct the *first* time we attempt to create a read or \
> write event on that file descriptor: {noformat}
> // evmap.c, evmap_io_del()
> if (ev->ev_events & EV_READ) {
> if (++nread == 1)
> res |= EV_READ;
> }
> if (ev->ev_events & EV_WRITE) {
> if (++nwrite == 1)
> res |= EV_WRITE;
> }
> [...]
> if (res) {
> [...]
> if (evsel->add(base, ev->ev_fd,
> old, (ev->ev_events & EV_ET) | res, extra) == -1)
> return (-1);
> [...]
> }
> {noformat}
> So when the same file descriptor is attempted to be used again by libevent for \
> epoll() polling, the process will hang because reads or writes to that file \
> descriptor are never noticed. This can be reproduced for example by running a test \
> where the `verify()`-callback fails on the server side twice in a row: (note, the \
> LIBPROCESS_IP below is set in order to induce a test failure, result may vary on \
> your local network and ssl configuration) {noformat}
> LIBPROCESS_IP=127.0.1.1 ./libprocess-tests --gtest_filter="*VerifyCertificate*" \
> --gtest_repeat=2 {noformat}
> There is a chance that the issue described here is the same as the ominous "issues" \
> described in MESOS-3008, 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic