[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openldap-devel
Subject:    LMDB and fsync failures
From:       Howard Chu <hyc () symas ! com>
Date:       2024-02-09 10:09:04
Message-ID: 7cce887c-c1a1-ecf7-7508-c5abf4eec2a8 () symas ! com
[Download RAW message or body]

If anyone remembers fsync-gate https://danluu.com/fsyncgate/ which showed a lot of \
vulnerabilities in other popular DBMSs, some other research was published on the \
topic as well  https://www.usenix.org/conference/atc20/presentation/rebello

I originally discussed this on twitter back in 2020 but wanted to summarize again \
here.

As usual with these types of reports, there are a lot of flaws in their test \
methodology, which invalidates some of their conclusions.

In particular, I question the validity of the failure scenarios their CuttleFS \
simulator produces. Specifically, they claim that multiple systems exhibit False \
Failures after fsync reports a failure, but actually (partially) succeeded. In the \
case of LMDB, where a 1-page synchronous write is involved, this is just an invalid \
test.

They assume that the relevant sector that LMDB cares about is successfully written, \
but an I/O error occurs on some other sector in the page. And so while LMDB \
invalidates the commit in memory, a cache flush and subsequent page-in will read the \
updated sector. But in the real world, if there are hard I/O errors on these other \
sectors, they will most likely also be unreadable, and a subsequent page-in will also \
fail. So at least for LMDB, there would be no false failure.

The failure modes they're modeling don't reflect reality.

Leaving that issue aside, there's also the point that modern storage devices are now \
using 4KB sectors, and still guarantee atomic sector writes, so the partial success \
scenario they describe can't even happen. This is a bunch of academic speculation, \
with a total absence of real world modeling to validate the failure scenarios they \
presented.

The other failures they report, on ext4fs with journaled data, are certainly \
disturbing. But we always recommend turning that journaling off with LMDB; it's \
redundant with LMDB's own COW strategy and harms perf for no benefit.

Of course, you don't even need to trust the filesystem, you can just use LMDB on a \
raw block device.

-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic