[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openldap-devel
Subject:    LMDB and fsync failures
From:       Howard Chu <hyc () symas ! com>
Date:       2024-02-09 10:09:04
Message-ID: 7cce887c-c1a1-ecf7-7508-c5abf4eec2a8 () symas ! com
[Download RAW message or body]

If anyone remembers fsync-gate https://danluu.com/fsyncgate/ which showed a \
lot of vulnerabilities in other popular DBMSs, some other research was \
published on the topic as well  \
https://www.usenix.org/conference/atc20/presentation/rebello

I originally discussed this on twitter back in 2020 but wanted to summarize \
again here.

As usual with these types of reports, there are a lot of flaws in their \
test methodology, which invalidates some of their conclusions.

In particular, I question the validity of the failure scenarios their \
CuttleFS simulator produces. Specifically, they claim that multiple systems \
exhibit False Failures after fsync reports a failure, but actually \
(partially) succeeded. In the case of LMDB, where a 1-page synchronous \
write is involved, this is just an invalid test.

They assume that the relevant sector that LMDB cares about is successfully \
written, but an I/O error occurs on some other sector in the page. And so \
while LMDB invalidates the commit in memory, a cache flush and subsequent \
page-in will read the updated sector. But in the real world, if there are \
hard I/O errors on these other sectors, they will most likely also be \
unreadable, and a subsequent page-in will also fail. So at least for LMDB, \
there would be no false failure.

The failure modes they're modeling don't reflect reality.

Leaving that issue aside, there's also the point that modern storage \
devices are now using 4KB sectors, and still guarantee atomic sector \
writes, so the partial success scenario they describe can't even happen. \
This is a bunch of academic speculation, with a total absence of real world \
modeling to validate the failure scenarios they presented.

The other failures they report, on ext4fs with journaled data, are \
certainly disturbing. But we always recommend turning that journaling off \
with LMDB; it's redundant with LMDB's own COW strategy and harms perf for \
no benefit.

Of course, you don't even need to trust the filesystem, you can just use \
LMDB on a raw block device.

-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic