'Re: Storage overhead of LMDB vs SQLite3 for file hierarchy indexing scenario'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openldap-technical
Subject:    Re: Storage overhead of LMDB vs SQLite3 for file hierarchy indexing scenario
From:       Howard Chu <hyc () symas ! com>
Date:       2024-05-13 15:59:19
Message-ID: 64f57310-ccc4-34a3-c1da-c20475f49b44 () symas ! com
[Download RAW message or body]

Jérôme Carretero wrote:
> Hi,
> 
> 
> I wanted to share some thoughts and findings and perhaps collect some
> comments about a small test I did: I used LMDB vs. SQLite3 to perform
> indexing of a file tree.
> 
> In this scenario, we're capturing file paths, along with some metadata
> (size, mtime, and hashes, let's use only sha1 for this example),
> and we want to be able to efficiently walk the tree, as well as resolve
> a file's info by id or by path.
> 
> I'm showing the SQL schema first, as it helps formalizing the "NoSQL"
> case:
> 
> CREATE TABLE IF NOT EXISTS files (
>  id INTEGER PRIMARY KEY AUTOINCREMENT,
>  sha1 BLOB,
>  size INTEGER,
>  mtime INTEGER
> );
> 
> CREATE TABLE IF NOT EXISTS pathsegments (
>  id INTEGER PRIMARY KEY AUTOINCREMENT,
>  parent_id INTEGER,
>  file_id INTEGER,
>  segment_name TEXT,
>  FOREIGN KEY(parent_id) REFERENCES pathsegments(id),
>  FOREIGN KEY(file_id) REFERENCES files(id)
> );
> 
> CREATE INDEX IF NOT EXISTS idx_sha1 ON files (sha1);
> 
> CREATE INDEX IF NOT EXISTS idx_fileid2segment ON pathsegments
> (file_id);
> 
> CREATE INDEX IF NOT EXISTS idx_parent_id2segment ON pathsegments
> (parent_id);
> 
> 
> Capturing the full path of each entry as a string was not something I
> wanted, and was also wasteful in the few datasets I tested it on.
> So we're using path segments, and a full path can be reconstructed; in
> the walking situation, from root to leaves, or if we have a file ID and
> want its path, from the leaves to the root, using the fileid2segment
> index and the parent information.
> And if files get moved around, we can alter the database without
> rewriting many paths.

Yes, that's the correct approach. This is how back-hdb, back-mdb, and back-ndb are structured as well.

> With lmdb, as I was conscious that each database would "intrinsically"
> build an index, I packed the file_id->{mtime,size,hash} as fid2attrs
> and segment_id->{parent_id,name} as sid2kids databases, and we still
> have segment_id->kids DUPSORT sid2kids for doing asciibetical readdir
> like the SQL idx_parent_id2segment.
> 
> So we have our 2 tables with 2+3 indexes in SQL, and 4 DBs in lmdb
> (fid2attrs, sid2kids, sid2dirname, hash2fid).

In OpenLDAP, fileID and segmentID are the same thing. And sid2kids / sid2dirname are a single table.
You can read about how that's done in the back-hdb paper https://openldap.org/conf/odd-sfo-2003/proceedings.html

> I implemented this (in Python) with sqlite and lmdb (using COBS for
> serialization, for simplicity and because it has the nice property of
> keeping serialized varint unsigned integers with preserved ordering),
> and went on to build dbs from a dataset of 474782 small files and 45014
> directories (which I've coincidentally uploaded at
> https://archive.org/download/ftp.modland.com_pub_modules).
> 
> Well, I was surprised that the SQLite3 file was weighing 67,149,824
> bytes, and the lmdb one was larger at 82,718,720 bytes.

That shouldn't be surprising, since SQLite3 is an update-in-place design and LMDB
is copy-on-write.

> In particular, looking at database statistics, it seems that the hash
> to id index/DB had unexpected (for me) overhead of ~ 112% to payload
> size with lmdb (and it was using less payload), with a size of 24932352
> bytes for lmdb's DB vs. 15298560 for SQLite's index;

Any normal LMDB record has 8 byte overhead per record: 4 byte value length,
2 byte key length, and 2 byte flags. Most SQLite3 records have very little
per-record overhead because they're fixed-width columns.

Since your fid2attrs and hash2id records are fixed size, you can use MDB_DUPFIXED to eliminate
all of the per-record overhead of those tables. (Don't use varints for these, that would prevent using
DUPFIXED and would completely defeat the point.)
> 
> SQLite statistics:
> 
>  Percentage of total database......................  22.8%
>  Number of entries................................. 474782
>  Bytes of storage consumed......................... 15298560
>  Bytes of payload.................................. 12311437    80.5%
>  Bytes of metadata................................. 1469162      9.6%
>  B-tree depth...................................... 3
>  Average payload per entry......................... 25.93
>  Average unused bytes per entry.................... 3.20
>  Average metadata per entry........................ 3.09
>  Average fanout.................................... 109.00
>  Non-sequential pages.............................. 2986        80.0%
>  Maximum payload per entry......................... 26
>  Entries that use overflow......................... 0            0.0%
>  Index pages used.................................. 34
>  Primary pages used................................ 3701
>  Overflow pages used............................... 0
>  Total pages used.................................. 3735
>  Unused bytes on index pages....................... 16998       12.2%
>  Unused bytes on primary pages..................... 1500963      9.9%
>  Unused bytes on overflow pages.................... 0
>  Unused bytes on all pages......................... 1517961      9.9%

Looks like SQLite3 uses a much higher page fill factor than textbook B+trees. Normally
for a random-insert case, pages are split when they're full, leaving you two pages
at about 50% utilization each. So on average a B+tree's pages would be 75% full, with
only 25% unused bytes. This is what LMDB does for random inserts.

> LMDB: depth=3 branch_pages=66 leaf_pages=6021 overflow_pages=0
>  Keys    size:      9495640
>  Values  size:      2261081
>  Payload size:     11756721
>  Pages   size:     24932352
>  Average key   size:       20.000 (20-20)
>  Average value size:        4.762 (1-5)
>  Average entry size:       24.762
>  Entries number: 474782
>  Overhead    :     13175631
>  Overhead/payload ratio: 1.121
>  Overhead/entry avg:       27.751
> 
> I checked the overhead for in-order insertions and it's still around
> 50% for this key/value sizes, much more than SQLite does (~21%) for
> random insertions.
> 
> But even if we were to remove this index, SQLite's file would still be
> tighter...
> 
> 
> That's it!
> 
> 
> Best regards,
> 


-- 
  -- Howard Chu
  CTO, Symas Corp.           http://www.symas.com
  Director, Highland Sun     http://highlandsun.com/hyc/
  Chief Architect, OpenLDAP  http://www.openldap.org/project/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic