'Re: ZFS ARC and mmap/page cache coherency question'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-hackers
Subject:    Re: ZFS ARC and mmap/page cache coherency question
From:       Paul Koch <paul.koch137 () gmail ! com>
Date:       2016-08-15 5:15:01
Message-ID: 20160815151501.5f5b4a86 () splash ! akips ! com
[Download RAW message or body]

Just a followup to my original post about VM/ZFS ARC coherency. We've
done a few simple changes to our application to get around the
coherency issues, and observed some odd things.

Description of our app:
Very large scale ping/snmp poller/database (made up of 8 underlying databases
- configuration, event, time-series, common strings, etc).  Each database
contains various sized mmap'ed files, ranging from 512 bytes to many
gigabytes.  All mmap'ed files are opened with the MAP_NOSYNC flag.  The
poller updates every page of the mmap'ed data every minute.  We fsync the
mmap'ed data every 10 minutes when the system is mostly idle.  Everything
works fine while the mmap'ed data is in both the VM and ZFS caches.  Every 80
minutes we process a very large amount of cached poller data, which pushes
the mmap'ed data out of the ARC.  The performance of the next 10 minute fsync
then falls off a cliff, causing lots of read/write contention. This is due to
the lack of VM/ZFS ARC coherency.

We've changed our sync algorithm to something like:
1. Exclusive lock on the entire database
2. fsync() all the small 512 byte mmap'ed files
3. Write out new copies of all the other mmap'ed files 
    - mprotect
    - write
    - rename
    - munprotect
4. Release exclusive lock
5. Signal all database processes so they reopen the database.

Our sync now completes in a very predictable manner and is significantly faster.

But we observed some odd things:

1. The rename in step 3 above can be painfully slow for large files. Not sure
what is going on, but we also noticed that deleting the same files using
unlink(2) or rm(1) was also painfully slow.  It is much much faster to
truncate(2) the large files to zero bytes before calling rename(2) or
unlink(2).  Why is that ??

2. We are using both fsync(2) and write(2) in the above sync.  We observed
that order was very important.  If we write/rename the large mmap'ed files
first and then fsync the small 512 byte files, the fsync sits in zio for some
time.  Doing the fsync calls first and then the large write/renames is much
faster.  Not sure what is going on there.

	Paul.
-- 
Paul Koch | Founder | CEO
AKIPS Network Monitor | akips.com
Brisbane, Australia
_______________________________________________
freebsd-hackers@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
[prev in list] [next in list] [prev in thread] [next in thread]