'Re: fsync(2) manual and hdd write caching'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-hackers
Subject:    Re: fsync(2) manual and hdd write caching
From:       David Schultz <das () freebsd ! org>
Date:       2010-11-30 22:22:02
Message-ID: 20101130222202.GA79001 () zim ! MIT ! EDU
[Download RAW message or body]

On Thu, Oct 28, 2010, perryh@pluto.rain.com wrote:
> Ivan Voras <ivoras@freebsd.org> wrote:
> 
> > ... The problem is actually pretty hard - since AFAIK SoftUpdates
> > doesn't have "checkpoints" in the sense that it groups writes and
> > all data "before" can guaranteed to be on-disk, the problem is
> > *when* to issue BIO_FLUSH requests.
> 
> Seems to me the originally-stated problem -- making fsync(2)
> do what it claims to do -- is not hard at all.  Just issue a
> BIO_FLUSH request as the final step in handling fsync(2).

Yes, for correctness, fsync(2) needs to flush the relevant parts
of the disk's volatile write cache before returning.  If it
doesn't, applications like databases can fail if there is a power
loss.

Unfortunately, this isn't really practical.  First, performance is
poor: you generally can't flush a particular sector without
flushing the entire write cache, and many disks (including all ATA
disks) don't differentiate between volatile and non-volatile
caches.  Second, many disks ignore the command.

So the status quo for all the major Unix variants is apparently to
favor performance over correctness.  However, FlushFileBuffers()
in Windows does the right thing and flushes the disk write cache,
and I've heard that ZFS and ext4 also do the right thing (subject
to the correctness of the disk controller, of course).

So FreeBSD isn't any worse than most of the world here.  FreeBSD
used to turn off disk write caches by default, but many people
complained about FreeBSD being slow.  Far fewer people complain
about corruptions due to power failure.  Usually people who
require stronger reliability guarantees invest in replicated
storage and battery backups anyway.

Note that the "broken" behavior is still protective against kernel
and application crashes -- just not power failures and certain
types of disk faults.

An informative article on the topic is here:

   http://www.postgresql.org/docs/9.0/static/wal-reliability.html

> While we're at it, perhaps do the same in close(2).
> I _hope_ we are already doing it in unmount(2).

close(2) is a different beast; flushes would be too expensive, and
they aren't needed except for NFS.  Apps are expected to use
fsync(2) if they require it.
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscribe@freebsd.org"
[prev in list] [next in list] [prev in thread] [next in thread]