'Re: [zfs-discuss] repair [was: about btrfs and zfs]'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       zfs-discuss
Subject:    Re: [zfs-discuss] repair [was:  about btrfs and zfs]
From:       Garrett D'Amore <Garrett.DAmore () nexenta ! com>
Date:       2011-10-19 12:27:50
Message-ID: CBDA3BDE-1242-4DCB-A5A0-D6D6D78DF45C () nexenta ! com
[Download RAW message or body]

On Oct 19, 2011, at 1:52 PM, Richard Elling wrote:

> On Oct 18, 2011, at 5:21 PM, Edward Ned Harvey wrote:
> 
> > > From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss-
> > > bounces@opensolaris.org] On Behalf Of Tim Cook
> > > 
> > > I had and have redundant storage, it has *NEVER* automatically fixed
> > > it.  You're the first person I've heard that has had it automatically fix
> > it.
> > 
> > That's probably just because it's normal and expected behavior to
> > automatically fix it - I always have redundancy, and every cksum error I
> > ever find is always automatically fixed.  I never tell anyone here because
> > it's normal and expected.
> 
> Yes, and in fact the automated tests for ZFS developers intentionally corrupts data
> so that the repair code can be tested. Also, the same checksum code is used to 
> calculate the checksum when writing and reading.
> 
> > If you have redundancy, and cksum errors, and it's not automatically fixed,
> > then you should report the bug.
> 
> For modern Solaris-based implementations, each checksum mismatch that is
> repaired reports the bitmap of the corrupted vs expected data. Obviously, if the
> data cannot be repaired, you cannot know the expected data, so the error is 
> reported without identification of the broken bits.
> 
> In the archives, you can find reports of recoverable and unrecoverable errors 
> attributed to:
> 	1. ZFS software (rare, but a bug a few years ago mishandled a raidz case)
> 	2. SAN switch firmware
> 	3. "Hardware" RAID array firmware
> 	4. Power supplies
> 	5. RAM
> 	6. HBA
> 	7. PCI-X bus
> 	8. BIOS settings
> 	9. CPU and chipset errata
> 
> Personally, I've seen all of the above except #7, because PCI-X hardware is
> hard to find now.

I've seen #7.  I have some PCI-X hardware that is flaky in my home lab. ;-)

There was a case of #1 not very long ago, but it was a difficult to trigger race and \
is fixed in illumos and I presume other derivatives (including NexentaStor).

	- Garrett
> 
> If consistently see unrecoverable data from a system that has protected data, then
> there may be an issue with a part of the system that is a single point of failure. \
> Very, very, very few x86 systems are designed with no SPOF.
> -- richard
> 
> -- 
> 
> ZFS and performance consulting
> http://www.RichardElling.com
> VMworld Copenhagen, October 17-20
> OpenStorage Summit, San Jose, CA, October 24-27
> LISA '11, Boston, MA, December 4-9 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[prev in list] [next in list] [prev in thread] [next in thread]