[prev in list] [next in list] [prev in thread] [next in thread]
List: zfs-discuss
Subject: Re: [zfs-discuss] repair [was: about btrfs and zfs]
From: Garrett D'Amore <Garrett.DAmore () nexenta ! com>
Date: 2011-10-19 12:27:50
Message-ID: CBDA3BDE-1242-4DCB-A5A0-D6D6D78DF45C () nexenta ! com
[Download RAW message or body]
On Oct 19, 2011, at 1:52 PM, Richard Elling wrote:
> On Oct 18, 2011, at 5:21 PM, Edward Ned Harvey wrote:
>
> > > From: zfs-discuss-bounces@opensolaris.org [mailto:zfs-discuss-
> > > bounces@opensolaris.org] On Behalf Of Tim Cook
> > >
> > > I had and have redundant storage, it has *NEVER* automatically fixed
> > > it. You're the first person I've heard that has had it automatically fix
> > it.
> >
> > That's probably just because it's normal and expected behavior to
> > automatically fix it - I always have redundancy, and every cksum error I
> > ever find is always automatically fixed. I never tell anyone here because
> > it's normal and expected.
>
> Yes, and in fact the automated tests for ZFS developers intentionally corrupts data
> so that the repair code can be tested. Also, the same checksum code is used to
> calculate the checksum when writing and reading.
>
> > If you have redundancy, and cksum errors, and it's not automatically fixed,
> > then you should report the bug.
>
> For modern Solaris-based implementations, each checksum mismatch that is
> repaired reports the bitmap of the corrupted vs expected data. Obviously, if the
> data cannot be repaired, you cannot know the expected data, so the error is
> reported without identification of the broken bits.
>
> In the archives, you can find reports of recoverable and unrecoverable errors
> attributed to:
> 1. ZFS software (rare, but a bug a few years ago mishandled a raidz case)
> 2. SAN switch firmware
> 3. "Hardware" RAID array firmware
> 4. Power supplies
> 5. RAM
> 6. HBA
> 7. PCI-X bus
> 8. BIOS settings
> 9. CPU and chipset errata
>
> Personally, I've seen all of the above except #7, because PCI-X hardware is
> hard to find now.
I've seen #7. I have some PCI-X hardware that is flaky in my home lab. ;-)
There was a case of #1 not very long ago, but it was a difficult to trigger race and \
is fixed in illumos and I presume other derivatives (including NexentaStor).
- Garrett
>
> If consistently see unrecoverable data from a system that has protected data, then
> there may be an issue with a part of the system that is a single point of failure. \
> Very, very, very few x86 systems are designed with no SPOF.
> -- richard
>
> --
>
> ZFS and performance consulting
> http://www.RichardElling.com
> VMworld Copenhagen, October 17-20
> OpenStorage Summit, San Jose, CA, October 24-27
> LISA '11, Boston, MA, December 4-9
>
>
>
>
>
>
>
>
>
>
>
>
>
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic