[prev in list] [next in list] [prev in thread] [next in thread]
List: freebsd-current
Subject: Re: still: Re: gbde data corruption?
From: "Poul-Henning Kamp" <phk () phk ! freebsd ! dk>
Date: 2003-04-30 13:51:19
[Download RAW message or body]
In message <20030430151514.X27116@daneel.foundation.hs>, Heiko Schaefer writes:
>Hi Poul,
>the broken version of the file contains lots of 0-bytes (instead of high
>entropy values in the original file). seems by the output of cmp that
>every damaged value is replaced by 0.
Zero bytes is the absolutely last thing I would expect...
How long are the sequences of zero bytes, and do they start at
sector boundaries ?
Do you also see this on the client ? (Ie: could it be that data is
still cached on the client and not flushed ?)
What is the approximate error-rate ? 1 file in 10 ? 1 file in 100 ?
How long are the files ?
>zoidberg# diskinfo /dev/ad0s1e
>/dev/ad0s1e 512 29051207680 56740640 56290 16 63
>zoidberg# diskinfo /dev/ad0s1e.bde
>/dev/ad0s1e.bde 4096 28937551872 7064832
This looks ok.
>another thing i just notice: /var/log/messages contains lots of
>
>[...]
>Apr 30 15:24:55 zoidberg kernel: ENOMEM 0xc4c62100 on 0xc45c6c80(ad2s1e.bde)
>Apr 30 15:25:19 zoidberg kernel: ENOMEM 0xc3fa5000 on 0xc45c6c80(ad2s1e.bde)
>Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4b46100 on 0xc45c6c80(ad2s1e.bde)
>Apr 30 15:25:57 zoidberg kernel: ENOMEM 0xc4364500 on 0xc45c6c80(ad2s1e.bde)
>[...]
This means that the kernel ran out of ram and the operation was retried,
it should not result in data corruption but it may reorder bio requests
significantly. I must admit that I have not bashed NFS to see that it
copes.
>i feel that the issue i see is outside the realm of 'should' - so i try to
>give any information i can think of. even useless information :)
Ohh, you're _WAY_ out of "should", you're with your feet deep into
"should certainly NOT", right next to "NEVER EVER!" :-)
>also, i have the unpleasant feeling that i might be making some stupid
>mistake, and waste your time by looking entirely in the wrong direction.
>
>...for all i know the hardware i use on the server-side (or the drivers
>for it ... for some reason the sis-based onboard nic comes to my mind,
>just now) could be subtly broken :/
>if you have no other things i could report or try, i might just throw away
>the gbde volumes and try the same copying with non-gbde partitions, just
>to be sure.
That would be a good first step, but we need to do it controlled to make
sure we know what we prove, so please try it this way:
add
option MALLOC_MAKE_FAILURES
to your kernel.
Build filesystem without GBDE, run test, check for corruption.
if no corruption run:
sysctl debug.malloc.failure_rate=9013
and then reeuild filesystem without GBDE, run test, check
for corruption.
if you get no corruption in either case GBDE is clearly to
blame, and I get to loose more hair while I chase that bug..
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
_______________________________________________
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-current-unsubscribe@freebsd.org"
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic