'Re: [Bug 16577] New: Firewire drive throughput slow, hangs,'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux1394-devel
Subject:    Re: [Bug 16577] New: Firewire drive throughput slow, hangs,
From:       Stefan Richter <stefanr () s5r6 ! in-berlin ! de>
Date:       2010-08-13 13:03:34
Message-ID: 4C6542A6.7090209 () s5r6 ! in-berlin ! de
[Download RAW message or body]

(Cc: linux1394-devel@lists.sourceforge.net)

> bugzilla-daemon@bugzilla.kernel.org wrote:
> > Created an attachment (id=27424)
> > --> (https://bugzilla.kernel.org/attachment.cgi?id=27424)
> > dmesg output  ("Device offlined" when I removed expresscard)

Quoting the dmesg from initialization of the firewire drivers till
removal of the card:

> [ 6478.536741] firewire_ohci 0000:06:00.0: enabling device (0000 -> 0002)
> [ 6478.536757] firewire_ohci 0000:06:00.0: PCI INT A -> GSI 19 (level, low) -> IRQ \
> 19 [ 6478.636077] firewire_ohci: Added fw-ohci device 0000:06:00.0, OHCI v1.10, 4 \
> IR + 8 IT contexts, quirks 0x2 [ 6479.128337] firewire_core: created device fw0: \
> GUID 7f82000000400000, S800 [ 6479.157967] scsi27 : SBP-2 IEEE-1394
> [ 6479.158077] firewire_core: created device fw1: GUID 00d04b881a053e71, S800
> [ 6479.158086] firewire_core: phy config: card 2, new root=ffc0, gap_count=5
> [ 6479.928094] firewire_sbp2: fw1.0: logged in to LUN 0000 (0 retries)
> [ 6479.934630] scsi 27:0:0:0: Direct-Access     LaCie    Rugged FW/USB         PQ: \
> 0 ANSI: 4 [ 6479.934923] sd 27:0:0:0: Attached scsi generic sg1 type 0
> [ 6479.939850] sd 27:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/465 \
> GiB) [ 6479.943081] sd 27:0:0:0: [sdb] Write Protect is off
> [ 6479.943086] sd 27:0:0:0: [sdb] Mode Sense: 10 00 00 00
> [ 6479.945138] sd 27:0:0:0: [sdb] Cache data unavailable
> [ 6479.945142] sd 27:0:0:0: [sdb] Assuming drive cache: write through
> [ 6479.954428] sd 27:0:0:0: [sdb] Cache data unavailable
> [ 6479.954433] sd 27:0:0:0: [sdb] Assuming drive cache: write through
> [ 6479.954440]  sdb: sdb1
> [ 6479.989692] sd 27:0:0:0: [sdb] Cache data unavailable
> [ 6479.989697] sd 27:0:0:0: [sdb] Assuming drive cache: write through
> [ 6479.989702] sd 27:0:0:0: [sdb] Attached SCSI disk
> [ 6480.532800] EXT4-fs (sdb1): mounted filesystem with ordered data mode

So far so good.

> [ 7710.529501] firewire_sbp2: fw1.0: sbp2_scsi_abort

The error handler of Linux' SCSI core detected a time-out of a SCSI
transaction.  I.e. the disk received the request but did not send status
before the (typically plentiful) time for completion that the SCSI core
expects ran out.  The SCSI core's error handler called the abort handler
of firewire-sbp2.  Since there is no further error message shortly after
the one above, the SCSI core (or a higher layer?) was able to recover
from this failed SCSI transaction.

BTW, for the duration of the SCSI transaction time-out, the process
which issued the request is (typically or always?) waiting in
uninterruptible state, e.g. seen as "D" state in the output of ps.

> [ 7758.502832] firewire_sbp2: fw1.0: sbp2_scsi_abort

Ditto.

I may be wrong but the period of 48 seconds between this and the former
sbp2_scsi_abort indicates that this is about a different SCSI
transaction, not a retry of the one that failed at 7710.529501.

> [ 8950.977907] firewire_sbp2: fw1.0: sbp2_scsi_abort
> [ 8950.977937] sd 27:0:0:0: Device offlined - not ready after error recovery
> [ 8950.978038] sd 27:0:0:0: [sdb] Unhandled error code
> [ 8950.978042] sd 27:0:0:0: [sdb] Result: hostbyte=DID_BUS_BUSY \
> driverbyte=DRIVER_OK [ 8950.978048] sd 27:0:0:0: [sdb] CDB: Write(10): 2a 00 01 da \
> ec 3f 00 04 00 00 [ 8950.978061] end_request: I/O error, dev sdb, sector 31124543

Now we have a SCSI transaction time-out from which SCSI core or higher
layers were unable to recover.

> [ 8950.978067] quiet_error: 246 callbacks suppressed
> [ 8950.978072] Buffer I/O error on device sdb1, logical block 3890560
> [ 8950.978075] lost page write due to I/O error on sdb1
> [ 8950.978085] Buffer I/O error on device sdb1, logical block 3890561
> [ 8950.978088] lost page write due to I/O error on sdb1
> [ 8950.978092] Buffer I/O error on device sdb1, logical block 3890562
> [ 8950.978096] lost page write due to I/O error on sdb1
> [ 8950.978100] Buffer I/O error on device sdb1, logical block 3890563
> [ 8950.978103] lost page write due to I/O error on sdb1
> [ 8950.978108] Buffer I/O error on device sdb1, logical block 3890564
> [ 8950.978111] lost page write due to I/O error on sdb1
> [ 8950.978116] Buffer I/O error on device sdb1, logical block 3890565
> [ 8950.978119] lost page write due to I/O error on sdb1
> [ 8950.978124] Buffer I/O error on device sdb1, logical block 3890566
> [ 8950.978127] lost page write due to I/O error on sdb1
> [ 8950.978131] Buffer I/O error on device sdb1, logical block 3890567
> [ 8950.978135] lost page write due to I/O error on sdb1
> [ 8950.978139] Buffer I/O error on device sdb1, logical block 3890568
> [ 8950.978142] lost page write due to I/O error on sdb1
> [ 8950.978147] Buffer I/O error on device sdb1, logical block 3890569
> [ 8950.978150] lost page write due to I/O error on sdb1
> [ 8950.988975] sd 27:0:0:0: [sdb] Unhandled error code
> [ 8950.988980] sd 27:0:0:0: [sdb] Result: hostbyte=DID_BUS_BUSY \
> driverbyte=DRIVER_OK [ 8950.988983] sd 27:0:0:0: [sdb] CDB: Write(10): 2a 00 01 da \
> f0 3f 00 04 00 00 [ 8950.988989] end_request: I/O error, dev sdb, sector 31125567

And here we have a SCSI transaction failure which differs from the ones
above in the way that "sbp2_scsi_abort" is not logged.  As far as I can
tell, this means that from now on already the small IEEE 1394
transaction fails which firewire-sbp2 uses to signal the availability of
a new SCSI request to the target.  (As I wrote, in the early two
failures the target seemingly accepted the SCSI request but did not
complete it in time.  Now firewire-sbp2 is apparently unable to get in
touch with the target at all.)

> [ 8950.989094] end_request: I/O error, dev sdb, sector 486801991
> [ 8950.989109] Aborting journal on device sdb1-8.
> [ 8950.989598] EXT4-fs error (device sdb1) in ext4_da_writepages: IO failure
> [ 8950.999021] JBD2: I/O error detected when updating journal superblock for \
> sdb1-8. [ 8951.000143] EXT4-fs error (device sdb1): ext4_journal_start_sb: Detected \
> aborted journal [ 8951.000148] EXT4-fs (sdb1): Remounting filesystem read-only
> [ 8951.000151] EXT4-fs (sdb1): ext4_da_writepages: jbd2_start: 2048 pages, ino \
> 5636100; err -30 [ 8951.000155] 
> [ 8951.059965] JBD2: Detected IO errors while flushing file data on sdb1-8
> [ 8951.059978] journal commit I/O error
> [ 8951.059980] journal commit I/O error
> [ 8951.386489] EXT4-fs error (device sdb1): ext4_find_entry: reading directory #2 \
> offset 0 [ 8951.386565] EXT4-fs error (device sdb1): ext4_find_entry: reading \
> directory #2 offset 0 [ 8951.386616] EXT4-fs error (device sdb1): ext4_find_entry: \
> reading directory #2 offset 0 [ 8951.418399] sd 27:0:0:0: [sdb] Stopping disk
> [ 8951.418478] sd 27:0:0:0: [sdb] START_STOP FAILED
> [ 8951.418482] sd 27:0:0:0: [sdb] Result: hostbyte=DID_BUS_BUSY \
> driverbyte=DRIVER_OK [ 8951.418577] firewire_sbp2: released fw1.0, target 27:0:0

The last four lines occur when the target node was switched off (or
switched itself off) or was unplugged or the FireWire controller card
was ejected.

Since all these error messages are logged right after the abort at
8950.977907, I presume that the target went away ( = was switched off,
switched itself off, was unplugged, or the card was ejected) shortly
before this abort.

> [ 8951.480505] EXT4-fs error (device sdb1): ext4_put_super: Couldn't clean up the \
> journal [ 8961.502993] firewire_ohci 0000:06:00.0: PCI INT A disabled
> [ 8961.503000] firewire_ohci: Removed fw-ohci device.

The last two lines are when the FireWire controller card was ejected.
This is 10 seconds after the previous error messages.  Either it took
the driver stack 10 seconds to release all the remaining references to
the card --- if so, then all messages from 8950.977907 (inclusive)
onwards are a result of the card being ejected --- or the target node
went away 10 seconds before the card was ejected.

My conclusion here is that the unresponsiveness of the disk (first
recoverable but resulting in considerable slow-down according to your
description, later unrecoverable or due to the card being ejected) is
either a hardware problem or a firmware problem or a combination of these.

LaCie's product page to the rugged disk, "support" tab, shows a LaCie
Update Tool from Jun 23 2008, for Mac OS X and MS Windows.  If you
happen to have access to a FireWire equipped PC with one of these OSs,
check whether that tool offers a newer firmware for this disk.

A driver bug cannot be ruled out but is unlikely.  The code in question
is very well tested and known to be robust (also in conjunction with
Texas Instruments controllers like yours and with FireWire 800--to--SATA
bridge chips as used in LaCie drives, which are to my knowledge bridges
from Oxford Semiconductor).
-- 
Stefan Richter
-=====-==-=- =--- -==-=
http://arcgraph.de/sr/

------------------------------------------------------------------------------
This SF.net email is sponsored by 

Make an app they can't live without
Enter the BlackBerry Developer Challenge
http://p.sf.net/sfu/RIM-dev2dev 
_______________________________________________
mailing list linux1394-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux1394-devel


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic