'Re: sbp2: sbp2util_node_write_no_wait failed'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux1394-user
Subject:    Re: sbp2: sbp2util_node_write_no_wait failed
From:       Stefan Richter <stefanr () s5r6 ! in-berlin ! de>
Date:       2005-11-05 12:57:20
Message-ID: 436CAC30.4070804 () s5r6 ! in-berlin ! de
[Download RAW message or body]

Michael Brade wrote:
> On Friday 04 November 2005 23:13, Stefan Richter wrote:
>>Michael Brade wrote:
>>>Heh, good news, debugging finished! :-)
>>Let's say, _nearly_ finished.
> Ok, fair enough, however, you didn't tell me yet what to do next...

I hope my lack of insight didn't come across as lack of manners. :-)

>>There are 64 tlabels, therefore what you have seen is that sbp2 added
>>ORBs and rung the doorbell 64 times while the target did not finish
>>these 64 small transactions to the doorbell register with a response.
> 
> Yep, but what I don't quite understand yet is when exactly it happens. It 
> seems that I can copy one huge file almost without errors, maybe one or two, 
> rarely more. But when I write a lot of small files and try to do some reading 
> inbetween the error happens every 10 seconds or worse. With the odd exception 
> to the rule.

I/O of huge files involves very few SCSI commands with big accompanying 
data transfers, therefore the ratio of "protocol traffic" to "data 
traffic" is much lower then. This explains why the frequence of the 
problem and the number of files (rather than file size) correlate.

It is mysterious though why it happens only with writes, not with reads.

BTW, when you replaced the Prolific based enclosure by an Oxford 911 
based one, did you check what's actually on the bridge board? Some 
Prolific firmwares show strings in their config ROM that look similar to 
what a OXFW911 firmware would show.

(Anyway, while the exhaustion of tlabels is probably a deficiency of the 
SBP-2 bridge, sbp2's inability to get the doorbell ring out to the 
device long before the SCSI layer times out is of course a bug in sbp2.)

>>Maybe we could move the initiator's part of the protocol (in particular,
>>writes to command block agent register and writes to doorbell register)
>>out of atomic context, e.g. into an additional kthread or into a
>>workqueue. That would let sbp2 wait for availability of a tlabel.
> 
> That sounds about good to me ;-)

I think one workqueue per node_entry would be suitable.

SBP-2 login and SBP-2 reconnect could perhaps be moved into this 
workqueue too. They happen in non-atomic context, but it would enable 
nodemgr to probe and update several nodes on the same bus quicker.

>>Moreover, as I said in an earlier reply, there are many reports about
>>mysterious "aborting sbp2 command" mishaps but only few reports about
>>"sbp2util_node_write_no_wait failed" along with command abortions.
> 
> Hm, I have no idea about how many "aborting sbp2 command" reports there are 
> but I found quite some reports on Google with the sbp2util_node_write_no_wait 
> failed, none of them with a good answer though.

Google shows 158 hits for "aborting sbp2 command" + 
"sbp2util_node_write_no_wait failed" but 1370 hits for "aborting sbp2 
command" alone.

>>Therefore, IMO, we should implement changes of such kind only after we
>>understood the more common problem better and have an idea how such
>>changes may affect it, ideally cure it.
> 
> Ok, is there anything I do to help there? I mean, I have a system where the 
> error happens with 100% certainty every time, so as a test bed it's quite 
> perfect, eh?

We don't know yet what problems these 1370 hits above actually reflect. 
The only sure thing is that these are several different problems.

At least the particular problem that tlables may be exhausted when sbp2 
tries to ring the doorbell is now a well-known problem, thanks to your 
really good debugging work. You are right, we should try the best to 
solve this problem right now, given that you and your system are now 
available for testing. What I said above was because I am afraid of the 
consequences of my idea of a solution. Maybe a more relaxed attitude 
would be more productive. :-)

I will work on the workqueue idea during 2nd half of next week. (At 
least I hope to be able to do so.) Of course if you or anybody else has 
something to contribute or has an idea for a better approach, the better.

>>Idea 2:
>>
>>[...] It won't improve
>>performance relative to what you are seeing now, but it would lower the
>>risk of data loss.
> 
> Well, as far as I can see I haven't lost any data yet. Do you reckon that 
> could happen or is even likely to happen? Cause then I'd do some backups 
> pretty soon.

AFAIU data loss will be imminent when the SCSI layer had to abort the 
same task too often. Eventually, it gives up on any more traffic and 
takes the device offline. Moreover I am not sure how well firmware 
handles abortion of a task and requeueing of the rest of the current 
task set, which may happen many times before SCSI layer gives up.

But backups are really good to have anyway.
-- 
Stefan Richter
-=====-=-=-= =-== --=-=
http://arcgraph.de/sr/


-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
mailing list Linux1394-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux1394-user
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic