'Re: sbp2: sbp2util_node_write_no_wait failed'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux1394-devel
Subject:    Re: sbp2: sbp2util_node_write_no_wait failed
From:       Michael Brade <brade () informatik ! uni-muenchen ! de>
Date:       2005-11-05 0:36:32
Message-ID: 200511050136.36245.brade () informatik ! uni-muenchen ! de
[Download RAW message or body]

On Friday 04 November 2005 23:13, Stefan Richter wrote:
> Michael Brade wrote:
> > Heh, good news, debugging finished! :-)
>
> Let's say, _nearly_ finished.
Ok, fair enough, however, you didn't tell me yet what to do next...

> There are 64 tlabels, therefore what you have seen is that sbp2 added
> ORBs and rung the doorbell 64 times while the target did not finish
> these 64 small transactions to the doorbell register with a response.
Yep, but what I don't quite understand yet is when exactly it happens. It 
seems that I can copy one huge file almost without errors, maybe one or two, 
rarely more. But when I write a lot of small files and try to do some reading 
inbetween the error happens every 10 seconds or worse. With the odd exception 
to the rule.

> > Any idea what to fix?
>
> I am unsure.
>
> Idea 1:
>
> Maybe we could move the initiator's part of the protocol (in particular,
> writes to command block agent register and writes to doorbell register)
> out of atomic context, e.g. into an additional kthread or into a
> workqueue. That would let sbp2 wait for availability of a tlabel.
That sounds about good to me ;-)

> Moreover, as I said in an earlier reply, there are many reports about
> mysterious "aborting sbp2 command" mishaps but only few reports about
> "sbp2util_node_write_no_wait failed" along with command abortions.
Hm, I have no idea about how many "aborting sbp2 command" reports there are 
but I found quite some reports on Google with the sbp2util_node_write_no_wait 
failed, none of them with a good answer though.

> Therefore, IMO, we should implement changes of such kind only after we
> understood the more common problem better and have an idea how such
> changes may affect it, ideally cure it.
Ok, is there anything I do to help there? I mean, I have a system where the 
error happens with 100% certainty every time, so as a test bed it's quite 
perfect, eh?

> Idea 2:
>
> [...] It won't improve
> performance relative to what you are seeing now, but it would lower the
> risk of data loss.
Well, as far as I can see I haven't lost any data yet. Do you reckon that 
could happen or is even likely to happen? Cause then I'd do some backups 
pretty soon.

> > It doesn't seem justified to lock up for 30 seconds
> > since a new label could be available much earlier. But that's just my
> > guess.
>
> These pauses aren't spent locked-up in sbp2. It is the period that the
> SCSI subsystem waits for completion of a task.
I know... do you think I would put something at risk if I'd lower the timeout 
to, say, 5 seconds? 30 seconds is *really* annoying.

Cheers,
-- 
Michael Brade;                 KDE Developer, Student of Computer Science
  |-mail: echo brade !#|tr -d "c oh"|s\e\d 's/e/\@/2;s/$/.org/;s/bra/k/2'
  °--web: http://www.kde.org/people/michaelb.html

KDE 3: The Next Generation in Desktop Experience

[Attachment #3 (application/pgp-signature)]
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP.  Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
mailing list linux1394-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux1394-devel

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic