[prev in list] [next in list] [prev in thread] [next in thread]
List: linux1394-user
Subject: Re: sbp2: sbp2util_node_write_no_wait failed
From: Stefan Richter <stefanr () s5r6 ! in-berlin ! de>
Date: 2005-11-04 22:13:38
Message-ID: 436BDD12.8070805 () s5r6 ! in-berlin ! de
[Download RAW message or body]
Michael Brade wrote:
> Heh, good news, debugging finished! :-)
Let's say, _nearly_ finished.
> On Tuesday 01 November 2005 20:55, Stefan Richter wrote:
>
>>sbp2.c::sbp2util_node_write_no_wait calls
>>ieee1394_core.c::hpsb_send_packet and ohci1394.c::ohci_transmit. You
>>could add printks to the failure paths of the latter two functions to
>>find out what prevented sbp2 from sending.
>
> Alright, none of those was a problem, in my case the sbp2 can't get a
> transaction label for the packet. ieee1394_transactions.c::hpsb_get_tlabel()
> returns non-zero. It only seems to happen when sbp2util_node_write_no_wait()
> is called from sbp2.c line 2020, I haven't seen it happen in line 2056. Then
> sbp2util_node_write_no_wait() calls hpsb_make_writepacket(), which fails and
> returns a NULL packet due to the tlabel thing.
(Line 2020 is when sbp2 announces a new list of ORBs by writing to the
ORB pointer register. Line 2056 is when it announces an appended ORB by
writing to the doorbell register, which obviously happens at much higher
frequency.)
Any asynchronous transaction needs a transaction label. A transaction
consists of request from node A to node B (plus ack from B to A) and
response from B to A (plus ack from A to B; if the upper layers of B are
fast enough, the 1st ack may already provide the response, saving bus
bandwidth). As long as a transaction from A to B is not finished by
response, its tlabel cannot be used in another transaction from A to B.
Thereby it is possible to find out to which request an incoming response
is meant for.
There are 64 tlabels, therefore what you have seen is that sbp2 added
ORBs and rung the doorbell 64 times while the target did not finish
these 64 small transactions to the doorbell register with a response.
> Any idea what to fix?
I am unsure.
Idea 1:
Maybe we could move the initiator's part of the protocol (in particular,
writes to command block agent register and writes to doorbell register)
out of atomic context, e.g. into an additional kthread or into a
workqueue. That would let sbp2 wait for availability of a tlabel.
(It would also allow to implement a rate limit for ringing of the
doorbell. The doorbell doesn't need to be rung every time when ORBs are
added in fast succession. It is sufficient to ring it after the last new
ORB, thereby using fewer tlabels and less bus bandwidth, as long as we
don't have gap count optimization. OTOH the doorbell should not be rung
too late, else the target's fetch agent may have become unnecessarily idle.)
However, I am very uncertain for now about the impact of such a change
on code and performance --- which is why I added linux1394-devel to the
Cc list in hope of other's thoughts.
Moreover, as I said in an earlier reply, there are many reports about
mysterious "aborting sbp2 command" mishaps but only few reports about
"sbp2util_node_write_no_wait failed" along with command abortions.
Therefore, IMO, we should implement changes of such kind only after we
understood the more common problem better and have an idea how such
changes may affect it, ideally cure it.
Idea 2:
We could add a eh_timed_out handler which simply rings the doorbell
again, or at least tries to do so. A drawback is that a full time-out of
a SCSI command has to occur before eh_timed_out is entered. Also,
eh_timed_out is called from atomic context, i.e. sbp2 can't do much here
too if sbp2util_node_write_no_wait fails again. Therefore this idea is
rather a lame workaround instead of a proper fix. It won't improve
performance relative to what you are seeing now, but it would lower the
risk of data loss.
> It doesn't seem justified to lock up for 30 seconds
> since a new label could be available much earlier. But that's just my guess.
These pauses aren't spent locked-up in sbp2. It is the period that the
SCSI subsystem waits for completion of a task. After this time-out the
SCSI error handler springs into action (or before that, eh_timed_out
handler if there was one). The error handler orders the target not to do
any further work on the timed-out task and revokes the task's data
structure.
--
Stefan Richter
-=====-=-=-= =-== --=--
http://arcgraph.de/sr/
-------------------------------------------------------
SF.Net email is sponsored by:
Tame your development challenges with Apache's Geronimo App Server. Download
it for free - -and be entered to win a 42" plasma tv or your very own
Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php
_______________________________________________
mailing list Linux1394-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux1394-user
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic