[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freebsd-hackers
Subject:    Trying to understand CAM (and the cciss driver)
From:       Peter Eriksson <pen () lysator ! liu ! se>
Date:       2024-05-09 17:58:32
Message-ID: 087AB4C4-9CE0-4024-8E1B-4C636014C8CC () lysator ! liu ! se
[Download RAW message or body]

Hi,

I'm trying to fix a bug in the cciss driver that has been there "forever" when using \
it with an HP H241 SAS HBA card.  The driver works fine when all (SAS, spinning rust) \
drives are behaving well, but when some of them are starting to go bad it often goes \
into spin and either hangs the kernel or panics. I've been trying to add \
instrumentation to it in order to pin-point the problem and have been attempting some \
workarounds (like clearing cr_complete since without that hack sometime the driver \
get many many non-busy repeated requests with the same "tag" and then It panics with: \


> ciss1: WARNING: completing non-busy request
> ciss1: WARNING: completing non-busy request
> ciss1: WARNING: completing non-busy request
> panic: cam_periph_done_panic: already done with ccb 0xfffff88abefc4000
> cpuid = 0
> time = 1714852260
> KDB: stack backtrace:
> #0 0xffffffff80c541b5 at kdb_backtrace+0x65
> #1 0xffffffff80c06b31 at vpanic+0x151
> #2 0xffffffff80c069d3 at panic+0x43
> #3 0xffffffff8039017c at cam_periph_done_panic+0x1c
> #4 0xffffffff80397193 at xpt_done_process+0x313
> #5 0xffffffff803990a5 at xpt_done_td+0xf5
> #6 0xffffffff80bc333e at fork_exit+0x7e
> #7 0xffffffff8108a44e at fork_trampoline+0xe
 
but it's still there… 

I think I could use some input/suggestions from someone more knowledgeable with CAM \
and device drivers in general…

An example (this is with extra instrumentation added so not quite a stock kernel \
driver):

ciss1: *** Drive failure imminent, Port=1E Box=1 Bay=17 reason=0x37 [ts=4460, \
class=4, subclass=0, detail=1, tag=0x2] (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 \
00 00 00 04 7f 17 51 98 00 00 00 10 00 00  (da122:ciss1:32:78:0): CAM status: CCB \
request completed with an error (da122:ciss1:32:78:0): Retrying command, 3 more tries \
remain (da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 7f 17 52 18 00 00 00 \
10 00 00  (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 65 c8 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 63 c8 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 61 c8 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f c8 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f a8 00 00 00 20 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5f 38 00 00 00 70 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5d 38 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 5b 38 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 59 38 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 58 48 00 00 00 f0 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 7f 17 56 48 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 48 d0 00 00 00 10 00 \
00  (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): WRITE(16). CDB: 8a 00 00 00 00 04 88 d8 46 e0 00 00 01 f0 00 \
00  (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8f d0 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
(da122:ciss1:32:78:0): READ(16). CDB: 88 00 00 00 00 04 88 d8 8d d0 00 00 02 00 00 00 \
 (da122:ciss1:32:78:0): CAM status: CCB request completed with an error
(da122:ciss1:32:78:0): Retrying command, 3 more tries remain
ciss1: WARNING: completing non-busy request #1 [onq=0, length=262144, \
ccphys=1791048448, tag=72, flags=0x50 (----I-C), sg_tag=0x3, \
                cr_complete=0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=0, length=262144, \
ccphys=1791318368, tag=313, flags=0x50 (----I-C), sg_tag=0x3, \
                cr_complete=0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=0, length=8192, \
ccphys=1791226528, tag=231, flags=0x50 (----I-C), sg_tag=0x3, \
                cr_complete=0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=0, length=262144, \
ccphys=1791313888, tag=309, flags=0x50 (----I-C), sg_tag=0x5, \
                cr_complete=0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
ciss1: WARNING: completing non-busy request #1 [onq=0, length=8192, \
ccphys=1791206368, tag=213, flags=0x50 (----I-C), sg_tag=0x3, \
                cr_complete=0xffffffff806059a0]
ciss1: WARNING: clearing cr_complete due to ciss_cam_complete & nonbusy
panic: camq_remove: Attempt to remove out-of-bounds index -1 from queue \
0xfffff80107fa9838 of size 1 cpuid = 7
time = 1715266713
KDB: stack backtrace:
#0 0xffffffff80c542d5 at kdb_backtrace+0x65
#1 0xffffffff80c06c51 at vpanic+0x151
#2 0xffffffff80c06af3 at panic+0x43
#3 0xffffffff80390676 at camq_remove+0xf6
#4 0xffffffff803936c8 at xpt_run_devq+0x148
#5 0xffffffff80392b0c at xpt_action_default+0x35c
#6 0xffffffff803c3544 at dastart+0x314
#7 0xffffffff80394f73 at xpt_run_allocq+0x83
#8 0xffffffff803c5152 at dastrategy+0x82
#9 0xffffffff80b438d4 at g_disk_start+0x314
#10 0xffffffff80b46b55 at g_io_request+0x1d5
#11 0xffffffff80b46b55 at g_io_request+0x1d5
#12 0xffffffff821a34dc at vdev_geom_io_start+0x23c
#13 0xffffffff82347fb4 at zio_vdev_io_start+0x204
#14 0xffffffff8234204f at zio_nowait+0x12f
#15 0xffffffff822a0ce7 at vdev_mirror_io_start+0x177
#16 0xffffffff82347fb4 at zio_vdev_io_start+0x204
#17 0xffffffff8234204f at zio_nowait+0x12f
Uptime: 10h16m26s


The drive it complains about in the first line is one that has SMART errors:

> connector 1E box  1 bay 17                 HP      MB010000JWAYK                    \
> 7PH8LU3G     HPD5 S.M.A.R.T. predictive failure.

It is _not_ da122 though… The drive with a SMART error (da85) is not in use so \
basically is just sitting there.



The code where I'm trying to figure out what's going on is this part in \
sys/dev/ciss/ciss.c:

static void
ciss_complete(struct ciss_softc *sc, cr_qhead_t *qh)
{
    struct ciss_request *cr;

    debug_called(2);

    /*                                                                                \
                
     * Loop taking requests off the completed queue and performing                    \
                
     * completion processing on them.                                                 \
                
     */
    for (;;) {
      int got_nonbusy = 0;

        if ((cr = ciss_dequeue_complete(sc, qh)) == NULL)
            break;
        ciss_unmap_request(cr);

        if ((cr->cr_flags & CISS_REQ_BUSY) == 0) {
          cr->nonbusy_request_counter++;
          ciss_printf(sc,
                      "WARNING: completing non-busy request #%d [onq=%d, length=%u, \
ccphys=%u, tag=%d, flags=0x%x (%s), sg_tag=0x%x, cr_complete=%p]\n",  \
                cr->nonbusy_request_counter,
                      cr->cr_onq, cr->cr_length, cr->cr_ccphys, cr->cr_tag,
                      cr->cr_flags, cr_flags2str(cr->cr_flags),
                      cr->cr_sg_tag, cr->cr_complete
                      );
          got_nonbusy = 1;
        }

        cr->cr_flags &= ~CISS_REQ_BUSY;

        /*                                                                            \
                
         * If the request has a callback, invoke it.                                  \
                
         */
        if (cr->cr_complete != NULL) {
          cr->cr_complete(cr);

	/* pen: Attempt to make sure ciss_cam_complete() doesn't get called multiple times \
*/

          if (cr->cr_complete == ciss_cam_complete && got_nonbusy) {
            ciss_printf(sc, "WARNING: clearing cr_complete due to ciss_cam_complete & \
nonbusy\n");  cr->cr_complete = NULL;
          }
          continue;
        }      

        /*                                                                            \
                
         * If someone is sleeping on this request, wake them up.                      \
                
         */
        if (cr->cr_flags & CISS_REQ_SLEEP) {
          cr->cr_flags &= ~CISS_REQ_SLEEP;
          wakeup(cr);
          continue;
        }

        /*                                                                            \
                
         * If someone is polling this request for completion, signal.                 \
                
         */
        if (cr->cr_flags & CISS_REQ_POLL) {
          cr->cr_flags &= ~CISS_REQ_POLL;
          continue;
        }

        /*                                                                            \
                
         * Give up and throw the request back on the free queue.  This                \
                
         * should never happen; resources will probably be lost.                      \
                
         */
        ciss_printf(sc, "WARNING: completed command with no submitter\n");
        ciss_enqueue_free(cr);
    }
}


- Any suggestions? (Probably not but I thought I'd ask anyway :-)

- Peter


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic