'ahci ncq error recovery'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openbsd-tech
Subject:    ahci ncq error recovery
From:       Jonathan Matthew <jonathan () d14n ! org>
Date:       2017-02-27 23:16:34
Message-ID: 20170227231634.GA1218 () mild ! embarrassm ! net
[Download RAW message or body]

Various people have reported seeing kernel diagnostic assertion
"ccb->ccb_xa.state == ATA_S_ONCHIP" panics with ahci.  In short, this happens
when a queued command fails, we ask the device which command fails, and it
gives us the wrong answer.  The ccb_xa.state assertion fails if the command
was not active.

For non-queued commands, we handle this by failing all active commands (since
r1.157 in 2010), and every other driver I've looked at does this too for both
queued and non-queued commands, so I think it makes sense to handle queued
command errors the same way.

This came out of the most recent thread about this on bugs@, where it seems
to have made a slight improvement, and has been in snaps for over a week,
so I think it should go in.

ok?


Index: ahci.c
===================================================================
RCS file: /cvs/src/sys/dev/ic/ahci.c,v
retrieving revision 1.28
diff -u -p -u -p -r1.28 ahci.c
--- ahci.c	2 Oct 2016 18:56:05 -0000	1.28
+++ ahci.c	27 Feb 2017 07:10:40 -0000
@@ -2158,6 +2158,12 @@ ahci_port_intr(struct ahci_port *ap, u_i
 				PORTNAME(ap), err_slot);
 
 			ccb = &ap->ap_ccbs[err_slot];
+			if (ccb->ccb_xa.state != ATA_S_ONCHIP) {
+				printf("%s: NCQ errored slot %d is idle"
+				    " (%08x active)\n", PORTNAME(ap), err_slot,
+				    ci_saved);
+				goto failall;
+			}
 		} else {
 			/* Didn't reset, could gather extended info from log. */
 		}

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic