[prev in list] [next in list] [prev in thread] [next in thread] 

List:       grub-devel
Subject:    Re: [PATCH] Better ahci error handling
From:       Michel Hermier <michel.hermier () gmail ! com>
Date:       2018-01-02 10:22:48
Message-ID: CAAZ5spAm2KEEjZJRDrPGH+GGdcw14HE8H4WreU3zKvXxOeyDqQ () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hi,
IS_FATAL_ERROR sounds like a utility macro, I would add a parameter to test
the flags and make it return a Bool compatible value.
My 2 cents comments.
Cheers

Le 2 janv. 2018 11:05, "Stefan Fritsch" <sf@sfritsch.de> a écrit :

Hi,

on a Fujitsu E744 laptop we have a problem that sometimes there is a
very long delay (up to several minutes) when booting from hard disk.
It seems accessing the DVD drive (which has no disk inserted)
sometimes fails with some errors that are not correctly handled by
grub, which leads to each access being stalled until the 20s timeout
triggers. This seems to happen when grub is trying to read
filesystem/partition data.

Example debug output:

disk/ahci.c:1020: AHCI command scheduled
disk/ahci.c:1022: AHCI tfd = 50
disk/ahci.c:1024: AHCI inten = 7dc000ff
disk/ahci.c:1026: AHCI intstatus = 3
disk/ahci.c:1031: AHCI inten = 7dc000ff
disk/ahci.c:1033: AHCI tfd = 50
disk/ahci.c:1036: AHCI sig = eb140101
disk/ahci.c:1038: AHCI tfd = 2051
disk/ahci.c:1049: AHCI status <1 1 40000001 2051>
disk/ahci.c:926: AHCI tfd = 2051
disk/ahci.c:932: AHCI tfd = 2051
disk/ahci.c:936: AHCI tfd = 2051
disk/ahci.c:942: grub_ahci_read (size=0, cmdsize = 0)
disk/ahci.c:957: AHCI tfd = 2051, CL=0xda455000
disk/ahci.c:968: AHCI tfd = 2051
disk/ahci.c:980: AHCI tfd = 2051
disk/ahci.c:987: AHCI tfd = 2051
disk/ahci.c:998: cfis: 27 80 08 00 00 00 00 00
disk/ahci.c:1003: cfis: 00 00 00 00 00 00 00 00
disk/ahci.c:1015: PRDT = da453000, 0, ffffffff (128)

The problem is that the command_issue bit that is checked in the loop
is only reset if the "HBA receives a FIS which clears the BSY, DRQ, and
ERR bits for the command", but the ERR bit is never cleared.

The patch below seems to fix the issue by checking the error bits in
the interrupt status register. According to the AHCI 1.2 spec,
"Interrupt sources that are disabled (‘0') are still reflected in the
status registers.", so this should work even though grub uses polling.
The relevant bit in our case is the Task File Error Status (TFES),
which is equivalent to the ERR bit 0 in tfd. But the patch below also
checks the other error bits except for the "Interface non-fatal error
status" bit.

Cheers,
Stefan

diff --git a/grub-core/disk/ahci.c b/grub-core/disk/ahci.c
--- a/grub-core/disk/ahci.c
+++ b/grub-core/disk/ahci.c
@@ -82,6 +82,19 @@ enum grub_ahci_hba_port_command
     GRUB_AHCI_HBA_PORT_CMD_FR = 0x4000,
   };

+enum grub_ahci_hba_port_int_status
+  {
+         GRUB_AHCI_HBA_PORT_IS_IFS  = (1UL << 27),
+         GRUB_AHCI_HBA_PORT_IS_HBDS = (1UL << 28),
+         GRUB_AHCI_HBA_PORT_IS_HBFS = (1UL << 29),
+         GRUB_AHCI_HBA_PORT_IS_TFES = (1UL << 30),
+  };
+#define IS_FATAL_ERROR (\
+       GRUB_AHCI_HBA_PORT_IS_IFS| \
+       GRUB_AHCI_HBA_PORT_IS_HBDS|\
+       GRUB_AHCI_HBA_PORT_IS_HBFS|\
+       GRUB_AHCI_HBA_PORT_IS_TFES)
+
 struct grub_ahci_hba
 {
   grub_uint32_t cap;
@@ -1026,7 +1039,8 @@ grub_ahci_readwrite_real (struct grub_ahci_device
*dev,

   endtime = grub_get_time_ms () + (spinup ? 20000 : 20000);
   while ((dev->hba->ports[dev->port].command_issue & 1))
-    if (grub_get_time_ms () > endtime)
+    if (grub_get_time_ms () > endtime ||
+       (dev->hba->ports[dev->port].intstatus & IS_FATAL_ERROR))
       {
        grub_dprintf ("ahci", "AHCI status <%x %x %x %x>\n",
                      dev->hba->ports[dev->port].command_issue,
@@ -1034,7 +1048,10 @@ grub_ahci_readwrite_real (struct grub_ahci_device
*dev,
                      dev->hba->ports[dev->port].intstatus,
                      dev->hba->ports[dev->port].task_file_data);
        dev->hba->ports[dev->port].command_issue = 0;
-       err = grub_error (GRUB_ERR_IO, "AHCI transfer timed out");
+       if (dev->hba->ports[dev->port].intstatus & IS_FATAL_ERROR)
+         err = grub_error (GRUB_ERR_IO, "AHCI transfer error");
+       else
+         err = grub_error (GRUB_ERR_IO, "AHCI transfer timed out");
        if (!reset)
          grub_ahci_reset_port (dev, 1);
        break;
_______________________________________________
Grub-devel mailing list
Grub-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/grub-devel

[Attachment #5 (text/html)]

<div dir="auto">Hi,<div dir="auto">IS_FATAL_ERROR sounds like a utility macro, I \
would add a parameter to test the flags and make it return a Bool compatible \
value.</div><div dir="auto">My 2 cents comments.</div><div \
dir="auto">Cheers</div></div><div class="gmail_extra"><br><div class="gmail_quote">Le \
2 janv. 2018 11:05, &quot;Stefan Fritsch&quot; &lt;<a \
href="mailto:sf@sfritsch.de">sf@sfritsch.de</a>&gt; a écrit  :<br \
type="attribution"><blockquote class="quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex">Hi,<br> <br>
on a Fujitsu E744 laptop we have a problem that sometimes there is a<br>
very long delay (up to several minutes) when booting from hard disk.<br>
It seems accessing the DVD drive (which has no disk inserted)<br>
sometimes fails with some errors that are not correctly handled by<br>
grub, which leads to each access being stalled until the 20s timeout<br>
triggers. This seems to happen when grub is trying to read<br>
filesystem/partition data.<br>
<br>
Example debug output:<br>
<br>
disk/ahci.c:1020: AHCI command scheduled<br>
disk/ahci.c:1022: AHCI tfd = 50<br>
disk/ahci.c:1024: AHCI inten = 7dc000ff<br>
disk/ahci.c:1026: AHCI intstatus = 3<br>
disk/ahci.c:1031: AHCI inten = 7dc000ff<br>
disk/ahci.c:1033: AHCI tfd = 50<br>
disk/ahci.c:1036: AHCI sig = eb140101<br>
disk/ahci.c:1038: AHCI tfd = 2051<br>
disk/ahci.c:1049: AHCI status &lt;1 1 40000001 2051&gt;<br>
disk/ahci.c:926: AHCI tfd = 2051<br>
disk/ahci.c:932: AHCI tfd = 2051<br>
disk/ahci.c:936: AHCI tfd = 2051<br>
disk/ahci.c:942: grub_ahci_read (size=0, cmdsize = 0)<br>
disk/ahci.c:957: AHCI tfd = 2051, CL=0xda455000<br>
disk/ahci.c:968: AHCI tfd = 2051<br>
disk/ahci.c:980: AHCI tfd = 2051<br>
disk/ahci.c:987: AHCI tfd = 2051<br>
disk/ahci.c:998: cfis: 27 80 08 00 00 00 00 00<br>
disk/ahci.c:1003: cfis: 00 00 00 00 00 00 00 00<br>
disk/ahci.c:1015: PRDT = da453000, 0, ffffffff (128)<br>
<br>
The problem is that the command_issue bit that is checked in the loop<br>
is only reset if the &quot;HBA receives a FIS which clears the BSY, DRQ, and<br>
ERR bits for the command&quot;, but the ERR bit is never cleared.<br>
<br>
The patch below seems to fix the issue by checking the error bits in<br>
the interrupt status register. According to the AHCI 1.2 spec,<br>
&quot;Interrupt sources that are disabled (‘0') are still reflected in the<br>
status registers.&quot;, so this should work even though grub uses polling.<br>
The relevant bit in our case is the Task File Error Status (TFES),<br>
which is equivalent to the ERR bit 0 in tfd. But the patch below also<br>
checks the other error bits except for the &quot;Interface non-fatal error<br>
status&quot; bit.<br>
<br>
Cheers,<br>
Stefan<br>
<br>
diff --git a/grub-core/disk/ahci.c b/grub-core/disk/ahci.c<br>
--- a/grub-core/disk/ahci.c<br>
+++ b/grub-core/disk/ahci.c<br>
@@ -82,6 +82,19 @@ enum grub_ahci_hba_port_command<br>
        GRUB_AHCI_HBA_PORT_CMD_FR = 0x4000,<br>
     };<br>
<br>
+enum grub_ahci_hba_port_int_status<br>
+   {<br>
+              GRUB_AHCI_HBA_PORT_IS_IFS   = (1UL &lt;&lt; 27),<br>
+              GRUB_AHCI_HBA_PORT_IS_HBDS = (1UL &lt;&lt; 28),<br>
+              GRUB_AHCI_HBA_PORT_IS_HBFS = (1UL &lt;&lt; 29),<br>
+              GRUB_AHCI_HBA_PORT_IS_TFES = (1UL &lt;&lt; 30),<br>
+   };<br>
+#define IS_FATAL_ERROR (\<br>
+           GRUB_AHCI_HBA_PORT_IS_IFS| \<br>
+           GRUB_AHCI_HBA_PORT_IS_HBDS|\<br>
+           GRUB_AHCI_HBA_PORT_IS_HBFS|\<br>
+           GRUB_AHCI_HBA_PORT_IS_TFES)<br>
+<br>
  struct grub_ahci_hba<br>
  {<br>
     grub_uint32_t cap;<br>
@@ -1026,7 +1039,8 @@ grub_ahci_readwrite_real (struct grub_ahci_device *dev,<br>
<br>
     endtime = grub_get_time_ms () + (spinup ? 20000 : 20000);<br>
     while ((dev-&gt;hba-&gt;ports[dev-&gt;port].<wbr>command_issue &amp; 1))<br>
-      if (grub_get_time_ms () &gt; endtime)<br>
+      if (grub_get_time_ms () &gt; endtime ||<br>
+           (dev-&gt;hba-&gt;ports[dev-&gt;port].<wbr>intstatus &amp; \
IS_FATAL_ERROR))<br>  {<br>
            grub_dprintf (&quot;ahci&quot;, &quot;AHCI status &lt;%x %x %x \
                %x&gt;\n&quot;,<br>
                                 \
dev-&gt;hba-&gt;ports[dev-&gt;port].<wbr>command_issue,<br> @@ -1034,7 +1048,10 @@ \
                grub_ahci_readwrite_real (struct grub_ahci_device *dev,<br>
                                 \
                dev-&gt;hba-&gt;ports[dev-&gt;port].<wbr>intstatus,<br>
                                 \
                dev-&gt;hba-&gt;ports[dev-&gt;port].<wbr>task_file_data);<br>
            dev-&gt;hba-&gt;ports[dev-&gt;port].<wbr>command_issue = 0;<br>
-           err = grub_error (GRUB_ERR_IO, &quot;AHCI transfer timed out&quot;);<br>
+           if (dev-&gt;hba-&gt;ports[dev-&gt;port].<wbr>intstatus &amp; \
IS_FATAL_ERROR)<br> +              err = grub_error (GRUB_ERR_IO, &quot;AHCI transfer \
error&quot;);<br> +           else<br>
+              err = grub_error (GRUB_ERR_IO, &quot;AHCI transfer timed \
out&quot;);<br>  if (!reset)<br>
               grub_ahci_reset_port (dev, 1);<br>
            break;<br>______________________________<wbr>_________________<br>
Grub-devel mailing list<br>
<a href="mailto:Grub-devel@gnu.org">Grub-devel@gnu.org</a><br>
<a href="https://lists.gnu.org/mailman/listinfo/grub-devel" rel="noreferrer" \
target="_blank">https://lists.gnu.org/mailman/<wbr>listinfo/grub-devel</a><br> \
<br></blockquote></div><br></div>



_______________________________________________
Grub-devel mailing list
Grub-devel@gnu.org
https://lists.gnu.org/mailman/listinfo/grub-devel


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic