[prev in list] [next in list] [prev in thread] [next in thread] 

List:       llvm-bugs
Subject:    [llvm-bugs] [Bug 28780] New: "edit distance" algorithm is wrong
From:       via llvm-bugs <llvm-bugs () lists ! llvm ! org>
Date:       2016-07-30 16:02:57
Message-ID: bug-28780-206 () http ! llvm ! org/bugs/
[Download RAW message or body]

--1469894578.8F46d61.17334
Date: Sat, 30 Jul 2016 11:02:58 -0500
MIME-Version: 1.0
Content-Type: text/plain; charset="UTF-8"

https://llvm.org/bugs/show_bug.cgi?id=28780

            Bug ID: 28780
           Summary: "edit distance" algorithm is wrong
           Product: compiler-rt
           Version: unspecified
          Hardware: All
                OS: All
            Status: NEW
          Severity: normal
          Priority: P
         Component: compiler-rt
          Assignee: unassignedbugs@nondot.org
          Reporter: jiangg@mail.ustc.edu.cn
                CC: llvm-bugs@lists.llvm.org, nicolasweber@gmx.de
    Classification: Unclassified

In "llvm/include/llvm/ADT/edit_distance.h:ComputeEditDistance()" function, if
two strings(or arrays) are both empty, the result would be a random value, but
expected zero.
The current version on 2016.07.29 is:

    template<typename T>
    unsigned ComputeEditDistance(ArrayRef<T> FromArray, ArrayRef<T> ToArray,
                                 bool AllowReplacements = true,
                                 unsigned MaxEditDistance = 0) {
      // The algorithm implemented below is the "classic"
      // dynamic-programming algorithm for computing the Levenshtein
      // distance, which is described here:
      //
      //   http://en.wikipedia.org/wiki/Levenshtein_distance
      //
      // Although the algorithm is typically described using an m x n
      // array, only one row plus one element are used at a time, so this
      // implementation just keeps one vector for the row.  To update one
entry,
      // only the entries to the left, top, and top-left are needed.  The left
      // entry is in Row[x-1], the top entry is what's in Row[x] from the last
      // iteration, and the top-left entry is stored in Previous.
      typename ArrayRef<T>::size_type m = FromArray.size();
      typename ArrayRef<T>::size_type n = ToArray.size();

      const unsigned SmallBufferSize = 64;
      unsigned SmallBuffer[SmallBufferSize];
      std::unique_ptr<unsigned[]> Allocated;
      unsigned *Row = SmallBuffer;
      if (n + 1 > SmallBufferSize) {
        Row = new unsigned[n + 1];
        Allocated.reset(Row);
      }

      for (unsigned i = 1; i <= n; ++i)
        Row[i] = i;

      for (typename ArrayRef<T>::size_type y = 1; y <= m; ++y) {
        Row[0] = y;
        unsigned BestThisRow = Row[0];

        unsigned Previous = y - 1;
        for (typename ArrayRef<T>::size_type x = 1; x <= n; ++x) {
          int OldRow = Row[x];
          if (AllowReplacements) {
            Row[x] = std::min(
                Previous + (FromArray[y-1] == ToArray[x-1] ? 0u : 1u),
                std::min(Row[x-1], Row[x])+1);
          }
          else {
            if (FromArray[y-1] == ToArray[x-1]) Row[x] = Previous;
            else Row[x] = std::min(Row[x-1], Row[x]) + 1;
          }
          Previous = OldRow;
          BestThisRow = std::min(BestThisRow, Row[x]);
        }

        if (MaxEditDistance && BestThisRow > MaxEditDistance)
          return MaxEditDistance + 1;
      }

      unsigned Result = Row[n];
      return Result;
    }

The above algorithm was first introduced on 2015.07.13: git-svn-id:
https://llvm.org/svn/llvm-project/llvm/trunk@242069
91177308-0d34-0410-b5e6-96231b3b80d8. And the last right version is git-svn-id:
https://llvm.org/svn/llvm-project/llvm/trunk@240390
91177308-0d34-0410-b5e6-96231b3b80d8.

The above `ComputeEditDistance` function is used in
"llvm/lib/Support/StringRef.cpp:StringRef::edit_distance()". So you can
reproduce the bug via calling `StringRef::edit_distance()`. NOTE that the
returned random value may be zero.

How to fix it? The simple solution is to initialize the stack-based array
explictly: `unsigned SmallBuffer[SmallBufferSize]{};`. The safer solution is to
use RAII-style container, however, at the cost for allocating memory dynamicly
even if small number of elements.

Accordingly, unit test can be strengthened in
"llvm/unittests/ADT/StringRefTest.cpp":

    diff --git a/unittests/ADT/StringRefTest.cpp
b/unittests/ADT/StringRefTest.cpp
    index 66e5944..306dc23 100644
    --- a/unittests/ADT/StringRefTest.cpp
    +++ b/unittests/ADT/StringRefTest.cpp
    @@ -390,6 +390,11 @@ TEST(StringRefTest, Count) {
     }

     TEST(StringRefTest, EditDistance) {
    +  StringRef StrNull;
    +  StringRef StrEmpty("");
    +  EXPECT_EQ(0U, StrNull.edit_distance(""));
    +  EXPECT_EQ(0U, StrEmpty.edit_distance(""));
    +
       StringRef Str("hello");
       EXPECT_EQ(2U, Str.edit_distance("hill"));
     }

That's all.

Gang JIANG
jiangg@mail.ustc.edu.cn
http://justme0.com/
University of Science and Technology of China

-- 
You are receiving this mail because:
You are on the CC list for the bug.

--1469894578.8F46d61.17334
Date: Sat, 30 Jul 2016 11:02:58 -0500
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"

<html>
    <head>
      <base href="https://llvm.org/bugs/" />
    </head>
    <body><table border="1" cellspacing="0" cellpadding="8">
        <tr>
          <th>Bug ID</th>
          <td><a class="bz_bug_link 
          bz_status_NEW "
   title="NEW --- - &quot;edit distance&quot; algorithm is wrong"
   href="https://llvm.org/bugs/show_bug.cgi?id=28780">28780</a>
          </td>
        </tr>

        <tr>
          <th>Summary</th>
          <td>&quot;edit distance&quot; algorithm is wrong
          </td>
        </tr>

        <tr>
          <th>Product</th>
          <td>compiler-rt
          </td>
        </tr>

        <tr>
          <th>Version</th>
          <td>unspecified
          </td>
        </tr>

        <tr>
          <th>Hardware</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>OS</th>
          <td>All
          </td>
        </tr>

        <tr>
          <th>Status</th>
          <td>NEW
          </td>
        </tr>

        <tr>
          <th>Severity</th>
          <td>normal
          </td>
        </tr>

        <tr>
          <th>Priority</th>
          <td>P
          </td>
        </tr>

        <tr>
          <th>Component</th>
          <td>compiler-rt
          </td>
        </tr>

        <tr>
          <th>Assignee</th>
          <td>unassignedbugs&#64;nondot.org
          </td>
        </tr>

        <tr>
          <th>Reporter</th>
          <td>jiangg&#64;mail.ustc.edu.cn
          </td>
        </tr>

        <tr>
          <th>CC</th>
          <td>llvm-bugs&#64;lists.llvm.org, nicolasweber&#64;gmx.de
          </td>
        </tr>

        <tr>
          <th>Classification</th>
          <td>Unclassified
          </td>
        </tr></table>
      <p>
        <div>
        <pre>In &quot;llvm/include/llvm/ADT/edit_distance.h:ComputeEditDistance()&quot; \
function, if two strings(or arrays) are both empty, the result would be a random \
value, but expected zero.
The current version on 2016.07.29 is:

    template&lt;typename T&gt;
    unsigned ComputeEditDistance(ArrayRef&lt;T&gt; FromArray, ArrayRef&lt;T&gt; \
ToArray,  bool AllowReplacements = true,
                                 unsigned MaxEditDistance = 0) {
      // The algorithm implemented below is the &quot;classic&quot;
      // dynamic-programming algorithm for computing the Levenshtein
      // distance, which is described here:
      //
      //   <a href="http://en.wikipedia.org/wiki/Levenshtein_distance">http://en.wikipedia.org/wiki/Levenshtein_distance</a>
  //
      // Although the algorithm is typically described using an m x n
      // array, only one row plus one element are used at a time, so this
      // implementation just keeps one vector for the row.  To update one
entry,
      // only the entries to the left, top, and top-left are needed.  The left
      // entry is in Row[x-1], the top entry is what's in Row[x] from the last
      // iteration, and the top-left entry is stored in Previous.
      typename ArrayRef&lt;T&gt;::size_type m = FromArray.size();
      typename ArrayRef&lt;T&gt;::size_type n = ToArray.size();

      const unsigned SmallBufferSize = 64;
      unsigned SmallBuffer[SmallBufferSize];
      std::unique_ptr&lt;unsigned[]&gt; Allocated;
      unsigned *Row = SmallBuffer;
      if (n + 1 &gt; SmallBufferSize) {
        Row = new unsigned[n + 1];
        Allocated.reset(Row);
      }

      for (unsigned i = 1; i &lt;= n; ++i)
        Row[i] = i;

      for (typename ArrayRef&lt;T&gt;::size_type y = 1; y &lt;= m; ++y) {
        Row[0] = y;
        unsigned BestThisRow = Row[0];

        unsigned Previous = y - 1;
        for (typename ArrayRef&lt;T&gt;::size_type x = 1; x &lt;= n; ++x) {
          int OldRow = Row[x];
          if (AllowReplacements) {
            Row[x] = std::min(
                Previous + (FromArray[y-1] == ToArray[x-1] ? 0u : 1u),
                std::min(Row[x-1], Row[x])+1);
          }
          else {
            if (FromArray[y-1] == ToArray[x-1]) Row[x] = Previous;
            else Row[x] = std::min(Row[x-1], Row[x]) + 1;
          }
          Previous = OldRow;
          BestThisRow = std::min(BestThisRow, Row[x]);
        }

        if (MaxEditDistance &amp;&amp; BestThisRow &gt; MaxEditDistance)
          return MaxEditDistance + 1;
      }

      unsigned Result = Row[n];
      return Result;
    }

The above algorithm was first introduced on 2015.07.13: git-svn-id:
<a href="https://llvm.org/svn/llvm-project/llvm/trunk&#64;242069">https://llvm.org/svn/llvm-project/llvm/trunk&#64;242069</a>
 91177308-0d34-0410-b5e6-96231b3b80d8. And the last right version is git-svn-id:
<a href="https://llvm.org/svn/llvm-project/llvm/trunk&#64;240390">https://llvm.org/svn/llvm-project/llvm/trunk&#64;240390</a>
 91177308-0d34-0410-b5e6-96231b3b80d8.

The above `ComputeEditDistance` function is used in
&quot;llvm/lib/Support/StringRef.cpp:StringRef::edit_distance()&quot;. So you can
reproduce the bug via calling `StringRef::edit_distance()`. NOTE that the
returned random value may be zero.

How to fix it? The simple solution is to initialize the stack-based array
explictly: `unsigned SmallBuffer[SmallBufferSize]{};`. The safer solution is to
use RAII-style container, however, at the cost for allocating memory dynamicly
even if small number of elements.

Accordingly, unit test can be strengthened in
&quot;llvm/unittests/ADT/StringRefTest.cpp&quot;:

    diff --git a/unittests/ADT/StringRefTest.cpp
b/unittests/ADT/StringRefTest.cpp
    index 66e5944..306dc23 100644
    --- a/unittests/ADT/StringRefTest.cpp
    +++ b/unittests/ADT/StringRefTest.cpp
    &#64;&#64; -390,6 +390,11 &#64;&#64; TEST(StringRefTest, Count) {
     }

     TEST(StringRefTest, EditDistance) {
    +  StringRef StrNull;
    +  StringRef StrEmpty(&quot;&quot;);
    +  EXPECT_EQ(0U, StrNull.edit_distance(&quot;&quot;));
    +  EXPECT_EQ(0U, StrEmpty.edit_distance(&quot;&quot;));
    +
       StringRef Str(&quot;hello&quot;);
       EXPECT_EQ(2U, Str.edit_distance(&quot;hill&quot;));
     }

That's all.

Gang JIANG
<a href="mailto:jiangg&#64;mail.ustc.edu.cn">jiangg&#64;mail.ustc.edu.cn</a>
<a href="http://justme0.com/">http://justme0.com/</a>
University of Science and Technology of China</pre>
        </div>
      </p>
      <hr>
      <span>You are receiving this mail because:</span>
      
      <ul>
          <li>You are on the CC list for the bug.</li>
      </ul>
    </body>
</html>

--1469894578.8F46d61.17334--


[Attachment #3 (text/plain)]

_______________________________________________
llvm-bugs mailing list
llvm-bugs@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic