'[kbibtex/kbibtex/0.7] src/io: Avoid duplicate hits when extracting URLs from entries'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-commits
Subject:    [kbibtex/kbibtex/0.7] src/io: Avoid duplicate hits when extracting URLs from entries
From:       Thomas Fischer <null () kde ! org>
Date:       2017-03-31 21:56:05
Message-ID: E1cu4Wb-0004F6-EE () code ! kde ! org
[Download RAW message or body]

Git commit 11b6e20bd2e6486406ba56e1717e5c5cd01f4cba by Thomas Fischer.
Committed on 31/03/2017 at 21:47.
Pushed by thomasfischer into branch 'kbibtex/0.7'.

Avoid duplicate hits when extracting URLs from entries

When extracting URLs from an entry, KBibTeX first looks for
text patterns that look like DOIs, which are then prefixed
with 'http://dx.doi.org' (automatic online resolver).
To avoid multiple hits, the old code removed the extracted
DOI pattern from the input text and continued to look for
URLs in the remaining text.

However, in scenarios where the input text was like
  'http://doi.example.org/10.1051/a946-000'
the remainder after removal of the DOI would have been
  'http://doi.example.org/'
which then in a later step would have been as well
recognized as a separate URL.

The new code introduced with this commit checks if the
text directly preceeding a found DOI looks like an URL.
If that is the case, the URL-like part will be removed
from the input text before the search for more URLs
continues. Extracted DOIs will still be prefixed with
'dx.doi.org'.

Manual backport of commit ec9c7f840f from branch 'master'.

M  +13   -1    src/io/fileinfo.cpp

https://commits.kde.org/kbibtex/11b6e20bd2e6486406ba56e1717e5c5cd01f4cba

diff --git a/src/io/fileinfo.cpp b/src/io/fileinfo.cpp
index 0b11c11b..13ed028d 100644
--- a/src/io/fileinfo.cpp
+++ b/src/io/fileinfo.cpp
@@ -99,7 +99,19 @@ void FileInfo::urlsInText(const QString &text, TestExistence \
testExistence, cons  if (url.isValid() && !result.contains(url))
             result << url;
         /// remove match from internal text to avoid duplicates
-        internalText = internalText.left(pos) + internalText.mid(pos + \
match.length()); +
+        /// Cut away any URL that may be right before found DOI number:
+        /// For example, if DOI '10.1000/38-abc' was found in
+        ///   'Lore ipsum http://doi.example.org/10.1000/38-abc Lore ipsum'
+        /// also remove 'http://doi.example.org/' from the text, keeping only
+        ///   'Lore ipsum  Lore ipsum'
+        static const QRegExp \
genericDoiUrlPrefix(QLatin1String("http[s]?://[a-z0-9./]+/")); ///< looks like an URL \
+        const int urlStartPos = genericDoiUrlPrefix.lastIndexIn(internalText, pos); \
+        if (urlStartPos >= 0 && genericDoiUrlPrefix.cap(0).length() > pos - \
urlStartPos) +            /// genericDoiUrlPrefix.cap(0) may contain (parts of) DOI
+            internalText = internalText.left(urlStartPos) + internalText.mid(pos + \
match.length()); +        else
+            internalText = internalText.left(pos) + internalText.mid(pos + \
match.length());  }
 
     const QStringList fileList = \
internalText.split(KBibTeX::fileListSeparatorRegExp, QString::SkipEmptyParts);


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic