Git commit cfb4bff158bf5472cc3d3d98a176d6927a3d7aee by Thomas Fischer. Committed on 31/03/2017 at 21:41. Pushed by thomasfischer into branch 'master'. Avoid duplicate hits when extracting URLs from entries When extracting URLs from an entry, KBibTeX first looks for text patterns that look like DOIs, which are then prefixed with 'http://dx.doi.org' (automatic online resolver). To avoid multiple hits, the old code removed the extracted DOI pattern from the input text and continued to look for URLs in the remaining text. However, in scenarios where the input text was like 'http://doi.example.org/10.1051/a946-000' the remainder after removal of the DOI would have been 'http://doi.example.org/' which then in a later step would have been as well recognized as a separate URL. The new code introduced with this commit checks if the text directly preceeding a found DOI looks like an URL. If that is the case, the URL-like part will be removed from the input text before the search for more URLs continues. Extracted DOIs will still be prefixed with 'dx.doi.org'. M +13 -1 src/io/fileinfo.cpp https://commits.kde.org/kbibtex/cfb4bff158bf5472cc3d3d98a176d6927a3d7aee diff --git a/src/io/fileinfo.cpp b/src/io/fileinfo.cpp index c8e567fb..4710310d 100644 --- a/src/io/fileinfo.cpp +++ b/src/io/fileinfo.cpp @@ -107,7 +107,19 @@ void FileInfo::urlsInText(const QString &text, TestExi= stence testExistence, cons if (url.isValid() && !result.contains(url)) result << url; /// remove match from internal text to avoid duplicates - internalText =3D internalText.left(pos) + internalText.mid(pos + m= atch.length()); + + /// Cut away any URL that may be right before found DOI number: + /// For example, if DOI '10.1000/38-abc' was found in + /// 'Lore ipsum http://doi.example.org/10.1000/38-abc Lore ipsum' + /// also remove 'http://doi.example.org/' from the text, keeping o= nly + /// 'Lore ipsum Lore ipsum' + static const QRegExp genericDoiUrlPrefix(QStringLiteral("http[s]?:= //[a-z0-9./]+/")); ///< looks like an URL + const int urlStartPos =3D genericDoiUrlPrefix.lastIndexIn(internal= Text, pos); + if (urlStartPos >=3D 0 && genericDoiUrlPrefix.cap(0).length() > po= s - urlStartPos) + /// genericDoiUrlPrefix.cap(0) may contain (parts of) DOI + internalText =3D internalText.left(urlStartPos) + internalText= .mid(pos + match.length()); + else + internalText =3D internalText.left(pos) + internalText.mid(pos= + match.length()); } = const QStringList fileList =3D internalText.split(KBibTeX::fileListSep= aratorRegExp, QString::SkipEmptyParts);