Git commit 11b6e20bd2e6486406ba56e1717e5c5cd01f4cba by Thomas Fischer. Committed on 31/03/2017 at 21:47. Pushed by thomasfischer into branch 'kbibtex/0.7'. Avoid duplicate hits when extracting URLs from entries When extracting URLs from an entry, KBibTeX first looks for text patterns that look like DOIs, which are then prefixed with 'http://dx.doi.org' (automatic online resolver). To avoid multiple hits, the old code removed the extracted DOI pattern from the input text and continued to look for URLs in the remaining text. However, in scenarios where the input text was like 'http://doi.example.org/10.1051/a946-000' the remainder after removal of the DOI would have been 'http://doi.example.org/' which then in a later step would have been as well recognized as a separate URL. The new code introduced with this commit checks if the text directly preceeding a found DOI looks like an URL. If that is the case, the URL-like part will be removed from the input text before the search for more URLs continues. Extracted DOIs will still be prefixed with 'dx.doi.org'. Manual backport of commit ec9c7f840f from branch 'master'. M +13 -1 src/io/fileinfo.cpp https://commits.kde.org/kbibtex/11b6e20bd2e6486406ba56e1717e5c5cd01f4cba diff --git a/src/io/fileinfo.cpp b/src/io/fileinfo.cpp index 0b11c11b..13ed028d 100644 --- a/src/io/fileinfo.cpp +++ b/src/io/fileinfo.cpp @@ -99,7 +99,19 @@ void FileInfo::urlsInText(const QString &text, TestExist= ence testExistence, cons if (url.isValid() && !result.contains(url)) result << url; /// remove match from internal text to avoid duplicates - internalText =3D internalText.left(pos) + internalText.mid(pos + m= atch.length()); + + /// Cut away any URL that may be right before found DOI number: + /// For example, if DOI '10.1000/38-abc' was found in + /// 'Lore ipsum http://doi.example.org/10.1000/38-abc Lore ipsum' + /// also remove 'http://doi.example.org/' from the text, keeping o= nly + /// 'Lore ipsum Lore ipsum' + static const QRegExp genericDoiUrlPrefix(QLatin1String("http[s]?:/= /[a-z0-9./]+/")); ///< looks like an URL + const int urlStartPos =3D genericDoiUrlPrefix.lastIndexIn(internal= Text, pos); + if (urlStartPos >=3D 0 && genericDoiUrlPrefix.cap(0).length() > po= s - urlStartPos) + /// genericDoiUrlPrefix.cap(0) may contain (parts of) DOI + internalText =3D internalText.left(urlStartPos) + internalText= .mid(pos + match.length()); + else + internalText =3D internalText.left(pos) + internalText.mid(pos= + match.length()); } = const QStringList fileList =3D internalText.split(KBibTeX::fileListSep= aratorRegExp, QString::SkipEmptyParts);