[prev in list] [next in list] [prev in thread] [next in thread]
List: nepomuk
Subject: [Nepomuk] Re: Virtuoso eating up CPUs
From: Sebastian_Trüg <trueg () kde ! org>
Date: 2011-01-08 10:27:20
Message-ID: 4D283C08.3050304 () kde ! org
[Download RAW message or body]
In the meantime I have a different patch for kdelibs which should at
least solve the issue of the KRunner producing queries that let Virtuoso
go berserk.
Please test.
Cheers,
Sebastian
On 01/03/2011 05:15 PM, Sebastian Trüg wrote:
> I have a heavy patch indeed but it also requires an updated (still
> unreleased) Virtuoso which fixes a bug.
> I will let you know more soon.
>
> Cheers,
> Sebastian
>
> On 01/03/2011 03:58 PM, Will Stephenson wrote:
>> On Monday 03 January 2011 10:48:57 Sebastian Trüg wrote:
>>> I am on that one. These come from krunner....
>>
>> Glad to hear it. I'm catching up after an offline Xmas but if you need any
>> more info or patch testing just let me know.
>>
>> Will
>> _______________________________________________
>> Nepomuk mailing list
>> Nepomuk@kde.org
>> https://mail.kde.org/mailman/listinfo/nepomuk
>>
> _______________________________________________
> Nepomuk mailing list
> Nepomuk@kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
>
["1.diff" (text/plain)]
commit ea85b495f1a99aa604ebb3bc17912e7406d0387d
Author: Sebastian Trueg <trueg@kde.org>
Date: Fri Jan 7 21:16:40 2011 +0100
* Fixed the handling of quotes and keywords such as "AND", "OR", and "NOT" in \
LiteralTerm. Now correct bif:contains or regex filters are created for values.
* Made the query parser merge LiteralTerms into a single one to improve query \
performance.
While merging two LiteralTerms into one does not yield the exact same query \
(when merged
both literal tokens need to appear in the same property value while with \
separate LiteralTerms
the tokens can appear in different properties) it should cover close to all \
typical use cases
while increasing the performance significantly and getting rid of the nasty \
"Virtuoso goes crazy when I use KRunner" bug.
diff --git a/nepomuk/query/literalterm.cpp b/nepomuk/query/literalterm.cpp
index 4623e82..f8173ac 100644
--- a/nepomuk/query/literalterm.cpp
+++ b/nepomuk/query/literalterm.cpp
@@ -75,41 +75,6 @@ QString Nepomuk::Query::LiteralTermPrivate::toSparqlGraphPattern( \
const QString&
namespace {
-QString prepareQueryText( const QString& text )
-{
- //
- // we try to be a little smart about creating the query text
- // by following a few simple rules:
- //
- // 1. enclose everything in quotes to be safe
- // 2. quotes in search terms are not handled. replace them with spaces
- // 3. replace double quotes with single quotes
- // [4. wildcards can only be used if they are preceeded by at least 4 chars]
- //
-
- QString s = text.simplified();
- if( s.isEmpty() )
- return s;
-
- // strip quotes
- if( s[0] == '"' || s[0] == '\'' ) {
- s = s.mid(1);
- }
- if( !s.isEmpty() &&
- ( s[s.length()-1] == '"' || s[s.length()-1] == '\'' ) ) {
- s.truncate(s.length()-1);
- }
-
- // replace quotes with spaces
- s.replace( '"', ' ' );
- s.replace( '\'', ' ' );
-
- // add quotes
- s = '\'' + s + '\'';
-
- return s;
-}
-
QString prepareRegexText( const QString& text )
{
QString filterRxStr = QRegExp::escape( text );
@@ -123,29 +88,107 @@ QString prepareRegexText( const QString& text )
QString Nepomuk::Query::LiteralTermPrivate::createContainsPattern( const QString& \
varName, const QString& text, Nepomuk::Query::QueryBuilderData* qbd ) {
- const int i = text.indexOf( QRegExp(QLatin1String("[\\?\\*]")) );
+ // each token with a negation flag
+ QList<QPair<QString, bool> > containsTokens;
+ QList<QPair<QString, bool> > regexTokens;
+
+ // we only support AND xor OR, not both at the same time
+ bool isUnion = false;
+
+ // gather all the tokens
+ bool inQuotes = false;
+ QString currentToken;
+ bool nextIsNegated = false;
+ int i = 0;
+ while( i < text.length() ) {
+ const QChar& c = text[i];
+ bool tokenEnd = false;
+
+ if( c == QChar('"') || c == QChar('\'') ) {
+ inQuotes = !inQuotes;
+ tokenEnd = !inQuotes;
+ }
+ else if( c.isSpace() && !inQuotes ) {
+ tokenEnd = true;
+ }
+ else {
+ currentToken.append(c);
+ }
- //
- // Virtuoso needs four leading chars when using wildcards. Thus, if there is \
less (this includes 0) we fall back to the slower regex filter
- //
- if( i < 0 || i > 3 ) {
- const QString finalText = prepareQueryText( text );
+ if( i == text.count()-1 ) {
+ tokenEnd = true;
+ }
- QString scoringPattern;
- if( qbd->query()->m_fullTextScoringEnabled ) {
- scoringPattern = QString::fromLatin1("OPTION (score %1) \
").arg(qbd->createScoringVariable()); + if( tokenEnd && \
!currentToken.isEmpty() ) { + //
+ // Handle the three special tokens supported in Virtuoso's full text \
search engine we support (there is also "near" which we do not handle yet) + \
// + if( currentToken.toLower() == QLatin1String("and") ) {
+ isUnion = false;
+ }
+ else if( currentToken.toLower() == QLatin1String("or") ) {
+ isUnion = true;
+ }
+ else if( currentToken.toLower() == QLatin1String("not") ) {
+ nextIsNegated = true;
+ }
+ else {
+ QPair<QString, bool> currentTokenPair = qMakePair( currentToken, \
nextIsNegated ); +
+ //
+ // Virtuoso needs four leading chars when using wildcards. Thus, if \
there is less (this includes 0) we fall back to the slower regex filter + \
// + const QStringList subTokens = currentToken.split( QLatin1Char(' \
'), QString::SkipEmptyParts ); + bool needsRegex = false;
+ Q_FOREACH( const QString& subToken, subTokens ) {
+ const int i = subToken.indexOf( \
QRegExp(QLatin1String("[\\?\\*]")) ); + if( i >= 0 && i < 4 ) {
+ needsRegex = true;
+ break;
+ }
+ }
+ if( !needsRegex ) {
+ containsTokens << currentTokenPair;
+ }
+ else {
+ regexTokens << currentTokenPair;
+ }
+ }
+
+ nextIsNegated = false;
+ currentToken.clear();
}
- qbd->addFullTextSearchTerm( varName, finalText );
- return QString::fromLatin1( "%1 bif:contains \"%2\" %3. " )
- .arg( varName,
- finalText,
- scoringPattern );
+ ++i;
}
- else {
- return QString::fromLatin1( "FILTER(REGEX(%1, \"%2\")) . " )
- .arg( varName, prepareRegexText(text) );
+
+ // convert the tokens into SPARQL filters
+ QStringList filters;
+ QStringList containsFilterTokens;
+ for( int i = 0; i < containsTokens.count(); ++i ) {
+ QString containsFilterToken;
+ if( containsTokens[i].second )
+ containsFilterToken += QLatin1String("NOT ");
+ containsFilterToken += \
QString::fromLatin1("'%1'").arg(containsTokens[i].first); + \
containsFilterTokens << containsFilterToken; }
+ if( !containsFilterTokens.isEmpty() ) {
+ filters << QString::fromLatin1("bif:contains(%1, \"%2\")")
+ .arg( varName,
+ containsFilterTokens.join( isUnion ? QLatin1String(" OR ") \
: QLatin1String(" AND ")) ); + }
+ QStringList regexFilters;
+ for( int i = 0; i < regexTokens.count(); ++i ) {
+ QString regexFilter;
+ if( regexTokens[i].second )
+ regexFilter += QLatin1Char('!');
+ regexFilter += QString::fromLatin1( "REGEX(%1, \"%2\")" )
+ .arg( varName,
+ prepareRegexText(regexTokens[i].first) );
+ filters << regexFilter;
+ }
+
+ return QString( QLatin1String("FILTER(") + filters.join( isUnion ? \
QLatin1String(" || ") : QLatin1String(" && ") ) + QLatin1String(") . ") ); }
diff --git a/nepomuk/query/queryparser.cpp b/nepomuk/query/queryparser.cpp
index 3b793d4..656714d 100644
--- a/nepomuk/query/queryparser.cpp
+++ b/nepomuk/query/queryparser.cpp
@@ -130,9 +130,14 @@ namespace {
}
}
- Soprano::LiteralValue createLiteral( const QString& s_, bool globbing ) {
- bool hadQuotes = false;
- QString s = stripQuotes( s_, &hadQuotes );
+ Soprano::LiteralValue createLiteral( const QString& s, bool globbing ) {
+ // no globbing if we have quotes or if there already is a wildcard
+ if ( s[0] == QLatin1Char('\'') ||
+ s[0] == QLatin1Char('\"') ) {
+ return s;
+ }
+
+ // at this point we should have a string without spaces in it
bool b = false;
int i = s.toInt( &b );
if ( b )
@@ -144,7 +149,7 @@ namespace {
//
// we can only do query term globbing for strings longer than 3 chars
//
- if( !hadQuotes && globbing && s.length() > 3 && !s.endsWith('*') && \
!s.endsWith('?') ) + if( globbing && s.length() > 3 && !s.endsWith('*') && \
!s.endsWith('?') ) return QString(s + '*');
else
return s;
@@ -250,6 +255,36 @@ namespace {
\
Nepomuk::Query::ComparisonTerm::Regexp ); }
+ /**
+ * Merging literal terms is an optimization which is based on the assumption \
that most + * users want to search for the full text terms they enter in the \
value of the same + * property.
+ * Since merging two literals "foo" and "bar" into one term "foo AND bar" \
effectively + * changes the result set (the former allows that "foo" occurs in a \
property value + * different from "bar" while the latter forces them to occur in \
the same.) + * But the resulting query is much faster.
+ */
+ Nepomuk::Query::Term mergeLiteralTerms( const Nepomuk::Query::Term& term )
+ {
+ if( term.isAndTerm() ) {
+ AndTerm mergedTerm;
+ QStringList fullTextTerms;
+ Q_FOREACH( const Term& st, term.toAndTerm().subTerms() ) {
+ if( st.isLiteralTerm() ) {
+ fullTextTerms << st.toLiteralTerm().value().toString();
+ }
+ else {
+ mergedTerm.addSubTerm( st );
+ }
+ }
+ mergedTerm.addSubTerm( LiteralTerm( QString( QLatin1String("'") + \
fullTextTerms.join( QString::fromLatin1("' AND '") ) + QLatin1String("'") ) ) ); + \
return mergedTerm.optimized(); + }
+ else {
+ return term;
+ }
+ }
+
#ifndef Q_CC_MSVC
#warning Make the parser handle different data, time, and datetime encodings as well \
as suffixes like MB or GB #endif
@@ -612,7 +647,7 @@ Nepomuk::Query::Query Nepomuk::Query::QueryParser::parse( const \
QString& query, final.setTerm( t );
}
- final.setTerm( resolveFields( final.term(), this ) );
+ final.setTerm( mergeLiteralTerms( resolveFields( final.term(), this ) ) );
return final;
}
_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic