'[Nepomuk] Re: Virtuoso eating up CPUs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nepomuk
Subject:    [Nepomuk] Re: Virtuoso eating up CPUs
From:       Sebastian_Trüg <trueg () kde ! org>
Date:       2011-01-08 10:27:20
Message-ID: 4D283C08.3050304 () kde ! org
[Download RAW message or body]

In the meantime I have a different patch for kdelibs which should at
least solve the issue of the KRunner producing queries that let Virtuoso
go berserk.
Please test.

Cheers,
Sebastian

On 01/03/2011 05:15 PM, Sebastian Trüg wrote:
> I have a heavy patch indeed but it also requires an updated (still
> unreleased) Virtuoso which fixes a bug.
> I will let you know more soon.
> 
> Cheers,
> Sebastian
> 
> On 01/03/2011 03:58 PM, Will Stephenson wrote:
>> On Monday 03 January 2011 10:48:57 Sebastian Trüg wrote:
>>> I am on that one. These come from krunner....
>>
>> Glad to hear it.  I'm catching up after an offline Xmas but if you need any 
>> more info or patch testing just let me know.
>>
>> Will
>> _______________________________________________
>> Nepomuk mailing list
>> Nepomuk@kde.org
>> https://mail.kde.org/mailman/listinfo/nepomuk
>>
> _______________________________________________
> Nepomuk mailing list
> Nepomuk@kde.org
> https://mail.kde.org/mailman/listinfo/nepomuk
> 

["1.diff" (text/plain)]

commit ea85b495f1a99aa604ebb3bc17912e7406d0387d
Author: Sebastian Trueg <trueg@kde.org>
Date:   Fri Jan 7 21:16:40 2011 +0100

    * Fixed the handling of quotes and keywords such as "AND", "OR", and \
"NOT" in LiteralTerm.  Now correct bif:contains or regex filters are \
                created for values.
    * Made the query parser merge LiteralTerms into a single one to improve \
                query performance.
      While merging two LiteralTerms into one does not yield the exact same \
                query (when merged
      both literal tokens need to appear in the same property value while \
                with separate LiteralTerms
      the tokens can appear in different properties) it should cover close \
                to all typical use cases
      while increasing the performance significantly and getting rid of the \
nasty "Virtuoso goes  crazy when I use KRunner" bug.

diff --git a/nepomuk/query/literalterm.cpp b/nepomuk/query/literalterm.cpp
index 4623e82..f8173ac 100644
--- a/nepomuk/query/literalterm.cpp
+++ b/nepomuk/query/literalterm.cpp
@@ -75,41 +75,6 @@ QString \
Nepomuk::Query::LiteralTermPrivate::toSparqlGraphPattern( const QString&  
 
 namespace {
-QString prepareQueryText( const QString& text )
-{
-    //
-    // we try to be a little smart about creating the query text
-    // by following a few simple rules:
-    //
-    // 1. enclose everything in quotes to be safe
-    // 2. quotes in search terms are not handled. replace them with spaces
-    // 3. replace double quotes with single quotes
-    // [4. wildcards can only be used if they are preceeded by at least 4 \
                chars]
-    //
-
-    QString s = text.simplified();
-    if( s.isEmpty() )
-        return s;
-
-    // strip quotes
-    if( s[0] == '"' || s[0] == '\'' ) {
-        s = s.mid(1);
-    }
-    if( !s.isEmpty() &&
-        ( s[s.length()-1] == '"' || s[s.length()-1] == '\'' ) ) {
-        s.truncate(s.length()-1);
-    }
-
-    // replace quotes with spaces
-    s.replace( '"', ' ' );
-    s.replace( '\'', ' ' );
-
-    // add quotes
-    s = '\'' + s + '\'';
-
-    return s;
-}
-
 QString prepareRegexText( const QString& text )
 {
     QString filterRxStr = QRegExp::escape( text );
@@ -123,29 +88,107 @@ QString prepareRegexText( const QString& text )
 
 QString Nepomuk::Query::LiteralTermPrivate::createContainsPattern( const \
QString& varName, const QString& text, Nepomuk::Query::QueryBuilderData* \
qbd )  {
-    const int i = text.indexOf( QRegExp(QLatin1String("[\\?\\*]")) );
+    // each token with a negation flag
+    QList<QPair<QString, bool> > containsTokens;
+    QList<QPair<QString, bool> > regexTokens;
+
+    // we only support AND xor OR, not both at the same time
+    bool isUnion = false;
+
+    // gather all the tokens
+    bool inQuotes = false;
+    QString currentToken;
+    bool nextIsNegated = false;
+    int i = 0;
+    while( i < text.length() ) {
+        const QChar& c = text[i];
+        bool tokenEnd = false;
+
+        if( c == QChar('"') || c == QChar('\'') ) {
+            inQuotes = !inQuotes;
+            tokenEnd = !inQuotes;
+        }
+        else if( c.isSpace() && !inQuotes ) {
+            tokenEnd = true;
+        }
+        else {
+            currentToken.append(c);
+        }
 
-    //
-    // Virtuoso needs four leading chars when using wildcards. Thus, if \
                there is less (this includes 0) we fall back to the slower \
                regex filter
-    //
-    if( i < 0 || i > 3 ) {
-        const QString finalText = prepareQueryText( text );
+        if( i == text.count()-1 ) {
+            tokenEnd = true;
+        }
 
-        QString scoringPattern;
-        if( qbd->query()->m_fullTextScoringEnabled ) {
-            scoringPattern = QString::fromLatin1("OPTION (score %1) \
").arg(qbd->createScoringVariable()); +        if( tokenEnd && \
!currentToken.isEmpty() ) { +            //
+            // Handle the three special tokens supported in Virtuoso's \
full text search engine we support (there is also "near" which we do not \
handle yet) +            //
+            if( currentToken.toLower() == QLatin1String("and") ) {
+                isUnion = false;
+            }
+            else if( currentToken.toLower() == QLatin1String("or") ) {
+                isUnion = true;
+            }
+            else if( currentToken.toLower() == QLatin1String("not") ) {
+                nextIsNegated = true;
+            }
+            else {
+                QPair<QString, bool> currentTokenPair = qMakePair( \
currentToken, nextIsNegated ); +
+                //
+                // Virtuoso needs four leading chars when using wildcards. \
Thus, if there is less (this includes 0) we fall back to the slower regex \
filter +                //
+                const QStringList subTokens = currentToken.split( \
QLatin1Char(' '), QString::SkipEmptyParts ); +                bool \
needsRegex = false; +                Q_FOREACH( const QString& subToken, \
subTokens ) { +                    const int i = subToken.indexOf( \
QRegExp(QLatin1String("[\\?\\*]")) ); +                    if( i >= 0 && i \
< 4 ) { +                        needsRegex = true;
+                        break;
+                    }
+                }
+                if( !needsRegex ) {
+                    containsTokens << currentTokenPair;
+                }
+                else {
+                    regexTokens << currentTokenPair;
+                }
+            }
+
+            nextIsNegated = false;
+            currentToken.clear();
         }
-        qbd->addFullTextSearchTerm( varName, finalText );
 
-        return QString::fromLatin1( "%1 bif:contains \"%2\" %3. " )
-                .arg( varName,
-                     finalText,
-                     scoringPattern );
+        ++i;
     }
-    else {
-        return QString::fromLatin1( "FILTER(REGEX(%1, \"%2\")) . " )
-                .arg( varName, prepareRegexText(text) );
+
+    // convert the tokens into SPARQL filters
+    QStringList filters;
+    QStringList containsFilterTokens;
+    for( int i = 0; i < containsTokens.count(); ++i ) {
+        QString containsFilterToken;
+        if( containsTokens[i].second )
+            containsFilterToken += QLatin1String("NOT ");
+        containsFilterToken += \
QString::fromLatin1("'%1'").arg(containsTokens[i].first); +        \
containsFilterTokens << containsFilterToken;  }
+    if( !containsFilterTokens.isEmpty() ) {
+        filters << QString::fromLatin1("bif:contains(%1, \"%2\")")
+                   .arg( varName,
+                         containsFilterTokens.join( isUnion ? \
QLatin1String(" OR ") : QLatin1String(" AND ")) ); +    }
+    QStringList regexFilters;
+    for( int i = 0; i < regexTokens.count(); ++i ) {
+        QString regexFilter;
+        if( regexTokens[i].second )
+            regexFilter += QLatin1Char('!');
+        regexFilter += QString::fromLatin1( "REGEX(%1, \"%2\")" )
+                .arg( varName,
+                      prepareRegexText(regexTokens[i].first) );
+        filters << regexFilter;
+    }
+
+    return QString( QLatin1String("FILTER(") + filters.join( isUnion ? \
QLatin1String(" || ") : QLatin1String(" && ") ) + QLatin1String(") . ") );  \
}  
 
diff --git a/nepomuk/query/queryparser.cpp b/nepomuk/query/queryparser.cpp
index 3b793d4..656714d 100644
--- a/nepomuk/query/queryparser.cpp
+++ b/nepomuk/query/queryparser.cpp
@@ -130,9 +130,14 @@ namespace {
         }
     }
 
-    Soprano::LiteralValue createLiteral( const QString& s_, bool globbing \
                ) {
-        bool hadQuotes = false;
-        QString s = stripQuotes( s_, &hadQuotes );
+    Soprano::LiteralValue createLiteral( const QString& s, bool globbing ) \
{ +        // no globbing if we have quotes or if there already is a \
wildcard +        if ( s[0] == QLatin1Char('\'') ||
+             s[0] == QLatin1Char('\"') ) {
+            return s;
+        }
+
+        // at this point we should have a string without spaces in it
         bool b = false;
         int i = s.toInt( &b );
         if ( b )
@@ -144,7 +149,7 @@ namespace {
         //
         // we can only do query term globbing for strings longer than 3 \
chars  //
-        if( !hadQuotes && globbing && s.length() > 3 && !s.endsWith('*') \
&& !s.endsWith('?') ) +        if( globbing && s.length() > 3 && \
!s.endsWith('*') && !s.endsWith('?') )  return QString(s + '*');
         else
             return s;
@@ -250,6 +255,36 @@ namespace {
                                                \
Nepomuk::Query::ComparisonTerm::Regexp );  }
 
+    /**
+     * Merging literal terms is an optimization which is based on the \
assumption that most +     * users want to search for the full text terms \
they enter in the value of the same +     * property.
+     * Since merging two literals "foo" and "bar" into one term "foo AND \
bar" effectively +     * changes the result set (the former allows that \
"foo" occurs in a property value +     * different from "bar" while the \
latter forces them to occur in the same.) +     * But the resulting query \
is much faster. +     */
+    Nepomuk::Query::Term mergeLiteralTerms( const Nepomuk::Query::Term& \
term ) +    {
+        if( term.isAndTerm() ) {
+            AndTerm mergedTerm;
+            QStringList fullTextTerms;
+            Q_FOREACH( const Term& st, term.toAndTerm().subTerms() ) {
+                if( st.isLiteralTerm() ) {
+                    fullTextTerms << \
st.toLiteralTerm().value().toString(); +                }
+                else {
+                    mergedTerm.addSubTerm( st );
+                }
+            }
+            mergedTerm.addSubTerm( LiteralTerm( QString( \
QLatin1String("'") + fullTextTerms.join( QString::fromLatin1("' AND '") ) + \
QLatin1String("'") ) ) ); +            return mergedTerm.optimized();
+        }
+        else {
+            return term;
+        }
+    }
+
 #ifndef Q_CC_MSVC
 #warning Make the parser handle different data, time, and datetime \
encodings as well as suffixes like MB or GB  #endif
@@ -612,7 +647,7 @@ Nepomuk::Query::Query \
Nepomuk::Query::QueryParser::parse( const QString& query,  final.setTerm( t \
);  }
 
-    final.setTerm( resolveFields( final.term(), this ) );
+    final.setTerm( mergeLiteralTerms( resolveFields( final.term(), this ) \
) );  return final;
 }
 



_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk


[prev in list] [next in list] [prev in thread] [next in thread]