[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nepomuk
Subject:    [Nepomuk] [RFC] Better Full text search
From:       Vishesh Handa <me () vhanda ! in>
Date:       2013-05-04 13:31:05
Message-ID: CAOPTMKA+MTGh=qy4LaDGXAD9-nv-nAJfAhSegvT1iXZkzuECLg () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hey guys

I've been thinking about this for a couple of weeks now. We basically do
not do text based searches that well specifically in the case when the data
is separated among multiple resources.

For example - A music file has it's artist and album stored in separate
resources. So doing a search where I mention - "title artist album" is very
hard to do.

select ?r where {
  {
     { ?r ?p ?o .
      bif:contains(?o, "title") .
    }
    UNION {
        ?r ?p ?o1
        ?o1 ?p2 ?o .
        bif:contains(?o, "title") .
   }
  }
  {
     { ?r ?p ?o .
      bif:contains(?o, "artist") .
    }
    UNION {
        ?r ?p ?o1
        ?o1 ?p2 ?o .
        bif:contains(?o, "artist") .
   }
  }
  {
     { ?r ?p ?o .
      bif:contains(?o, "album") .
    }
    UNION {
        ?r ?p ?o1
        ?o1 ?p2 ?o .
        bif:contains(?o, "album") .
   }
  }
}


This query is a monster and takes quite some time to execute. About 26
seconds on my system. Even when you're doing a simple search for one word
it is still something like this -

select distinct ?r where {
    { ?r ?p ?o .
      bif:contains(?o, "word") .
    }
    UNION {
        ?r ?p ?o1
        ?o1 ?p2 ?o .
        bif:contains(?o, "word") .
   }
}

which is again kinda slow cause we aren't using any of the indexes of the
statements.

I was thinking of moving all the plain text related to a file into the
nie:plainTextContent of the resource. So in the case of music we would have
-

<res> nie:plainTextContent "title artist album whatevereElse" .

for the case of files, we would append the file name, and any other plain
text that we want searched just in the nie:plainTextConent. So a search for
any combination of text will just have to search through the plain text
content.

Opinions?

We can easily do this for the 4.11 release cause we already need everyone
to re-index everything cause of the migration.

-- 
Vishesh Handa

[Attachment #5 (text/html)]

<div dir="ltr"><div><div><div><div><div><div><div><div>Hey guys<br><br>I&#39;ve been \
thinking about this for a couple of weeks now. We basically do not do text based \
searches that well specifically in the case when the data is separated among multiple \
resources.<br> <br></div>For example - A music file has it&#39;s artist and album \
stored in separate resources. So doing a search where I mention - &quot;title artist \
album&quot; is very hard to do.<br><br></div>select ?r where { <br>  {<br> </div>     \
{ ?r ?p ?o .<br></div>      bif:contains(?o, &quot;title&quot;) .<br>    }<br></div>  \
UNION {<br></div>        ?r ?p ?o1<br></div>        ?o1 ?p2 ?o .<br></div>        \
bif:contains(?o, &quot;title&quot;) .<br>  }<br>  }<br>  {<br>     { ?r ?p ?o .<br>   \
bif:contains(?o, &quot;artist&quot;) .<br>    }<br>    UNION {<br>        ?r ?p \
?o1<br>        ?o1 ?p2 ?o .<br>        bif:contains(?o, &quot;artist&quot;) .<br>   \
}<br>  }<br>  {<br>     { ?r ?p ?o .<br>      bif:contains(?o, &quot;album&quot;) \
.<br>    }<br>    UNION {<br>        ?r ?p ?o1<br>        ?o1 ?p2 ?o .<br>        \
bif:contains(?o, &quot;album&quot;) .<br>   }<br>  }<br>}<br><div><div> \
<div><div><div><div><div><div><div><br><br></div><div>This query is a monster and \
takes quite some time to execute. About 26 seconds on my system. Even when you&#39;re \
doing a simple search for one word it is still something like this -<br> \
</div><div><br></div><div>select distinct ?r where {<br>    { ?r ?p ?o .<br>      \
bif:contains(?o, &quot;word&quot;) .<br>    }<br>    UNION {<br>        ?r ?p ?o1<br> \
?o1 ?p2 ?o .<br>        bif:contains(?o, &quot;word&quot;) .<br>  \
}<br>}<br><br></div><div>which is again kinda slow cause we aren&#39;t using any of \
the indexes of the statements.<br><br></div><div>I was thinking of moving all the \
plain text related to a file into the nie:plainTextContent of the resource. So in the \
case of music we would have -<br> <br></div><div>&lt;res&gt; nie:plainTextContent \
&quot;title artist album whatevereElse&quot; .<br><br></div><div>for the case of \
files, we would append the file name, and any other plain text that we want searched \
just in the nie:plainTextConent. So a search for any combination of text will just \
have to search through the plain text content.<br> \
<br></div><div>Opinions?<br><br></div><div>We can easily do this for the 4.11 release \
cause we already need everyone to re-index everything cause of the \
migration.<br><br></div><div>-- <br><span style="color:rgb(192,192,192)">Vishesh \
Handa</span><br>

</div></div></div></div></div></div></div></div></div></div>



_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic