[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nepomuk
Subject:    Re: [Nepomuk] [RFC] Better Full text search
From:       Vishesh Handa <me () vhanda ! in>
Date:       2013-05-09 16:28:01
Message-ID: CAOPTMKAPNuMJMZ0qxg09tOFWL4fWeiJrxmSZVP4-tMdNBqav=g () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


On Thu, May 9, 2013 at 5:16 PM, Ignacio Serantes <kde@aynoa.net> wrote:

> On Thu, May 9, 2013 at 3:21 PM, Ignacio Serantes <kde@aynoa.net> wrote:
>
>>
>>> Yeah. I was just thinking about the file indexer. If you incorporate the
>> web-miner, then it would be the web-miner's responsibility to update the
>> full text index (or nie:plainTextIndex) for the resource who it attaches
>> resources to.
>>
>
> You can't expect this. Nepomuk it's the middleware so this is a Nepomuk's
> job and you can't trust that your clients follow right logic because logic
> could change. For example, nepomuk webminer could be abandoned or some
> users could still use Bangarang that seems abandoned. Even worse, some
> distros could be packaging and old version without your required changes,
> for example openSUSE still is packaging webminer but the old version.
>

I know this is going to be sound really bad - but if I can fix the user
experience for most of the users (more than 90%) while breaking
compatibility with old packages and maybe breaking some things, then I'm
okay with that.

I would of course rather not do that if there is an alternative.

KDE SC does not ship the web-miner by default and the web-miner has never
had an official release.

>
> You can use clients to process data (sorting, filtering, etc...) but all
> business logic must be written in the middleware and never in the clients.
>

Maybe I could incorporate this logic into StoreResources, but it's a little
hard to know which the main resource is, but it is still feasible. I'll
look into it. But one could still break this in other ways.


>
>>> I read some comments about Nepomuk it's not a data store that concern
>>> me. I'm using Nepomuk as a data store extensively, tags, comments, rating
>>> and other stuff using Nepoogle, because without it Nepomuk less more useful
>>> for me and, honestly, I can't understand it without this functionality.
>>> Some time ago some effort was spend explaining that Nepomuk it's not a file
>>> search so don't transform it in a resources search tool.
>>>
>>
>> I was hoping it would be more of a full text + structured data storage
>> tool. It is not a place to store plain text such as the file's contents and
>> expect to get them back exactly as they were.
>>
>
> No, but you want to store any kind of metadata and retrieve information.
>
>
>> I'm worried that if we do not put all the plain text in one place we
>> cannot reliably solve the searching problem. Users mostly just provide
>> plain text when searching. They just provide words "blah blah blah". Most
>> users would not know about higher semantics such as "hasTag:" or
>> "performer:". That's only for more technical users.
>>
>
> The main problem here is the lack of a proper search interface. There was
> one in KDE 4.5, I learned about hastag and other stuff that days, but was
> removed. Maybe with this year's GSOC we have one.
>

I have to start working on a dedicated interface for searching, but the
main aim over there will be to search as fast as possible. The support for
'hasTag' and other features are secondary cause I cannot expect users to
know or care about them.


>
>
>> Doing a search through for x words leads 2x unions which is very very
>> slow. In the case I highlighted in the first email, it takes a good 26
>> seconds on my system. That's just too slow. The user expects feedback in
>> MAX a second. Generally, even less than a second.
>>
>
> Yes, I know it but where are you doing this search? In Dolphin or KRunner?
> Then maybe something like mail:"my search text" could be useful to optimize
> this kind of search. Consider that most of the time you can optimize
> queries if you guide user but, it's true, full text search always will be
> slow :)
>

Dedicated full text engines are very fast. Have a look at the recoll
project. It uses the Xapian full text engine and it really fast. The bottle
neck is clearly not in the full text search part. It's in the numerous
union operations.


> . In applications, like KMail, you can always add right filters to
> optimize the search.
>

Come on? Is that how an average user searches? Do you remember what all
keywords are supported by GMail or Google? Most users prefer a simple
search field where they can type anything. Anything more than that is for
power users.

Also, for KMail - one of the most common use cases is to search within a
date range, and considering that virtuoso has no special indexes for date
time in the object column, it's very very slow. Couple that with plain text
search and you get a virtuoso that gets mad, and Nepomuk gets a horrible
reputation.


>> Do you have any suggestions on how to fix that?
>>
>
> Sadly I don't know how to optimize this in triplestore databases, with
> relational databases you use views and indexes but you can't use it here.
> Maybe Virtuoso people could help you.
>
>>
>> Additionally, with the query I showed above you also have a problem of
>> stuff like this -
>>
>> res a nmm:MusicPiece .
>> nmm:MuiscPiece rdfs:comment "Used to assign music-specific properties
>> such a BPM to video and audio"
>>
>> searching for 'assign music' can give me music results which have nothing
>> to do with music. I'm not sure how to solve this.
>>
>
> You can't, a full text search it's a full text search :).
>

I need to fix this. We cannot have a broken full text search experience
specially since Nepomuk is *the desktop search* solution for KDE.


>
>> If we want to change how plain text is stored now is the time to do
>> because with the 4.11 release most users will already have to re-index all
>> their files and PIM data.
>>
>
> When 4.11 will be released? This is not a trivial change and must be
> tested.
>

We have the freeze in about a month. It will be released in August -
http://techbase.kde.org/Schedules/KDE4/4.11_Release_Schedule


-- 
Vishesh Handa

[Attachment #5 (text/html)]

<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Thu, \
May 9, 2013 at 5:16 PM, Ignacio Serantes <span dir="ltr">&lt;<a \
href="mailto:kde@aynoa.net" target="_blank">kde@aynoa.net</a>&gt;</span> wrote:<br> \
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="im"><span \
style="color:rgb(80,0,80)">On Thu, May 9, 2013 at 3:21 PM, Ignacio Serantes \
</span><span dir="ltr" style="color:rgb(80,0,80)">&lt;<a href="mailto:kde@aynoa.net" \
target="_blank">kde@aynoa.net</a>&gt;</span><span style="color:rgb(80,0,80)"> \
wrote:</span><br>


</div><div class="gmail_extra"><div class="gmail_quote"><div class="im"><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr">

<div class="gmail_extra"><div class="gmail_quote"><div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div \
dir="ltr"><br></div></blockquote></div><div>Yeah. I was just thinking about the file \
indexer. If you incorporate the web-miner, then it would be the web-miner&#39;s \
responsibility to update the full text index (or nie:plainTextIndex) for the resource \
who it attaches resources to. <br>


</div></div></div></div></blockquote><div><br></div></div><div>You can&#39;t expect \
this. Nepomuk it&#39;s the middleware so this is a Nepomuk&#39;s job and you \
can&#39;t trust that your clients follow right logic because logic could change. For \
example, nepomuk webminer could be abandoned or some users could still use Bangarang \
that seems abandoned. Even worse, some distros could be packaging and old version \
without your required changes, for example openSUSE still is packaging webminer but \
the old version.<br> </div></div></div></div></blockquote><div><br></div><div>I know \
this is going to be sound really bad - but if I can fix the user experience for most \
of the users (more than 90%) while breaking compatibility with old packages and maybe \
breaking some things, then I&#39;m okay with that.<br> <br>I would of course rather \
not do that if there is an alternative.<br><br></div><div>KDE SC does not ship the \
web-miner by default and the web-miner has never had an official release. \
<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> <div dir="ltr"><div \
class="gmail_extra"><div class="gmail_quote"><div>

<br>You can use clients to process data (sorting, filtering, etc...) but all business \
logic must be written in the middleware and never in the clients. \
</div></div></div></div></blockquote><div><br></div><div>Maybe I could incorporate \
this logic into StoreResources, but it&#39;s a little hard to know which the main \
resource is, but it is still feasible. I&#39;ll look into it. But one could still \
break this in other ways.<br>  <br></div><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div \
class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0px \
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>
</div><div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">

<div><br></div><div>I read some comments about Nepomuk it&#39;s not a data store that \
concern me. I&#39;m using Nepomuk as a data store extensively, tags, comments, rating \
and other stuff using Nepoogle, because without it Nepomuk less more useful for me \
and, honestly, I can&#39;t understand it without this functionality. Some time ago \
some effort was spend explaining that Nepomuk it&#39;s not a file search so don&#39;t \
transform it in a resources search tool.</div>



</div></blockquote><div><br></div></div><div>I was hoping it would be more of a full \
text + structured data storage tool. It is not a place to store plain text such as \
the file&#39;s contents and expect to get them back exactly as they were.</div>


</div></div></div></blockquote></div><div><br>No, but you want to store any kind of \
metadata and retrieve information.</div><div class="im"><div>   </div><blockquote \
class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex">


<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><div>I&#39;m worried \
that if we do not put all the plain text in one place we cannot reliably solve the \
searching problem. Users mostly just provide plain text when searching. They just \
provide words &quot;blah blah blah&quot;. Most users would not know about higher \
semantics such as &quot;hasTag:&quot; or &quot;performer:&quot;. That&#39;s only for \
more technical users.<br>


</div></div></div></div></blockquote><div><br></div></div><div>The main problem here \
is the lack of a proper search interface. There was one in KDE 4.5, I learned about \
hastag and other stuff that days, but was removed. Maybe with this year&#39;s GSOC we \
have one.</div> </div></div></div></blockquote><div><br></div><div>I have to start \
working on a dedicated interface for searching, but the main aim over there will be \
to search as fast as possible. The support for &#39;hasTag&#39; and other features \
are secondary cause I cannot expect users to know or care about them.<br>  \
<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div \
class="gmail_extra"><div class="gmail_quote"><div class="im">

<div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div \
class="gmail_extra"><div class="gmail_quote">

<div>Doing a search through for x words leads 2x unions which is very very slow. In \
the case I highlighted in the first email, it takes a good 26 seconds on my system. \
That&#39;s just too slow. The user expects feedback in MAX a second. Generally, even \
less than a second.<br>


</div></div></div></div></blockquote><div><br></div></div><div>Yes, I know it but \
where are you doing this search? In Dolphin or KRunner? Then maybe something like \
mail:&quot;my search text&quot; could be useful to optimize this kind of search. \
Consider that most of the time you can optimize queries if you guide user but, \
it&#39;s true, full text search always will be slow :)</div> \
</div></div></div></blockquote><div><br></div><div>Dedicated full text engines are \
very fast. Have a look at the recoll project. It uses the Xapian full text engine and \
it really fast. The bottle neck is clearly not in the full text search part. It&#39;s \
in the numerous union operations.<br>  <br></div><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div \
class="gmail_quote"><div>. In applications, like KMail, you can always add right \
filters to optimize the search.</div> \
</div></div></div></blockquote><div><br></div><div>Come on? Is that how an average \
user searches? Do you remember what all keywords are supported by GMail or Google? \
Most users prefer a simple search field where they can type anything. Anything more \
than that is for power users.<br> <br></div><div>Also, for KMail - one of the most \
common use cases is to search within a date range, and considering that virtuoso has \
no special indexes for date time in the object column, it&#39;s very very slow. \
Couple that with plain text search and you get a virtuoso that gets mad, and Nepomuk \
gets a horrible reputation.<br> <br></div><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div \
class="gmail_quote"><div class="im">

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div \
class="gmail_quote">

<div></div><div>
<br>Do you have any suggestions on how to fix \
that?<br></div></div></div></div></blockquote><div><br></div></div><div>Sadly I \
don&#39;t know how to optimize this in triplestore databases, with relational \
databases you use views and indexes but you can&#39;t use it here. Maybe Virtuoso \
people could help you. </div> <div class="im">

<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div \
class="gmail_quote">

<div> <br></div><div>Additionally, with the query I showed above you also have a \
problem of stuff like this -<br><br></div><div>res a nmm:MusicPiece \
.<br></div><div>nmm:MuiscPiece rdfs:comment &quot;Used to assign music-specific \
properties such a BPM to video and audio&quot;</div>



<div><br></div><div>searching for &#39;assign music&#39; can give me music results \
which have nothing to do with music. I&#39;m not sure how to solve \
this.<br></div></div></div></div></blockquote><div><br></div></div><div>


You can&#39;t, a full text search it&#39;s a full text search \
:).</div></div></div></div></blockquote><div><br></div><div>I need to fix this. We \
cannot have a broken full text search experience specially since Nepomuk is *the \
desktop search* solution for KDE.<br>  <br></div><blockquote class="gmail_quote" \
style="margin:0px 0px 0px 0.8ex;border-left:1px solid \
rgb(204,204,204);padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div \
class="gmail_quote"><div class="im"><blockquote class="gmail_quote" style="margin:0px \
0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">


<div dir="ltr"><div class="gmail_extra"><div \
class="gmail_quote"><div></div><div><br>If we want to change how plain text is stored \
now is the time to do because with the 4.11 release most users will already have to \
re-index all their files and PIM data.<br>


</div></div></div></div></blockquote><div><br></div></div><div>When 4.11 will be \
released? This is not a trivial change and must be \
tested.</div></div></div></div></blockquote><div><br></div><div>We have the freeze in \
about a month. It will be released in August - <a \
href="http://techbase.kde.org/Schedules/KDE4/4.11_Release_Schedule">http://techbase.kde.org/Schedules/KDE4/4.11_Release_Schedule</a><br>
 </div></div><br clear="all"><br>-- <br><span style="color:rgb(192,192,192)">Vishesh \
Handa</span><br> </div></div>



_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic