[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nepomuk
Subject:    Re: [Nepomuk] [RFC] Better Full text search
From:       Vishesh Handa <me () vhanda ! in>
Date:       2013-05-04 14:53:34
Message-ID: CAOPTMKDAcrUjWsVLPSAdphGnFQuFsL_LjNGyv4UsVBjNKGN8uA () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


On Sat, May 4, 2013 at 7:10 PM, Christian Mollekopf <chrigi_1@fastmail.fm>wrote:

> On Saturday 04 May 2013 18.49:05 Vishesh Handa wrote:
> > Hey guys
> >
>
> > I was thinking of moving all the plain text related to a file into the
> > nie:plainTextContent of the resource. So in the case of music we would
> have
> > -
> >
> > <res> nie:plainTextContent "title artist album whatevereElse" .
> >
> > for the case of files, we would append the file name, and any other plain
> > text that we want searched just in the nie:plainTextConent. So a search
> for
> > any combination of text will just have to search through the plain text
> > content.
> >
> > Opinions?
>
> Hey Vishesh,
>
> I think that's a good idea. We're also already using it that way to be
> able to
> search through emails with markup in the email feeder, and I see no reason
> why
> we can't extend that to other resource types (after all the property is
> exactly for this purpose).
> So that means, in the future all feeders should push all information which
> should be matched by full text searching to nie:plainTextContent, right?
>

I was actually thinking of adding a separate API for the text which is
streamed instead of the current load everything in memory and push it. The
File Indexers already have a function like that.


>
> The alternative would of course be to use a separate dedicated fulltext
> index,
> which may have better performance, some more features (tokenizer, stemming
> etc.), but would obviously complicate the setup again (fulltext query =>
> i.e.
> filter by type in nepomuk => retrieve akonadi item). So not necessarily
> the way
> to go, but I wanted to bring it on the table anyways as it's IMO not
> conflicting with what nepomuk provides (the semantic analysis), and could
> result in better results (performance and feature wise) than letting
> virtuoso
> doing all the work.
>

I have been thinking about the same thing - we have no support for stemming
or any other advanced feature we want. I'll take more about this later. I
have an idea which might be very controversial.


>
> >
> > We can easily do this for the 4.11 release cause we already need everyone
> > to re-index everything cause of the migration.
>
> Cool.
>
> Cheers,
> Christian
>



-- 
Vishesh Handa

[Attachment #5 (text/html)]

<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, \
May 4, 2013 at 7:10 PM, Christian Mollekopf <span dir="ltr">&lt;<a \
href="mailto:chrigi_1@fastmail.fm" \
target="_blank">chrigi_1@fastmail.fm</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">On Saturday 04 May 2013 18.49:05 Vishesh Handa wrote:<br> \
&gt; Hey guys<br> <div class="im">&gt;<br>
<br>
&gt; I was thinking of moving all the plain text related to a file into the<br>
&gt; nie:plainTextContent of the resource. So in the case of music we would have<br>
&gt; -<br>
&gt;<br>
&gt; &lt;res&gt; nie:plainTextContent &quot;title artist album whatevereElse&quot; \
.<br> &gt;<br>
&gt; for the case of files, we would append the file name, and any other plain<br>
&gt; text that we want searched just in the nie:plainTextConent. So a search for<br>
&gt; any combination of text will just have to search through the plain text<br>
&gt; content.<br>
&gt;<br>
&gt; Opinions?<br>
<br>
</div>Hey Vishesh,<br>
<br>
I think that&#39;s a good idea. We&#39;re also already using it that way to be able \
to<br> search through emails with markup in the email feeder, and I see no reason \
why<br> we can&#39;t extend that to other resource types (after all the property \
is<br> exactly for this purpose).<br>
So that means, in the future all feeders should push all information which<br>
should be matched by full text searching to nie:plainTextContent, \
right?<br></blockquote><div><br></div><div>I was actually thinking of adding a \
separate API for the text which is streamed instead of the current load everything in \
memory and push it. The File Indexers already have a function like that.<br>  \
<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex"> <br>
The alternative would of course be to use a separate dedicated fulltext index,<br>
which may have better performance, some more features (tokenizer, stemming<br>
etc.), but would obviously complicate the setup again (fulltext query =&gt; i.e.<br>
filter by type in nepomuk =&gt; retrieve akonadi item). So not necessarily the \
way<br> to go, but I wanted to bring it on the table anyways as it&#39;s IMO not<br>
conflicting with what nepomuk provides (the semantic analysis), and could<br>
result in better results (performance and feature wise) than letting virtuoso<br>
doing all the work.<br></blockquote><div><br></div><div>I have been thinking about \
the same thing - we have no support for stemming or any other advanced feature we \
want. I&#39;ll take more about this later. I have an idea which might be very \
controversial.<br>  <br></div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class="im"><br>
&gt;<br>
&gt; We can easily do this for the 4.11 release cause we already need everyone<br>
&gt; to re-index everything cause of the migration.<br>
<br>
</div>Cool.<br>
<br>
Cheers,<br>
Christian<br>
</blockquote></div><br><br clear="all"><br>-- <br><span \
style="color:rgb(192,192,192)">Vishesh Handa</span><br> </div></div>



_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic