[prev in list] [next in list] [prev in thread] [next in thread]
List: nepomuk
Subject: Re: [Nepomuk] [RFC] Better Full text search
From: Vishesh Handa <me () vhanda ! in>
Date: 2013-05-04 14:53:34
Message-ID: CAOPTMKDAcrUjWsVLPSAdphGnFQuFsL_LjNGyv4UsVBjNKGN8uA () mail ! gmail ! com
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
On Sat, May 4, 2013 at 7:10 PM, Christian Mollekopf <chrigi_1@fastmail.fm>wrote:
> On Saturday 04 May 2013 18.49:05 Vishesh Handa wrote:
> > Hey guys
> >
>
> > I was thinking of moving all the plain text related to a file into the
> > nie:plainTextContent of the resource. So in the case of music we would
> have
> > -
> >
> > <res> nie:plainTextContent "title artist album whatevereElse" .
> >
> > for the case of files, we would append the file name, and any other plain
> > text that we want searched just in the nie:plainTextConent. So a search
> for
> > any combination of text will just have to search through the plain text
> > content.
> >
> > Opinions?
>
> Hey Vishesh,
>
> I think that's a good idea. We're also already using it that way to be
> able to
> search through emails with markup in the email feeder, and I see no reason
> why
> we can't extend that to other resource types (after all the property is
> exactly for this purpose).
> So that means, in the future all feeders should push all information which
> should be matched by full text searching to nie:plainTextContent, right?
>
I was actually thinking of adding a separate API for the text which is
streamed instead of the current load everything in memory and push it. The
File Indexers already have a function like that.
>
> The alternative would of course be to use a separate dedicated fulltext
> index,
> which may have better performance, some more features (tokenizer, stemming
> etc.), but would obviously complicate the setup again (fulltext query =>
> i.e.
> filter by type in nepomuk => retrieve akonadi item). So not necessarily
> the way
> to go, but I wanted to bring it on the table anyways as it's IMO not
> conflicting with what nepomuk provides (the semantic analysis), and could
> result in better results (performance and feature wise) than letting
> virtuoso
> doing all the work.
>
I have been thinking about the same thing - we have no support for stemming
or any other advanced feature we want. I'll take more about this later. I
have an idea which might be very controversial.
>
> >
> > We can easily do this for the 4.11 release cause we already need everyone
> > to re-index everything cause of the migration.
>
> Cool.
>
> Cheers,
> Christian
>
--
Vishesh Handa
[Attachment #5 (text/html)]
<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Sat, \
May 4, 2013 at 7:10 PM, Christian Mollekopf <span dir="ltr"><<a \
href="mailto:chrigi_1@fastmail.fm" \
target="_blank">chrigi_1@fastmail.fm</a>></span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">On Saturday 04 May 2013 18.49:05 Vishesh Handa wrote:<br> \
> Hey guys<br> <div class="im">><br>
<br>
> I was thinking of moving all the plain text related to a file into the<br>
> nie:plainTextContent of the resource. So in the case of music we would have<br>
> -<br>
><br>
> <res> nie:plainTextContent "title artist album whatevereElse" \
.<br> ><br>
> for the case of files, we would append the file name, and any other plain<br>
> text that we want searched just in the nie:plainTextConent. So a search for<br>
> any combination of text will just have to search through the plain text<br>
> content.<br>
><br>
> Opinions?<br>
<br>
</div>Hey Vishesh,<br>
<br>
I think that's a good idea. We're also already using it that way to be able \
to<br> search through emails with markup in the email feeder, and I see no reason \
why<br> we can't extend that to other resource types (after all the property \
is<br> exactly for this purpose).<br>
So that means, in the future all feeders should push all information which<br>
should be matched by full text searching to nie:plainTextContent, \
right?<br></blockquote><div><br></div><div>I was actually thinking of adding a \
separate API for the text which is streamed instead of the current load everything in \
memory and push it. The File Indexers already have a function like that.<br> \
<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px \
#ccc solid;padding-left:1ex"> <br>
The alternative would of course be to use a separate dedicated fulltext index,<br>
which may have better performance, some more features (tokenizer, stemming<br>
etc.), but would obviously complicate the setup again (fulltext query => i.e.<br>
filter by type in nepomuk => retrieve akonadi item). So not necessarily the \
way<br> to go, but I wanted to bring it on the table anyways as it's IMO not<br>
conflicting with what nepomuk provides (the semantic analysis), and could<br>
result in better results (performance and feature wise) than letting virtuoso<br>
doing all the work.<br></blockquote><div><br></div><div>I have been thinking about \
the same thing - we have no support for stemming or any other advanced feature we \
want. I'll take more about this later. I have an idea which might be very \
controversial.<br> <br></div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex"> <div class="im"><br>
><br>
> We can easily do this for the 4.11 release cause we already need everyone<br>
> to re-index everything cause of the migration.<br>
<br>
</div>Cool.<br>
<br>
Cheers,<br>
Christian<br>
</blockquote></div><br><br clear="all"><br>-- <br><span \
style="color:rgb(192,192,192)">Vishesh Handa</span><br> </div></div>
_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic