'Re: [Nepomuk] [RFC] Simplify Nepomuk Graph handling'

[prev in list] [next in list] [prev in thread] [next in thread] 

List: nepomuk
Subject: Re: [Nepomuk] [RFC] Simplify Nepomuk Graph handling
From: Vishesh Handa <me () vhanda ! in>
Date: 2013-01-03 9:33:49
Message-ID: CAOPTMKC0q_pQUcKScMV41NbS7LJRrDen62tMiiVp0XwC_oZUVQ () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

Ping?

On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa <me@vhanda.in> wrote:

> Hey everyone
>
> This is another one of those big changes that I have been thinking about
> for quite some time. This email has a number of different proposals, all of
> which add up to create this really simple system, with the same
> functionality.
>
> Graph Introduction
> ---------------------------
>
> For those of you who don't know about graphs in Nepomuk. Please read [1].
> It serves as a decent introduction to where Graphs are used. Currently, we
> create a new graph for each data-management command.
>
> What does this provide?
> ----------------------------------
>
> We currently use graphs for 2 features -
>
> 1. Remove Data By Application
> 2. Backup
>
> What all information do we store?
> ------------------------------------------------
>
> 1. Creation date of each graph
> 2. Modification date of each graph ( Always the same as creation date )
> 3. Type of the graph - Normal or Discardable
> 4. Maintained by which application
>
> (1) and (2) currently serve us no purpose. They never have. They are just
> things that are nice to have. I cannot even name a single use case for it.
> Except for they let us see when a statement was added.
>
> (3) is what powers Nepomuk Backup. We do not backup everything but only
> backup the data that is not discardable. So, stuff like indexing
> information is not saved. Currently this system is slightly broken as one
> cannot just filter on the basis of not Discardable Data, as that includes
> stuff like the Ontologies. So the queries get quite complicated. Plus, one
> still needs to save certain information from the Discardable Data such as
> the rdf:type, nao:creation, and nao:lastModified. Hence, the query becomes
> even more complex. For my machine with some 10 million triples, creating a
> backup takes a sizeable amount of time ( Over 5 minutes ), with a lot of
> cpu execution.
>
> Current query -
>
> select distinct ?r ?p ?o ?g where {
> graph ?g { ?r ?p ?o. }
> ?g a nrl:InstanceBase .
> FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) .
> FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . }
> } ORDER BY ?r ?p
>
> + Requires additional queries to backup the type, nao:lastModified, and
> nao:created.
>
> Maybe it would be simpler if we did not make this distinction? Instead we
> backup everything (really fast), and just discard the data for files that
> no longer exist during restoration? It would save users the trouble of
> re-indexing their files as well. More importantly, it (might) save them the
> trouble of re-indexing their email, which is a very slow process.
>
> Also, right now one can only set the graph via StoreResources, and not via
> any other Data Management command.
>
> ----
>
> (4) is the most important reason for graphs. It allows us to know which
> application added the data. Stuff starts to get a little messy, when two
> application add the same data. In that case those statements need to be
> split out of their existing graph and a new graph needs to be created which
> will be maintained by the both the applications. This is expensive.
>
> I'm proposing that instead of splitting the statement out of the existing
> graph, we just create a duplicate of the statement with a new graph,
> containing the other application.
>
> Eg -
>
> Before -
>
> graph <G1> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1> .
> <G1> nao:maintainedBy <App2> .
>
> After -
>
> graph <G1> { <resA> a nco:Contact . }
> graph <G2> { <resA> a nco:Contact . }
> <G1> nao:maintainedBy <App1>
> <G2> nao:maintainedBy <App2> .
>
> The advantage of this approach is that it would simplify some of the
> extremely complex queries in the DataManagementModel. That would result in
> a direct performance upgrade. It would also solve some of the ugly
> transaction problems we have 2 commands are accessing the same statement,
> and one command removes the data in order to move it to another graph. This
> has happened to me a couple of times.
>
> ---
>
> My third proposal is that considering that the modification and creation
> date of a graph do not serve any benefit. Perhaps we shouldn't store them
> at all? Unless there is a proper use case, why go through the added effort?
> Normally, storing a couple of extra properties isn't a big deal, but if we
> do not store them, then we can effectively kill the need to create new
> graph for each data management command.
>
> With this one would just need 1 graph per application, in which all of its
> data would reside. We wouldn't need to check for empty graphs or anything.
> It would also reduce the number of triples in a database, which can get
> alarmingly high.
>
> This seems like a pretty good system to me, which provides all the
> benefits and none of the losses.
>
> What do you guys think?
>
> [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts
>
> --
> Vishesh Handa
>
>

-- 
Vishesh Handa

[Attachment #5 (text/html)]

<div dir="ltr">Ping? </div><div class="gmail_extra"> <div \
class="gmail_quote">On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa &lt;<a href="mailto:me@vhanda.in" \
target="_blank">me@vhanda.in</a>&gt; wrote: <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">Hey everyone This is another one of those big changes \
that I have been thinking about for quite some time. This email has a number of \
different proposals, all of which add up to create this really simple system, with \
the same functionality.

Graph Introduction --------------------------- For those of you who \
don&#39;t know about graphs in Nepomuk. Please read [1]. It serves as a decent \
introduction to where Graphs are used. Currently, we create a new graph for each \
data-management command.

What does this provide? ---------------------------------- We currently \
use graphs for 2 features - 1. Remove Data By Application 2. \
Backup What all information do we \
store? ------------------------------------------------

1. Creation date of each graph 2. Modification date of each graph ( Always the \
same as creation date ) 3. Type of the graph - Normal or Discardable 4. \
Maintained by which application (1) and (2) currently serve us no purpose. \
They never have. They are just things that are nice to have. I cannot even name a \
single use case for it. Except for they let us see when a statement was added.

(3) is what powers Nepomuk Backup. We do not backup everything but only backup \
the data that is not discardable. So, stuff like indexing information is not saved. \
Currently this system is slightly broken as one cannot just filter on the basis of \
not Discardable Data, as that includes stuff like the Ontologies. So the queries get \
quite complicated. Plus, one still needs to save certain information from the \
Discardable Data such as the rdf:type, nao:creation, and nao:lastModified. Hence, the \
query becomes even more complex. For my machine with some 10 million triples, \
creating a backup takes a sizeable amount of time ( Over 5 minutes ), with a lot of \
cpu execution.

<br>Current query - <br><br>select distinct ?r ?p ?o ?g where {<br>graph ?g { ?r ?p \
?o. } <br>?g a nrl:InstanceBase .<br>FILTER( REGEX(STR(?r), \
&#39;^nepomuk:/(res/|me)&#39;) ) .<br>FILTER NOT EXISTS { ?g a \
nrl:DiscardableInstanceBase . }<br>

} ORDER BY ?r ?p + Requires additional queries to backup the type, \
nao:lastModified, and nao:created. Maybe it would be simpler if we did not \
make this distinction? Instead we backup everything (really fast), and just discard \
the data for files that no longer exist during restoration? It would save users the \
trouble of re-indexing their files as well. More importantly, it (might) save them \
the trouble of re-indexing their email, which is a very slow process.

Also, right now one can only set the graph via StoreResources, and not via any \
other Data Management command. ---- (4) is the most important reason \
for graphs. It allows us to know which application added the data. Stuff starts to \
get a little messy, when two application add the same data. In that case those \
statements need to be split out of their existing graph and a new graph needs to be \
created which will be maintained by the both the applications. This is expensive.

<br>I&#39;m proposing that instead of splitting the statement out of the existing \
graph, we just create a duplicate of the statement with a new graph, containing the \
other application.<br><br>Eg -<br><br>Before -<br><br> graph &lt;G1&gt; { \
&lt;resA&gt; a nco:Contact . }<br> &lt;G1&gt; nao:maintainedBy &lt;App1&gt; \
.<br>&lt;G1&gt; nao:maintainedBy &lt;App2&gt; .<br><br>After -<br><br>graph \
&lt;G1&gt; { &lt;resA&gt; a nco:Contact . }<br>graph &lt;G2&gt; { &lt;resA&gt; a \
nco:Contact . }<br>&lt;G1&gt; nao:maintainedBy &lt;App1&gt;<br>

&lt;G2&gt; nao:maintainedBy &lt;App2&gt; . The advantage of this approach is \
that it would simplify some of the extremely complex queries in the \
DataManagementModel. That would result in a direct performance upgrade. It would also \
solve some of the ugly transaction problems we have 2 commands are accessing the same \
statement, and one command removes the data in order to move it to another graph. \
This has happened to me a couple of times.

--- My third proposal is that considering that the modification and \
creation date of a graph do not serve any benefit. Perhaps we shouldn&#39;t store \
them at all? Unless there is a proper use case, why go through the added effort? \
Normally, storing a couple of extra properties isn&#39;t a big deal, but if we do not \
store them, then we can effectively kill the need to create new graph for each data \
management command.

With this one would just need 1 graph per application, in which all of its data \
would reside. We wouldn&#39;t need to check for empty graphs or anything. It would \
also reduce the number of triples in a database, which can get alarmingly high.

<br>This seems like a pretty good system to me, which provides all the benefits and \
none of the losses.<br><br>What do you guys think?<br><br>[1] <a \
href="http://techbase.kde.org/Projects/Nepomuk/GraphConcepts" \
target="_blank">http://techbase.kde.org/Projects/Nepomuk/GraphConcepts</a><span \
class="HOEnZb"><font color="#888888"><br clear="all">

<br>-- <br><span style="color:rgb(192,192,192)">Vishesh Handa</span><br><br>
</font></span></blockquote></div><br><br clear="all"><br>-- <br><span \
style="color:rgb(192,192,192)">Vishesh Handa</span><br> </div>

_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk

[prev in list] [next in list] [prev in thread] [next in thread]