I totally agree. In fact have a look at DataManagemenModel::Private::m_ignoreCreationDate. As for the duplicate statements: that is a good point also. It greatly simplifies the removal of data per app. On 01/03/2013 10:21 AM, Vishesh Handa wrote: > Ping? > > > On Sun, Dec 16, 2012 at 3:03 AM, Vishesh Handa > wrote: > > Hey everyone > > This is another one of those big changes that I have been thinking > about for quite some time. This email has a number of different > proposals, all of which add up to create this really simple system, > with the same functionality. > > Graph Introduction > --------------------------- > > For those of you who don't know about graphs in Nepomuk. Please read > [1]. It serves as a decent introduction to where Graphs are used. > Currently, we create a new graph for each data-management command. > > What does this provide? > ---------------------------------- > > We currently use graphs for 2 features - > > 1. Remove Data By Application > 2. Backup > > What all information do we store? > ------------------------------------------------ > > 1. Creation date of each graph > 2. Modification date of each graph ( Always the same as creation date ) > 3. Type of the graph - Normal or Discardable > 4. Maintained by which application > > (1) and (2) currently serve us no purpose. They never have. They are > just things that are nice to have. I cannot even name a single use > case for it. Except for they let us see when a statement was added. > > (3) is what powers Nepomuk Backup. We do not backup everything but > only backup the data that is not discardable. So, stuff like > indexing information is not saved. Currently this system is slightly > broken as one cannot just filter on the basis of not Discardable > Data, as that includes stuff like the Ontologies. So the queries get > quite complicated. Plus, one still needs to save certain information > from the Discardable Data such as the rdf:type, nao:creation, and > nao:lastModified. Hence, the query becomes even more complex. For my > machine with some 10 million triples, creating a backup takes a > sizeable amount of time ( Over 5 minutes ), with a lot of cpu execution. > > Current query - > > select distinct ?r ?p ?o ?g where { > graph ?g { ?r ?p ?o. } > ?g a nrl:InstanceBase . > FILTER( REGEX(STR(?r), '^nepomuk:/(res/|me)') ) . > FILTER NOT EXISTS { ?g a nrl:DiscardableInstanceBase . } > } ORDER BY ?r ?p > > + Requires additional queries to backup the type, nao:lastModified, > and nao:created. > > Maybe it would be simpler if we did not make this distinction? > Instead we backup everything (really fast), and just discard the > data for files that no longer exist during restoration? It would > save users the trouble of re-indexing their files as well. More > importantly, it (might) save them the trouble of re-indexing their > email, which is a very slow process. > > Also, right now one can only set the graph via StoreResources, and > not via any other Data Management command. > > ---- > > (4) is the most important reason for graphs. It allows us to know > which application added the data. Stuff starts to get a little > messy, when two application add the same data. In that case those > statements need to be split out of their existing graph and a new > graph needs to be created which will be maintained by the both the > applications. This is expensive. > > I'm proposing that instead of splitting the statement out of the > existing graph, we just create a duplicate of the statement with a > new graph, containing the other application. > > Eg - > > Before - > > graph { a nco:Contact . } > nao:maintainedBy . > nao:maintainedBy . > > After - > > graph { a nco:Contact . } > graph { a nco:Contact . } > nao:maintainedBy > nao:maintainedBy . > > The advantage of this approach is that it would simplify some of the > extremely complex queries in the DataManagementModel. That would > result in a direct performance upgrade. It would also solve some of > the ugly transaction problems we have 2 commands are accessing the > same statement, and one command removes the data in order to move it > to another graph. This has happened to me a couple of times. > > --- > > My third proposal is that considering that the modification and > creation date of a graph do not serve any benefit. Perhaps we > shouldn't store them at all? Unless there is a proper use case, why > go through the added effort? Normally, storing a couple of extra > properties isn't a big deal, but if we do not store them, then we > can effectively kill the need to create new graph for each data > management command. > > With this one would just need 1 graph per application, in which all > of its data would reside. We wouldn't need to check for empty graphs > or anything. It would also reduce the number of triples in a > database, which can get alarmingly high. > > This seems like a pretty good system to me, which provides all the > benefits and none of the losses. > > What do you guys think? > > [1] http://techbase.kde.org/Projects/Nepomuk/GraphConcepts > > -- > Vishesh Handa > > > > > -- > Vishesh Handa > > > _______________________________________________ > Nepomuk mailing list > Nepomuk@kde.org > https://mail.kde.org/mailman/listinfo/nepomuk > _______________________________________________ Nepomuk mailing list Nepomuk@kde.org https://mail.kde.org/mailman/listinfo/nepomuk