[prev in list] [next in list] [prev in thread] [next in thread] 

List:       nepomuk
Subject:    [Nepomuk] Why store file urls?
From:       Vishesh Handa <me () vhanda ! in>
Date:       2012-11-23 8:58:22
Message-ID: CAOPTMKC6ayYJmLevjEdrvcSGOhEAo5ovONDJ9nutsmtGMXOohA () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hey everyone

Last week I was somewhat shut out from the world, so I had some time to
think about a lot of different things in Nepomuk. This is one of the many
emails to come out it.

For those of you who don't know, in Nepomuk we always store the complete
url of the file with the property nie:url.

Example -

<nepomuk:/res/23161f9c-8839-4de3-bba0-affdd6d654ef>
        rdf:type
nmm:MusicPiece
        rdf:type
nfo:FileDataObject
        rdf:type
nfo:Audio
        rdf:type
nie:InformationElement
        nie:url
file:///home/vishesh/Music/where_does_the_good_go.mp3

Storing this URL makes accessing file resources quite convenient. But I
fear it has been a terrible design decision. By storing the url we face the
following problems -

1. Changing the url of a directory is very expensive. This doesn't need to
be done very frequently, but occasionally the user might move/rename a
directory which contains a large number of files. The url of every one of
these files needs to be adjusted. Since changes in Nepomuk are not that
cheap, this results in virtuoso + nepomukstorage + nepomukfilewatch
consuming large amounts of cpu for quite some time.

This is *very* *very* noticeable when renaming a directory with over 1000
files.

2. Removable Media Handling - We have very sad support for removable media
handling. Currently we store urls which are not fixed under a "filex"
scheme. Example -

<nepomuk:/res/7017a499-786b-4e97-a9f8-e9ee2506c322>
        rdf:type          nfo:FileDataObject
        nao:created       2012-11-02T17:52:16.022Z
        nao:lastModified  2012-11-02T17:52:16.088Z
        nie:url           <filex://72acd848acd8090d/Lost>

The "72acd848acd8090d" is the UUID of the device.

When any results containing "filex" are being returned, Solid is consulted
to check if that particular device is mounted, and accordingly the filex is
translated to "file:/mountpoint/". This way one can mount a removable
device under different locations and still not loose the data.
Theoretically.

The problem with this approach is that every single url which is passed
through Nepomuk needs to be checked for the "filex" scheme and then
translated. Since we do not have a sparql parser this is done by employing
regular expressions to check for patterns with file:/mount/point and filex.

Valgrind logs show that for small queries a sizable amount (upto 40%) of
time is spent in just this regular expression based parsing. Additionally
since queries can return any kind of data, all of the data passed from
virtuoso to Nepomuk has go through these checks.

3. Database consistency -

Since we operate on an RDF based database which does not provide us any
kind of checking (primary key, types, etc), we need to do all of these
checks on our own. We currently have 3 properties which need to be given
special privileges when dealing with files - nie:url, nfo:filename, and
nie:isPartOf.

When a file is moved (and renamed) from one directory to another, all 3 of
these properties need to be updated. We currently have code in the storage
service to explicitly check if the url is being changed and accordingly
update the filename as well. These are special cases that we need to check
for each time which result in extra cpu cycles.

Additionally we have special handling for nie:url which seems to complicate
the code like crazy. In fact even I try to stay away from some of the
"core" code related to this stuff cause it is so insanely complicated.

Proposed Solution
---------------------------

We only store urls for non-file related stuff. Otherwise we rely on the
nfo:filename and nie:isPartOf relation to traverse the file system tree.
That way (1) can very easily be addressed. (2) can be stored as a
nfo:RemoveableMediaDevice with the appropriate mount point, and maybe we
can even give different treatment to RemoveableDevices and NetworkStorage.

(3) is complicated, cause the code is so complicated. But I think this
solution would result in slightly messier and slower code in some places,
but the main code should get simplified.

Problems
--------------

Accessing a files metadata is going to get trickier and slower. One will
have to load the nfo:filename for every entire chain up to the root.
However, I think this is not something the users of our libraries
should/will notice. This can be done transparently.

What do you guys think?

-- 
Vishesh Handa

[Attachment #5 (text/html)]

Hey everyone<br clear="all"><br>Last week I was somewhat shut out from the world, so \
I had some time to think about a lot of different things in Nepomuk. This is one of \
the many emails to come out it.<br><br>For those of you who don&#39;t know, in \
Nepomuk we always store the complete url of the file with the property nie:url. <br> \
<br>Example - <br><br>&lt;nepomuk:/res/23161f9c-8839-4de3-bba0-affdd6d654ef&gt;<br>   \
rdf:type            nmm:MusicPiece                                         <br>       \
rdf:type            nfo:FileDataObject                                     <br>  \
rdf:type            nfo:Audio                                              <br>       \
rdf:type            nie:InformationElement                                 <br>       \
nie:url             file:///home/vishesh/Music/where_does_the_good_go.mp3  <br> \
<br>Storing this URL makes accessing file resources quite convenient. But I fear it \
has been a terrible design decision. By storing the url we face the following \
problems -<br><br>1. Changing the url of a directory is very expensive. This \
doesn&#39;t need to be done very frequently, but occasionally the user might \
move/rename a directory which contains a large number of files. The url of every one \
of these files needs to be adjusted. Since changes in Nepomuk are not that cheap, \
this results in virtuoso + nepomukstorage + nepomukfilewatch consuming large amounts \
of cpu for quite some time.<br> <br>This is *very* *very* noticeable when renaming a \
directory with over 1000 files.<br><br>2. Removable Media Handling - We have very sad \
support for removable media handling. Currently we store urls which are not fixed \
under a &quot;filex&quot; scheme. Example -<br> \
<br>&lt;nepomuk:/res/7017a499-786b-4e97-a9f8-e9ee2506c322&gt;<br>        rdf:type     \
nfo:FileDataObject               <br>        nao:created       \
2012-11-02T17:52:16.022Z         <br>        nao:lastModified  \
2012-11-02T17:52:16.088Z         <br>  nie:url           \
&lt;filex://72acd848acd8090d/Lost&gt;<br><br>The &quot;72acd848acd8090d&quot; is the \
UUID of the device.<br><br>When any results containing &quot;filex&quot; are being \
returned, Solid is consulted to check if that particular device is mounted, and \
accordingly the filex is translated to &quot;file:/mountpoint/&quot;. This way one \
can mount a removable device under different locations and still not loose the data. \
Theoretically.<br> <br>The problem with this approach is that every single url which \
is passed through Nepomuk needs to be checked for the &quot;filex&quot; scheme and \
then translated. Since we do not have a sparql parser this is done by employing \
regular expressions to check for patterns with file:/mount/point and filex.<br> \
<br>Valgrind logs show that for small queries a sizable amount (upto 40%) of time is \
spent in just this regular expression based parsing. Additionally since queries can \
return any kind of data, all of the data passed from virtuoso to Nepomuk has go \
through these checks.<br> <br>3. Database consistency -<br><br>Since we operate on an \
RDF based database which does not provide us any kind of checking (primary key, \
types, etc), we need to do all of these checks on our own. We currently have 3 \
properties which need to be given special privileges when dealing with files - \
nie:url, nfo:filename, and nie:isPartOf.<br> <br>When a file is moved (and renamed) \
from one directory to another, all 3 of these properties need to be updated. We \
currently have code in the storage service to explicitly check if the url is being \
changed and accordingly update the filename as well. These are special cases that we \
need to check for each time which result in extra cpu cycles.<br> <br>Additionally we \
have special handling for nie:url which seems to complicate the code like crazy. In \
fact even I try to stay away from some of the &quot;core&quot; code related to this \
stuff cause it is so insanely complicated.<br> <br>Proposed \
Solution<br>---------------------------<br><br>We only store urls for non-file \
related stuff. Otherwise we rely on the nfo:filename and nie:isPartOf relation to \
traverse the file system tree. That way (1) can very easily be addressed. (2) can be \
stored as a nfo:RemoveableMediaDevice with the appropriate mount point, and maybe we \
can even give different treatment to RemoveableDevices and NetworkStorage.<br> \
<br>(3) is complicated, cause the code is so complicated. But I think this solution \
would result in slightly messier and slower code in some places, but the main code \
should get simplified.<br><br>Problems<br>--------------<br> <br>Accessing a files \
metadata is going to get trickier and slower. One will have to load the nfo:filename \
for every entire chain up to the root. However, I think this is not something the \
users of our libraries should/will notice. This can be done transparently.<br> \
<br>What do you guys think?<br> <br>-- <br><span \
style="color:rgb(192,192,192)">Vishesh Handa</span><br><br>



_______________________________________________
Nepomuk mailing list
Nepomuk@kde.org
https://mail.kde.org/mailman/listinfo/nepomuk


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic