[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-edu
Subject:    The next file format
From:       Inge Wallin <inge () lysator ! liu ! se>
Date:       2014-08-17 10:46:39
Message-ID: 2944164.bS2fG83C6Q () linux-yik5 ! site
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]


Hey there,

I talked a little with Andreas Xavier the other day about the new file format, and now with 
4.14 tagged we thought it would be a good time to start discussing that.

With this mail I will try to establish a common base that I think we can all agree about and 
with that out of the way we can start to argue about the details. I got a suggestion from 
Andreas with a very ambitious xsl definition but I think that most of what he suggested is 
for the next level of discussions.

KVTML
---------

First a short recapitulation about kvtml, our current file format. It's XML based and has a 
number of sections represented by the following tags:
 - <information>: general info such as author, title, etc
 - <identifiers>: Specification of the languages, including tenses, articles, word classes, etc
 - <entries>: this is a list of entries, where each entry is a list of translations, which 
normally is a word with possibly extra data such as attached image, sound, etc
 - <lessons>: This is what the user normally sees. Each lesson is more or less a list of 
translations with a title.
 - <wordtypes>: This is a list of what is normally called word class in linguistics

Each identifier (language), entry, translation (=word inside an entry) has an id. The 
translations refer to the identifiers (languages) using the id and the lessons refer to the 
words by using the id of the entries. 

Note that this is the file format itself. Applications such as Parley add an extra dimension to 
it by letting the user select languages to practice but that is not reflected in the file format.

One other notable thing is that each translation (word) has a confidence level (known as 
"grade" in the file) attached to it. This is a numerical value between 1 and 7 of the 
confidence that the student has reached in recognizing that particular word. This means 
that every word can only have one confidence level attached to it which is one of the big 
problems with kvtml. More about that below.

New file format
----------------------

The new format needs to address a number of shortcomings in kvtml:
 - pictures and audio are not contained inside it but are referenced as outside files. This 
makes it difficult to store lessons on a server, e.g. GHNS, and also to download them
 - Training data is stored together with the word and lesson data. (not a very big problem, I 
think)
 - There can only be one confidence level for each word. This makes it impossible to have 
separate values for e.g. spoken and written translations of the same word. Both of these 
are important when learning languages but are not the same.
 - Languages are underspecified in the file formats. Here we need to be careful because it 
is easy to overdesign a format like this. 

We have discussed this on IRC a number of times and here is what I think we agree on:

1. It should be a container format that can contain every aspect of collection inside it. The 
container itself should be ZIP.
2. Words and lessons should be separated from the training data inside the file.
3. We should still base the files inside the container on XML - except the multimedia 
attachments.

If you don't agree this far, please protest as soon as possible.

Now, here are some suggestions that I don't think are very controversial. If we can get past 
this quickly, we can start in on the details as soon as possible.

1. The new format should copy some of the details from the Open Document Format. This 
is a good format that works well and for which there are some nice tools already. The 
ebook format EPUB also uses the same conventions to a large degree. Specifically:
1.1 The first file inside it should be called 'mimetype' and contain the mimetype for the file.
1.2 There should be a manifest file which lists the type and name of all the files inside the 
container. ODF uses META-INF/manifest.xml which works for me.
1.3 multimedia files (pictures, video, audio, ...) are put in the container and referred to 
using <xlink> tags. There *could* also be links to external files but that should be avoided.
1.3.1 There is no mandatory place to put the attachments but Pictures/, Video/ and Audio/ 
are preferred paths.
1.4 There is a file for metadata called meta.xml.
1.5 There is a file for user settings called settings.xml (is this necessary?)
1.6 There is a thumbnail file which can be shown in e.g. a file browser called 
Thumbnails/thumbnail.png (is this necessary?)

2. I suggest that we name the main file collection.xml and the training status training.xml.

3. Everything inside the collection.xml file should have an id property which is a numerical 
number that should form a consecutive series. These numbers are only unique within their 
domain (e.g. words and identifiers both use id's 0 and up). This means that attachments 
for a word, e.g. a picture, does also have an id, which is not the case now.

4. confidence levels inside the training.xml files always refer to *pairs* of items. Examples: 
translation from a word to another word, translation from an audio file to a written word.  
These entities can be uniquely identified by the tree of id's (e.g. entry 4, translation 2, 
attachment 2 for the audio file for the the 2nd translation of the 4th entry). See below for a 
question about training types.

I will stop here for now. If we can agree on this, then we can dive into the details next, such 
as the actual tags. :)


Open questions
----------------------

1. What should be the mimetype of the new format?
2. Should we move metadata from collection.xml to the global meta.xml file?
3. Some have suggested to base the file format on OPC, the Open Packaging Conventions, 
which is used for lots of file formats, mostly on Windows. This format is mostly like ODF but 
has an advance way of linking together different files inside the container. I don't know 
what this would bring us but it is perhaps worth discussing.
4. Should we also use the type of training in the training data? For instance, just because I 
know that the spoken translation of DOG into German is HUND (as found by flashcard 
training) does not mean that I know how to spell HUND, which can be trained separately.


Conclusions
-----------------

These suggestions should not be too controversial. I am fine with other solutions but why 
reinvent the wheel when it already works well elsewhere?


[Attachment #5 (unknown)]

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0//EN" \
"http://www.w3.org/TR/REC-html40/strict.dtd"> <html><head><meta name="qrichtext" \
content="1" /><style type="text/css"> p, li { white-space: pre-wrap; }
</style></head><body style=" font-family:'Sans Serif'; font-size:11pt; \
font-weight:400; font-style:normal;"> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">Hey there,</p> <p style="-qt-paragraph-type:empty; margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; ">&nbsp;</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">I talked a little with Andreas Xavier the other day about the new \
file format, and now with 4.14 tagged we thought it would be a good time to start \
discussing that.</p> <p style="-qt-paragraph-type:empty; margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; ">&nbsp;</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">With this mail I will try to establish a common base that I think \
we can all agree about and with that out of the way we can start to argue about the \
details. I got a suggestion from Andreas with a very ambitious xsl definition but I \
think that most of what he suggested is for the next level of discussions.</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">KVTML</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">---------</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">First a short recapitulation \
about kvtml, our current file format. It's XML based and has a number of sections \
represented by the following tags:</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;"> - &lt;information&gt;: general info such as author, title, \
etc</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;"> - \
&lt;identifiers&gt;: Specification of the languages, including tenses, articles, word \
classes, etc</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;"> - \
&lt;entries&gt;: this is a list of entries, where each entry is a list of \
translations, which normally is a word with possibly extra data such as attached \
image, sound, etc</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;"> - \
&lt;lessons&gt;: This is what the user normally sees. Each lesson is more or less a \
list of translations with a title.</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;"> - &lt;wordtypes&gt;: This is a list of what is normally called \
word class in linguistics</p> <p style="-qt-paragraph-type:empty; margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; ">&nbsp;</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">Each identifier (language), entry, translation (=word inside an \
entry) has an id. The translations refer to the identifiers (languages) using the id \
and the lessons refer to the words by using the id of the entries. </p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Note that this is the file \
format itself. Applications such as Parley add an extra dimension to it by letting \
the user select languages to practice but that is not reflected in the file \
format.</p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> \
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">One other notable thing is \
that each translation (word) has a confidence level (known as &quot;grade&quot; in \
the file) attached to it. This is a numerical value between 1 and 7 of the confidence \
that the student has reached in recognizing that particular word. This means that \
every word can only have one confidence level attached to it which is one of the big \
problems with kvtml. More about that below.</p> <p style="-qt-paragraph-type:empty; \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;">New file format</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;">----------------------</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">The new format needs to \
address a number of shortcomings in kvtml:</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;"> - pictures and audio are not contained inside it \
but are referenced as outside files. This makes it difficult to store lessons on a \
server, e.g. GHNS, and also to download them</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;"> - Training data is stored together with the word \
and lesson data. (not a very big problem, I think)</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;"> - There can only be one confidence level for \
each word. This makes it impossible to have separate values for e.g. spoken and \
written translations of the same word. Both of these are important when learning \
languages but are not the same.</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;"> - Languages are underspecified in the file formats. Here we need \
to be careful because it is easy to overdesign a format like this. </p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">We have discussed this on IRC \
a number of times and here is what I think we agree on:</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">1. It should be a container \
format that can contain every aspect of collection inside it. The container itself \
should be ZIP.</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">2. Words \
and lessons should be separated from the training data inside the file.</p> <p \
style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">3. We should still base the \
files inside the container on XML - except the multimedia attachments.</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">If you don't agree this far, \
please protest as soon as possible.</p> <p style="-qt-paragraph-type:empty; \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;">Now, here are some suggestions that I don't think \
are very controversial. If we can get past this quickly, we can start in on the \
details as soon as possible.</p> <p style="-qt-paragraph-type:empty; margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; ">&nbsp;</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">1. The new format should copy some of the details from the Open \
Document Format. This is a good format that works well and for which there are some \
nice tools already. The ebook format EPUB also uses the same conventions to a large \
degree. Specifically:</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">1.1 The first file inside it should be called 'mimetype' and \
contain the mimetype for the file.</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">1.2 There should be a manifest file which lists the type and name \
of all the files inside the container. ODF uses META-INF/manifest.xml which works for \
me.</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">1.3 \
multimedia files (pictures, video, audio, ...) are put in the container and referred \
to using &lt;xlink&gt; tags. There *could* also be links to external files but that \
should be avoided.</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">1.3.1 There \
is no mandatory place to put the attachments but Pictures/, Video/ and Audio/ are \
preferred paths.</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">1.4 There \
is a file for metadata called meta.xml.</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;">1.5 There is a file for user settings called \
settings.xml (is this necessary?)</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">1.6 There is a thumbnail file which can be shown in e.g. a file \
browser called Thumbnails/thumbnail.png (is this necessary?)</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">2. I suggest that we name the \
main file collection.xml and the training status training.xml.</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">3. Everything inside the \
collection.xml file should have an id property which is a numerical number that \
should form a consecutive series. These numbers are only unique within their domain \
(e.g. words and identifiers both use id's 0 and up). This means that attachments for \
a word, e.g. a picture, does also have an id, which is not the case now.</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">4. confidence levels inside \
the training.xml files always refer to *pairs* of items. Examples: translation from a \
word to another word, translation from an audio file to a written word.  These \
entities can be uniquely identified by the tree of id's (e.g. entry 4, translation 2, \
attachment 2 for the audio file for the the 2nd translation of the 4th entry). See \
below for a question about training types.</p> <p style="-qt-paragraph-type:empty; \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" margin-top:0px; \
margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; \
text-indent:0px; -qt-user-state:0;">I will stop here for now. If we can agree on \
this, then we can dive into the details next, such as the actual tags. :)</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Open questions</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">----------------------</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">1. What should be the \
mimetype of the new format?</p> <p style=" margin-top:0px; margin-bottom:0px; \
margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; \
-qt-user-state:0;">2. Should we move metadata from collection.xml to the global \
meta.xml file?</p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; -qt-user-state:0;">3. Some \
have suggested to base the file format on OPC, the Open Packaging Conventions, which \
is used for lots of file formats, mostly on Windows. This format is mostly like ODF \
but has an advance way of linking together different files inside the container. I \
don't know what this would bring us but it is perhaps worth discussing.</p> <p \
style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">4. Should we also use the \
type of training in the training data? For instance, just because I know that the \
spoken translation of DOG into German is HUND (as found by flashcard training) does \
not mean that I know how to spell HUND, which can be trained separately.</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
<p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">Conclusions</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">-----------------</p> <p \
style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; \
margin-right:0px; -qt-block-indent:0; text-indent:0px; ">&nbsp;</p> <p style=" \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; -qt-user-state:0;">These suggestions should not \
be too controversial. I am fine with other solutions but why reinvent the wheel when \
it already works well elsewhere?</p> <p style="-qt-paragraph-type:empty; \
margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; \
-qt-block-indent:0; text-indent:0px; ">&nbsp;</p></body></html>



_______________________________________________
kde-edu mailing list
kde-edu@mail.kde.org
https://mail.kde.org/mailman/listinfo/kde-edu


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic