On Sat, Oct 25, 2008 at 7:40 PM, Isaac Clerencia <isaac@warp.es> wrote:
On Sat, Oct 25, 2008 at 12:25 PM, Albert Cervera i Areny
<albert@nan-tic.com> wrote:
> A Dimarts 21 Octubre 2008, Jordi Polo va escriure:
>>
>> Any opinion about these ideas is very much welcomed.
>
> I think an interesting application would be to process text documents (PDF,
> ODF, etc) and extract tags and information out of it to feed nepomuk. Let me
> put some examples:
>
> Say you're a lawyer with lots of contracts of different types. You'd expect
> strigi+your application + nepomuk to recognize those contracts, put the
> appropiate tag. It also sets tags saying what kind of contract it is and even
> extracts information about the people involved. Maybe even what the contract
> is about.

One of the easiest things to do (may be it's even done already, I
haven't been following development closely later) would be to detect
the language of the document in the absence of metadata language
information. Have a look at this post for example[0].

There is an open source library somewhere that implements that method. It would be a matter of wrap or reimplement it with Qt.
 

The main problem about the other proposals (invoices, contracts,
whatever, ...) is that it will require a huge effort to have them
available in the different languages.
 
Yes, maybe trained models or corpora for every language ... Not a very appealing outlook... But I guess it can not be helped.
 

Apertium[1], a open-source machine translation engine can be helpful
because it can do part-of-speech tagging making it easier to try to
extract meaning from a document afterwards.

That software (as any other I know) work for a number of languages. POS tagging depends on external data that may not be available for a much of the 50+ languages KDE is translated to.
Once we go multilingual everything turns difficult. Even the already stable and old spellchecking has no satisfying solution.
 

Another idea might be helping with collocations[2], mainly in KWord or
other programs where spell checking makes sense, so when you write a
noun you can somehow ask for the adjectives that are usually used with
that noun, read this blog post[3] for more insight about it.
 
The solution of that blog is the first step on statistical language processing: look for frequencies of words and get some patters. See http://en.wikipedia.org/wiki/N-gram
The problem with collocations using that method can be reduced to find 2-gram or 3-gram and you don't even need corpora, just use the text of the same document. For instance you are writing some homework and you happened to wrote Computer Science a couple of times, next time you write Computer,  "Science" can appear as suggestion. It is a cool feature to have and great and computational cheap idea.
The problem is that I should look like a 1 year challenge.


That thing was written in Python in a weekend and can easily be ported
to C++ and although the demo only uses English texts it's language
independent so it just needs to be trained with a corpus from a given
language (for example, that language's wikipedia) to be able to work
with it.

[0] http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/
[1] http://www.apertium.org/
[2] http://en.wikipedia.org/wiki/Collocation
[3] http://people.warp.es/~isaac/blog/index.php/sketching-words-79
--
Isaac Clerencia at Warp Networks, http://www.warp.es
Blog: http://people.warp.es/~isaac/blog/
Work: <isaac@warp.es>   | Debian: <isaac@debian.org>



--
Jordi Polo Carres
NLP laboratory - NAIST
http://www.bahasara.org