--===============0892445022== Content-Type: multipart/alternative; boundary="----=_Part_15989_24138599.1224937514691" ------=_Part_15989_24138599.1224937514691 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline On Sat, Oct 25, 2008 at 7:40 PM, Isaac Clerencia wrote: > On Sat, Oct 25, 2008 at 12:25 PM, Albert Cervera i Areny > wrote: > > A Dimarts 21 Octubre 2008, Jordi Polo va escriure: > >> > >> Any opinion about these ideas is very much welcomed. > > > > I think an interesting application would be to process text documents > (PDF, > > ODF, etc) and extract tags and information out of it to feed nepomuk. Let > me > > put some examples: > > > > Say you're a lawyer with lots of contracts of different types. You'd > expect > > strigi+your application + nepomuk to recognize those contracts, put the > > appropiate tag. It also sets tags saying what kind of contract it is and > even > > extracts information about the people involved. Maybe even what the > contract > > is about. > > One of the easiest things to do (may be it's even done already, I > haven't been following development closely later) would be to detect > the language of the document in the absence of metadata language > information. Have a look at this post for example[0]. > There is an open source library somewhere that implements that method. It would be a matter of wrap or reimplement it with Qt. > > The main problem about the other proposals (invoices, contracts, > whatever, ...) is that it will require a huge effort to have them > available in the different languages. Yes, maybe trained models or corpora for every language ... Not a very appealing outlook... But I guess it can not be helped. > > Apertium[1], a open-source machine translation engine can be helpful > because it can do part-of-speech tagging making it easier to try to > extract meaning from a document afterwards. > That software (as any other I know) work for a number of languages. POS tagging depends on external data that may not be available for a much of the 50+ languages KDE is translated to. Once we go multilingual everything turns difficult. Even the already stable and old spellchecking has no satisfying solution. Another idea might be helping with collocations[2], mainly in KWord or > other programs where spell checking makes sense, so when you write a > noun you can somehow ask for the adjectives that are usually used with > that noun, read this blog post[3] for more insight about it. > The solution of that blog is the first step on statistical language processing: look for frequencies of words and get some patters. See http://en.wikipedia.org/wiki/N-gram The problem with collocations using that method can be reduced to find 2-gram or 3-gram and you don't even need corpora, just use the text of the same document. For instance you are writing some homework and you happened to wrote Computer Science a couple of times, next time you write Computer, "Science" can appear as suggestion. It is a cool feature to have and great and computational cheap idea. The problem is that I should look like a 1 year challenge. That thing was written in Python in a weekend and can easily be ported > to C++ and although the demo only uses English texts it's language > independent so it just needs to be trained with a corpus from a given > language (for example, that language's wikipedia) to be able to work > with it. > > [0] > http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/ > [1] http://www.apertium.org/ > [2] http://en.wikipedia.org/wiki/Collocation > [3] http://people.warp.es/~isaac/blog/index.php/sketching-words-79 > -- > Isaac Clerencia at Warp Networks, http://www.warp.es > Blog: http://people.warp.es/~isaac/blog/ > Work: | Debian: > > >> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to > unsubscribe << > -- Jordi Polo Carres NLP laboratory - NAIST http://www.bahasara.org ------=_Part_15989_24138599.1224937514691 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Content-Disposition: inline

On Sat, Oct 25, 2008 at 7:40 PM, Isaac Clerencia <isaac@warp.es> wrote:
On Sat, Oct 25, 2008 at 12:25 PM, Albert Cervera i Areny
<albert@nan-tic.com> wrote:
> A Dimarts 21 Octubre 2008, Jordi Polo va escriure:
>>
>> Any opinion about these ideas is very much welcomed.
>
> I think an interesting application would be to process text documents (PDF,
> ODF, etc) and extract tags and information out of it to feed nepomuk. Let me
> put some examples:
>
> Say you're a lawyer with lots of contracts of different types. You'd expect
> strigi+your application + nepomuk to recognize those contracts, put the
> appropiate tag. It also sets tags saying what kind of contract it is and even
> extracts information about the people involved. Maybe even what the contract
> is about.

One of the easiest things to do (may be it's even done already, I
haven't been following development closely later) would be to detect
the language of the document in the absence of metadata language
information. Have a look at this post for example[0].

There is an open source library somewhere that implements that method. It would be a matter of wrap or reimplement it with Qt.
 

The main problem about the other proposals (invoices, contracts,
whatever, ...) is that it will require a huge effort to have them
available in the different languages.
 
Yes, maybe trained models or corpora for every language ... Not a very appealing outlook... But I guess it can not be helped.
 

Apertium[1], a open-source machine translation engine can be helpful
because it can do part-of-speech tagging making it easier to try to
extract meaning from a document afterwards.

That software (as any other I know) work for a number of languages. POS tagging depends on external data that may not be available for a much of the 50+ languages KDE is translated to.
Once we go multilingual everything turns difficult. Even the already stable and old spellchecking has no satisfying solution.
 

Another idea might be helping with collocations[2], mainly in KWord or
other programs where spell checking makes sense, so when you write a
noun you can somehow ask for the adjectives that are usually used with
that noun, read this blog post[3] for more insight about it.
 
The solution of that blog is the first step on statistical language processing: look for frequencies of words and get some patters. See http://en.wikipedia.org/wiki/N-gram
The problem with collocations using that method can be reduced to find 2-gram or 3-gram and you don't even need corpora, just use the text of the same document. For instance you are writing some homework and you happened to wrote Computer Science a couple of times, next time you write Computer,  "Science" can appear as suggestion. It is a cool feature to have and great and computational cheap idea.
The problem is that I should look like a 1 year challenge.


That thing was written in Python in a weekend and can easily be ported
to C++ and although the demo only uses English texts it's language
independent so it just needs to be trained with a corpus from a given
language (for example, that language's wikipedia) to be able to work
with it.

[0] http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/
[1] http://www.apertium.org/
[2] http://en.wikipedia.org/wiki/Collocation
[3] http://people.warp.es/~isaac/blog/index.php/sketching-words-79
--
Isaac Clerencia at Warp Networks, http://www.warp.es
Blog: http://people.warp.es/~isaac/blog/
Work: <isaac@warp.es>   | Debian: <isaac@debian.org>



--
Jordi Polo Carres
NLP laboratory - NAIST
http://www.bahasara.org

------=_Part_15989_24138599.1224937514691-- --===============0892445022== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline >> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe << --===============0892445022==--