'Re: Natural language processing tech for the desktop!'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Natural language processing tech for the desktop!
From:       "Jordi Polo" <mumismo () gmail ! com>
Date:       2008-10-25 12:25:14
Message-ID: a4162420810250525r55737122x48336f3ea95a0882 () mail ! gmail ! com
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

On Sat, Oct 25, 2008 at 7:40 PM, Isaac Clerencia <isaac@warp.es> wrote:

> On Sat, Oct 25, 2008 at 12:25 PM, Albert Cervera i Areny
> <albert@nan-tic.com> wrote:
> > A Dimarts 21 Octubre 2008, Jordi Polo va escriure:
> > > 
> > > Any opinion about these ideas is very much welcomed.
> > 
> > I think an interesting application would be to process text documents
> (PDF,
> > ODF, etc) and extract tags and information out of it to feed nepomuk. Let
> me
> > put some examples:
> > 
> > Say you're a lawyer with lots of contracts of different types. You'd
> expect
> > strigi+your application + nepomuk to recognize those contracts, put the
> > appropiate tag. It also sets tags saying what kind of contract it is and
> even
> > extracts information about the people involved. Maybe even what the
> contract
> > is about.
> 
> One of the easiest things to do (may be it's even done already, I
> haven't been following development closely later) would be to detect
> the language of the document in the absence of metadata language
> information. Have a look at this post for example[0].
> 

There is an open source library somewhere that implements that method. It
would be a matter of wrap or reimplement it with Qt.

> 
> The main problem about the other proposals (invoices, contracts,
> whatever, ...) is that it will require a huge effort to have them
> available in the different languages.

Yes, maybe trained models or corpora for every language ... Not a very
appealing outlook... But I guess it can not be helped.

> 
> Apertium[1], a open-source machine translation engine can be helpful
> because it can do part-of-speech tagging making it easier to try to
> extract meaning from a document afterwards.
> 

That software (as any other I know) work for a number of languages. POS
tagging depends on external data that may not be available for a much of the
50+ languages KDE is translated to.
Once we go multilingual everything turns difficult. Even the already stable
and old spellchecking has no satisfying solution.

Another idea might be helping with collocations[2], mainly in KWord or
> other programs where spell checking makes sense, so when you write a
> noun you can somehow ask for the adjectives that are usually used with
> that noun, read this blog post[3] for more insight about it.
> 

The solution of that blog is the first step on statistical language
processing: look for frequencies of words and get some patters. See
http://en.wikipedia.org/wiki/N-gram
The problem with collocations using that method can be reduced to find
2-gram or 3-gram and you don't even need corpora, just use the text of the
same document. For instance you are writing some homework and you happened
to wrote Computer Science a couple of times, next time you write Computer,
"Science" can appear as suggestion. It is a cool feature to have and great
and computational cheap idea.
The problem is that I should look like a 1 year challenge.

That thing was written in Python in a weekend and can easily be ported
> to C++ and although the demo only uses English texts it's language
> independent so it just needs to be trained with a corpus from a given
> language (for example, that language's wikipedia) to be able to work
> with it.
> 
> [0]
> http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/
> [1] http://www.apertium.org/
> [2] http://en.wikipedia.org/wiki/Collocation
> [3] http://people.warp.es/~isaac/blog/index.php/sketching-words-79<http://people.warp.es/%7Eisaac/blog/index.php/sketching-words-79>
>                 
> --
> Isaac Clerencia at Warp Networks, http://www.warp.es
> Blog: http://people.warp.es/~isaac/blog/<http://people.warp.es/%7Eisaac/blog/>
> Work: <isaac@warp.es>   | Debian: <isaac@debian.org>
> 
> > > Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to
> unsubscribe <<
> 

-- 
Jordi Polo Carres
NLP laboratory - NAIST
http://www.bahasara.org

[Attachment #5 (text/html)]

<br><br><div class="gmail_quote">On Sat, Oct 25, 2008 at 7:40 PM, Isaac Clerencia \
<span dir="ltr">&lt;<a href="mailto:isaac@warp.es">isaac@warp.es</a>&gt;</span> \
wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, \
204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> <div class="Ih2E3d">On Sat, Oct \
25, 2008 at 12:25 PM, Albert Cervera i Areny<br> &lt;<a \
href="mailto:albert@nan-tic.com">albert@nan-tic.com</a>&gt; wrote:<br> &gt; A Dimarts \
21 Octubre 2008, Jordi Polo va escriure:<br> &gt;&gt;<br>
&gt;&gt; Any opinion about these ideas is very much welcomed.<br>
&gt;<br>
&gt; I think an interesting application would be to process text documents (PDF,<br>
&gt; ODF, etc) and extract tags and information out of it to feed nepomuk. Let me<br>
&gt; put some examples:<br>
&gt;<br>
&gt; Say you&#39;re a lawyer with lots of contracts of different types. You&#39;d \
expect<br> &gt; strigi+your application + nepomuk to recognize those contracts, put \
the<br> &gt; appropiate tag. It also sets tags saying what kind of contract it is and \
even<br> &gt; extracts information about the people involved. Maybe even what the \
contract<br> &gt; is about.<br>
<br>
</div>One of the easiest things to do (may be it&#39;s even done already, I<br>
haven&#39;t been following development closely later) would be to detect<br>
the language of the document in the absence of metadata language<br>
information. Have a look at this post for example[0].<br>
</blockquote><div><br>There is an open source library somewhere that implements that \
method. It would be a matter of wrap or reimplement it with \
Qt.<br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid \
rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> <br>
The main problem about the other proposals (invoices, contracts,<br>
whatever, ...) is that it will require a huge effort to have them<br>
available in the different languages.</blockquote><div>&nbsp;</div><div>Yes, maybe \
trained models or corpora for every language ... Not a very appealing outlook... But \
I guess it can not be helped. <br>&nbsp;</div><blockquote class="gmail_quote" \
style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; \
padding-left: 1ex;"> <br>
Apertium[1], a open-source machine translation engine can be helpful<br>
because it can do part-of-speech tagging making it easier to try to<br>
extract meaning from a document afterwards.<br>
</blockquote><div><br>That software (as any other I know) work for a number of \
languages. POS tagging depends on external data that may not be available for a much \
of the 50+ languages KDE is translated to.<br>Once we go multilingual everything \
turns difficult. Even the already stable and old spellchecking has no satisfying \
solution.<br> &nbsp;<br><br></div><blockquote class="gmail_quote" style="border-left: \
1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> Another \
idea might be helping with collocations[2], mainly in KWord or<br> other programs \
where spell checking makes sense, so when you write a<br> noun you can somehow ask \
for the adjectives that are usually used with<br> that noun, read this blog post[3] \
for more insight about it.<br> </blockquote><div>&nbsp;<br>The solution of that blog \
is the first step on statistical language processing: look for frequencies of words \
and get some patters. See <a \
href="http://en.wikipedia.org/wiki/N-gram">http://en.wikipedia.org/wiki/N-gram</a> \
<br> The problem with collocations using that method can be reduced to find 2-gram or \
3-gram and you don&#39;t even need corpora, just use the text of the same document. \
For instance you are writing some homework and you happened to wrote Computer Science \
a couple of times, next time you write Computer,&nbsp; &quot;Science&quot; can appear \
as suggestion. It is a cool feature to have and great and computational cheap \
idea.<br> The problem is that I should look like a 1 year challenge. \
<br><br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid \
rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"> That thing was \
written in Python in a weekend and can easily be ported<br> to C++ and although the \
demo only uses English texts it&#39;s language<br> independent so it just needs to be \
trained with a corpus from a given<br> language (for example, that language&#39;s \
wikipedia) to be able to work<br> with it.<br>
<br>
[0] <a href="http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/" \
target="_blank">http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/</a><br>
 [1] <a href="http://www.apertium.org/" \
target="_blank">http://www.apertium.org/</a><br> [2] <a \
href="http://en.wikipedia.org/wiki/Collocation" \
target="_blank">http://en.wikipedia.org/wiki/Collocation</a><br> [3] <a \
href="http://people.warp.es/%7Eisaac/blog/index.php/sketching-words-79" \
target="_blank">http://people.warp.es/~isaac/blog/index.php/sketching-words-79</a><br>
 <font color="#888888">--<br>
Isaac Clerencia at Warp Networks, <a href="http://www.warp.es" \
                target="_blank">http://www.warp.es</a><br>
Blog: <a href="http://people.warp.es/%7Eisaac/blog/" \
                target="_blank">http://people.warp.es/~isaac/blog/</a><br>
Work: &lt;<a href="mailto:isaac@warp.es">isaac@warp.es</a>&gt; &nbsp; | Debian: \
&lt;<a href="mailto:isaac@debian.org">isaac@debian.org</a>&gt;<br> \
</font><div><div></div><div class="Wj3C7c"><br> &gt;&gt; Visit <a \
href="http://mail.kde.org/mailman/listinfo/kde-devel#unsub" \
target="_blank">http://mail.kde.org/mailman/listinfo/kde-devel#unsub</a> to \
unsubscribe &lt;&lt;<br> </div></div></blockquote></div><br><br clear="all"><br>-- \
<br>Jordi Polo Carres<br>NLP laboratory - NAIST<br><a \
href="http://www.bahasara.org">http://www.bahasara.org</a><br><br>

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

[prev in list] [next in list] [prev in thread] [next in thread]