--===============0892445022==
Content-Type: multipart/alternative; 
	boundary="----=_Part_15989_24138599.1224937514691"

------=_Part_15989_24138599.1224937514691
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On Sat, Oct 25, 2008 at 7:40 PM, Isaac Clerencia <isaac@warp.es> wrote:

> On Sat, Oct 25, 2008 at 12:25 PM, Albert Cervera i Areny
> <albert@nan-tic.com> wrote:
> > A Dimarts 21 Octubre 2008, Jordi Polo va escriure:
> >>
> >> Any opinion about these ideas is very much welcomed.
> >
> > I think an interesting application would be to process text documents
> (PDF,
> > ODF, etc) and extract tags and information out of it to feed nepomuk. Let
> me
> > put some examples:
> >
> > Say you're a lawyer with lots of contracts of different types. You'd
> expect
> > strigi+your application + nepomuk to recognize those contracts, put the
> > appropiate tag. It also sets tags saying what kind of contract it is and
> even
> > extracts information about the people involved. Maybe even what the
> contract
> > is about.
>
> One of the easiest things to do (may be it's even done already, I
> haven't been following development closely later) would be to detect
> the language of the document in the absence of metadata language
> information. Have a look at this post for example[0].
>

There is an open source library somewhere that implements that method. It
would be a matter of wrap or reimplement it with Qt.


>
> The main problem about the other proposals (invoices, contracts,
> whatever, ...) is that it will require a huge effort to have them
> available in the different languages.


Yes, maybe trained models or corpora for every language ... Not a very
appealing outlook... But I guess it can not be helped.


>
> Apertium[1], a open-source machine translation engine can be helpful
> because it can do part-of-speech tagging making it easier to try to
> extract meaning from a document afterwards.
>

That software (as any other I know) work for a number of languages. POS
tagging depends on external data that may not be available for a much of the
50+ languages KDE is translated to.
Once we go multilingual everything turns difficult. Even the already stable
and old spellchecking has no satisfying solution.


Another idea might be helping with collocations[2], mainly in KWord or
> other programs where spell checking makes sense, so when you write a
> noun you can somehow ask for the adjectives that are usually used with
> that noun, read this blog post[3] for more insight about it.
>

The solution of that blog is the first step on statistical language
processing: look for frequencies of words and get some patters. See
http://en.wikipedia.org/wiki/N-gram
The problem with collocations using that method can be reduced to find
2-gram or 3-gram and you don't even need corpora, just use the text of the
same document. For instance you are writing some homework and you happened
to wrote Computer Science a couple of times, next time you write Computer,
"Science" can appear as suggestion. It is a cool feature to have and great
and computational cheap idea.
The problem is that I should look like a 1 year challenge.


That thing was written in Python in a weekend and can easily be ported
> to C++ and although the demo only uses English texts it's language
> independent so it just needs to be trained with a corpus from a given
> language (for example, that language's wikipedia) to be able to work
> with it.
>
> [0]
> http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/
> [1] http://www.apertium.org/
> [2] http://en.wikipedia.org/wiki/Collocation
> [3] http://people.warp.es/~isaac/blog/index.php/sketching-words-79<http://people.warp.es/%7Eisaac/blog/index.php/sketching-words-79>
> --
> Isaac Clerencia at Warp Networks, http://www.warp.es
> Blog: http://people.warp.es/~isaac/blog/<http://people.warp.es/%7Eisaac/blog/>
> Work: <isaac@warp.es>   | Debian: <isaac@debian.org>
>
> >> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to
> unsubscribe <<
>


-- 
Jordi Polo Carres
NLP laboratory - NAIST
http://www.bahasara.org

------=_Part_15989_24138599.1224937514691
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

<br><br><div class="gmail_quote">On Sat, Oct 25, 2008 at 7:40 PM, Isaac Clerencia <span dir="ltr">&lt;<a href="mailto:isaac@warp.es">isaac@warp.es</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<div class="Ih2E3d">On Sat, Oct 25, 2008 at 12:25 PM, Albert Cervera i Areny<br>
&lt;<a href="mailto:albert@nan-tic.com">albert@nan-tic.com</a>&gt; wrote:<br>
&gt; A Dimarts 21 Octubre 2008, Jordi Polo va escriure:<br>
&gt;&gt;<br>
&gt;&gt; Any opinion about these ideas is very much welcomed.<br>
&gt;<br>
&gt; I think an interesting application would be to process text documents (PDF,<br>
&gt; ODF, etc) and extract tags and information out of it to feed nepomuk. Let me<br>
&gt; put some examples:<br>
&gt;<br>
&gt; Say you&#39;re a lawyer with lots of contracts of different types. You&#39;d expect<br>
&gt; strigi+your application + nepomuk to recognize those contracts, put the<br>
&gt; appropiate tag. It also sets tags saying what kind of contract it is and even<br>
&gt; extracts information about the people involved. Maybe even what the contract<br>
&gt; is about.<br>
<br>
</div>One of the easiest things to do (may be it&#39;s even done already, I<br>
haven&#39;t been following development closely later) would be to detect<br>
the language of the document in the absence of metadata language<br>
information. Have a look at this post for example[0].<br>
</blockquote><div><br>There is an open source library somewhere that implements that method. It would be a matter of wrap or reimplement it with Qt.<br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
The main problem about the other proposals (invoices, contracts,<br>
whatever, ...) is that it will require a huge effort to have them<br>
available in the different languages.</blockquote><div>&nbsp;</div><div>Yes, maybe trained models or corpora for every language ... Not a very appealing outlook... But I guess it can not be helped. <br>&nbsp;</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
Apertium[1], a open-source machine translation engine can be helpful<br>
because it can do part-of-speech tagging making it easier to try to<br>
extract meaning from a document afterwards.<br>
</blockquote><div><br>That software (as any other I know) work for a number of languages. POS tagging depends on external data that may not be available for a much of the 50+ languages KDE is translated to.<br>Once we go multilingual everything turns difficult. Even the already stable and old spellchecking has no satisfying solution.<br>
&nbsp;<br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Another idea might be helping with collocations[2], mainly in KWord or<br>
other programs where spell checking makes sense, so when you write a<br>
noun you can somehow ask for the adjectives that are usually used with<br>
that noun, read this blog post[3] for more insight about it.<br>
</blockquote><div>&nbsp;<br>The solution of that blog is the first step on statistical language processing: look for frequencies of words and get some patters. See <a href="http://en.wikipedia.org/wiki/N-gram">http://en.wikipedia.org/wiki/N-gram</a> <br>
The problem with collocations using that method can be reduced to find 2-gram or 3-gram and you don&#39;t even need corpora, just use the text of the same document. For instance you are writing some homework and you happened to wrote Computer Science a couple of times, next time you write Computer,&nbsp; &quot;Science&quot; can appear as suggestion. It is a cool feature to have and great and computational cheap idea.<br>
The problem is that I should look like a 1 year challenge. <br><br><br></div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
That thing was written in Python in a weekend and can easily be ported<br>
to C++ and although the demo only uses English texts it&#39;s language<br>
independent so it just needs to be trained with a corpus from a given<br>
language (for example, that language&#39;s wikipedia) to be able to work<br>
with it.<br>
<br>
[0] <a href="http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/" target="_blank">http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/</a><br>
[1] <a href="http://www.apertium.org/" target="_blank">http://www.apertium.org/</a><br>
[2] <a href="http://en.wikipedia.org/wiki/Collocation" target="_blank">http://en.wikipedia.org/wiki/Collocation</a><br>
[3] <a href="http://people.warp.es/%7Eisaac/blog/index.php/sketching-words-79" target="_blank">http://people.warp.es/~isaac/blog/index.php/sketching-words-79</a><br>
<font color="#888888">--<br>
Isaac Clerencia at Warp Networks, <a href="http://www.warp.es" target="_blank">http://www.warp.es</a><br>
Blog: <a href="http://people.warp.es/%7Eisaac/blog/" target="_blank">http://people.warp.es/~isaac/blog/</a><br>
Work: &lt;<a href="mailto:isaac@warp.es">isaac@warp.es</a>&gt; &nbsp; | Debian: &lt;<a href="mailto:isaac@debian.org">isaac@debian.org</a>&gt;<br>
</font><div><div></div><div class="Wj3C7c"><br>
&gt;&gt; Visit <a href="http://mail.kde.org/mailman/listinfo/kde-devel#unsub" target="_blank">http://mail.kde.org/mailman/listinfo/kde-devel#unsub</a> to unsubscribe &lt;&lt;<br>
</div></div></blockquote></div><br><br clear="all"><br>-- <br>Jordi Polo Carres<br>NLP laboratory - NAIST<br><a href="http://www.bahasara.org">http://www.bahasara.org</a><br><br>

------=_Part_15989_24138599.1224937514691--

--===============0892445022==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

 
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

--===============0892445022==--