On Sat, Oct 25, 2008 at 12:25 PM, Albert Cervera i ArenyOne of the easiest things to do (may be it's even done already, I
<albert@nan-tic.com> wrote:
> A Dimarts 21 Octubre 2008, Jordi Polo va escriure:
>>
>> Any opinion about these ideas is very much welcomed.
>
> I think an interesting application would be to process text documents (PDF,
> ODF, etc) and extract tags and information out of it to feed nepomuk. Let me
> put some examples:
>
> Say you're a lawyer with lots of contracts of different types. You'd expect
> strigi+your application + nepomuk to recognize those contracts, put the
> appropiate tag. It also sets tags saying what kind of contract it is and even
> extracts information about the people involved. Maybe even what the contract
> is about.
haven't been following development closely later) would be to detect
the language of the document in the absence of metadata language
information. Have a look at this post for example[0].
The main problem about the other proposals (invoices, contracts,
whatever, ...) is that it will require a huge effort to have them
available in the different languages.
Apertium[1], a open-source machine translation engine can be helpful
because it can do part-of-speech tagging making it easier to try to
extract meaning from a document afterwards.
Another idea might be helping with collocations[2], mainly in KWord or
other programs where spell checking makes sense, so when you write a
noun you can somehow ask for the adjectives that are usually used with
that noun, read this blog post[3] for more insight about it.
That thing was written in Python in a weekend and can easily be ported
to C++ and although the demo only uses English texts it's language
independent so it just needs to be trained with a corpus from a given
language (for example, that language's wikipedia) to be able to work
with it.
[0] http://richmarr.wordpress.com/2008/09/30/language-detection-using-trigrams/
[1] http://www.apertium.org/
[2] http://en.wikipedia.org/wiki/Collocation
[3] http://people.warp.es/~isaac/blog/index.php/sketching-words-79
--
Isaac Clerencia at Warp Networks, http://www.warp.es
Blog: http://people.warp.es/~isaac/blog/
Work: <isaac@warp.es> | Debian: <isaac@debian.org>