[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-i18n-doc
Subject:    Re: Automatic translation; was "Re: KBabel Maintainer"
From:       Adriaan de Groot <groot () kde ! org>
Date:       2007-02-09 11:38:24
Message-ID: 200702091238.29206.groot () kde ! org
[Download RAW message or body]


On Friday 09 February 2007 02:22, Kevin Scannell wrote:
>   One idea I've had for a while would be to keep a central database of
> linguistically preprocessed English POT files that could be used as
> input to (hopefully naive) machine translation programs for many
> languages.  The point is that good MT requires good parsing of the
> source language (some might disagree with that), and fortunately there
> are several robust open source parsers that we could use to take care
> of this once and for all.  Collins parser, Link Grammar parser, and
> the "Stanford" parser come to mind, but it's been a while since I
> surveyed what's available.

[since I'm down the hall from prof. Koster and he's the only parsing and 
translation person I know, I'm going to be referring to him a lot] There is 
also parsing technology available from some research groups; it can take a 
little work to get it under a good license, but Kees' work (I gather he does 
a lot of semantic search work with word-stem pairs) will be made available 
under the GPL if we need it.

>    None of them will give 100% accuracy, but a human could maintain
> these POT files in almost exactly the same way that language groups
> maintain their PO files.  When the unprocessed POT files change,
> scripty could attempt a parse and fuzzy the processed POT files.  The
> maintainer can check these and make any necessary fixes to the parse
> trees or the part-of-speech tags.

Part of the project would ceertainly be to establish this corpus based on the 
text of Free software that is actually *used* by governments in the EU. A 
rough sketch would be:

- inventory what is used within the EU (with limits)
- gather translation statistics
- build corpus of english strings *across products* (this is sort of what 
Rosetta and other tools do, no? they handle translations of far more than 
just KDE)
- build corpus of existing translations
- produce sound semantic analyses (machine-built and hand-tuned; this has a 
pretty good effect on the accuracy, I'm told) in English
- use these to suggest normalizations of english strings across projects (this 
is scary: suggesting string changes in the english source so that multiple 
independent Free software projects use consistent terminology)


>    This is, I think, all doable with existing tools.  The hard part is
> then generating the target strings from the parsed English, and there
> is no denying that is very hard to do, and requires big bilingual
> lexicons and the like.   

We're in a very constrained part of English, though. Isn't there work on using 
restricted natural language for this?

> There are also statistical methods one could 
> try, but they're more reliable when you can parse the target language
> too, and have existing translations for training.

Fortunately we have those both, in varying amounts.

Given that this will be leveraging existing tools to create multiple 
dictionaries and databases, we *will* need to figure out where the original 
research comes in (perhaps from exploiting the multiple parallel translations 
we have).

-- 
These are your friends - Adem
    GPG: FEA2 A3FE Adriaan de Groot

[Attachment #3 (application/pgp-signature)]

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic