[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-i18n-doc
Subject: Re: Automatic translation; was "Re: KBabel Maintainer"
From: Adriaan de Groot <groot () kde ! org>
Date: 2007-02-09 11:38:24
Message-ID: 200702091238.29206.groot () kde ! org
[Download RAW message or body]
On Friday 09 February 2007 02:22, Kevin Scannell wrote:
> One idea I've had for a while would be to keep a central database of
> linguistically preprocessed English POT files that could be used as
> input to (hopefully naive) machine translation programs for many
> languages. The point is that good MT requires good parsing of the
> source language (some might disagree with that), and fortunately there
> are several robust open source parsers that we could use to take care
> of this once and for all. Collins parser, Link Grammar parser, and
> the "Stanford" parser come to mind, but it's been a while since I
> surveyed what's available.
[since I'm down the hall from prof. Koster and he's the only parsing and
translation person I know, I'm going to be referring to him a lot] There is
also parsing technology available from some research groups; it can take a
little work to get it under a good license, but Kees' work (I gather he does
a lot of semantic search work with word-stem pairs) will be made available
under the GPL if we need it.
> None of them will give 100% accuracy, but a human could maintain
> these POT files in almost exactly the same way that language groups
> maintain their PO files. When the unprocessed POT files change,
> scripty could attempt a parse and fuzzy the processed POT files. The
> maintainer can check these and make any necessary fixes to the parse
> trees or the part-of-speech tags.
Part of the project would ceertainly be to establish this corpus based on the
text of Free software that is actually *used* by governments in the EU. A
rough sketch would be:
- inventory what is used within the EU (with limits)
- gather translation statistics
- build corpus of english strings *across products* (this is sort of what
Rosetta and other tools do, no? they handle translations of far more than
just KDE)
- build corpus of existing translations
- produce sound semantic analyses (machine-built and hand-tuned; this has a
pretty good effect on the accuracy, I'm told) in English
- use these to suggest normalizations of english strings across projects (this
is scary: suggesting string changes in the english source so that multiple
independent Free software projects use consistent terminology)
> This is, I think, all doable with existing tools. The hard part is
> then generating the target strings from the parsed English, and there
> is no denying that is very hard to do, and requires big bilingual
> lexicons and the like.
We're in a very constrained part of English, though. Isn't there work on using
restricted natural language for this?
> There are also statistical methods one could
> try, but they're more reliable when you can parse the target language
> too, and have existing translations for training.
Fortunately we have those both, in varying amounts.
Given that this will be leveraging existing tools to create multiple
dictionaries and databases, we *will* need to figure out where the original
research comes in (perhaps from exploiting the multiple parallel translations
we have).
--
These are your friends - Adem
GPG: FEA2 A3FE Adriaan de Groot
[Attachment #3 (application/pgp-signature)]
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic