From kde-i18n-doc  Fri Feb 09 11:38:24 2007
From: Adriaan de Groot <groot () kde ! org>
Date: Fri, 09 Feb 2007 11:38:24 +0000
To: kde-i18n-doc
Subject: Re: Automatic translation; was "Re: KBabel Maintainer"
Message-Id: <200702091238.29206.groot () kde ! org>
X-MARC-Message: https://marc.info/?l=kde-i18n-doc&m=117102139805125
MIME-Version: 1
Content-Type: multipart/mixed; boundary="--nextPart5048937.EZeMvkJhuM"

--nextPart5048937.EZeMvkJhuM
Content-Type: text/plain;
  charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

On Friday 09 February 2007 02:22, Kevin Scannell wrote:
>   One idea I've had for a while would be to keep a central database of
> linguistically preprocessed English POT files that could be used as
> input to (hopefully naive) machine translation programs for many
> languages.  The point is that good MT requires good parsing of the
> source language (some might disagree with that), and fortunately there
> are several robust open source parsers that we could use to take care
> of this once and for all.  Collins parser, Link Grammar parser, and
> the "Stanford" parser come to mind, but it's been a while since I
> surveyed what's available.

[since I'm down the hall from prof. Koster and he's the only parsing and=20
translation person I know, I'm going to be referring to him a lot] There is=
=20
also parsing technology available from some research groups; it can take a=
=20
little work to get it under a good license, but Kees' work (I gather he doe=
s=20
a lot of semantic search work with word-stem pairs) will be made available=
=20
under the GPL if we need it.

>    None of them will give 100% accuracy, but a human could maintain
> these POT files in almost exactly the same way that language groups
> maintain their PO files.  When the unprocessed POT files change,
> scripty could attempt a parse and fuzzy the processed POT files.  The
> maintainer can check these and make any necessary fixes to the parse
> trees or the part-of-speech tags.

Part of the project would ceertainly be to establish this corpus based on t=
he=20
text of Free software that is actually *used* by governments in the EU. A=20
rough sketch would be:

=2D inventory what is used within the EU (with limits)
=2D gather translation statistics
=2D build corpus of english strings *across products* (this is sort of what=
=20
Rosetta and other tools do, no? they handle translations of far more than=20
just KDE)
=2D build corpus of existing translations
=2D produce sound semantic analyses (machine-built and hand-tuned; this has=
 a=20
pretty good effect on the accuracy, I'm told) in English
=2D use these to suggest normalizations of english strings across projects =
(this=20
is scary: suggesting string changes in the english source so that multiple=
=20
independent Free software projects use consistent terminology)


>    This is, I think, all doable with existing tools.  The hard part is
> then generating the target strings from the parsed English, and there
> is no denying that is very hard to do, and requires big bilingual
> lexicons and the like.  =20

We're in a very constrained part of English, though. Isn't there work on us=
ing=20
restricted natural language for this?

> There are also statistical methods one could=20
> try, but they're more reliable when you can parse the target language
> too, and have existing translations for training.

=46ortunately we have those both, in varying amounts.

Given that this will be leveraging existing tools to create multiple=20
dictionaries and databases, we *will* need to figure out where the original=
=20
research comes in (perhaps from exploiting the multiple parallel translatio=
ns=20
we have).

=2D-=20
These are your friends - Adem
    GPG: FEA2 A3FE Adriaan de Groot

--nextPart5048937.EZeMvkJhuM
Content-Type: application/pgp-signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (FreeBSD)

iD8DBQBFzF01dqzuAf6io/4RArUBAJ9W3mhB4XvdKKWKXi2dZapBqGDlwwCdGpzF
ZbvHn1yPv0OdC4DQiMNGELs=
=Mn+a
-----END PGP SIGNATURE-----

--nextPart5048937.EZeMvkJhuM--