From kde-i18n-doc Fri Feb 09 11:38:24 2007 From: Adriaan de Groot Date: Fri, 09 Feb 2007 11:38:24 +0000 To: kde-i18n-doc Subject: Re: Automatic translation; was "Re: KBabel Maintainer" Message-Id: <200702091238.29206.groot () kde ! org> X-MARC-Message: https://marc.info/?l=kde-i18n-doc&m=117102139805125 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--nextPart5048937.EZeMvkJhuM" --nextPart5048937.EZeMvkJhuM Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline On Friday 09 February 2007 02:22, Kevin Scannell wrote: > One idea I've had for a while would be to keep a central database of > linguistically preprocessed English POT files that could be used as > input to (hopefully naive) machine translation programs for many > languages. The point is that good MT requires good parsing of the > source language (some might disagree with that), and fortunately there > are several robust open source parsers that we could use to take care > of this once and for all. Collins parser, Link Grammar parser, and > the "Stanford" parser come to mind, but it's been a while since I > surveyed what's available. [since I'm down the hall from prof. Koster and he's the only parsing and=20 translation person I know, I'm going to be referring to him a lot] There is= =20 also parsing technology available from some research groups; it can take a= =20 little work to get it under a good license, but Kees' work (I gather he doe= s=20 a lot of semantic search work with word-stem pairs) will be made available= =20 under the GPL if we need it. > None of them will give 100% accuracy, but a human could maintain > these POT files in almost exactly the same way that language groups > maintain their PO files. When the unprocessed POT files change, > scripty could attempt a parse and fuzzy the processed POT files. The > maintainer can check these and make any necessary fixes to the parse > trees or the part-of-speech tags. Part of the project would ceertainly be to establish this corpus based on t= he=20 text of Free software that is actually *used* by governments in the EU. A=20 rough sketch would be: =2D inventory what is used within the EU (with limits) =2D gather translation statistics =2D build corpus of english strings *across products* (this is sort of what= =20 Rosetta and other tools do, no? they handle translations of far more than=20 just KDE) =2D build corpus of existing translations =2D produce sound semantic analyses (machine-built and hand-tuned; this has= a=20 pretty good effect on the accuracy, I'm told) in English =2D use these to suggest normalizations of english strings across projects = (this=20 is scary: suggesting string changes in the english source so that multiple= =20 independent Free software projects use consistent terminology) > This is, I think, all doable with existing tools. The hard part is > then generating the target strings from the parsed English, and there > is no denying that is very hard to do, and requires big bilingual > lexicons and the like. =20 We're in a very constrained part of English, though. Isn't there work on us= ing=20 restricted natural language for this? > There are also statistical methods one could=20 > try, but they're more reliable when you can parse the target language > too, and have existing translations for training. =46ortunately we have those both, in varying amounts. Given that this will be leveraging existing tools to create multiple=20 dictionaries and databases, we *will* need to figure out where the original= =20 research comes in (perhaps from exploiting the multiple parallel translatio= ns=20 we have). =2D-=20 These are your friends - Adem GPG: FEA2 A3FE Adriaan de Groot --nextPart5048937.EZeMvkJhuM Content-Type: application/pgp-signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (FreeBSD) iD8DBQBFzF01dqzuAf6io/4RArUBAJ9W3mhB4XvdKKWKXi2dZapBqGDlwwCdGpzF ZbvHn1yPv0OdC4DQiMNGELs= =Mn+a -----END PGP SIGNATURE----- --nextPart5048937.EZeMvkJhuM--