[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-i18n-doc
Subject:    Automatic translation; was "Re: KBabel Maintainer"
From:       "Kevin Scannell" <kscanne () gmail ! com>
Date:       2007-02-09 1:22:46
Message-ID: a27d00500702081722s256e40a2pe6fab6dd3a1ae241 () mail ! gmail ! com
[Download RAW message or body]

  Hi everyone,

  Kevin Donnelly just pointed me to the interesting thread about
machine translation, etc. for KDE.   Machine translation is one of my
main research interests, but as Kevin points out I've been spending
most of my time over the past 3-4 years developing simpler,
foundational tools for minority languages that we hope will eventually
lead to robust MT.

  One idea I've had for a while would be to keep a central database of
linguistically preprocessed English POT files that could be used as
input to (hopefully naive) machine translation programs for many
languages.  The point is that good MT requires good parsing of the
source language (some might disagree with that), and fortunately there
are several robust open source parsers that we could use to take care
of this once and for all.  Collins parser, Link Grammar parser, and
the "Stanford" parser come to mind, but it's been a while since I
surveyed what's available.

   None of them will give 100% accuracy, but a human could maintain
these POT files in almost exactly the same way that language groups
maintain their PO files.  When the unprocessed POT files change,
scripty could attempt a parse and fuzzy the processed POT files.  The
maintainer can check these and make any necessary fixes to the parse
trees or the part-of-speech tags.

   It would also be possible to add semantic markup when there are
ambiguous words in the source strings like "directory" (contact info)
vs. "directory" (for files).   "player" is another good example (games
vs. media players).   This part can be done automatically and more or
less reliably with simple Bayesian techniques.

   This is, I think, all doable with existing tools.  The hard part is
then generating the target strings from the parsed English, and there
is no denying that is very hard to do, and requires big bilingual
lexicons and the like.   There are also statistical methods one could
try, but they're more reliable when you can parse the target language
too, and have existing translations for training.

 Kevin
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic