'A soundslike problem with combined English+Russian dictionary'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       aspell-user
Subject:    A soundslike problem with combined English+Russian dictionary
From:       Maxim Nikulin <manikulin () gmail ! com>
Date:       2021-06-22 16:56:25
Message-ID: sat4nq$tb6$1 () ciao ! gmane ! io
[Download RAW message or body]

Hi,

I am aware that multi-lingual dictionaries are unsupported by Aspell, 
but I think in some particular cases it is still possible to combine a 
couple of dictionaries and to get a result of reasonable quality. I am 
almost achieved what I expected for merged English and Russian word 
lists. I am quite satisfied even with current result. Maybe I just have 
not discovered detrimental effect of missed affix table for English or 
combined special characters ("-" and "'").

I was hooked by description of the metaphone algorithm that should 
improve suggested corrections for misspelled words. Since I am not a 
native English speaker, I do not mind to have such feature if it helps 
to remind some word. For Russian general edit distance should be enough, 
so I tried to use a copy of en_phonet.dat with added line (and exact 
copy as well)

     remove_accents 0

that is referenced in the .dat file

     soundslike rue_phonet

To my surprise with such configuration whole English alphabet is 
suggested as a replacement for misspelled Russian word. In the following 
example word "funetik" is taken from the manual to check that phonetic 
rules are taken into account (another example taff -> tough does not 
work with default suggestion mode)

> echo "funetik програма" | aspell -d ./rue.rws -a
> @(#) International Ispell Version 3.1.20 (but really Aspell 0.60.8)
> & funetik 26 0: fanatic, funk, fungi, Fuentes, functor, frenetic, genetic, kinetic, \
> finite, fount, fungoid, funky, lunatic, phonetic, fountain, funked, Fundy, fined, \
> founts, funded, font, fund, frantic, funkier, fount's, Fuentes's & програма \
> 100 8: программа, программ, A, B, C, D, E, F, G, H, I, J, K, L, M, \
> N, O, P, Q, R, S, T, U, V, X, Z, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, \
> r, s, t, u, v, x, z, AA, AI, AR, Ar, Au, BA, BB, BO, Ba, Be, Bi, CA, CO, Ca, Ce, \
> Ch, Ci, Co, Cu, DA, DD, DE, DI, Di, Du, Dy, ER, EU, Er, Eu, FY, Fe, GA, GE, GI, GU, \
> Ga, Ge, HI, Ha, He, Ho, IA, IE, Ia, Io, Ir, Jo, KO, KY

That is why I have

      soundslike generic

is my current configuration and it gives more reasonable variants for 
Russian test word:

> & програма 13 8: программа, программ, программе, \
> программу, программы, программах, программам, \
> программка, параграмма, программою, \
> проиграна, параграмм, погрома

Have I done something wrong? Is it expected behavior that English 
phonetic rules have so detrimental effect on variants for Russian words? 
I am unsure whether observed result is a bug. (Actually the question is: 
`How many bugs have I faced?' With zero as a possible variant)

More details of my configuration.

The goal is to see misspelled words in mixed-language documents with my 
notes. Variants of correction are appreciated as well. It works in Vim 
for years:

     set spelllang=en,ru spell

and I would like to have comparable feature in Emacs

     M-x flyspell-mode RET M-x ispell-change-dictionary RET rue

without special configuration of custom dictionary in Emacs. Side note: 
certainly I am against idea, I have seen once, to bind ispell dictionary 
to input method.

There is a feature request for support of multi-lingual dictionaries
https://github.com/GNUAspell/aspell/issues/448
(and a number of similar threads in the archive of this mail list).
People are still trying to combine dictionaries:
https://unix.stackexchange.com/questions/341714/use-multi-language-dictionary-with-aspell
 https://wiki.archlinux.org/title/User:Georgek
There is no section in the manual that clarifies possible problems of 
this approach.

I hope, in my particular case of English and Russian languages it can be 
done in a bit more accurate way.

- I rarely use letters with accents, so alphabets are disjoint set of 
characters. US-ASCII is a subset of KOI8-R encoding.
- The cost of discarding of affix data for Russian is ~30M of disk space 
(and almost certainly RAM as well). I am unsure if I loose something by 
ignoring affix table for English.
- Combined "special" is a kind of compromise, it should be per-language, 
I have not example of imperfect behavior yet however.
- As I said above, I would prefer phonetic rules for English but I have 
to use generic ones.

--->8--- rue.dat begin --->8---

# Combined dictionary for English and Russian languages
#
# An attempt to create a dictionary suitable for spell checking
# of mixed-language texts.
#
# Something distinct from just "ru" and "en". Do not use a name longer
# than 3 characters otherwise it will not appear in "aspell dump dicts"
# thus will be ignored by other applications. Numbers, e.g. "ru2"
# make language identifier invalid as well.
name		rue
# ISO8859-1 used for "en" dictionaries is a subset of KOI8-R
# modulo accents.
# Russian dictionary from system package on Ubuntu uses namely KOI8-R.
charset		koi8-r
# Combine values from "ru" and "en"
special		- -*- ' -*-
# With
#
#     soundslike rue
#
# and a copy of en_phonet.dat aspell suggests
# e.g. "phonetic" for "funetik" input.
# Unfortunately it ruins scoring of corrections for Russian.
# Even with "remove_accents 0" inside "rue_phonet.dat", abundant
# single- and two-letters variants appear as alternatives.
# However a couple of top rated suggestions are still reasonable.
# Segfault may happen on attempt to generate master dictionary
# when "rue_phonet.dat" is missed in the current directory.
# As a compromise, prefer better quality of correction variants
# for Russian.
soundslike	generic
# Affix compression is not enabled for "en" system dictionaries.
# At the same time it allows to save enough space for "ru" dictionary.
# Size of compressed dictionary is 3Mb, expanded one consumes 30Mb
# of disk space.
#
#     aspell --lang=ru --encoding=koi8-r dump master \
#         | aspell --lang=ru --encoding=koi8-r expand \
#         | aspell --lang=ru create master ./ru.rws
#
#     aspell --lang=ru --encoding=koi8-r dump master \
#         | aspell --lang=ru --encoding=koi8-r expand \
#         | tr ' ' '\n' \
#         | aspell --lang=ru create master ./ru-expand.rws
#
affix-compress	true
# Actually it is ignored and "rue_affix.dat"
# (copy or symlink is required).
affix		ru

# Noticed differences:
#
#     echo "programm funetic" | aspell --lang en -a
#     & programm 5 0: program, programs, programmer, programmed, program's
#     & funetic 14 9: fanatic, frenetic, genetic, kinetic, lunatic, 
phonetic, frantic, fungi, Fuentes, antic, functor, fanatics, fungoid, 
fanatic's
#     # ------------------------------------------------------------^^^^^^^^
#
#     echo "programm funetic" | aspell --lang rue -a
#     & programm 6 0: program, programs, programmed, programmer, 
program's, pogrom
# 
#---------------------------------------------------------------------^^^^^^
#     & funetic 5 9: fanatic, genetic, kinetic, lunatic, Fuentes
#
# Absence of "phonetic" caused by "soundslike generic". "Pogrom" presents
# in the original "en" word list.

---8<--- rue.dat end   ---8<---

--->8--- rue.multi begin --->8---

# Combined dictionary for English and Russian languages
#
# It is not possible to just add ru.multi and en.multi
# because of languages
# inside the dictionaries differ. Unsure if it is safe to generate
# dictionary for English language using modified ru.dat
# with "special ' -*-".
# Let's generate dictionaries with "rue" as a language identifier.
#
# System-wide .rws files are created on Ubuntu in postinst scripts by
# /usr/sbin/update-dictcommon-aspell and /usr/sbin/aspell-autobuildhash
# utilities. Source word lists are provided
# in /usr/share/aspell directory.
# Example of command to unpack:
#
#     zcat /usr/share/aspell/en-wo_accents-only.cwl.gz | precat
#
# E.g. en_US dictionary is combination of en-common
# (shared with e.g. en_GB)
# and en-wo_accents-only. Unsure if I need this degree of word list
# granularity, so let's try a naive approach to create word lists.
#
# "rue_affix.dat" is required despite "affix ru" line in rue.dat
#
#     ln -s /usr/lib/aspell/ru_affix.dat rue_affix.dat
#
#     aspell --lang=ru --encoding=koi8-r dump master \
#          | aspell --lang=rue create master ./rue-ru.rws
#
# Despite warnings like
#
#     # Warning: Removing inapplicable affix 'H' from word Адель.
#
# expanded word list is the same as the original one.
add rue-ru.rws
# Specify encoding to avoid UTF-8 if some accents
# will appear accidentally.
#
#     aspell --lang=en_US --encoding=iso8859-1 dump master \
#          | aspell --lang=rue create master ./rue-en_US.rws
add rue-en_US.rws

---8<--- rue.multi end   ---8<---

Commands to generate word lists are in the last comments in rue.multi. 
Finally I can run

     aspell --lang rue -a

Does such configuration have apparent problems? Is it possible to use 
en_phonet.dat instead of "generic" for soundslike?


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic