[prev in list] [next in list] [prev in thread] [next in thread] 

List:       aspell-devel
Subject:    [aspell-devel] Checking of word-marginal specials
From:       Ciarán Ó Duibhín <ciaran () oduibhin ! freeserve ! co ! uk>
Date:       2013-06-20 17:08:40
Message-ID: 79AD57E6731D442087CE4B0DAE6B3B82 () InneallChiarin
[Download RAW message or body]

This is the second part (change #2) of my consideration of apostrophes =
and hyphens in aspell.  The first part (change #1) was "Tokenization of =
word-initial specials" dated June 14 2013.

Currently, when *.dat marks apostrophe as valid initially, the =
dictionary form well validates the token 'well (in addition to the token =
well).  And, when *.dat marks apostrophe as valid finally, the =
dictionary form well also validates the token well' .  However, neither =
of the tokens 'well or well' should ever be validated by the form well, =
but approved only if those exact forms are present in the dictionary.

There are two cases: when the apostrophe is encountered in a token in a =
position, initial or final, where it IS NOT valid in *.dat (and note =
that this applies to en.dat), it is immediately dropped from the token, =
and only the token without the apostrophe is checked against the =
dictionary.  (Before change #1, even a valid initial apostrophe was =
dropped from the token, but not a valid final apostrophe.)  So if =
"trying the token without the special" is done with the intention of =
accepting a token of English which has contrived to include a =
neighbouring quotation mark, this is a non-existent situation whose =
removal will have no effect.

When the apostrophe is encountered in a token in a position, initial or =
final, where it IS valid in *.dat, the token should be accepted only if =
the dictionary contains the word including the apostrophe - the current =
practice of accepting the token, merely because the corresponding form =
without the apostrophe is in the dictionary, is to accept an invalid =
word, possibly resulting from a mistaken use of the apostrophe (ASCII =
hex 27) as a quotation mark.  (Remember that languages which accept =
valid word-marginal apostrophes in *.dat do not use ASCII hex 27 as a =
quotation mark.)

The code for "trying the token with and without any initial or final =
special" is found in procedure SensitiveCompare in =
modules/speller/default/language.cpp at around line 428.  The suggested =
change #2 is to remove the code which, when the token begins or ends =
with a valid special, and has failed to match the dictionary, compares =
the token minus the special to the dictionary.  (Note again that a token =
will never be found to begin or end with an INVALID special, as that =
special will have been dropped during tokenization.)  Specifically, I =
suggest removal of the four separate lines which use the special() =
function.  Having no previous experience of C++ programming I cannot say =
that everything has been done which ought to be done, but the concept =
has been tried and shown to work.  I do not at present see any reason to =
make it conditional, ie. I cannot see any situation where the present =
behaviour is preferable.

This suggestion will enable a language like Italian, for example, to =
have a new it.dat in which word-final apostrophe is allowed, and =
non-words like anch may be replaced in the dictionary by anch' .  Even =
for English, a new en.dat allowing marginal apostrophes and a new =
dictionary (with, for example, 'twas and 'twill in place of twas and =
twill, and adding 'tis and 'twould) could produce an improvement, but =
only with English texts in which an encoding distinction has been made =
between apostrophe and quotation mark.  The main beneficiaries of the =
suggestion will be among languages other than English.

As before, my experiments have been conducted using the Hatier port of =
aspell for Windows at =
http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2 .

Third and final part to follow.

Ciar=E1n =D3 Duibh=EDn


[Attachment #3 (text/html)]

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META content="text/html; charset=iso-8859-1" http-equiv=Content-Type>
<META name=GENERATOR content="MSHTML 9.00.8112.16490">
<STYLE></STYLE>
</HEAD>
<BODY bgColor=#ffffff>
<DIV><FONT size=2 face=Arial>This is the second part (change #2)&nbsp;of my 
consideration of apostrophes and hyphens in aspell.&nbsp; The first part (change 
#1)&nbsp;was "Tokenization of word-initial specials" dated June 14 
2013.</FONT></DIV>
<DIV><FONT size=2 face=Arial></FONT>&nbsp;</DIV>
<DIV><FONT size=2 face=Arial>Currently, when *.dat marks apostrophe as valid 
initially, the dictionary form <EM>well</EM> validates the token <EM>'well</EM> 
(in addition to the token <EM>well</EM>).&nbsp; And, when *.dat marks apostrophe 
as valid finally, the dictionary form <EM>well</EM> also validates the token 
<EM>well'</EM> .&nbsp; However, neither of the tokens <EM>'well</EM> or 
<EM>well'</EM> should ever be validated by the form <EM>well</EM>, but approved 
only if those exact forms are present in the dictionary.</FONT></DIV>
<DIV><FONT size=2 face=Arial></FONT>&nbsp;</DIV>
<DIV><FONT size=2 face=Arial>There are two cases: when the apostrophe is 
encountered in a token in a position, initial or final, where it IS NOT valid in 
*.dat (and note that this applies to en.dat), it is immediately dropped from the 
token, and only the token without the apostrophe is checked against the 
dictionary.&nbsp; (Before change #1, even a valid initial apostrophe was dropped 
from the token, but not a valid final apostrophe.)&nbsp; So if "trying the token 
without the special" is done with the intention of accepting a token of English 
which has&nbsp;contrived to include a&nbsp;neighbouring quotation mark, this is 
a non-existent situation whose removal will have no effect.</FONT></DIV>
<DIV><FONT size=2 face=Arial></FONT>&nbsp;</DIV>
<DIV><FONT size=2 face=Arial>When the apostrophe is encountered in a token in a 
position, initial or final, where it IS valid in *.dat, the token should be 
accepted only if the dictionary contains the word including the apostrophe — the 
current practice of accepting the token, merely because the corresponding form 
without the apostrophe is in the dictionary, is to accept an invalid word, 
possibly resulting from a mistaken use of the apostrophe (ASCII hex 27) as a 
quotation mark.&nbsp; (Remember that languages which accept valid word-marginal 
apostrophes in *.dat do not use ASCII hex 27 as a quotation mark.)</FONT></DIV>
<DIV><FONT size=2 face=Arial></FONT>&nbsp;</DIV>
<DIV><FONT size=2 face=Arial>The code for "trying the token with and without any 
initial or final special" is found in procedure SensitiveCompare in 
modules/speller/default/language.cpp at around line 428.&nbsp; The suggested 
change #2 is to remove the code which, when the token begins or ends with a 
valid special, and has failed to match the dictionary, compares the token minus 
the special to the dictionary.&nbsp; (Note again that a token will never be 
found to begin or end with an INVALID special, as that special will have been 
dropped during tokenization.)&nbsp; Specifically, I suggest removal of the four 
separate lines which use the special() function.&nbsp; Having&nbsp;no previous 
experience of C++ programming I cannot say that everything has been done which 
ought to be done, but the concept has been tried and shown to work.&nbsp; I do 
not at present see any reason to make it conditional, ie. I cannot see any 
situation where the present behaviour is preferable.</FONT></DIV>
<DIV><FONT size=2 face=Arial></FONT>&nbsp;</DIV>
<DIV><FONT size=2 face=Arial>This suggestion will enable a language like 
Italian, for example, to have a new it.dat in which word-final apostrophe is 
allowed, and non-words like anch may be replaced in the dictionary by anch' 
.&nbsp; Even for English, a new en.dat allowing marginal apostrophes and a new 
dictionary (with, for example, 'twas and 'twill in place of twas and twill, and 
adding 'tis and 'twould) could produce an improvement, but only with English 
texts in which an encoding distinction has been made between apostrophe and 
quotation mark.&nbsp; The main beneficiaries of the suggestion will be among 
languages other than English.</FONT></DIV>
<DIV><FONT size=2 face=Arial></FONT>&nbsp;</DIV>
<DIV><FONT size=2 face=Arial>As before, my experiments have been conducted using 
the Hatier port of aspell for Windows at <A 
href="http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2">http://www.niversoft.com/downloads/aspell-0.60.5-msvc.tar.bz2</A> \
                
.<BR></FONT></DIV>
<DIV><FONT size=2 face=Arial>Third and final part to follow.</FONT></DIV>
<DIV><FONT size=2 face=Arial>&nbsp;</DIV>
<DIV>Ciarán Ó Duibhín</DIV></FONT>
<P><FONT size=2 face=Arial></FONT>&nbsp;</P></BODY></HTML>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic