'[ICU4J-discussion] Re: Line breaking aaa(aaa: ICU 1.7'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       icu4j-discussion
Subject:    [ICU4J-discussion] Re: Line breaking aaa(aaa: ICU 1.7
From:       "Mark Davis" <mark.davis () us ! ibm ! com>
Date:       2001-02-21 23:28:27
[Download RAW message or body]

That text is messy, I agree. However, it says (somewhere) that the regular
expression version is the reference; the pair implementation is only an
approximation to it.

Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014



"Edward J. Batutis" <ejbatutis@yahoo.com>@dwoss.lotus.com on 02-15-2001
19:40:23

Sent by:  owner-icu@dwoss.lotus.com


To:   Eric Mader/Cupertino/IBM@IBMUS
cc:   Alan Liu <alan@finwhale.com>, icu@dwoss.lotus.com,
      icu4j-discussion@dwoss.lotus.com
Subject:  Re: Line breaking aaa(aaa: ICU 1.7




--- Eric Mader <ermader@us.ibm.com> wrote:
>
> Ed,
>
> In general, the pair table approach doesn't work
> quite as well as regular
> expressions because there are cases where you need
> more context than the
> two surrounding characters. (cf. the last paragraph
> on page 125 - section
> 5.15 right before "Example Specifications.")
>

I've re-implemented the line breaking rules based on
the line breaking properties file on unicode.org and
based partially on UTR 14. My new line??.brk files
implement line breaking that is closer to UTR 14.
After some additional testing I hope to contribute it
to ICU/ICU4J.

In any case, after struggling with it for several days
I'm not too happy with UTR 14.

UTR 14 attempts to describe proper line breaking using
both regular expressions and pairs, but it is clear
that the author had a pair implementation in mind. He
tries to break some of the regular expressions down
into pairs, but admits that this is only approximate.
On the other hand although the regular expressions can
be implemented using a regular expression engine, the
pairs cannot (at least not with the ICU engine). The
result is a description of line breaking that isn't
entirely satisfactory for either a pair table
implementation or a regular expression implementation.
I would rather see a spec that aims clearly at one
target - or both targets separately - and hits it
directly. Line breaking is inherently a bit messy.
Ideally it would vary based on the content of the text
it was operating on and the like, but it seems that a
clearer and more implementable description of line
breaking for general text should be possible. I've
attempted to contact the author and will try to
forward my comments on to him directly.

=Ed


__________________________________________________
Do You Yahoo!?
Get personalized email addresses from Yahoo! Mail - only $35
a year!  http://personal.mail.yahoo.com/




[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic