'Re: RFC: space vs. time vs. functionality in \N{name} loose matching'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl5-porters
Subject:    Re: RFC: space vs. time vs. functionality in \N{name} loose matching
From:       Jim Cromie <jim.cromie () gmail ! com>
Date:       2010-07-31 5:33:10
Message-ID: AANLkTikOPuBpfQV=oKRpdoCXJjng7eGvH+tYck0sN2Ag () mail ! gmail ! com
[Download RAW message or body]

On Fri, Jul 30, 2010 at 2:17 PM, karl williamson
<public@khwilliamson.com> wrote:
> John Imrie wrote:
>>
>> Arn't we over complicating this, or have I misunderstood something.
>>

Im also 2nd guessing, but on a different basis.

if you use hv_store()s precomputed hash, not 0, you could compute the hash
using the compressed string (approximately s/\b[ -_]\b//g) but store
the canonical name.
You'd have to use the compressed hash for lookup too,
but on a casual reading of the thread, I think it covers your needs.

If it does, then perhaps the XS bits can be buried under/behind some
hv related magic
(wild speculation here)

      hv_store
               Stores an SV in a hash.  The hash key is specified as
"key" and "klen" is the length of the key.
               The "hash" parameter is the precomputed hash value; if
it is zero then Perl will compute it.
               The return value will be NULL if the operation failed
or if the value did not need to be
               actually stored within the hash (as in the case of tied
hashes).  Otherwise it can be
               dereferenced to get the original "SV*".  Note that the
caller is responsible for suitably
               incrementing the reference count of "val" before the
call, and decrementing it if the function
               returned NULL.  Effectively a successful hv_store takes
ownership of one reference to "val".
               This is usually what you want; a newly created SV has a
reference count of one, so if all your
               code does is create SVs then store them in a hash,
hv_store will own the only reference to the
               new SV, and your code doesn’t need to do anything
further to tidy up.  hv_store is not
               implemented as a call to hv_store_ent, and does not
create a temporary SV for the key, so if
               your key data is not already in SV form then use
hv_store in preference to hv_store_ent.

(no other comments follow)


>>  >From http://www.unicode.org/reports/tr44/#Matching_Rules
>>
>>
>>        Character Names
>>
>> Unicode character names constitute a special case. Formally, they are
>> values of the Name property. While each Unicode character name for an
>> assigned character is guaranteed to be unique, names are assigned in such a
>> way that the presence or absence of spaces cannot be used to distinguish
>> them. Furthermore, implementations sometimes create identifiers from Unicode
>> character names by inserting underscores for spaces. For best results in
>> comparing Unicode character names, use loose matching rule UAX44-LM2.
>>
>> /*UAX44-LM2.*/ Ignore case, whitespace, underscore ('_'), and all medial
>> hyphens except the hyphen in U+1180 HANGUL JUNGSEONG O-E.
>>
>>    * "zero-width space" is equivalent to "ZERO WIDTH SPACE" or
>>      "zerowidthspace"
>>    * "character -a" is /not/ equivalent to "character a"
>>
>> So the code in mktables needs to create names that have had the spaces
>> underscores and medial hyphens removed, except as noted and the result then
>> uppercased.
>>
>> When processing the \N{ whatever } all we have to do is follow the above
>> rules to generate a normalized name.
>>
>> I don't know where in the perl C code \N{} is processed but I hope it's
>> not too difficult to process this; certainty it could be written in Perl
>> very easily.
>>
>> John
>
> The problem is that we have to look-up in both directions.  viacode() takes
> a code point number and returns the official Unicode name.  We want that
> official name to have the correct spaces and hyphens.  We don't want it to
> be "ZEROWIDTHSPACE", for example.  The only reasonable way to do this is to
> have the official name stored correctly.  That means we have to have a table
> with all the correct official names. There's no getting around that.
>
> What is done currently, is that that same table is used for look ups in the
> other direction, for vianame() or \N{}, which take a name and find its code
> point.  This allows the table to be dual-purposed, but doesn't lend itself
> to loose matching.  Hence there is no loose matching currently.
>
> Retaining that, i.e., do nothing, is my option 1) in the proposal.
>
> What you suggest is my option 4).  And that is to create a second table.  It
> would have white space and medial hyphens squeezed out.  Then it becomes a
> simple matter of squeezing the input name similarly and looking for an exact
> match.
>
> The downside of this option is that it requires a second huge table.  I
> would change mktables to generate both.  Recall that we have to have a table
> with the correct names in it for the viacode case.  Things could be
> structured so that the corresponding table is loaded only if its function is
> called.  That means that only programs that do look ups in both directions
> would be penalized.  And the lookup-by-name table is 6% smaller than the
> non-squeezed one, so programs that look up only by name would gain.
>
> Further, once the tables were decoupled, we could do things that would speed
> up performance even more.  The tables could be stored as hashes, with some
> overhead; or as sorted arrays for a binary search, I presume with less
> overhead than the hash case.
>
> The way I am able to do loose matching with just a single table is the
> following.  The table is actually a giant string, and the way things are
> structured is the input is a pattern, and the code looks like
>  $huge_table =~ /$input/;
> Then $-[0] and $+[0] are used to find where it matched.
> What I can do to get loose matching is to change $input to not be just a
> straight string, but to be a real pattern.  So either 'DIGIT ONE' or
> 'd i-gito- -ne' input both would be transformed into
>  $input = 'D[ -]?I[ -]?G[ -]?I[ -]?T[ -]?O[ -]?N[ -]?E[ -]?'
> That is what I've implemented, and slows look ups down by a factor of 2-3.
>  This is option 2) in my proposal.  But, remember that the results are
> cached in a hash, so the same lookup later would avoid all this.
>
> And finally option 3) is option 2) but only if the user included the
> ":loose" parameter in the pragma call.
>

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic