[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl-unicode
Subject:    Re: Word boundaries
From:       Zbigniew_Łukasiak <zzbbyy () gmail ! com>
Date:       2012-03-27 12:21:50
Message-ID: CAGL_UUs88EgeenesgwpCbwainTnaVFY=_YHo2F_8U-Z7EJpfdQ () mail ! gmail ! com
[Download RAW message or body]

On Mon, Mar 26, 2012 at 12:57 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 <daxim@cpan.org> wrote:
> Let the regex engine help you advance the character counter.
>
>      $ cat langs
>      ΕλληνικάEnglish한국어日本語 усскийไทย
>
> ----
>
>      $ cat langs.pl
>      use 5.010;
>      use strictures;
>      use Unicode::UCD qw(charinfo);
>
>      sub script {
>            return charinfo(ord substr($_[0], 0, 1))->{script}
>      };
>
>      # necessary because pos() magic is tracked on the scalar.
>      my $copy = $_;
>      while (/(\X)/g) {
>            my $script = script $1;
>            my ($part) = $copy =~ /(\p{$script}+)/;
>            say $part;
>            pos($_) = pos($_) + length($part);
>      }

Thanks a lot!

Here is the first version of my tokenizer based on this idea:


use Lingua::ZH::MMSEG;

sub tokenize {
    my $text = shift;
    my @tokens;
    while ( $text =~ /(\X)/g ) {
        my $part = $1;
        my $script = charinfo( ord $1)->{script};
        $text=~ /(\p{$script}*)/g;
        next if $script eq 'Common';
        $part .= $1;
        if( $script eq 'Han' ){
            push @tokens, mmseg( $part );
        }
        else{
            push @tokens, $part;
        }
    }
    return @tokens;
}

And the surprise - this works even without further splitting because
space and other dots all get the 'Common' script and are not matched
by \p{Latin}.

-- 
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic