[prev in list] [next in list] [prev in thread] [next in thread]
List: perl-unicode
Subject: Re: Word boundaries
From: Zbigniew_Łukasiak <zzbbyy () gmail ! com>
Date: 2012-03-27 12:21:50
Message-ID: CAGL_UUs88EgeenesgwpCbwainTnaVFY=_YHo2F_8U-Z7EJpfdQ () mail ! gmail ! com
[Download RAW message or body]
On Mon, Mar 26, 2012 at 12:57 PM, Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 <daxim@cpan.org> wrote:
> Let the regex engine help you advance the character counter.
>
> $ cat langs
> ΕλληνικάEnglish한국어日本語 усскийไทย
>
> ----
>
> $ cat langs.pl
> use 5.010;
> use strictures;
> use Unicode::UCD qw(charinfo);
>
> sub script {
> return charinfo(ord substr($_[0], 0, 1))->{script}
> };
>
> # necessary because pos() magic is tracked on the scalar.
> my $copy = $_;
> while (/(\X)/g) {
> my $script = script $1;
> my ($part) = $copy =~ /(\p{$script}+)/;
> say $part;
> pos($_) = pos($_) + length($part);
> }
Thanks a lot!
Here is the first version of my tokenizer based on this idea:
use Lingua::ZH::MMSEG;
sub tokenize {
my $text = shift;
my @tokens;
while ( $text =~ /(\X)/g ) {
my $part = $1;
my $script = charinfo( ord $1)->{script};
$text=~ /(\p{$script}*)/g;
next if $script eq 'Common';
$part .= $1;
if( $script eq 'Han' ){
push @tokens, mmseg( $part );
}
else{
push @tokens, $part;
}
}
return @tokens;
}
And the surprise - this works even without further splitting because
space and other dots all get the 'Common' script and are not matched
by \p{Latin}.
--
Zbigniew Lukasiak
http://brudnopis.blogspot.com/
http://perlalchemy.blogspot.com/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic