'[xsl] Re: Tokenize question: tokenize on words, spaces and punctuation'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xsl-list
Subject:    [xsl] Re: Tokenize question: tokenize on words, spaces and punctuation
From:       Martin Holmes <mholmes () uvic ! ca>
Date:       2011-03-17 4:27:31
Message-ID: ils2ji$vt4$2 () dough ! gmane ! org
[Download RAW message or body]

This looks perfect. I'm actually dealing with relatively modern French, 
so I think the Unicode character categories should work fine.

Thanks indeed,
Martin

On 11-03-16 09:19 PM, Brandon Ibach wrote:
> The main trick here seems to be simply constructing an appropriate
> character class for each type of token and then matching sequences of
> one or more of each.
>
> The following does just that, though it also tosses in a twist to
> handle words with embedded dashes, so that the dash won't break the
> word into three separate tokens.  Further adjustments along those
> lines may be needed, depending on your requirements.
>
> The use of Unicode character categories for the character classes
> should ensure that this works for most languages, I think, though
> non-English languages aren't my strong suit, so I make no guarantees.
> :)
>
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>                  version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"
>                  xmlns:f="urn:stylesheet-func" exclude-result-prefixes="xs f">
>      <xsl:output method="text"/>
>      <xsl:param name="s" select="'Oh, what a fun-filled day!'"/>
>      <xsl:function name="f:tokens" as="xs:string*">
>          <xsl:param name="string"/>
>          <xsl:analyze-string select="$string"
> regex="{'\w[-\w]*|[\p{P}\p{C}]+|\p{Z}+'}">
>              <xsl:matching-substring><xsl:sequence
> select="."/></xsl:matching-substring>
>          </xsl:analyze-string>
>      </xsl:function>
>      <xsl:template match="/">
>          <xsl:text>('</xsl:text>
>          <xsl:value-of select="f:tokens($s)" separator="', '"/>
>          <xsl:text>')</xsl:text>
>      </xsl:template>
> </xsl:stylesheet>
>
> -Brandon :)
>
>
> On Wed, Mar 16, 2011 at 8:33 PM, Martin Holmes<mholmes@uvic.ca>  wrote:
>> Hi there,
>>
>> This is really a question for XPath regex gurus:
>>
>> I need to tokenize a string of text such that words, punctuation and spaces
>> are split. So from this:
>>
>> Oh, what a great day!
>>
>> I need to get:
>>
>> ('Oh', ',', ' ', 'what', ' ', 'a', ' ', 'great', ' ', 'day', '!')
>>
>> I've been hacking away at this for a while, but regexps aren't my strong
>> suit. Can anyone help?
>>
>> Cheers,
>> Martin
>>
>>
>> --~------------------------------------------------------------------
>> XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>> To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
>> or e-mail:<mailto:xsl-list-unsubscribe@lists.mulberrytech.com>
>> --~--
>>
>>
>
> --~------------------------------------------------------------------
> XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
> To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
> or e-mail:<mailto:xsl-list-unsubscribe@lists.mulberrytech.com>
> --~--
>
>



--~------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe@lists.mulberrytech.com>
--~--

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic