[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xml4lib
Subject:    [XML4Lib] regex matching problem
From:       Dana Pearson <dbpearsonmlis () GMAIL ! COM>
Date:       2014-07-17 1:16:52
Message-ID: CA+g3ULutk-KKgW2SFyJJv7kizYcJ_-nYCjmMyZYeH0FR=YjooQ () mail ! gmail ! com
[Download RAW message or body]

I am stumped on a regular expression to capture the dimensions of 2
photographs, the original and the 'access image'.

The source has a string which the following 3 examples are representative
of the variation.

<udf21>Scanned on Epson 10000 XL, with Adobe Photoshop.  1100 dpi.
5235x4152 pixels, access image 3000 x1322 pixels.</udf21>

<udf21>Scanned on Epson 10000 XL, with Adobe Photoshop.  3300 dpi.
1795x5100 pixels, access image 982x3000 pixels.</udf21>

<udf21>Scanned on Epson 10000 XL, with Adobe Photoshop.  5070 x 3344
pixels, access image 3000x1979 pixels.</udf21>

The strings are relatively uniform except some do not have a number
followed by 'dpi' and there are sometimes spaces before and/or after the
'x' in the string I'm trying to capture with analyze-string (XSL 2.0).

All dimensions have 4 digits except one (2nd example, access image 982x3000
pixels).

I did not anticipate this being a difficult regex problem but I cannot
 find the solution.

The following regex is as close as I can come to matching all 94 instances
of element udf21.

My regex (curly brackets doubled for XPath):

(.*)(\d{{4}}\s?x\s?\d{{4}})(.*)(\d{{3,4}}\s?x\s?\d{{4}})(.*)

Perfect for the second example but all others have only the 2nd, 3rd, 4th
of the 4 digits in this part of the string.

000 x1322
982x3000
000x1979

<xsl:analyze-string
 select="."
 regex="(.*)(\d{{4}}\s?x\s?\d{{4}})(.*)(\d{{3,4}}\s?x\s?\d{{4}})(.*)">
<xsl:matching-substring>
<subfield code="d"><xsl:value-of select="regex-group(4)"/></subfield>
</xsl:matching-substring>
</xsl:analyze-string>

I have also tried:
(.*)(\d{4}\s?x\s?\d{4})(.*)(\d\d\d\d?\s?x\s?\d{4})(.*)

The result is the same.

I think I have run up against a subtlely of regular expressions beyond my
understanding.

thanks,
dana


-- 
Dana Pearson
dbpearsonmlis.com
Metadata and Bibliographic Services for Libraries

================================

To unsubscribe: http://bit.ly/xml4lib

XML4Lib Web Site: http://xml4lib.org/

2014-07-16

[Attachment #3 (text/html)]

<div dir="ltr"><div>I am stumped on a regular expression to capture the dimensions of \
2 photographs, the original and the &#39;access \
image&#39;.</div><div><br></div><div>The source has a string which the following 3 \
examples are representative of the variation.</div> \
<div><br></div><div>&lt;udf21&gt;Scanned on Epson 10000 XL, with Adobe Photoshop.   \
1100 dpi. 5235x4152 pixels, access image 3000 x1322 \
pixels.&lt;/udf21&gt;</div><div><br></div><div>&lt;udf21&gt;Scanned on Epson 10000 \
XL, with Adobe Photoshop.   3300 dpi. 1795x5100 pixels, access image 982x3000 \
pixels.&lt;/udf21&gt;</div> <div><br></div><div>&lt;udf21&gt;Scanned on Epson 10000 \
XL, with Adobe Photoshop.   5070 x 3344 pixels, access image 3000x1979 \
pixels.&lt;/udf21&gt;</div><div><br></div><div>The strings are relatively uniform \
except some do not have a number followed by &#39;dpi&#39; and there are sometimes \
spaces before and/or after the &#39;x&#39; in the string I&#39;m trying to capture \
with analyze-string (XSL 2.0).</div> <div><br></div><div>All dimensions have 4 digits \
except one (2nd example, access image 982x3000 pixels).</div><div><br></div><div>I \
did not anticipate this being a difficult regex problem but I cannot   find the \
solution.</div> <div><br></div><div>The following regex is as close as I can come to \
matching all 94 instances of element udf21.  </div><div><br></div><div>My regex \
(curly brackets doubled for \
XPath):</div><div><br></div><div>(.*)(\d{{4}}\s?x\s?\d{{4}})(.*)(\d{{3,4}}\s?x\s?\d{{4}})(.*)</div>
 <div><br></div><div>Perfect for the second example but all others have only the 2nd, \
3rd, 4th of the 4 digits in this part of the string.</div><div><br></div><div>000 \
x1322</div><div>982x3000</div><div>000x1979</div><div> \
<br></div><div>&lt;xsl:analyze-string</div><div>  select=&quot;.&quot;</div><div>  \
regex=&quot;(.*)(\d{{4}}\s?x\s?\d{{4}})(.*)(\d{{3,4}}\s?x\s?\d{{4}})(.*)&quot;&gt;</div><div>&lt;xsl:matching-substring&gt;</div><div>&lt;subfield \
code=&quot;d&quot;&gt;&lt;xsl:value-of \
select=&quot;regex-group(4)&quot;/&gt;&lt;/subfield&gt;</div> \
<div>&lt;/xsl:matching-substring&gt;</div><div>&lt;/xsl:analyze-string&gt;</div><div><br></div><div>I \
have also tried:</div><div>(.*)(\d{4}\s?x\s?\d{4})(.*)(\d\d\d\d?\s?x\s?\d{4})(.*)</div><div><br></div><div>The \
result is the same.</div> <div><br></div><div>I think I have run up against a \
subtlely of regular expressions beyond my \
understanding.</div><div><br></div><div>thanks,</div><div>dana</div><div><br></div><div><br></div>-- \
<br><div dir="ltr">Dana Pearson<br> <a href="http://dbpearsonmlis.com" \
target="_blank">dbpearsonmlis.com</a><div>Metadata and Bibliographic Services for \
Libraries</div></div> </div>
================================
<p>
To unsubscribe: http://bit.ly/xml4lib
</p><p>
XML4Lib Web Site: http://xml4lib.org/
</p><p>
2014-07-16



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic