'Re: Lua pattern to match an url'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lua-l
Subject:    Re: Lua pattern to match an url
From:       Philippe Verdy <verdy_p () wanadoo ! fr>
Date:       2019-12-25 23:13:43
Message-ID: CAGa7JC35i2idVFVf_FUggscg=iGzobf2h5DXuB-yRXsJaRraqA () mail ! gmail ! com
[Download RAW message or body]

Wrong, there are typically quotation marks around attributes, then the link
anchor tag is followed by text which can be "MP3", you would then match too
much.
The URL can use various extensions (including with variable letter case).
Also note that the dot pattern does matches spaces. And the .mp3 links can
also use non-ASCII characters (not necessarily URL-encoded, and you cannot
safely guess which text encoding is used in the path or querystring of the
URL, as it is not necessarily the same as the HTML page encoding itself
(including when it is URL-encoded to become ASCII). URLs re designed to be
opaque for most things, except that HTTP(S) are designed to be "hierarchic"
and make special behavior only of the "/" and or .." relative references;
"." alone is supported by target filesystems onwhic h the webserver is
installed, but not needed for HTTP(S) which defines its own filesystem
space with web semantics, not local-OS semantics on the server. Beside that
the path elements in HTTP are opaque binary, do not support any control or
whitespaces that have not been URLencoded in %xx hexadecimal form, and do
not have any semantics which is brought separately in MIME type headers.
Once again you need to stcik to the URI RFC. Don't reinvent the wheel,
there are already tons of URL parsers. And many MP3 on the web are never
accessible with an URL ending with ".mp3" in their path or in their query
string, and the terminatione ".mp3" may has well return actually NO valid
MP3 but plain HTML, or plain text or other file formats (as indicated
properly by the MIME type header and the HTTP status code).
So please read the RFC ! https://tools.ietf.org/html/rfc3986
Then look and HTML refefences to see how these URLS are further reencoded
by another layer in HTML, applying additional escaping when needed (like
character references "&name;" or "&#numericDecimal;" or
"&#xnumericHexadecimal;" using Unicode code points independantly of the
encoding in the URI itself. Three distinct encoding layers are applied to
encapulate the actual resource names, two of them being standard, but one
of them being resolved only in the server side.



Le mer. 25 dÃ©c. 2019 Ã  23:22, nobody <nobody+lua-list@afra-berlin.de> a
Ã©crit :

>
> > On 25. Dec 2019, at 22:44, Philippe Verdy <verdy_p@wanadoo.fr> wrote:
> >
> > It matches too many things [â€¦]
>
> When dealing with _valid HTML,_ heuristically, any string that starts with
> 'http://', ends with '.mp3' and doesn't contain spaces is almost
> certainly exactly a URL pointing at (something that claims to be) an MP3.
> (The other pattern works, too.) [So a somewhat better pattern than what I
> initially suggested would be "http://%S+%.mp3" â€“ also excluding line
> breaks.]
>
> When you're not dealing with random / adversarial strings, that is good
> enough and you don't have to care about all those intricacies. From what I
> gathered, the goal is one-off semi-manual extraction of links from HTML
> generated by some other party, so even potential errors don't really
> matterâ€¦ (The human in the loop can notice / fix things.)
>
> -- nobody
>
>

[Attachment #3 (text/html)]

<div dir="ltr">Wrong, there are typically quotation marks around attributes, then the \
link anchor tag is followed by text which can be &quot;MP3&quot;, you would then \
match too much.<div>The URL can use various extensions (including with variable  \
letter case). Also note that the dot pattern  does matches spaces. And the .mp3 links \
can also use non-ASCII characters (not necessarily  URL-encoded, and you cannot \
safely guess which text encoding is used in the path or querystring of the URL, as it \
is not necessarily the same as the HTML page encoding itself (including when it is \
URL-encoded to become ASCII). URLs re designed to be opaque for most things, except \
that HTTP(S) are designed to be &quot;hierarchic&quot; and make special behavior only \
of the &quot;/&quot; and or ..&quot; relative references; &quot;.&quot; alone is \
supported by target filesystems onwhic h the webserver is installed, but not needed \
for HTTP(S) which defines its own filesystem space with web semantics, not local-OS \
semantics on the server. Beside that the path elements in HTTP are opaque binary, do \
not support any control or whitespaces that have not been URLencoded in %xx \
hexadecimal form, and do not have any semantics which is brought separately in MIME \
type headers.</div><div>Once again you need to stcik to the URI RFC. Don&#39;t \
reinvent the wheel, there are already tons of URL parsers. And many MP3 on the web \
are never accessible with an URL ending with &quot;.mp3&quot; in their path or in \
their query string, and the terminatione &quot;.mp3&quot; may has well return \
actually NO valid MP3 but plain HTML, or plain text or other file formats (as \
indicated properly by the MIME type header and the HTTP status code).</div><div>So \
please read the RFC !  <a \
href="https://tools.ietf.org/html/rfc3986">https://tools.ietf.org/html/rfc3986</a></div><div>Then \
look and HTML refefences to see how these URLS are further reencoded by another layer \
in HTML, applying additional escaping when needed (like character references \
&quot;&amp;name;&quot; or &quot;&amp;#numericDecimal;&quot; or \
&quot;&amp;#xnumericHexadecimal;&quot; using Unicode code points independantly of the \
encoding in the URI itself. Three distinct encoding layers are applied to encapulate \
the actual resource names, two of them being standard, but one of them being resolved \
only in the server side.</div><div><br></div><div><br></div></div><br><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">Le  mer. 25 dÃ©c. 2019 Ã   \
23:22, nobody &lt;<a \
href="mailto:nobody%2Blua-list@afra-berlin.de">nobody+lua-list@afra-berlin.de</a>&gt; \
a Ã©crit  :<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px \
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br> &gt; On 25. Dec \
2019, at 22:44, Philippe Verdy &lt;<a href="mailto:verdy_p@wanadoo.fr" \
target="_blank">verdy_p@wanadoo.fr</a>&gt; wrote:<br> &gt; <br>
&gt; It matches too many things [â€¦]<br>
<br>
When dealing with _valid HTML,_ heuristically, any string that starts with \
&#39;http://&#39;, ends with &#39;.mp3&#39; and doesn&#39;t contain spaces is almost \
certainly exactly a URL pointing at (something that claims to be) an MP3. (The other \
pattern works, too.) [So a somewhat better pattern than what I initially suggested \
would be &quot;http://%S+%.mp3&quot; â€“ also excluding line breaks.]<br> <br>
When you&#39;re not dealing with random / adversarial strings, that is good enough \
and you don&#39;t have to care about all those intricacies. From what I gathered, the \
goal is one-off semi-manual extraction of links from HTML generated by some other \
party, so even potential errors don&#39;t really matterâ€¦ (The human in the loop can \
notice / fix things.)<br> <br>
-- nobody<br>
<br>
</blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic