[prev in list] [next in list] [prev in thread] [next in thread]
List: ruby-talk
Subject: Re: HTML cleanup task
From: "Victor \"Zverok\" Shepelev" <vshepelev () imho ! com ! ua>
Date: 2006-11-30 21:39:47
Message-ID: 20061130213946.D334B3C22973B () carbon ! ruby-lang ! org
[Download RAW message or body]
From: Paul Lutus [mailto:nospam@nosite.zzz]
Sent: Thursday, November 30, 2006 11:00 PM
> Victor "Zverok" Shepelev wrote:
>
> > It is a task definition.
> >
> > The task may vary for different dictionaries. For ex., with some
> > dictionaries tables must not be deleted, but "normalized":
> > "<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>"
>
> Both the before and after forms show big syntax errors. I hope you
> understand HTML syntax, if not, this may be more difficult than I thought.
I understand HTML syntax. And I see no problem in above.
Closing tags for <tr> and <td> are both optional in HTML 4.01 w3c spec.
> Perhaps you could post what you consider to be the desired end result for a
> particular entry from the "dictionary" site of your choice.
OK. Here it is:
Source page: http://en.wikipedia.org/wiki/Ukraine
Start pattern: <!-- start content -->
End pattern: <h2>
Elements to exclude: tables, images.
Desired output (with text in middle of paragraph skipped):
---------------------------
<p><b>Ukraine</b> (<a href="/wiki/Ukrainian_language" title="Ukrainian \
language">Ukrainian</a>: <span lang="uk" xml:lang="uk">Україна</span>, \
<i>Ukraina</i>, <span title="Pronunciation in IPA" class="IPA">/ukraˈjina/</span>) \
is a <a href="/wiki/Country" title="Country">country</a> in <a \
href="/wiki/Eastern_Europe" title="Eastern Europe">Eastern \
Europe</a>.
....
It became independent again after the <a \
href="/wiki/History_of_the_Soviet_Union_%281985-1991%29" title="History of the Soviet \
Union (1985-1991)">Soviet Union's collapse</a> in 1991.</p>
---------------------------
That's all.
> By the way (my boilerplate remark about page scraping), if this is for any
> purpose other than your own personal use, it represents a copyright
> problem.
My application would be kinda browser (nano-browser), I don't want to "grab" \
dictionaries.
> I want to emphasize this is not difficult at all, once there is a clear
> statement of purpose. In can be done in a few (maybe a few dozen) lines of
> Ruby code.
I know. I'm not a nuby (my poor language in mails is due to natural language \
problems, not very low knowledge). I've just asked about existing libraries.
>
> --
> Paul Lutus
> http://www.arachnoid.com
V.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic