'Re: HTML cleanup task'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ruby-talk
Subject:    Re: HTML cleanup task
From:       "Victor \"Zverok\" Shepelev" <vshepelev () imho ! com ! ua>
Date:       2006-11-30 21:39:47
Message-ID: 20061130213946.D334B3C22973B () carbon ! ruby-lang ! org
[Download RAW message or body]

From: Paul Lutus [mailto:nospam@nosite.zzz]
Sent: Thursday, November 30, 2006 11:00 PM
> Victor "Zverok" Shepelev wrote:
> 
> > It is a task definition.
> > 
> > The task may vary for different dictionaries. For ex., with some
> > dictionaries tables must not be deleted, but "normalized":
> > "<td>text1<td>text2" => "<table><tr><td>text1<td>text2</table>"
> 
> Both the before and after forms show big syntax errors. I hope you
> understand HTML syntax, if not, this may be more difficult than I thought.

I understand HTML syntax. And I see no problem in above.
Closing tags for <tr> and <td> are both optional in HTML 4.01 w3c spec.

> Perhaps you could post what you consider to be the desired end result for a
> particular entry from the "dictionary" site of your choice.

OK. Here it is:
Source page: http://en.wikipedia.org/wiki/Ukraine
Start pattern: <!-- start content -->
End pattern: <h2>
Elements to exclude: tables, images.

Desired output (with text in middle of paragraph skipped):
---------------------------
<p><b>Ukraine</b> (<a href="/wiki/Ukrainian_language" title="Ukrainian \
language">Ukrainian</a>: <span lang="uk" xml:lang="uk">Україна</span>, \
<i>Ukraina</i>, <span title="Pronunciation in IPA" class="IPA">/ukraˈjina/</span>) \
is a <a href="/wiki/Country" title="Country">country</a> in <a \
                href="/wiki/Eastern_Europe" title="Eastern Europe">Eastern \
                Europe</a>.
....
It became independent again after the <a \
href="/wiki/History_of_the_Soviet_Union_%281985-1991%29" title="History of the Soviet \
                Union (1985-1991)">Soviet Union's collapse</a> in 1991.</p>
---------------------------

That's all.

> By the way (my boilerplate remark about page scraping), if this is for any
> purpose other than your own personal use, it represents a copyright
> problem.

My application would be kinda browser (nano-browser), I don't want to "grab" \
dictionaries. 

> I want to emphasize this is not difficult at all, once there is a clear
> statement of purpose. In can be done in a few (maybe a few dozen) lines of
> Ruby code.

I know. I'm not a nuby (my poor language in mails is due to natural language \
problems, not very low knowledge). I've just asked about existing libraries.

> 
> --
> Paul Lutus
> http://www.arachnoid.com

V.


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic