'Re: HTML parsing'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ruby-talk
Subject:    Re: HTML parsing
From:       Gavin Sinclair <gsinclair () soyabean ! com ! au>
Date:       2004-02-02 14:03:12
Message-ID: 1581818660825.20040203010218 () soyabean ! com ! au
[Download RAW message or body]

On Monday, February 2, 2004, 11:48:00 PM, Emmanuel wrote:

> Gavin Sinclair wrote:

>>Hi folks,
>>
>>I need to parse some HTML.  I've dug around the archives and so on and
>>found the best solution to be Ned Konz's 'ruby-htmltools', which
>>relies on 'html-parser'.  Both of these projects are not really
>>maintained, so I'm wondering what other people currently use.
>>  
>>
> i was using a home-made solution, but i just decided (this WE) to 
> convert it to REXML: I would use HTML tidy (which is already needed for
> ~60% of the pages i'm parsing now), and ask tidy to spit out XHTML. i
> think that's the best (with my home made solution, besides the 
> duplication of work of parsing HTML, i needed a list of tags that you
> don't need to close etc. in XHTML all is done for me.. and then i get
> the familiar API of REXML [even though i never used REXML yet :O) ]).

The library I mentioned gives you a REXML::Document as well, so I'm
using REXML for the first time.  It's very good, but I'm struggling to
really get a grip.

The single most useful improvement to REXML for a beginner, IMO, is
this: (more) reasonable implementations of #to_s and/or #inspect on
Element and Attribute objects.

As it is, I believe every element contains a link to its document,
which in my case is large, and #inspect spits out thousands of lines
of rubbish when all I want to see is the element I'm looking at.  This
makes it hard to use in 'irb'.

(I know, I should start with a small document, but I'm trying to get
my task done :)

Cheers,
Gavin


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic