[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xom-interest
Subject:    Re: [XOM-interest] [Spam:5.0]  Get all urls
From:       adamc () unc ! edu
Date:       2006-08-12 18:29:45
Message-ID: 20060812142945.ea360hwoowcsk04s () webmail5 ! isis ! unc ! edu
[Download RAW message or body]

Quoting Aaron Green <subnetrx@gmail.com>:

> I'm performing a query on a document to return all anchors in the document.
> This returns a list of nodes.  What I don't know is how to get attributes,
> such as href from this list of nodes.  This may not even be the correct way
> to do this.  I just want to get a list of all href attributes in a
> document.   I'm working on page scraping a company intranet to be put into a
> cms and need to get pages that are actively linked to, run them through
> tagsoup, write content to a file, and go to the next url.

A lot depends on your query and your source document; if really all you 
want are the values of the href attributes on anchor tags, then the 
XPath //a/@href will zero in on the values you're interested in.  That 
particular query will return a nu.xom.Nodes object whose members are 
all nu.xom.Attribute objects, which you can call getValue() on.

An issue that may or may not be in play is whether the source document 
is in the XHTML namespace (since you're using an XML processor on it).  
Assuming that's the case, the following snippet will get you what 
you're looking for as an array of Strings:

XPathContext context = new 
XPathContext("xhtml","http://www.w3.org/1999/xhtml");
Nodes links = doc.query("//xhtml:a/@href", context);
String [] hrefValues = new String[links.size()];
for(int i=0,n=hrefValues.length;i<n;i++)
{
    hrefValues[i] = links.get(i).getValue();
}

HTH,

AC

_______________________________________________
XOM-interest mailing list
XOM-interest@lists.ibiblio.org
http://lists.ibiblio.org/mailman/listinfo/xom-interest
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic