'Re: Proposal for Lucene / new component'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Re: Proposal for Lucene / new component
From:       Dmitry Serebrennikov <dmitrys () earthlink ! net>
Date:       2002-02-26 20:36:24
[Download RAW message or body]

Andrew C. Oliver wrote:

> > I think of a more generic generation and transformation framework (The cocoon \
> > framework is doing that (in a modified way) 
> 
> with the generation and transformation of html/xml via sax events.) 
> 
Well, I was just reading the XSLT book and racking my brains about how 
it might be used for content extraction from HTML (via SAX events). 
Didn't realize that this project I'm working on would also be applicable 
to Lucene, but here I come back from being away for a few days and I see 
almost the same thing being discussed on the list! Let me put in some of 
the thoughts I had over the weekend into the discussion and let's see if 
they have any bearing on what direction the crawler / content-handler 
might be taking.

First, I have to admit that I'm not familiar with Cocoon or Avalon 
(yet), so can't speak to their relative merits and don't know if what I 
am thinking of has been well known for years, so please excuse and 
educate me if this is the case. Second, this is all just an untested 
idea, so please feel free to shoot it down.

Here's the thread of reasoning:
- XSLT is a generic transformation language that can transform data in 
one XML format into another.
- XML format does not mean XML text file. It could be as generic as some 
in-memory data structures that issue and process SAX-style events, but 
otherwise have nothing in common with XML at all!
- HTML can be converted to XHTML (the Xalan's XNI-based parser has been 
mentioned just recently and I think there other ways of doing this).
- Once in XHTML, a style sheet can be written to convert any web page 
into a series of fields that the indexer can be made aware of. This can 
be done generically, but application-specific extensions can also be 
added. For example, lastModified, title, keywords, summary can all be 
extracted from any web page. But applications could define additional 
"site profiles" that would do some amount of screen-scraping to extract 
other fields from web pages they index. For example, one could extract 
price and description from web pages of some on-line catalog. Since the 
page formats don't tend to change too much once they stabilize, this 
might be a useful and viable solution for some applications.
- Some sites might offer XML output in addition HTML. This could happen 
in b-2-b space or because they expect browsers to apply stylesheets for 
the presentation needs. In this case, the xslt transformation from the 
site's XML to the input of the indexer would be even more interesting.
- Other data formats could be parsed into SAX events so that they can 
then be transformed into the XML (or just the SAX events) understood by 
the indexer (PDF, word, excel, proprietary databases).


So the pipeline that emerges is like this:
1) Obtain a source document from a "source adapter". An example of an 
adapter is an HttpClient (that handles URLs, connections, GET/POST, 
cookies, authentication, etc.) or a JDBCClient or some other 
application-specific adapter.
2) Find a parser based on the MIME type of the document
3) Use the parser to generate SAX events for the source document. The 
SAX events would correspond to a DTD defined for this type. The source 
could be XML or it could be PDF or whatever and the parser would take 
care of making it into a "virtual" XML. It will provide an XML view of 
the document by issuing SAX events as though it was reading the actual 
XML file. Well, you get the idea. The parser here is specific to the 
document type but not to the application, so it should be reusable. 
Also, the parser could be used in entirely different projects that also 
need to read the same file format.
-----
    Actually, 1, 2, and 3 could all be combined into a single step and 
we can define the source adapters to output
    SAX events. Some helper classes can then be provided to help in 
defining new adapters in a uniform way.
-----
4) Select a stylesheet for content extraction. A generic 
"markup-stripping" style sheet can probably be easily created as a 
baseline. Further, MIME-specific or maybe source DTD-specific 
stylesheets can also be created to be a step better. Further, 
application-specific style sheets can be created that would be selected 
based on the source of the document, its URL, and maybe based on the 
content of the document itself.
5) The selected stylesheet is applied to transfour the source XML 
(coming in as a series of SAX events) into the intermediate form XML, 
something that is understood internally by the indexer. This needs to 
include the following things:
    - Lucene Fields
        - none are required at this time, but if Lucene becomes aware of 
persistent document IDs or modification
            dates, these could be included here
    - Crawler/Content Handler Fields
        - URI's for links to follow for further crawling
        - document source
        - document id (unique for a given source)
    - Application Fields
        - just fields with datatype and content
6) Create a Lucene Document with all of these fields (without the links)
7) Process the links found in the intermediate representation. Decide if 
they should be followed further. If so,
request the source adapter to obtain the documents for these links.


How does this sound? The basic idea is to use xslt as a basis for the 
content extraction with the hope of the following bennefits:
a) make the pieces more reusable
b) handle any source document format and yet not drop down to the lowest 
common denominator
c) support applications that need very simple content extraction as well 
as very fine-grained screen scraping
d) use xml-based files to describe content extraction and field 
definition (this can later lead to tools that help users create these)
e) leverage a standard language and the knowledge that people may 
already have of xslt
f) leverage xslt processors that are already out there and that will 
continue to develop further


A couple of questions remain:
- How's performance of the xslt engines these days? Can they be used in 
this type of solution without compromising speed significantly?
- When SAX events are used for input and output, do parsers really keep 
the data out of memory or do they simply create DOM trees behind the scenes?
- Does xslt allow regex-based content extraction? This would be 
important for screen-scraping applications. From my readings it's not 
there. I've seen a few substring functions, but what I'm looking for is 
a function that would match some text from the source XML against a 
regular expression and then extract fields from the match the way Perl 
regular expressions do. Does anyone know how hard it would be to add new 
functions to XSLT?
- Is xslt an overkill for this application? Assembler can be used to 
describe any computer program, but it's not the best choice for most of 
them. In the same way, even if xslt can be used to describe content 
extraction strategy, does it have the right ballance of power with 
specialization or maybe a more specialized language needs to be used?

Any feedback would be greatly appreciated.
Thanks.
Dmitry.




--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic