[prev in list] [next in list] [prev in thread] [next in thread]
List: lucene-dev
Subject: Re: Proposal for Lucene / new component
From: Dmitry Serebrennikov <dmitrys () earthlink ! net>
Date: 2002-02-26 20:36:24
[Download RAW message or body]
Andrew C. Oliver wrote:
> > I think of a more generic generation and transformation framework (The cocoon \
> > framework is doing that (in a modified way)
>
> with the generation and transformation of html/xml via sax events.)
>
Well, I was just reading the XSLT book and racking my brains about how
it might be used for content extraction from HTML (via SAX events).
Didn't realize that this project I'm working on would also be applicable
to Lucene, but here I come back from being away for a few days and I see
almost the same thing being discussed on the list! Let me put in some of
the thoughts I had over the weekend into the discussion and let's see if
they have any bearing on what direction the crawler / content-handler
might be taking.
First, I have to admit that I'm not familiar with Cocoon or Avalon
(yet), so can't speak to their relative merits and don't know if what I
am thinking of has been well known for years, so please excuse and
educate me if this is the case. Second, this is all just an untested
idea, so please feel free to shoot it down.
Here's the thread of reasoning:
- XSLT is a generic transformation language that can transform data in
one XML format into another.
- XML format does not mean XML text file. It could be as generic as some
in-memory data structures that issue and process SAX-style events, but
otherwise have nothing in common with XML at all!
- HTML can be converted to XHTML (the Xalan's XNI-based parser has been
mentioned just recently and I think there other ways of doing this).
- Once in XHTML, a style sheet can be written to convert any web page
into a series of fields that the indexer can be made aware of. This can
be done generically, but application-specific extensions can also be
added. For example, lastModified, title, keywords, summary can all be
extracted from any web page. But applications could define additional
"site profiles" that would do some amount of screen-scraping to extract
other fields from web pages they index. For example, one could extract
price and description from web pages of some on-line catalog. Since the
page formats don't tend to change too much once they stabilize, this
might be a useful and viable solution for some applications.
- Some sites might offer XML output in addition HTML. This could happen
in b-2-b space or because they expect browsers to apply stylesheets for
the presentation needs. In this case, the xslt transformation from the
site's XML to the input of the indexer would be even more interesting.
- Other data formats could be parsed into SAX events so that they can
then be transformed into the XML (or just the SAX events) understood by
the indexer (PDF, word, excel, proprietary databases).
So the pipeline that emerges is like this:
1) Obtain a source document from a "source adapter". An example of an
adapter is an HttpClient (that handles URLs, connections, GET/POST,
cookies, authentication, etc.) or a JDBCClient or some other
application-specific adapter.
2) Find a parser based on the MIME type of the document
3) Use the parser to generate SAX events for the source document. The
SAX events would correspond to a DTD defined for this type. The source
could be XML or it could be PDF or whatever and the parser would take
care of making it into a "virtual" XML. It will provide an XML view of
the document by issuing SAX events as though it was reading the actual
XML file. Well, you get the idea. The parser here is specific to the
document type but not to the application, so it should be reusable.
Also, the parser could be used in entirely different projects that also
need to read the same file format.
-----
Actually, 1, 2, and 3 could all be combined into a single step and
we can define the source adapters to output
SAX events. Some helper classes can then be provided to help in
defining new adapters in a uniform way.
-----
4) Select a stylesheet for content extraction. A generic
"markup-stripping" style sheet can probably be easily created as a
baseline. Further, MIME-specific or maybe source DTD-specific
stylesheets can also be created to be a step better. Further,
application-specific style sheets can be created that would be selected
based on the source of the document, its URL, and maybe based on the
content of the document itself.
5) The selected stylesheet is applied to transfour the source XML
(coming in as a series of SAX events) into the intermediate form XML,
something that is understood internally by the indexer. This needs to
include the following things:
- Lucene Fields
- none are required at this time, but if Lucene becomes aware of
persistent document IDs or modification
dates, these could be included here
- Crawler/Content Handler Fields
- URI's for links to follow for further crawling
- document source
- document id (unique for a given source)
- Application Fields
- just fields with datatype and content
6) Create a Lucene Document with all of these fields (without the links)
7) Process the links found in the intermediate representation. Decide if
they should be followed further. If so,
request the source adapter to obtain the documents for these links.
How does this sound? The basic idea is to use xslt as a basis for the
content extraction with the hope of the following bennefits:
a) make the pieces more reusable
b) handle any source document format and yet not drop down to the lowest
common denominator
c) support applications that need very simple content extraction as well
as very fine-grained screen scraping
d) use xml-based files to describe content extraction and field
definition (this can later lead to tools that help users create these)
e) leverage a standard language and the knowledge that people may
already have of xslt
f) leverage xslt processors that are already out there and that will
continue to develop further
A couple of questions remain:
- How's performance of the xslt engines these days? Can they be used in
this type of solution without compromising speed significantly?
- When SAX events are used for input and output, do parsers really keep
the data out of memory or do they simply create DOM trees behind the scenes?
- Does xslt allow regex-based content extraction? This would be
important for screen-scraping applications. From my readings it's not
there. I've seen a few substring functions, but what I'm looking for is
a function that would match some text from the source XML against a
regular expression and then extract fields from the match the way Perl
regular expressions do. Does anyone know how hard it would be to add new
functions to XSLT?
- Is xslt an overkill for this application? Assembler can be used to
describe any computer program, but it's not the best choice for most of
them. In the same way, even if xslt can be used to describe content
extraction strategy, does it have the right ballance of power with
specialization or maybe a more specialized language needs to be used?
Any feedback would be greatly appreciated.
Thanks.
Dmitry.
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic