[prev in list] [next in list] [prev in thread] [next in thread]
List: slide-dev
Subject: Re: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor
From: "Ryan Rhodes" <ryanshaerhodes () hotmail ! com>
Date: 2004-04-28 14:57:02
Message-ID: BAY15-F193w2a9fFaiP000350e8 () hotmail ! com
[Download RAW message or body]
For starters, I wrote a couple extractors that pull out the text content,
because we are mainly interested in full text search.
Am I correct that your extractor only pulls out properties, or am I
confused?
Also, it looks like you used the low-level API, so will this work with any
office document?
-Ryan
>From: Daniel Florey <dflorey@c1-fse.de>
>Reply-To: "Slide Developers Mailing List" <slide-dev@jakarta.apache.org>
>To: Slide Developers Mailing List <slide-dev@jakarta.apache.org>
>Subject: Re: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor
>OfficeExtractor.java
>Date: Wed, 28 Apr 2004 16:02:44 +0200
>
>Hi,
>sorry for that ;-)
>This extractor is very basic. It uses the jakarta poi library to access the
>office files.
>You can map the extractor to files matching a url (e.g. all files under
>/files/word/) or matching a content type (application/ms-...)
>When content is stored the extractor extracts some properties from the
>given stream and stores them as webdav properties.
>You can afterwords use DASL to search documents by using this properties.
>We didn't figured out how to get speaking property names out of the
>documents, so you can configure the property names in the Domain.xml.
>Have a look at the Domain.xml, you can see that a cryptic
>DocumentSummaryInformation-x-y is mapped to webdav properties.
>It would be really helpful if you could have a closer look at the poi
>library and check out if there is some more useful information stored in
>the office documents.
>Regards,
>Daniel
>
>BTW: Many thanks go to Jan Stövesand (a collegue of mine) who figured out
>the POI things and will hopefully join the slide community soon...
>
>
>Ryan Rhodes wrote:
>
>>Beat me to the punch. I was just finishing an office extractor.
>>
>>Can you give me an idea of what this Extractor does, and what might be
>>missing Daniel?
>>
>>thanks,
>>
>>Ryan
>>
>>
>>>From: dflorey@apache.org
>>>Reply-To: "Slide Developers Mailing List" <slide-dev@jakarta.apache.org>
>>>To: jakarta-slide-cvs@apache.org
>>>Subject: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor
>>>OfficeExtractor.java
>>>Date: 28 Apr 2004 13:08:20 -0000
>>>
>>>dflorey 2004/04/28 06:08:20
>>>
>>> Added: src/share/org/apache/slide/extractor OfficeExtractor.java
>>> Log:
>>> Added MS Office metainfo extractor
>>>
>>> Revision Changes Path
>>> 1.1
>>>jakarta-slide/src/share/org/apache/slide/extractor/OfficeExtractor.java
>>>
>>> Index: OfficeExtractor.java
>>> ===================================================================
>>> package org.apache.slide.extractor;
>>>
>>> import java.io.InputStream;
>>> import java.util.*;
>>>
>>> import org.apache.poi.hpsf.*;
>>> import org.apache.poi.poifs.eventfilesystem.*;
>>> import org.apache.slide.util.conf.Configurable;
>>> import org.apache.slide.util.conf.Configuration;
>>> import org.apache.slide.util.conf.ConfigurationException;
>>>
>>> /**
>>> * The OfficeExtractor class
>>> *
>>> * @author <a href="mailto:dflorey@c1-fse.de">Daniel Florey</a>
>>> */
>>> public class OfficeExtractor extends AbstractPropertyExtractor
>>>implements Configurable {
>>> protected List instructions = new ArrayList();
>>> protected Map propertyMap = new HashMap();
>>>
>>> public OfficeExtractor(String uri, String contentType) {
>>> super(uri, contentType);
>>> }
>>>
>>> public Map extract(InputStream content) throws ExtractorException
>>>{
>>> OfficePropertiesListener listener = new
>>>OfficePropertiesListener();
>>> try {
>>> POIFSReader r = new POIFSReader();
>>> r.registerListener(listener);
>>> r.read(content);
>>> } catch (Exception e) {
>>> throw new ExtractorException("Exception while extracting
>>>properties in OfficeExtractor");
>>> }
>>> return listener.getProperties();
>>> }
>>>
>>> class OfficePropertiesListener implements POIFSReaderListener {
>>>
>>> private HashMap properties = new HashMap();
>>>
>>> public Map getProperties() {
>>> return properties;
>>> }
>>>
>>> public void processPOIFSReaderEvent(POIFSReaderEvent event) {
>>> PropertySet ps = null;
>>> try {
>>> ps = PropertySetFactory.create(event.getStream());
>>> } catch (NoPropertySetStreamException ex) {
>>> return;
>>> } catch (Exception ex) {
>>> throw new RuntimeException("Property set stream \"" +
>>>event.getPath() + event.getName() + "\": " + ex);
>>> }
>>> String eventName = event.getName().trim();
>>> final long sectionCount = ps.getSectionCount();
>>> List sections = ps.getSections();
>>> int nr = 0;
>>> for (Iterator i = sections.iterator(); i.hasNext();) {
>>> Section sec = (Section) i.next();
>>> int propertyCount = sec.getPropertyCount();
>>> Property[] props = sec.getProperties();
>>> for (int i2 = 0; i2 < props.length; i2++) {
>>> Property p = props[i2];
>>> int id = p.getID();
>>> long type = p.getType();
>>> Object value = p.getValue();
>>> String key = eventName + "-" + nr + "-" + id;
>>> if ( propertyMap.containsKey(key) ) {
>>> properties.put(propertyMap.get(key), value);
>>> }
>>> }
>>> }
>>> }
>>> }
>>>
>>> public void configure(Configuration configuration) throws
>>>ConfigurationException {
>>> Enumeration instructions =
>>>configuration.getConfigurations("instruction");
>>> while (instructions.hasMoreElements()) {
>>> Configuration extract =
>>>(Configuration)instructions.nextElement();
>>> String property = extract.getAttribute("property");
>>> String id = extract.getAttribute("id");
>>> propertyMap.put(id, property);
>>> }
>>> }
>>> }
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: slide-dev-help@jakarta.apache.org
>>>
>>
>>_________________________________________________________________
>>Lose those love handles! MSN Fitness shows you two moves to slim your
>>waist.
>>http://fitness.msn.com/articles/feeds/article.aspx?dept=exercise&article=et_pv_030104_lovehandles
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: slide-dev-help@jakarta.apache.org
>>
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: slide-dev-help@jakarta.apache.org
>
_________________________________________________________________
Test your ‘Travel Quotient’ and get the chance to win your dream trip!
http://travel.msn.com
---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic