[prev in list] [next in list] [prev in thread] [next in thread] 

List:       slide-dev
Subject:    Re: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor
From:       "Ryan Rhodes" <ryanshaerhodes () hotmail ! com>
Date:       2004-04-28 14:57:02
Message-ID: BAY15-F193w2a9fFaiP000350e8 () hotmail ! com
[Download RAW message or body]

For starters, I wrote a couple extractors that pull out the text content, 
because we are mainly interested in full text search.

Am I correct that your extractor only pulls out properties, or am I 
confused?

Also, it looks like you used the low-level API, so will this work with any 
office document?

-Ryan


>From: Daniel Florey <dflorey@c1-fse.de>
>Reply-To: "Slide Developers Mailing List" <slide-dev@jakarta.apache.org>
>To: Slide Developers Mailing List <slide-dev@jakarta.apache.org>
>Subject: Re: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor 
>OfficeExtractor.java
>Date: Wed, 28 Apr 2004 16:02:44 +0200
>
>Hi,
>sorry for that ;-)
>This extractor is very basic. It uses the jakarta poi library to access the 
>office files.
>You can map the extractor to files matching a url (e.g. all files under 
>/files/word/) or matching a content type (application/ms-...)
>When content is stored the extractor extracts some properties from the 
>given stream and stores them as webdav properties.
>You can afterwords use DASL to search documents by using this properties.
>We didn't figured out how to get speaking property names out of the 
>documents, so you can configure the property names in the Domain.xml.
>Have a look at the Domain.xml, you can see that a cryptic 
>DocumentSummaryInformation-x-y is mapped to webdav properties.
>It would be really helpful if you could have a closer look at the poi 
>library and check out if there is some more useful information stored in 
>the office documents.
>Regards,
>Daniel
>
>BTW: Many thanks go to Jan Stövesand (a collegue of mine) who figured out 
>the POI things and will hopefully join the slide community soon...
>
>
>Ryan Rhodes wrote:
>
>>Beat me to the punch.  I was just finishing an office extractor.
>>
>>Can you give me an idea of what this Extractor does, and what might be 
>>missing Daniel?
>>
>>thanks,
>>
>>Ryan
>>
>>
>>>From: dflorey@apache.org
>>>Reply-To: "Slide Developers Mailing List" <slide-dev@jakarta.apache.org>
>>>To: jakarta-slide-cvs@apache.org
>>>Subject: cvs commit: jakarta-slide/src/share/org/apache/slide/extractor 
>>>OfficeExtractor.java
>>>Date: 28 Apr 2004 13:08:20 -0000
>>>
>>>dflorey     2004/04/28 06:08:20
>>>
>>>   Added:       src/share/org/apache/slide/extractor OfficeExtractor.java
>>>   Log:
>>>   Added MS Office metainfo extractor
>>>
>>>   Revision  Changes    Path
>>>   1.1                  
>>>jakarta-slide/src/share/org/apache/slide/extractor/OfficeExtractor.java
>>>
>>>   Index: OfficeExtractor.java
>>>   ===================================================================
>>>   package org.apache.slide.extractor;
>>>
>>>   import java.io.InputStream;
>>>   import java.util.*;
>>>
>>>   import org.apache.poi.hpsf.*;
>>>   import org.apache.poi.poifs.eventfilesystem.*;
>>>   import org.apache.slide.util.conf.Configurable;
>>>   import org.apache.slide.util.conf.Configuration;
>>>   import org.apache.slide.util.conf.ConfigurationException;
>>>
>>>   /**
>>>    * The OfficeExtractor class
>>>    *
>>>    * @author <a href="mailto:dflorey@c1-fse.de">Daniel Florey</a>
>>>    */
>>>   public class OfficeExtractor extends AbstractPropertyExtractor 
>>>implements Configurable {
>>>       protected List instructions = new ArrayList();
>>>       protected Map propertyMap = new HashMap();
>>>
>>>       public OfficeExtractor(String uri, String contentType) {
>>>           super(uri, contentType);
>>>       }
>>>
>>>       public Map extract(InputStream content) throws ExtractorException 
>>>{
>>>           OfficePropertiesListener listener = new 
>>>OfficePropertiesListener();
>>>           try {
>>>               POIFSReader r = new POIFSReader();
>>>               r.registerListener(listener);
>>>               r.read(content);
>>>           } catch (Exception e) {
>>>               throw new ExtractorException("Exception while extracting 
>>>properties in OfficeExtractor");
>>>           }
>>>           return listener.getProperties();
>>>       }
>>>
>>>       class OfficePropertiesListener implements POIFSReaderListener {
>>>
>>>           private HashMap properties = new HashMap();
>>>
>>>           public Map getProperties() {
>>>                   return properties;
>>>           }
>>>
>>>           public void processPOIFSReaderEvent(POIFSReaderEvent event) {
>>>               PropertySet ps = null;
>>>               try {
>>>                   ps = PropertySetFactory.create(event.getStream());
>>>               } catch (NoPropertySetStreamException ex) {
>>>                   return;
>>>               } catch (Exception ex) {
>>>                   throw new RuntimeException("Property set stream \"" + 
>>>event.getPath() + event.getName() + "\": " + ex);
>>>               }
>>>               String eventName = event.getName().trim();
>>>               final long sectionCount = ps.getSectionCount();
>>>               List sections = ps.getSections();
>>>               int nr = 0;
>>>               for (Iterator i = sections.iterator(); i.hasNext();) {
>>>                   Section sec = (Section) i.next();
>>>                   int propertyCount = sec.getPropertyCount();
>>>                   Property[] props = sec.getProperties();
>>>                   for (int i2 = 0; i2 < props.length; i2++) {
>>>                       Property p = props[i2];
>>>                       int id = p.getID();
>>>                       long type = p.getType();
>>>                       Object value = p.getValue();
>>>                       String key = eventName + "-" + nr + "-" + id;
>>>                       if ( propertyMap.containsKey(key) ) {
>>>                           properties.put(propertyMap.get(key), value);
>>>                       }
>>>                   }
>>>               }
>>>           }
>>>       }
>>>
>>>       public void configure(Configuration configuration) throws 
>>>ConfigurationException {
>>>           Enumeration instructions = 
>>>configuration.getConfigurations("instruction");
>>>           while (instructions.hasMoreElements()) {
>>>               Configuration extract = 
>>>(Configuration)instructions.nextElement();
>>>               String property = extract.getAttribute("property");
>>>               String id = extract.getAttribute("id");
>>>               propertyMap.put(id, property);
>>>           }
>>>       }
>>>   }
>>>
>>>
>>>
>>>---------------------------------------------------------------------
>>>To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
>>>For additional commands, e-mail: slide-dev-help@jakarta.apache.org
>>>
>>
>>_________________________________________________________________
>>Lose those love handles! MSN Fitness shows you two moves to slim your 
>>waist. 
>>http://fitness.msn.com/articles/feeds/article.aspx?dept=exercise&article=et_pv_030104_lovehandles
>>
>>
>>
>>---------------------------------------------------------------------
>>To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
>>For additional commands, e-mail: slide-dev-help@jakarta.apache.org
>>
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: slide-dev-help@jakarta.apache.org
>

_________________________________________________________________
Test your ‘Travel Quotient’ and get the chance to win your dream trip! 
http://travel.msn.com


---------------------------------------------------------------------
To unsubscribe, e-mail: slide-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: slide-dev-help@jakarta.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic