[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    RE: Invalid UTF-8 character 0xffff at char #17373581, byte #17539047
From:       Markus_Jelsma <markus.jelsma () openindex ! io>
Date:       2017-02-28 16:44:37
Message-ID: zarafa.58b5a8f5.5992.3a532ac904a2471b () mail1 ! ams ! nl ! openindex ! io
[Download RAW message or body]

Hello - you are sending non-unicode code points. In Apache Nutch we use this [1] \
little method to clean up outgoing data. PDF and other crazy format parsers are known \
to sometimes emit bad characters. Using a TolerantUpdateProcessor is probably not \
going to work, the exception is thrown by the XML parser, long before documents are \
passed through update processors.

Regards,
Markus

[1]: https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/src/java/org/apache/nutch/indexwriter/solr/SolrUtils.java#L76
  
 
-----Original message-----
> From:Nick Way <nick@southeastpublishing.com>
> Sent: Tuesday 28th February 2017 17:27
> To: solr-user@lucene.apache.org
> Subject: Invalid UTF-8 character 0xffff at char #17373581, byte #17539047
> 
> Hello everyone,
> 
> We use Solr (with Adobe Coldfusion) to index circa 60,000 pdfs, however the
> daily refresh has been failing with this error "Invalid UTF-8 character
> 0xffff at char #17373581, byte #17539047...." [truncated - full error
> message is posted below]
> 
> -
> - Can Solr be configured to skip problematic documents (eg those
> containing an invalid character)?
> - Can Solr be configured to log which document it had a problem indexing?
> - If no to both of the above, do you have any suggestions for how I can
> either detect the problematic document or stop Solr erroring on it?
> 
> 
> Thank you very much indeed.
> 
> Kind regards,
> 
> 
> Nick Way
> 
> full error message:
> 
> [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff
> at char #17373581, byte #17539047) java.lang.RuntimeException: [was class
> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
> #17373581, byte #17539047) at
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
> org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301) at
> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at
> org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>  at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326) at
> org.mortbay.jetty.HttpConnection.handleRequest(H [was class
> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
> #17373581, byte #17539047) java.lang.RuntimeException: [was class
> java.io.CharConversionException] Invalid UTF-8 character 0xffff at char
> #17373581, byte #17539047) at
> com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
> at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at
> com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
> at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at
> org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:301) at
> org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:157) at
> org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
>  at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368) at
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
> at
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
> at
> org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
>  at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
> at
> org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
> at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
> at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
> at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at
> org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
>  at
> org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
> at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
> at org.mortbay.jetty.Server.handle(Server.java:326) at
> org.mortbay.jetty.HttpConnection.handleRequest(H request:
> http://localhost:8985/solr/solr77b/update?commit=true&waitFlush=false&waitSearcher=false&wt=xml&version=2.2
>  


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic