'Re: Indexing speed reduced significantly with OCR'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-user
Subject:    Re: Indexing speed reduced significantly with OCR
From:       Zheng Lin Edwin Yeo <edwinyeozl () gmail ! com>
Date:       2017-03-31 16:42:08
Message-ID: CAF2DzVXQZPwMYxBXizwKbAjgMz7Hu7OQF=EJQAQGkk1yZ-Bffg () mail ! gmail ! com
[Download RAW message or body]


This is my comparison of the indexing speed with and without Tesseract OCR.
The smaller file is taking longer to index, probably because there are more
text to do the OCR, as compared to the bigger file, which has lesser text.
Is that usually the case?

*With Tesseract OCR*

174KB - 5.20 sec

446KB - 2.45 sec



*Without Tesseract OCR*

174KB - 0.77 sec

446KB - 0.23 sec


Regards,

Edwin

On 31 March 2017 at 03:57, Phil Scadden <P.Scadden@gns.cri.nz> wrote:

> Yes, that would seem an accurate assessment of the problem.
>
> -----Original Message-----
> From: Zheng Lin Edwin Yeo [mailto:edwinyeozl@gmail.com]
> Sent: Thursday, 30 March 2017 4:53 p.m.
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing speed reduced significantly with OCR
>
> Thanks for your reply.
>
> From what I see, getting more hardware to do the OCR is inevitable?
>
> Even if we run the OCR outside of Solr indexing stream, it will still take
> a long time to process it if it is on just one machine. And we still need
> to wait for the OCR to finish converting before we can run the indexing to
> Solr.
>
> Regards,
> Edwin
> Notice: This email and any attachments are confidential and may not be
> used, published or redistributed without the prior written consent of the
> Institute of Geological and Nuclear Sciences Limited (GNS Science). If
> received in error please destroy and immediately notify GNS Science. Do not
> copy or disclose the contents.
>


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic