[prev in list] [next in list] [prev in thread] [next in thread] 

List:       gentoo-user
Subject:    Re: [gentoo-user] multi-region OCR
From:       Landis Blackwell <blackwelllandis () gmail ! com>
Date:       2016-11-30 19:48:25
Message-ID: 1a2dfbf4-e061-3162-58c2-1da289430568 () gmail ! com
[Download RAW message or body]

Did you train tesseract per chance? And could I get some sample images?

Landis


On 11/30/2016 12:28 PM, Michael Mol wrote:
> On Wednesday, November 30, 2016 05:34:25 PM J. Roeleveld wrote:
>> On November 30, 2016 6:03:36 PM GMT+01:00, Michael Mol <mikemol@gmail.com>
> wrote:
>>> On Wednesday, November 30, 2016 10:43:13 AM J. Roeleveld wrote:
>>>> On Tuesday, November 29, 2016 11:18:36 PM karl@aspodata.se wrote:
>>>>> Michael Mol:
>>>>> ...
>>>>>
>>>>>> xsane would have let me do it during the scan process if I'd
>>> thought of
>>>
>>>>>> it
>>>>>> then, but the scans are done, drives aren't there any more.
>>> Something
>>>
>>>>> ...
>>>>>
>>>>> If xsane solves your need why don't you just print your scans so
>>> xsane
>>>
>>>>> can do its job ?
>>>> There has to be a way to do this without killing an entire forest...
>>> And big chunks of ink cartridges. The scans stretched the contrast so I
>>> can
>>> clearly read the drive labels through the translucent anti-static bags,
>>> which
>>> means a huge chunk of the image (what's outside the labels) is pure
>>> black.
>>>
>>> Which I could get around by spending fifteen minutes munging things in
>>> the Gimp
>>> before printing, but at that point, I may as well just transcribe
>>> things
>>> manually at that point.
>>>
>>> Looking for something reasonably simple to improve the general
>>> workflow. I'd
>>> have hoped something would have already been available on Linux; it'd
>>> be easy
>>> enough to copy the scans to my phone and feed them through Google
>>> Goggles for
>>> the desired output, but then I'm deliberately filtering company data
>>> through an
>>> outside entity.
>> Did you manage to use that link I sent?
> I did. tesseract almost worked, even separating the regions cleanly in its
> output, but it seems, sadly, that the 300dpi scans were insufficient to get a
> good read; lots of clear corruption of the text, so things like serial
> numbers, model numbers, version numbers--everything you'd care about--would be
> highly suspect.
>
> The next tool that looked like it might work, gscan2pdf, wasn't in portage,
> and with the semi-garbled output from tesseract suggesting the scans were too
> poor quality, I didn't pursue further.
>


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic