[prev in list] [next in list] [prev in thread] [next in thread] 

List:       openoffice-users
Subject:    Re: [users] Converting document in PDF format to text format for     printing.
From:       Olav Pettershagen <olav.pettershagen () trysil ! online ! no>
Date:       2002-10-25 15:31:19
Message-ID: 200210251731.19753.olav.pettershagen () trysil ! online ! no
[Download RAW message or body]

Fredag 25. oktober 2002 15:20 skreiv Hal Vaughan:
> While it's possible to extract text from a .pdf file, it's almost
> impossible to extract it with marigns intact and completely formatted.  In
> a .pdf file, characters are placed in position relative to the last
> character (not always, but it's this way in all .pdf files I've examined).
>  While it would be possible to try to guess the margins, it's not always
> possible.  I spent 2 weeks earlier this year trying to find a way to
> extract text from a .pdf, do a mailmerge on the text, and dump it back into
> a new .pdf with the same formatting to print.  I came to the conclusion it
> would take a long time and serious programming to come close -- and that it
> would be almost impossible to guarantee that the extracted data kept the
> same formatting.

If someone really could write a tool that extracted all or parts of the text 
in a pdf to an editable format with all (most of ) the formatting intact 
he/she would make a fortune in no time. Translators all over the world would 
happily shell out hundreds of dollars for a copy - trust me, I'm a translator 
and pdf's are a major headache to the translator community.

There are some scanning/OCR software around which makes a halfway decent job, 
but a lot of post editing is still necessary.

Olav

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic