[prev in list] [next in list] [prev in thread] [next in thread]
List: openoffice-users
Subject: Re: [users] Converting document in PDF format to text format for printing.
From: Olav Pettershagen <olav.pettershagen () trysil ! online ! no>
Date: 2002-10-25 15:31:19
Message-ID: 200210251731.19753.olav.pettershagen () trysil ! online ! no
[Download RAW message or body]
Fredag 25. oktober 2002 15:20 skreiv Hal Vaughan:
> While it's possible to extract text from a .pdf file, it's almost
> impossible to extract it with marigns intact and completely formatted. In
> a .pdf file, characters are placed in position relative to the last
> character (not always, but it's this way in all .pdf files I've examined).
> While it would be possible to try to guess the margins, it's not always
> possible. I spent 2 weeks earlier this year trying to find a way to
> extract text from a .pdf, do a mailmerge on the text, and dump it back into
> a new .pdf with the same formatting to print. I came to the conclusion it
> would take a long time and serious programming to come close -- and that it
> would be almost impossible to guarantee that the extracted data kept the
> same formatting.
If someone really could write a tool that extracted all or parts of the text
in a pdf to an editable format with all (most of ) the formatting intact
he/she would make a fortune in no time. Translators all over the world would
happily shell out hundreds of dollars for a copy - trust me, I'm a translator
and pdf's are a major headache to the translator community.
There are some scanning/OCR software around which makes a halfway decent job,
but a lot of post editing is still necessary.
Olav
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic