'Re: Review Request 114632: Improve pdf title extraction'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Review Request 114632: Improve pdf title extraction
From:       "Albert Astals Cid" <aacid () kde ! org>
Date:       2014-01-07 21:10:13
Message-ID: 20140107211013.28120.43712 () probe ! kde ! org
[Download RAW message or body]

[Attachment #2 (multipart/alternative)]

> On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
> > Hm, you broke the comment :)
> 
> Luis Silva wrote:
> What do you mean? It all works fine here.
> 
> Christoph Feck wrote:
> Yes, because the compiler does not read comments.
> 
> Thomas Lübking wrote:
> Aside this, the approach seems too naive?
> DOIs have a defined structure, leading "doi: 10" (ignoring the case and making \
> colon and whitespace optional) and in general the "problematic" tokens will have a \
> massive digit overhead - so this could be used as additional test ( < 25 && \
> looksLikeIndex()) 
> Luis Silva wrote:
> @Christoph: Just (finally) understood what you meant with "breaking the comment". I \
> uploaded a new patch that (hopefully) fixes the issue in the correct way. @Thomas: \
> The approach was meant to be naive. In this simple form, this patch takes care of \
> all index-like cases as well as most other short garbage titles without further \
> parsing. What would be the point of actually knowing if a very short title was \
> actually a doi or an index? 
> Thomas Lübking wrote:
> echo "The Lord of the Rings" | wc -m
> 22
> 
> And that's not a short title - not to mention the typical Stephen King ("It") or \
> other languages that use hanzi, kanji or hanja and will never met your arbitrary 25 \
> glyph requirement. Though many academic papers (in western cultures at least) in \
> fact have clumsy long titles, that doesn't hold for other document types. 
> OTOH, if the "title" (=index) is some (md5, sha*) hash of the text, that will \
> easily outnumber 25 glyphs. 
> So the more honest solution seems to just omit the title field altogether.
> 
> The alternative (don't know how expensive the document scan is) would be to check \
> whether the title field seems like reasonable text, what could invoke the digit \
> ratio, the longest non-digit sequence ("0x12a21f56ea5") and maybe whether there's \
> any digitless word at all. 
> Albert Astals Cid wrote:
> Honestly I don't even know why there is the rule for needing a space, looking at my \
> shelf of books i can see "Cryptonomicon", "Azogue", "Portico", "Hyperion", \
> "Endymion", "1984", and then I have stopped. Please, don't try to be that much \
> clever, i can understand if you want to rule out stuff like "Microsoft Word - \
> something.doc", but imho you're being already too broad with the rule of "it \
> includes microsoft". What about if i have a manual about "Microsoft Visual Basic"? 
> Honestly omiting or mangling the title is a very bad thing to do. If you have a \
> sensible thing to run over the 1500 test pdf files i have here i'm happy to help. 
> Christoph Feck wrote:
> Would it make sense to refactor the code to use the (PDF supplied) document title, \
> and, if for whatever reason it is believed to be wrong, append the extracted text \
> that is believed to be a better title? 
> Luis Silva wrote:
> I can see the point Albert is making that when a pdf has a short (but valid) \
> pdftitle and an unparseable first page the resulting extracted title will be \
> gibberish. I also agree that mangling the title just because it seemed to be small \
> is unacceptable. I must admit that I did not think about the cases of hanzi, kanji \
> or hanja for which this patch would systematically force the parsing of the first \
> page of the document.  The issue here is when the pdftitle does not match the real \
> document title. In my database of academic papers (700+) this happens a lot. Most \
> of my other documents are either prints to pdf, documents generated from their \
> latex source or Word documents converted to pdf most (90%) of which lack a pdftitle \
> and so have to be parsed anyway. From my experience this is a typical situation, at \
> least amongst academics.  Of course, the best operating solution must cater for the \
> most common personas, not just academics, but in your experience, what would that \
> be?

I'm with Christoph here, not sure what he use case for this is, but would it be \
possible to add the extra information instead of replacing it? Maybe even in a second \
field? Like "title" and "thingwethinkmaybethetitle"?

- Albert

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
-----------------------------------------------------------

On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/114632/
> -----------------------------------------------------------
> 
> (Updated Jan. 6, 2014, 5:47 p.m.)
> 
> 
> Review request for Baloo and Vishesh Handa.
> 
> 
> Repository: kfilemetadata
> 
> 
> Description
> -------
> 
> A good portion of scientific papers in my collection had a doi or an index number \
> in the title. These are in general short string chains, shorter than the real \
> title. I improve extraction of titles from pdf's by setting a minimum size below \
> which parsing of the first page is forced. The cut-off size is arbitrarily set to \
> 25 characters (three "big words"). 
> 
> Diffs
> -----
> 
> src/extractors/popplerextractor.cpp b056581f51d10b632799586eed3cc15ac539fe80 
> 
> Diff: https://git.reviewboard.kde.org/r/114632/diff/
> 
> 
> Testing
> -------
> 
> This improved the title extraction on my pdf collection of scientific papers by \
> quite a lot. 
> 
> Thanks,
> 
> Luis Silva
> 
> 

[Attachment #5 (text/html)]

<html>
 <body>
  <div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
   <table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 \
solid;">  <tr>
     <td>
      This is an automatically generated e-mail. To reply, visit:
      <a href="https://git.reviewboard.kde.org/r/114632/">https://git.reviewboard.kde.org/r/114632/</a>
  </td>
    </tr>
   </table>
   <br />

<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <p style="margin-top: 0;">On December 26th, 2013, 1:57 a.m. UTC, \
<b>Christoph Feck</b> wrote:</p>  <blockquote style="margin-left: 1em; border-left: \
2px solid #d0d0d0; padding-left: 10px;">  <pre style="white-space: pre-wrap; \
white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; \
word-wrap: break-word;">Hm, you broke the comment :)</pre>  </blockquote>

 <p>On January 6th, 2014, 3:24 p.m. UTC, <b>Luis Silva</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">What do you mean? It all \
works fine here. </pre>  </blockquote>

 <p>On January 6th, 2014, 3:50 p.m. UTC, <b>Christoph Feck</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Yes, because the \
compiler does not read comments.</pre>  </blockquote>

 <p>On January 6th, 2014, 4:11 p.m. UTC, <b>Thomas Lübking</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Aside this, the approach \
seems too naive? DOIs have a defined structure, leading &quot;doi: 10&quot; (ignoring \
the case and making colon and whitespace optional) and in general the \
&quot;problematic&quot; tokens will have a massive digit overhead - so this could be \
used as additional test ( &lt; 25 &amp;&amp; looksLikeIndex())</pre>  </blockquote>

 <p>On January 6th, 2014, 5:46 p.m. UTC, <b>Luis Silva</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">@Christoph: Just \
(finally) understood what you meant with &quot;breaking the comment&quot;. I uploaded \
a new patch that (hopefully) fixes the issue in the correct way. @Thomas: The \
approach was meant to be naive. In this simple form, this patch takes care of all \
index-like cases as well as most other short garbage titles without further parsing. \
What would be the point of actually knowing if a very short title was actually a doi \
or an index?</pre>  </blockquote>

 <p>On January 6th, 2014, 8:23 p.m. UTC, <b>Thomas Lübking</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">echo &quot;The Lord of \
the Rings&quot; | wc -m 22

And that&#39;s not a short title - not to mention the typical Stephen King \
(&quot;It&quot;) or other languages that use hanzi, kanji or hanja and will never met \
your arbitrary 25 glyph requirement. Though many academic papers (in western cultures \
at least) in fact have clumsy long titles, that doesn&#39;t hold for other document \
types.

OTOH, if the &quot;title&quot; (=index) is some (md5, sha*) hash of the text, that \
will easily outnumber 25 glyphs.

So the more honest solution seems to just omit the title field altogether.

The alternative (don&#39;t know how expensive the document scan is) would be to check \
whether the title field seems like reasonable text, what could invoke the digit \
ratio, the longest non-digit sequence (&quot;0x12a21f56ea5&quot;) and maybe whether \
there&#39;s any digitless word at all.</pre>  </blockquote>

 <p>On January 6th, 2014, 8:43 p.m. UTC, <b>Albert Astals Cid</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Honestly I don&#39;t \
even know why there is the rule for needing a space, looking at my shelf of books i \
can see &quot;Cryptonomicon&quot;, &quot;Azogue&quot;, &quot;Portico&quot;, \
&quot;Hyperion&quot;, &quot;Endymion&quot;, &quot;1984&quot;, and then I have \
stopped. Please, don&#39;t try to be that much clever, i can understand if you want \
to rule out stuff like &quot;Microsoft Word - something.doc&quot;, but imho \
you&#39;re being already too broad with the rule of &quot;it includes \
microsoft&quot;. What about if i have a manual about &quot;Microsoft Visual \
Basic&quot;?

Honestly omiting or mangling the title is a very bad thing to do. If you have a \
sensible thing to run over the 1500 test pdf files i have here i&#39;m happy to \
help.</pre>  </blockquote>

 <p>On January 6th, 2014, 10:16 p.m. UTC, <b>Christoph Feck</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Would it make sense to \
refactor the code to use the (PDF supplied) document title, and, if for whatever \
reason it is believed to be wrong, append the extracted text that is believed to be a \
better title?</pre>  </blockquote>

 <p>On January 6th, 2014, 11:17 p.m. UTC, <b>Luis Silva</b> wrote:</p>
 <blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;">  <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">I can see the point \
Albert is making that when a pdf has a short (but valid) pdftitle and an unparseable \
first page the resulting extracted title will be gibberish. I also agree that \
mangling the title just because it seemed to be small is unacceptable. I must admit \
that I did not think about the cases of hanzi, kanji or hanja for which this patch \
would systematically force the parsing of the first page of the document.  The issue \
here is when the pdftitle does not match the real document title. In my database of \
academic papers (700+) this happens a lot. Most of my other documents are either \
prints to pdf, documents generated from their latex source or Word documents \
converted to pdf most (90%) of which lack a pdftitle and so have to be parsed anyway. \
From my experience this is a typical situation, at least amongst academics.  Of \
course, the best operating solution must cater for the most common personas, not just \
academics, but in your experience, what would that be?</pre>  </blockquote>

</blockquote>

<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">I&#39;m with Christoph \
here, not sure what he use case for this is, but would it be possible to add the \
extra information instead of replacing it? Maybe even in a second field? Like \
&quot;title&quot; and &quot;thingwethinkmaybethetitle&quot;?</pre> <br />

<p>- Albert</p>

<br />
<p>On January 6th, 2014, 5:47 p.m. UTC, Luis Silva wrote:</p>

<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" \
style="background-image: \
url('https://git.reviewboard.kde.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); \
background-position: left top; background-repeat: repeat-x; border: 1px black \
solid;">  <tr>
  <td>

<div>Review request for Baloo and Vishesh Handa.</div>
<div>By Luis Silva.</div>

<p style="color: grey;"><i>Updated Jan. 6, 2014, 5:47 p.m.</i></p>

<div style="margin-top: 1.5em;">
 <b style="color: #575012; font-size: 10pt;">Repository: </b>
kfilemetadata
</div>

<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
 <table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" \
style="border: 1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">A good portion of scientific papers in my collection had a doi or an \
index number in the title. These are in general short string chains, shorter than the \
real title. I improve extraction of titles from pdf&#39;s by setting a minimum size \
below which parsing of the first page is forced. The cut-off size is arbitrarily set \
to 25 characters (three &quot;big words&quot;). </pre>
  </td>
 </tr>
</table>

<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: \
1px solid #b8b5a0">  <tr>
  <td>
   <pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">This improved the title extraction on my pdf collection of scientific \
papers by quite a lot.</pre>  </td>
 </tr>
</table>

<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">

 <li>src/extractors/popplerextractor.cpp <span style="color: \
grey">(b056581f51d10b632799586eed3cc15ac539fe80)</span></li>

</ul>

<p><a href="https://git.reviewboard.kde.org/r/114632/diff/" style="margin-left: \
3em;">View Diff</a></p>

  </td>
 </tr>
</table>

  </div>
 </body>
</html>

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

[prev in list] [next in list] [prev in thread] [next in thread]