[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-devel
Subject: Re: Review Request 114632: Improve pdf title extraction
From: "Luis Silva" <lacsilva () gmail ! com>
Date: 2014-01-15 15:29:47
Message-ID: 20140115152947.14108.18673 () probe ! kde ! org
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
> On Dec. 26, 2013, 1:57 a.m., Christoph Feck wrote:
> > Hm, you broke the comment :)
>
> Luis Silva wrote:
> What do you mean? It all works fine here.
>
> Christoph Feck wrote:
> Yes, because the compiler does not read comments.
>
> Thomas Lübking wrote:
> Aside this, the approach seems too naive?
> DOIs have a defined structure, leading "doi: 10" (ignoring the case and making \
> colon and whitespace optional) and in general the "problematic" tokens will have a \
> massive digit overhead - so this could be used as additional test ( < 25 && \
> looksLikeIndex())
> Luis Silva wrote:
> @Christoph: Just (finally) understood what you meant with "breaking the comment". I \
> uploaded a new patch that (hopefully) fixes the issue in the correct way. @Thomas: \
> The approach was meant to be naive. In this simple form, this patch takes care of \
> all index-like cases as well as most other short garbage titles without further \
> parsing. What would be the point of actually knowing if a very short title was \
> actually a doi or an index?
> Thomas Lübking wrote:
> echo "The Lord of the Rings" | wc -m
> 22
>
> And that's not a short title - not to mention the typical Stephen King ("It") or \
> other languages that use hanzi, kanji or hanja and will never met your arbitrary 25 \
> glyph requirement. Though many academic papers (in western cultures at least) in \
> fact have clumsy long titles, that doesn't hold for other document types.
> OTOH, if the "title" (=index) is some (md5, sha*) hash of the text, that will \
> easily outnumber 25 glyphs.
> So the more honest solution seems to just omit the title field altogether.
>
> The alternative (don't know how expensive the document scan is) would be to check \
> whether the title field seems like reasonable text, what could invoke the digit \
> ratio, the longest non-digit sequence ("0x12a21f56ea5") and maybe whether there's \
> any digitless word at all.
> Albert Astals Cid wrote:
> Honestly I don't even know why there is the rule for needing a space, looking at my \
> shelf of books i can see "Cryptonomicon", "Azogue", "Portico", "Hyperion", \
> "Endymion", "1984", and then I have stopped. Please, don't try to be that much \
> clever, i can understand if you want to rule out stuff like "Microsoft Word - \
> something.doc", but imho you're being already too broad with the rule of "it \
> includes microsoft". What about if i have a manual about "Microsoft Visual Basic"?
> Honestly omiting or mangling the title is a very bad thing to do. If you have a \
> sensible thing to run over the 1500 test pdf files i have here i'm happy to help.
> Christoph Feck wrote:
> Would it make sense to refactor the code to use the (PDF supplied) document title, \
> and, if for whatever reason it is believed to be wrong, append the extracted text \
> that is believed to be a better title?
> Luis Silva wrote:
> I can see the point Albert is making that when a pdf has a short (but valid) \
> pdftitle and an unparseable first page the resulting extracted title will be \
> gibberish. I also agree that mangling the title just because it seemed to be small \
> is unacceptable. I must admit that I did not think about the cases of hanzi, kanji \
> or hanja for which this patch would systematically force the parsing of the first \
> page of the document. The issue here is when the pdftitle does not match the real \
> document title. In my database of academic papers (700+) this happens a lot. Most \
> of my other documents are either prints to pdf, documents generated from their \
> latex source or Word documents converted to pdf most (90%) of which lack a pdftitle \
> and so have to be parsed anyway. From my experience this is a typical situation, at \
> least amongst academics. Of course, the best operating solution must cater for the \
> most common personas, not just academics, but in your experience, what would that \
> be?
> Albert Astals Cid wrote:
> I'm with Christoph here, not sure what he use case for this is, but would it be \
> possible to add the extra information instead of replacing it? Maybe even in a \
> second field? Like "title" and "thingwethinkmaybethetitle"?
> Vishesh Handa wrote:
> The more I think about this, the more I realize how this is really not required.
>
> Use Cases -
>
> 1. Viewing the title - The title can currently only be seen via the Dolphin sidebar
> 2. Searching - It currently makes no difference if the text is in the title or in \
> the plain text. Both are currently given the same priority. In the future we could \
> give the title/any other field a higher priority, but that has not been done.
> Given that the only real use case is (1), and it is debatable if Dolphin users will \
> actually care, perhaps we could remove this all together. This could be implemented \
> in a specialized application like Conquiere which is built for Research Papers.
I agree with Vishesh. If the document text is indeed being extracted then, indeed, it \
should not matter for simple searches. But what about the cases when the document \
text cannot be extracted? I will withdraw this patch and submit a functionally \
equivalent to Conquiere.
- Luis
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
-----------------------------------------------------------
On Jan. 6, 2014, 5:47 p.m., Luis Silva wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/114632/
> -----------------------------------------------------------
>
> (Updated Jan. 6, 2014, 5:47 p.m.)
>
>
> Review request for Baloo and Vishesh Handa.
>
>
> Repository: kfilemetadata
>
>
> Description
> -------
>
> A good portion of scientific papers in my collection had a doi or an index number \
> in the title. These are in general short string chains, shorter than the real \
> title. I improve extraction of titles from pdf's by setting a minimum size below \
> which parsing of the first page is forced. The cut-off size is arbitrarily set to \
> 25 characters (three "big words").
>
> Diffs
> -----
>
> src/extractors/popplerextractor.cpp b056581f51d10b632799586eed3cc15ac539fe80
>
> Diff: https://git.reviewboard.kde.org/r/114632/diff/
>
>
> Testing
> -------
>
> This improved the title extraction on my pdf collection of scientific papers by \
> quite a lot.
>
> Thanks,
>
> Luis Silva
>
>
[Attachment #5 (text/html)]
<html>
<body>
<div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
<table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 \
solid;"> <tr>
<td>
This is an automatically generated e-mail. To reply, visit:
<a href="https://git.reviewboard.kde.org/r/114632/">https://git.reviewboard.kde.org/r/114632/</a>
</td>
</tr>
</table>
<br />
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <p style="margin-top: 0;">On December 26th, 2013, 1:57 a.m. UTC, \
<b>Christoph Feck</b> wrote:</p> <blockquote style="margin-left: 1em; border-left: \
2px solid #d0d0d0; padding-left: 10px;"> <pre style="white-space: pre-wrap; \
white-space: -moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; \
word-wrap: break-word;">Hm, you broke the comment :)</pre> </blockquote>
<p>On January 6th, 2014, 3:24 p.m. UTC, <b>Luis Silva</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">What do you mean? It all \
works fine here. </pre> </blockquote>
<p>On January 6th, 2014, 3:50 p.m. UTC, <b>Christoph Feck</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Yes, because the \
compiler does not read comments.</pre> </blockquote>
<p>On January 6th, 2014, 4:11 p.m. UTC, <b>Thomas Lübking</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Aside this, the approach \
seems too naive? DOIs have a defined structure, leading "doi: 10" (ignoring \
the case and making colon and whitespace optional) and in general the \
"problematic" tokens will have a massive digit overhead - so this could be \
used as additional test ( < 25 && looksLikeIndex())</pre> </blockquote>
<p>On January 6th, 2014, 5:46 p.m. UTC, <b>Luis Silva</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">@Christoph: Just \
(finally) understood what you meant with "breaking the comment". I uploaded \
a new patch that (hopefully) fixes the issue in the correct way. @Thomas: The \
approach was meant to be naive. In this simple form, this patch takes care of all \
index-like cases as well as most other short garbage titles without further parsing. \
What would be the point of actually knowing if a very short title was actually a doi \
or an index?</pre> </blockquote>
<p>On January 6th, 2014, 8:23 p.m. UTC, <b>Thomas Lübking</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">echo "The Lord of \
the Rings" | wc -m 22
And that's not a short title - not to mention the typical Stephen King \
("It") or other languages that use hanzi, kanji or hanja and will never met \
your arbitrary 25 glyph requirement. Though many academic papers (in western cultures \
at least) in fact have clumsy long titles, that doesn't hold for other document \
types.
OTOH, if the "title" (=index) is some (md5, sha*) hash of the text, that \
will easily outnumber 25 glyphs.
So the more honest solution seems to just omit the title field altogether.
The alternative (don't know how expensive the document scan is) would be to check \
whether the title field seems like reasonable text, what could invoke the digit \
ratio, the longest non-digit sequence ("0x12a21f56ea5") and maybe whether \
there's any digitless word at all.</pre> </blockquote>
<p>On January 6th, 2014, 8:43 p.m. UTC, <b>Albert Astals Cid</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Honestly I don't \
even know why there is the rule for needing a space, looking at my shelf of books i \
can see "Cryptonomicon", "Azogue", "Portico", \
"Hyperion", "Endymion", "1984", and then I have \
stopped. Please, don't try to be that much clever, i can understand if you want \
to rule out stuff like "Microsoft Word - something.doc", but imho \
you're being already too broad with the rule of "it includes \
microsoft". What about if i have a manual about "Microsoft Visual \
Basic"?
Honestly omiting or mangling the title is a very bad thing to do. If you have a \
sensible thing to run over the 1500 test pdf files i have here i'm happy to \
help.</pre> </blockquote>
<p>On January 6th, 2014, 10:16 p.m. UTC, <b>Christoph Feck</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Would it make sense to \
refactor the code to use the (PDF supplied) document title, and, if for whatever \
reason it is believed to be wrong, append the extracted text that is believed to be a \
better title?</pre> </blockquote>
<p>On January 6th, 2014, 11:17 p.m. UTC, <b>Luis Silva</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">I can see the point \
Albert is making that when a pdf has a short (but valid) pdftitle and an unparseable \
first page the resulting extracted title will be gibberish. I also agree that \
mangling the title just because it seemed to be small is unacceptable. I must admit \
that I did not think about the cases of hanzi, kanji or hanja for which this patch \
would systematically force the parsing of the first page of the document. The issue \
here is when the pdftitle does not match the real document title. In my database of \
academic papers (700+) this happens a lot. Most of my other documents are either \
prints to pdf, documents generated from their latex source or Word documents \
converted to pdf most (90%) of which lack a pdftitle and so have to be parsed anyway. \
From my experience this is a typical situation, at least amongst academics. Of \
course, the best operating solution must cater for the most common personas, not just \
academics, but in your experience, what would that be?</pre> </blockquote>
<p>On January 7th, 2014, 9:10 p.m. UTC, <b>Albert Astals Cid</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">I'm with Christoph \
here, not sure what he use case for this is, but would it be possible to add the \
extra information instead of replacing it? Maybe even in a second field? Like \
"title" and "thingwethinkmaybethetitle"?</pre> </blockquote>
<p>On January 8th, 2014, 10:39 a.m. UTC, <b>Vishesh Handa</b> wrote:</p>
<blockquote style="margin-left: 1em; border-left: 2px solid #d0d0d0; padding-left: \
10px;"> <pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">The more I think about \
this, the more I realize how this is really not required.
Use Cases -
1. Viewing the title - The title can currently only be seen via the Dolphin sidebar
2. Searching - It currently makes no difference if the text is in the title or in the \
plain text. Both are currently given the same priority. In the future we could give \
the title/any other field a higher priority, but that has not been done.
Given that the only real use case is (1), and it is debatable if Dolphin users will \
actually care, perhaps we could remove this all together. This could be implemented \
in a specialized application like Conquiere which is built for Research Papers.</pre> \
</blockquote>
</blockquote>
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">I agree with Vishesh. If \
the document text is indeed being extracted then, indeed, it should not matter for \
simple searches. But what about the cases when the document text cannot be extracted? \
I will withdraw this patch and submit a functionally equivalent to Conquiere.</pre>
<br />
<p>- Luis</p>
<br />
<p>On January 6th, 2014, 5:47 p.m. UTC, Luis Silva wrote:</p>
<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" \
style="background-image: \
url('https://git.reviewboard.kde.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); \
background-position: left top; background-repeat: repeat-x; border: 1px black \
solid;"> <tr>
<td>
<div>Review request for Baloo and Vishesh Handa.</div>
<div>By Luis Silva.</div>
<p style="color: grey;"><i>Updated Jan. 6, 2014, 5:47 p.m.</i></p>
<div style="margin-top: 1.5em;">
<b style="color: #575012; font-size: 10pt;">Repository: </b>
kfilemetadata
</div>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" \
style="border: 1px solid #b8b5a0"> <tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">A good portion of scientific papers in my collection had a doi or an \
index number in the title. These are in general short string chains, shorter than the \
real title. I improve extraction of titles from pdf's by setting a minimum size \
below which parsing of the first page is forced. The cut-off size is arbitrarily set \
to 25 characters (three "big words"). </pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: \
1px solid #b8b5a0"> <tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">This improved the title extraction on my pdf collection of scientific \
papers by quite a lot.</pre> </td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">
<li>src/extractors/popplerextractor.cpp <span style="color: \
grey">(b056581f51d10b632799586eed3cc15ac539fe80)</span></li>
</ul>
<p><a href="https://git.reviewboard.kde.org/r/114632/diff/" style="margin-left: \
3em;">View Diff</a></p>
</td>
</tr>
</table>
</div>
</body>
</html>
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic