[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-devel
Subject: Re: Review Request 114632: Improve pdf title extraction
From: "Christoph Feck" <christoph () maxiom ! de>
Date: 2013-12-26 1:57:18
Message-ID: 20131226015718.4833.56590 () probe ! kde ! org
[Download RAW message or body]
[Attachment #2 (multipart/alternative)]
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://git.reviewboard.kde.org/r/114632/#review46156
-----------------------------------------------------------
Hm, you broke the comment :)
- Christoph Feck
On Dec. 23, 2013, 4:14 p.m., Luis Silva wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://git.reviewboard.kde.org/r/114632/
> -----------------------------------------------------------
>
> (Updated Dec. 23, 2013, 4:14 p.m.)
>
>
> Review request for Baloo and Vishesh Handa.
>
>
> Repository: kfilemetadata
>
>
> Description
> -------
>
> A good portion of scientific papers in my collection had a doi or an index number \
> in the title. These are in general short string chains, shorter than the real \
> title. I improve extraction of titles from pdf's by setting a minimum size below \
> which parsing of the first page is forced. The cut-off size is arbitrarily set to \
> 25 characters (three "big words").
>
> Diffs
> -----
>
> src/extractors/popplerextractor.cpp b056581f51d10b632799586eed3cc15ac539fe80
>
> Diff: https://git.reviewboard.kde.org/r/114632/diff/
>
>
> Testing
> -------
>
> This improved the title extraction on my pdf collection of scientific papers by \
> quite a lot.
>
> Thanks,
>
> Luis Silva
>
>
[Attachment #5 (text/html)]
<html>
<body>
<div style="font-family: Verdana, Arial, Helvetica, Sans-Serif;">
<table bgcolor="#f9f3c9" width="100%" cellpadding="8" style="border: 1px #c9c399 \
solid;"> <tr>
<td>
This is an automatically generated e-mail. To reply, visit:
<a href="https://git.reviewboard.kde.org/r/114632/">https://git.reviewboard.kde.org/r/114632/</a>
</td>
</tr>
</table>
<br />
<pre style="white-space: pre-wrap; white-space: -moz-pre-wrap; white-space: \
-pre-wrap; white-space: -o-pre-wrap; word-wrap: break-word;">Hm, you broke the \
comment :)</pre> <br />
<p>- Christoph Feck</p>
<br />
<p>On December 23rd, 2013, 4:14 p.m. UTC, Luis Silva wrote:</p>
<table bgcolor="#fefadf" width="100%" cellspacing="0" cellpadding="8" \
style="background-image: \
url('https://git.reviewboard.kde.org/static/rb/images/review_request_box_top_bg.ab6f3b1072c9.png'); \
background-position: left top; background-repeat: repeat-x; border: 1px black \
solid;"> <tr>
<td>
<div>Review request for Baloo and Vishesh Handa.</div>
<div>By Luis Silva.</div>
<p style="color: grey;"><i>Updated Dec. 23, 2013, 4:14 p.m.</i></p>
<div style="margin-top: 1.5em;">
<b style="color: #575012; font-size: 10pt;">Repository: </b>
kfilemetadata
</div>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Description </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" \
style="border: 1px solid #b8b5a0"> <tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">A good portion of scientific papers in my collection had a doi or an \
index number in the title. These are in general short string chains, shorter than the \
real title. I improve extraction of titles from pdf's by setting a minimum size \
below which parsing of the first page is forced. The cut-off size is arbitrarily set \
to 25 characters (three "big words"). </pre>
</td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Testing </h1>
<table width="100%" bgcolor="#ffffff" cellspacing="0" cellpadding="10" style="border: \
1px solid #b8b5a0"> <tr>
<td>
<pre style="margin: 0; padding: 0; white-space: pre-wrap; white-space: \
-moz-pre-wrap; white-space: -pre-wrap; white-space: -o-pre-wrap; word-wrap: \
break-word;">This improved the title extraction on my pdf collection of scientific \
papers by quite a lot.</pre> </td>
</tr>
</table>
<h1 style="color: #575012; font-size: 10pt; margin-top: 1.5em;">Diffs</b> </h1>
<ul style="margin-left: 3em; padding-left: 0;">
<li>src/extractors/popplerextractor.cpp <span style="color: \
grey">(b056581f51d10b632799586eed3cc15ac539fe80)</span></li>
</ul>
<p><a href="https://git.reviewboard.kde.org/r/114632/diff/" style="margin-left: \
3em;">View Diff</a></p>
</td>
</tr>
</table>
</div>
</body>
</html>
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic