--===============1853145350037908781== Content-Type: multipart/alternative; boundary="===============7040429722962043997==" --===============7040429722962043997== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://git.reviewboard.kde.org/r/114632/#review46156 ----------------------------------------------------------- Hm, you broke the comment :) - Christoph Feck On Dec. 23, 2013, 4:14 p.m., Luis Silva wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://git.reviewboard.kde.org/r/114632/ > ----------------------------------------------------------- > > (Updated Dec. 23, 2013, 4:14 p.m.) > > > Review request for Baloo and Vishesh Handa. > > > Repository: kfilemetadata > > > Description > ------- > > A good portion of scientific papers in my collection had a doi or an index number in the title. These are in general short string chains, shorter than the real title. > I improve extraction of titles from pdf's by setting a minimum size below which parsing of the first page is forced. > The cut-off size is arbitrarily set to 25 characters (three "big words"). > > > Diffs > ----- > > src/extractors/popplerextractor.cpp b056581f51d10b632799586eed3cc15ac539fe80 > > Diff: https://git.reviewboard.kde.org/r/114632/diff/ > > > Testing > ------- > > This improved the title extraction on my pdf collection of scientific papers by quite a lot. > > > Thanks, > > Luis Silva > > --===============7040429722962043997== Content-Type: text/html; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit
This is an automatically generated e-mail. To reply, visit: https://git.reviewboard.kde.org/r/114632/

Hm, you broke the comment :)

- Christoph Feck


On December 23rd, 2013, 4:14 p.m. UTC, Luis Silva wrote:

Review request for Baloo and Vishesh Handa.
By Luis Silva.

Updated Dec. 23, 2013, 4:14 p.m.

Repository: kfilemetadata

Description

A good portion of scientific papers in my collection had a doi or an index number in the title. These are in general short string chains, shorter than the real title.
I improve extraction of titles from pdf's by setting a minimum size below which parsing of the first page is forced.
The cut-off size is arbitrarily set to 25 characters (three "big words").

Testing

This improved the title extraction on my pdf collection of scientific papers by quite a lot.

Diffs

  • src/extractors/popplerextractor.cpp (b056581f51d10b632799586eed3cc15ac539fe80)

View Diff

--===============7040429722962043997==-- --===============1853145350037908781== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline >> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe << --===============1853145350037908781==--