Hm, you broke the comment :)
- Christoph Feck
On December 23rd, 2013, 4:14 p.m. UTC, Luis Silva wrote:
Review request for Baloo and Vishesh Handa.
By Luis Silva.
Updated Dec. 23, 2013, 4:14 p.m.
Repository:
kfilemetadata
Description
A good portion of scientific papers in my collection had a doi or an index number in the title. These are in general short string chains, shorter than the real title.
I improve extraction of titles from pdf's by setting a minimum size below which parsing of the first page is forced.
The cut-off size is arbitrarily set to 25 characters (three "big words").
|
Testing
This improved the title extraction on my pdf collection of scientific papers by quite a lot.
|
Diffs
- src/extractors/popplerextractor.cpp (b056581f51d10b632799586eed3cc15ac539fe80)
View Diff
|