From kde-core-devel Thu Dec 26 17:56:16 2013 From: Martin Sandsmark Date: Thu, 26 Dec 2013 17:56:16 +0000 To: kde-core-devel Subject: Language detection in Sonnet Message-Id: <20131226175616.GA4884 () viritrilbia ! samfundet ! no> X-MARC-Message: https://marc.info/?l=kde-core-devel&m=138808061808894 Dear esteemed sirs and madams, I have spent the last couple of days re-merging back in an old branch for Sonnet that enables language detection. Simple, high-level overview of what is done: Replace the filter class with a (proper) tokenizer, using our own languagebreaks class because QTextBoundaryFinder is broken beyond hope of salvation (imho), and implement language recognition. The language recognition is performed in three major stepts: 1. Looking at the script types used (QChar::script()). 2. Trigram-based model (I abandoned the "most significant words" algorithm for reasons). 3. Pure brute-force on all available spelling backends (the one with the least amount of errors is chosen) In this branch I have also removed some dead code and whatnot. So if you wouldn't mind, please take a look in the "langdet" branch of Sonnet, and come with any and all feedback. https://projects.kde.org/projects/frameworks/sonnet/repository/revisions/langdet/show/ -- Martin Sandsmark