[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-core-devel
Subject:    Language detection in Sonnet
From:       Martin Sandsmark <martin.sandsmark () kde ! org>
Date:       2013-12-26 17:56:16
Message-ID: 20131226175616.GA4884 () viritrilbia ! samfundet ! no
[Download RAW message or body]

Dear esteemed sirs and madams,

I have spent the last couple of days re-merging back in an old branch for
Sonnet that enables language detection.

Simple, high-level overview of what is done: Replace the filter class with a
(proper) tokenizer, using our own languagebreaks class because
QTextBoundaryFinder is broken beyond hope of salvation (imho), and implement
language recognition.

The language recognition is performed in three major stepts:
    1. Looking at the script types used (QChar::script()).
    2. Trigram-based model (I abandoned the "most significant words"
        algorithm for reasons).
    3. Pure brute-force on all available spelling backends (the one with the
        least amount of errors is chosen)

In this branch I have also removed some dead code and whatnot.

So if you wouldn't mind, please take a look in the "langdet" branch of
Sonnet, and come with any and all feedback.

https://projects.kde.org/projects/frameworks/sonnet/repository/revisions/langdet/show/

-- 
Martin Sandsmark
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic