[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: [Sonnet] Introduction and request for guidance
From:       Karthik Periagaram <karthik.periagaram () gmail ! com>
Date:       2014-07-09 3:35:24
Message-ID: 2707649.Y7FieoKxWS () aria
[Download RAW message or body]

[Attachment #2 (multipart/signed)]


Splendid! Let's get hacking!

On Tuesday, July 08, 2014 17:13:14 Martin Sandsmark wrote:
> On Tuesday 8. July 2014 04.29.57 Karthik Periagaram wrote:
> > I reported this over the weekend, having encountered it all through my grad
> > years when having automatic spell check in katepart would highlight
> > "spelling errors" all over data files. So, the goal is to make sonnet
> > smarter and avoid spell checking numbers, generally speaking.
> 
> Well, I can't really think of any "correct" words with numbers in them, so why 
> not just check if the word contains any digits? Or maybe just make the 
> tokenizer split on anything that QChar::isLetter() returns false for?

I see a problem with that approach. Splitting on any non-letter character would mean \
something like "cat,1" will pass spell check just because cat is spelled correctly. I \
think it should fail the spell check, don't you?

So, I think I have a working fix. To play with my ideas, I've been using this test \
code: http://paste.kde.org/p5vt7oqht

As you can see, the main function here uses a test string containing some \
alphanumeric text which it is supposed to split into pieces, then decide if each of \
those pieces is a valid word or not. A valid word is one that will be sent to the \
spell checker. An invalid word will not be sent and effectively passes the spell \
check. The intent is that numbers should be treated as invalid words. Everything else \
gets spell checked.

The first big change I would like to propose is to use the  Line instead of Word for \
the QTextBoundaryFinder. Using Word breaks 1.0e+1 into three words: 1.0e, + and 1. \
Using Line on the other hand finds boundaries only at spaces and hyphenation points \
(Qt assistant says, places where text can break into multiple lines). You can run \
this program with both Line and Word and see the difference in the output. I think \
you'll agree that Line is very much the behavior we want.

Quick CMakeLists.txt file to build the file above (I named the cpp file as test.cpp):
http://paste.kde.org/pecdtlxcu

The findNextWord() function is simple enough. It just finds the next word and returns \
a boolean saying true or false (when it didn't find any more words). The next change \
I want to propose is in the isValidWord() function. Here, I quickly check if this \
word can convert to a double. This catches the exponential notation, which would fail \
the subsequent test. Else, I check for the presence of at least one letter in the \
word. If so, it will be sent to the spell checker. I'm not testing for an empty \
string in this test code, but that should probably be retained in the sonnet code to \
handle the empty buffer case.

The output of the program is basically each word identified by QTextBoundaryFinder \
followed by whether it is a valid word or not.

So, this is my proposed algorithm for the Filter class to parse the text and send \
words to the spell checker. What do you think? Do you see any obvious flaws in my \
assumptions?

Longer term, I think a test string like the one I used makes for an excellent unit \
test (or tests) for sonnet to check its behavior against numbers. Shorter term, I'll \
prep a patch for review, if you are okay with this approach.

> > If either are actively maintaining sonnet, I'd love to pepper you with more
> > questions!
> 
> Feel free!
> 
> And please do feel free to pepper me on IRC as well, I'm "sandsmark" there. 

Pepper inbound! But seriously, I think I'll need to put this off till the weekend \
when I'll have the time to study it carefully. I mostly wanted to clarify the \
execution path and confirm I understand how each class interacts with each other (and \
maybe discuss the merits/shortcomings of the design). Also, would you suggest we do \
that off the mailing list, or would it be preferable to have it officially documented \
(maybe for the benefit of other potential contributors)?

> And awesome that someone else wants to contribute to Sonnet, this warms my 
> heart. :-)

K.


["signature.asc" (application/pgp-signature)]

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic