[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-devel
Subject: Re: [Sonnet] Introduction and request for guidance
From: Karthik Periagaram <karthik.periagaram () gmail ! com>
Date: 2014-07-09 3:35:24
Message-ID: 2707649.Y7FieoKxWS () aria
[Download RAW message or body]
[Attachment #2 (multipart/signed)]
Splendid! Let's get hacking!
On Tuesday, July 08, 2014 17:13:14 Martin Sandsmark wrote:
> On Tuesday 8. July 2014 04.29.57 Karthik Periagaram wrote:
> > I reported this over the weekend, having encountered it all through my grad
> > years when having automatic spell check in katepart would highlight
> > "spelling errors" all over data files. So, the goal is to make sonnet
> > smarter and avoid spell checking numbers, generally speaking.
>
> Well, I can't really think of any "correct" words with numbers in them, so why
> not just check if the word contains any digits? Or maybe just make the
> tokenizer split on anything that QChar::isLetter() returns false for?
I see a problem with that approach. Splitting on any non-letter character would mean \
something like "cat,1" will pass spell check just because cat is spelled correctly. I \
think it should fail the spell check, don't you?
So, I think I have a working fix. To play with my ideas, I've been using this test \
code: http://paste.kde.org/p5vt7oqht
As you can see, the main function here uses a test string containing some \
alphanumeric text which it is supposed to split into pieces, then decide if each of \
those pieces is a valid word or not. A valid word is one that will be sent to the \
spell checker. An invalid word will not be sent and effectively passes the spell \
check. The intent is that numbers should be treated as invalid words. Everything else \
gets spell checked.
The first big change I would like to propose is to use the Line instead of Word for \
the QTextBoundaryFinder. Using Word breaks 1.0e+1 into three words: 1.0e, + and 1. \
Using Line on the other hand finds boundaries only at spaces and hyphenation points \
(Qt assistant says, places where text can break into multiple lines). You can run \
this program with both Line and Word and see the difference in the output. I think \
you'll agree that Line is very much the behavior we want.
Quick CMakeLists.txt file to build the file above (I named the cpp file as test.cpp):
http://paste.kde.org/pecdtlxcu
The findNextWord() function is simple enough. It just finds the next word and returns \
a boolean saying true or false (when it didn't find any more words). The next change \
I want to propose is in the isValidWord() function. Here, I quickly check if this \
word can convert to a double. This catches the exponential notation, which would fail \
the subsequent test. Else, I check for the presence of at least one letter in the \
word. If so, it will be sent to the spell checker. I'm not testing for an empty \
string in this test code, but that should probably be retained in the sonnet code to \
handle the empty buffer case.
The output of the program is basically each word identified by QTextBoundaryFinder \
followed by whether it is a valid word or not.
So, this is my proposed algorithm for the Filter class to parse the text and send \
words to the spell checker. What do you think? Do you see any obvious flaws in my \
assumptions?
Longer term, I think a test string like the one I used makes for an excellent unit \
test (or tests) for sonnet to check its behavior against numbers. Shorter term, I'll \
prep a patch for review, if you are okay with this approach.
> > If either are actively maintaining sonnet, I'd love to pepper you with more
> > questions!
>
> Feel free!
>
> And please do feel free to pepper me on IRC as well, I'm "sandsmark" there.
Pepper inbound! But seriously, I think I'll need to put this off till the weekend \
when I'll have the time to study it carefully. I mostly wanted to clarify the \
execution path and confirm I understand how each class interacts with each other (and \
maybe discuss the merits/shortcomings of the design). Also, would you suggest we do \
that off the mailing list, or would it be preferable to have it officially documented \
(maybe for the benefit of other potential contributors)?
> And awesome that someone else wants to contribute to Sonnet, this warms my
> heart. :-)
K.
["signature.asc" (application/pgp-signature)]
>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic