'Patch: Flesch reading ease'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    Patch: Flesch reading ease
From:       Daniel Naber <daniel.naber () t-online ! de>
Date:       2001-06-07 23:46:53
[Download RAW message or body]

Hi,

the freeze will start soon, but I hope this can still get in? It adds a 
formula to calculate the Flesch readability. To get this somewhat precise 
I had to make changes to the sentence and word count so that no every "." 
gets translated into a sentence anymore. This should be an improvment for 
most cases, although it's impossible to get this 100% correct.

This is most useful for English, but that doesn't mean it cannot be used 
for other languages. I think the absolute Flesch scores will change then, 
but the idea is still the same and it should still work.

Here's an explanation which can also be added to the docmentation (this 
explanation has a Flesch score of 58, BTW :-)

--- cut ---
The "Flesch reading ease" score is a number between 0 and 100 which 
estimates how readable a text is. The higher the number, the easier the 
text can be read. Texts with a score of 70-80 have a fairly good 
readability.  
  
The Flesch formula uses the number of words per sentences and the number 
of syllables per word. It says nothing about grammar or meaning. 
Implemenations of the Flesch formula on computers are not 100% precise, as 
both the number of sentences and the number of syllables is estimated. The 
text should be at least 200 words long, if it isn't the score will be 
marked as approximated. 
--- cut ---

Maybe someone can test this with longer texts (which I don't have).
Can I apply?

Regards
 Daniel

-- 
Daniel Naber, Paul-Gerhardt-Str. 2, 33332 Guetersloh, Germany
Tel. 05241-59371, Mobil 0170-4819674

["flesch.diff" (text/plain)]

Index: kwframe.h
===================================================================
RCS file: /home/kde/koffice/kword/kwframe.h,v
retrieving revision 1.95
diff -u -r1.95 kwframe.h
--- kwframe.h	2001/06/07 17:29:37	1.95
+++ kwframe.h	2001/06/07 23:33:39
@@ -429,7 +429,8 @@
      * The default implementation calls updateFrames() and zoom(). Call the parent \
:) */  virtual void finalize();
 
-    virtual void statistics( ulong & /*charsWithSpace*/, ulong & \
/*charsWithoutSpace*/, ulong & /*words*/, ulong & /*sentences*/ ) {} +    virtual \
void statistics( ulong & /*charsWithSpace*/, ulong & /*charsWithoutSpace*/, ulong & \
/*words*/,  +        ulong & /*sentences*/, ulong & /*syllables*/ ) {}
 
     KWDocument* kWordDocument() const { return m_doc; }
 
Index: kwtableframeset.cc
===================================================================
RCS file: /home/kde/koffice/kword/kwtableframeset.cc,v
retrieving revision 1.103
diff -u -r1.103 kwtableframeset.cc
--- kwtableframeset.cc	2001/06/07 17:29:37	1.103
+++ kwtableframeset.cc	2001/06/07 23:33:43
@@ -1507,10 +1507,11 @@
     }
 }
 
-void KWTableFrameSet::statistics( ulong & charsWithSpace, ulong & charsWithoutSpace, \
ulong & words, ulong & sentences ) +void KWTableFrameSet::statistics( ulong & \
charsWithSpace, ulong & charsWithoutSpace, ulong & words,  +    ulong & sentences, \
ulong & syllables )  {
     for (unsigned int i =0; i < m_cells.count(); i++) {
-        m_cells.at(i)->statistics( charsWithSpace, charsWithoutSpace, words, \
sentences ); +        m_cells.at(i)->statistics( charsWithSpace, charsWithoutSpace, \
words, sentences, syllables );  }
 }
 
Index: kwtableframeset.h
===================================================================
RCS file: /home/kde/koffice/kword/kwtableframeset.h,v
retrieving revision 1.49
diff -u -r1.49 kwtableframeset.h
--- kwtableframeset.h	2001/06/07 14:39:22	1.49
+++ kwtableframeset.h	2001/06/07 23:33:43
@@ -220,7 +220,8 @@
     virtual void zoom();
 
     /** Contribute to the document statistics */
-    virtual void statistics( ulong & charsWithSpace, ulong & charsWithoutSpace, \
ulong & words, ulong & sentences ); +    virtual void statistics( ulong & \
charsWithSpace, ulong & charsWithoutSpace, ulong & words,  +        ulong & \
sentences, ulong & syllables );  
     virtual void finalize();
 
Index: kwtextframeset.cc
===================================================================
RCS file: /home/kde/koffice/kword/kwtextframeset.cc,v
retrieving revision 1.255
diff -u -r1.255 kwtextframeset.cc
--- kwtextframeset.cc	2001/06/07 17:29:37	1.255
+++ kwtextframeset.cc	2001/06/07 23:33:53
@@ -424,41 +424,84 @@
     textdoc->invalidate(); // lazy layout, real update follows upon next repaint
 }
 
-void KWTextFrameSet::statistics( ulong & charsWithSpace, ulong & charsWithoutSpace, \
ulong & words, ulong & sentences ) +void KWTextFrameSet::statistics( ulong & \
charsWithSpace, ulong & charsWithoutSpace, ulong & words,  +    ulong & sentences, \
ulong & syllables )  {
+    // parts of words for better counting of syllables:
+    QStringList subs_syl; 
+    subs_syl << "cial" << "tia" << "cius" << "cious" << "giu" << "ion" << "iou" << \
"sia$" << "ely$"; +    QStringList add_syl;
+    add_syl << "ia" << "riet" << "dien" << "iu" << "io" << "ii" << "[aeiouym]bl$" << \
"[aeiou]{3}" +		<< "^mc" << "ism$" << "([^aeiouy])\1l$" << "[^l]lien" << \
"^coa[dglx]." << "[^gq]ua[^auieo]" << "dnt$"; +
     QTextParag * parag = textdoc->firstParag();
     for ( ; parag ; parag = parag->next() )
     {
         QString s = parag->string()->toString();
-        bool wordStarted = false;
-        bool sentenceStarted = false;
+
+        // Character count
         for ( uint i = 0 ; i < s.length() - 1 /*trailing-space*/ ; ++i )
         {
             QChar ch = s[i];
             ++charsWithSpace;
             if ( !ch.isSpace() )
                 ++charsWithoutSpace;
-            if ( ch.isSpace() || ch.isPunct() )
-            {
-                if ( wordStarted )
-                {
-                    ++words;
-                    wordStarted = false;
-                }
-                if ( KWAutoFormat::isMark( ch ) && sentenceStarted )
-                {
-                    ++sentences;
-                    sentenceStarted = false;
-                }
+        }
+
+        // Syllable and Word count
+        // Algorithm taken from Greg Fast's Lingua::EN::Syllable module for Perl.
+        // This guesses correct for 70-90% of English words, but the overall value
+        // is quite good, as some words get a number that's too high and others get
+        // one that's too low.
+        QRegExp re("\\s+");
+        QStringList wordlist = QStringList::split(re, s);
+        words += wordlist.count();
+       	re.setCaseSensitive(false);
+        for ( QStringList::Iterator it = wordlist.begin(); it != wordlist.end(); \
++it ) { +            int word_syllables = 0;
+            QString word = *it;
+            re.setPattern("e$");
+            s.replace(re, "");
+            re.setPattern("[^aeiouy]+");        // '-' should perhaps be added?
+            QStringList syls = QStringList::split(re, word);
+
+            for ( QStringList::Iterator it = subs_syl.begin(); it != subs_syl.end(); \
++it ) { +        	re.setPattern(*it);
+        	if( word.contains(re) )
+	            word_syllables--;
             }
-            else
-            {
-                wordStarted = true;
-                sentenceStarted = true;
+            for ( QStringList::Iterator it = add_syl.begin(); it != add_syl.end(); \
++it ) { +        	re.setPattern(*it);
+        	if( word.contains(re) )
+                    word_syllables++;
             }
+
+            if ( word.length() == 1 )
+	        word_syllables++;
+            word_syllables += syls.count();
+	    if ( word_syllables == 0 ) 
+                word_syllables = 1;
+	    syllables += word_syllables;
+	}
+        re.setCaseSensitive(true);
+	
+        // Sentence count
+        // Clean up for better result, destroys the original text but we only want \
to count +	s = s.stripWhiteSpace();
+	QChar lastchar = s.at(s.length());
+	if( ! s.isEmpty() && ! KWAutoFormat::isMark( lastchar ) ) {  // e.g. for headlines
+	    s = s + ".";
+	}
+        re.setPattern("[.?!]+");         // count "..." as only one "."
+        s.replace(re, ".");
+        re.setPattern("[A-Z]\\.+");      // don't count "U.S.A." as three sentences
+        s.replace(re, "*");
+        for ( uint i = 0 ; i < s.length() ; ++i )
+        {
+            QChar ch = s[i];
+            if ( KWAutoFormat::isMark( ch ) )
+                ++sentences;
         }
-        if ( wordStarted )
-            ++words;
     }
 }
 
Index: kwtextframeset.h
===================================================================
RCS file: /home/kde/koffice/kword/kwtextframeset.h,v
retrieving revision 1.95
diff -u -r1.95 kwtextframeset.h
--- kwtextframeset.h	2001/06/07 17:29:37	1.95
+++ kwtextframeset.h	2001/06/07 23:33:54
@@ -200,7 +200,8 @@
     virtual void invalidate();
 
     // Calculate statistics for this frameset
-    virtual void statistics( ulong & charsWithSpace, ulong & charsWithoutSpace, \
ulong & words, ulong & sentences ); +    virtual void statistics( ulong & \
charsWithSpace, ulong & charsWithoutSpace,  +        ulong & words, ulong& sentences, \
ulong & syllables );  
     // reimplemented from QTextFlow
     virtual int adjustLMargin( int yp, int h, int margin, int space );
Index: kwview.cc
===================================================================
RCS file: /home/kde/koffice/kword/kwview.cc,v
retrieving revision 1.271
diff -u -r1.271 kwview.cc
--- kwview.cc	2001/06/07 17:34:47	1.271
+++ kwview.cc	2001/06/07 23:34:01
@@ -713,6 +713,7 @@
     ulong charsWithoutSpace = 0L;
     ulong words = 0L;
     ulong sentences = 0L;
+    ulong syllables = 0L;
     QListIterator<KWFrameSet> framesetIt( m_doc->framesetsIterator() );
     for ( ; framesetIt.current(); ++framesetIt )
     {
@@ -720,9 +721,17 @@
         // Exclude headers and footers
         if ( frameSet->frameSetInfo() == KWFrameSet::FI_BODY && \
frameSet->isVisible() )  {
-            frameSet->statistics( charsWithSpace, charsWithoutSpace, words, \
sentences ); +            frameSet->statistics( charsWithSpace, charsWithoutSpace, \
words, sentences, syllables );  }
     }
+    // calculate Flesch reading ease score:
+    float flesch = 0;
+    if( words > 0 && sentences > 0 )
+        flesch = 206.835 - (1.015 * (words/sentences)) - (84.6 * syllables/words);
+    QString flesch_comment = "";
+    if( words < 200 ) {
+        flesch_comment = "approximately ";   // a kind of warning if too few words
+    }
 
     KDialogBase dlg( KDialogBase::Plain, i18n( "Document Statistics" ),
                      KDialogBase::Ok, KDialogBase::Ok, this, 0, true,
@@ -734,8 +743,11 @@
               "Characters (total count including spaces) : <b>%1</b><br/>"
               "Characters without spaces: <b>%2</b><br/>"
               "Words: <b>%3</b><br/>"
-              "Sentences: <b>%4</b></p>" ).
-        arg( charsWithSpace ).arg( charsWithoutSpace ).arg( words ).arg( sentences \
), +              "Sentences: <b>%4</b><br/>"
+              "English syllables: <b>%5</b><br/>"
+	      "Flesch reading ease: <b>%6%7</b></p>" ).
+        arg( charsWithSpace ).arg( charsWithoutSpace ).arg( words ).arg( sentences \
). +	arg( syllables ).arg( flesch_comment ).arg( flesch ),	// fixme: rounding
                         dlg.plainPage() ) );
     dlg.setInitialSize( QSize( 400, 200 ) ); // not too good for long \
translations... -> use a real layout and 5 labels  dlg.show();


_______________________________________________
Koffice-devel mailing list
Koffice-devel@master.kde.org
http://master.kde.org/mailman/listinfo/koffice-devel


[prev in list] [next in list] [prev in thread] [next in thread]