[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Indexing Text Files and Text Encoding
From:       Milian Wolff <mail () milianw ! de>
Date:       2015-01-21 10:51:13
Message-ID: 4636956.lqlh3Sfug8 () milian-kdab2
[Download RAW message or body]

On Tuesday 20 January 2015 13:18:03 David Narvaez wrote:
> On Tue, Jan 20, 2015 at 12:10 PM, Vishesh Handa <me@vhanda.in> wrote:
> > Hey guys
> > 
> > We have a plain text indexing plugin in KFileMetaData. It gives the plain
> > text of any file whose mimetype beings with 'text/'. We used to use
> > QString::fromUtf8 to convert this into a string. However, this may not be
> > ideal as a different encoding can exist.
> > 
> > I've just written a patch to use the system codec and if the conversion
> > fails, to abort. Does anyone have an opinions on this? I'm slightly
> > conflicted.
> > 
> > Reasons for doing this: If we cannot correctly convert it to text, we're
> > just indexing garbage. This often happens with a binary file getting
> > detected as text. [1].
>
> What about guessing the encoding from some heuristic[0]?

Or just use Qt directly:

http://stackoverflow.com/questions/18227530/check-if-utf-8-string-is-valid-in-qt/18228382#18228382

If it fails, either discard the file. Or try again with the system encoding 
(if that is not UTF-8) and discard otherwise.

Bye
-- 
Milian Wolff
mail@milianw.de
http://milianw.de

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic