'Re: Indexing Text Files and Text Encoding'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Indexing Text Files and Text Encoding
From:       David Narvaez <david.narvaez () computer ! org>
Date:       2015-01-20 18:18:03
Message-ID: CACFh1D5b_h091HHu9ogKiXCx1c0Gvhz+Fr_BZ-3sXW52XmOkbg () mail ! gmail ! com
[Download RAW message or body]

On Tue, Jan 20, 2015 at 12:10 PM, Vishesh Handa <me@vhanda.in> wrote:
> Hey guys
> 
> We have a plain text indexing plugin in KFileMetaData. It gives the plain text of \
> any file whose mimetype beings with 'text/'. We used to use QString::fromUtf8 to \
> convert this into a string. However, this may not be ideal as a different encoding \
> can exist. 
> I've just written a patch to use the system codec and if the conversion fails, to \
> abort. Does anyone have an opinions on this? I'm slightly conflicted. 
> Reasons for doing this: If we cannot correctly convert it to text, we're just \
> indexing garbage. This often happens with a binary file getting detected as text. \
> [1].

What about guessing the encoding from some heuristic[0]?

David E. Narvaez

[0] http://icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html

> > Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<

[prev in list] [next in list] [prev in thread] [next in thread]