'Re: copy unicode (UCS-2) file'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       perl-beginners
Subject:    Re: copy unicode (UCS-2) file
From:       Brandon McCaig <bamccaig () gmail ! com>
Date:       2014-11-26 16:10:37
Message-ID: CANUGeEZkDr2Lw3F0ZQ=+HMPQpvHOa1DxQfuUvv9foWCe3B1igQ () mail ! gmail ! com
[Download RAW message or body]

Hans:

On Wed, Nov 26, 2014 at 8:49 AM, Hans Ginzel <hans@matfyz.cz> wrote:
> Hello!

Hi,

> Consider a small perl code below.
> It should copy text file with removing leading and trailing spaces.
>
> while (<>) {
>  s/^s+//; s/s+$//;
>  print $_, "n"; # say;
> }
>
> I run it with "shell" redirection
> perl copy.pl <src.txt >dst.txt
>
> It works well for Windows ansi and utf-8 text files. But when I have
> tried an unicode (ucs-2le) source file containing
> "anb" this is
> FF FE 61 00 0D 00 0A 00 62 00 0D 00 0A 00
> in hex (with Byte Order Mark) I get characters in hex
> FF FE 61 00 0D 00 0D 0A 00 62 00 0D 00 0D 0A 00 0D 0A
> .
> I have attached these files but I am not sure what Mail Agents do with
> them.
>
> Variable PERL_UNICODE is not set.
>
> I have tried add -CS to the command line, but got info about Malformed
> UTF-8 character.
> I have tried adding each of these pragmas to the beginning
> use open ':encoding(UCS-2LE)';
> use open IO => ':encoding(UCS-2LE)';
> use open ':std' => ':encoding(UCS-2LE)';
> but without desired goal. I tried to combine the pragma with -CS
> option.
> I have tried use feature qw/say/; say; instaed of print $_, "n"; but
> without correct results.
>
> perl --version
> This is perl 5, version 18, subversion 2 (v5.18.2)
> built for MSWin32-x86-multi-thread-64int
>
> What is the correct way to set stdin/out to UCS-2LE, please?
> What is the correct way to print "encoding independent" new line
> character, please?
> What is the correct way to say that s should match the "UCS-2LE way",
> please?

You can generally pass an :encoding() LAYER to binmode to specify the
text encoding (see perldoc -f binmode). For file handles that you are
creating yourself with open you can pass these in directly into the
MODE argument (see perldoc -f open). You should of course be using the
3 or more argument version of open regardless, but this matters
especially if you intend to use tainted (perldoc perlsec) data to
alter the IO layers.

I wasn't familiar with the open pragma, but according to perldoc open,
in order for it to affect the standard streams (as opposed to new file
handles that you create yourself) you must include a :std layer. It
looks like you tried that, but were probably doing it wrong. I would
guess the incantation you need is:

use open ':encoding(UCS-2LE) :std';

Or variations thereof, or thereabouts.

I recommend you familiarize yourself with open(), binmode(), and the
Encode module to rid yourself of text encoding doubts. I guess since
you're using the open pragma you should also familiarize yourself with
that, but since there isn't much to the perldoc I'm guessing you
already tried to do that.

It's actually made pretty easy in Perl, but you do need to have a
basic understanding of the system to use it properly. You can also
look into PerlIO which provides the mechanisms for doing this
automatically on a stream.

> In addition, is there a standardised way to auto-detect input encoding
> (legacy(8bit)/utf-8/ucs-2), please?

Unfortunately there's no perfect way to detect character encodings.
There are many encodings that use all of the same code points and have
no identifying features. In general, the encoding needs to be written
in the headers or content of a stream (if being sent by machine) or in
user-specified options (if being sent by humans). For example, there
are ways to specify text encoding for HTTP messages in the headers.
There are ways to specify text encoding in an HTML or XML document. Of
course, if the content is written in the encoding itself, how are you
supposed to read the specified encoding? I'm not sure. I guess you can
try to guess until the text makes sense, and hobble along until you
find the encoding, and then reinterpret the text properly.

Search metacpan.org for "guess encoding". I know that there are
several modules that attempt to solve this problem. Note that they
aren't flawless because they can't be. It's not a problem with a
perfect solution. The best option is to have the machine or user that
is giving you data also tell you what the format/text encoding of it
is. A simple way to do this is to implement a command-line option in
your program (e.g., see Getopt::Long) that overrides a sane default
encoding (e.g., UTF-8). If you control the software you can also
define that the data *must* be in a particular encoding (e.g., UTF-8)
and just assume that it is.

Regards,


-- 
Brandon McCaig <bamccaig@gmail.com> <bamccaig@castopulence.org>
Castopulence Software <https://www.castopulence.org/>
Blog <http://www.bambams.ca/>
perl -E '$_=q{V zrna gur orfg jvgu jung V fnl. }.
q{Vg qbrfa'\''g nyjnlf fbhaq gung jnl.};
tr/A-Ma-mN-Zn-z/N-Zn-zA-Ma-m/;say'

-- 
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic