'Re: Issues removing files with certain characters in their names.'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       busybox
Subject:    Re: Issues removing files with certain characters in their names.
From:       Rich Felker <dalias () libc ! org>
Date:       2014-05-31 1:08:51
Message-ID: 20140531010851.GB507 () brightrain ! aerifal ! cx
[Download RAW message or body]

On Fri, May 30, 2014 at 09:18:19PM +0200, Harald Becker wrote:
> Hi Rich !
> 
> >My statement was imprecise; of course to support users still
> >stuck on legacy locales, nl_langinfo(CODESET) should be
> >consulted.
> 
> How do you determine the correct code set of a foreign file
> system on an external drive? How can you tell if all systems
> which accessed this drive has handled translations in the correct
> way?

All modern filesystems used on external devices (fat32, ntfs, udf,
...) use Unicode-based encodings for filenames, so the foreign
encoding is known and fixed.

> >> .... and not only unzip may produce such results. Think of
> >> using an USB stick at an Windows machine, then carry that over
> >> to an Linux machine.
> >
> >The filenames are stored in UCS-2. No problem.
> 
> UCS-2 with different code page translations from an 8 bit
> charset. Translations which leave name mapping in inconsistent
> state when further translations occur.

I don't follow what you think the problem is.

> >If you mount it incorrectly, then this is user error.
> 
> Correct, all those trouble arrives due to anybody having an
> incorrect setup. This will ripple trough and may produce trouble
> on other ends.

All modern Linux-based systems use the utf8 option by default when
mounting filesystems that don't store filenames as pure byte strings
but in a Unicode-based form. You have to be rolling your own or else
actively breaking your system's default setup to get this wrong.

> >All programs are not affected. Only programs which read
> >filenames as byte strings from foreign sources (such as the
> >directory table of a zip file) are affected.
> 
> .... but how do you know the code page the zip archive uses. How
> do you know you need to do translations? I'm unsure if the archiv
> contains this information, so it needs to be provided by a much
> more error prone user.

When encountering such an archive, the unzip utility could simply exit
with an error when there are non-ASCII names unless the user specifies
the encoding. To be less error-prone, it could print the names as
interpreted in several different encodings as part of the error
message, to help the user identify which one is correct. IMO it should
also automatically assume UTF-8 and suppress the error condition if
the names all decode as valid UTF-8, since the probability of
meaningful non-UTF-8 text decoding successful as UTF-8 is negligible.

Rich
_______________________________________________
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox
[prev in list] [next in list] [prev in thread] [next in thread]