[prev in list] [next in list] [prev in thread] [next in thread]
List: git
Subject: Re: [PATCH v2 4/5] convert: add support for 'checkout-encoding' attribute
From: Lars Schneider <larsxschneider () gmail ! com>
Date: 2017-12-30 19:58:02
Message-ID: 99B51DE1-0825-41FC-8DD3-677BCF7B52A5 () gmail ! com
[Download RAW message or body]
> On 29 Dec 2017, at 18:28, Torsten Bögershausen <tboegi@web.de> wrote:
>
> I probably need to look at convert.c more closer, some other comments inline.
>
>
> On Fri, Dec 29, 2017 at 04:22:21PM +0100, lars.schneider@autodesk.com wrote:
>> From: Lars Schneider <larsxschneider@gmail.com>
>>
>> Git and its tools (e.g. git diff) expect all text files in UTF-8
>> encoding. Git will happily accept content in all other encodings, too,
>> but it might not be able to process the text (e.g. viewing diffs or
>> changing line endings).
>
> UTF-8 is too strict, the text from below is more correct:
> +Git recognizes files encoded with ASCII or one of its supersets (e.g.
> +UTF-8 or ISO-8859-1) as text files. All other encodings are usually
> +interpreted as binary and consequently built-in Git text processing
> +tools (e.g. 'git diff') as well as most Git web front ends do not
> +visualize the content.
Agreed. I'll replace it.
>> Add an attribute to tell Git what encoding the user has defined for a
>> given file. If the content is added to the index, then Git converts the
>> content to a canonical UTF-8 representation. On checkout Git will
>
> Minor question about "canonical":
> Would this mean the same ?
> ...then Git converts the content into UTF-8.
I feel the word canonical makes sense in this context to express
that UTF-8 is not some randomly chosen encoding. AFAIK UTF-8 and
LF are the canonical form for text in Git, no?
>> +In these cases you can teach Git the encoding of a file in the working
> teach ? tell ?
"tell", agreed!
>> +directory with the `checkout-encoding` attribute. If a file with this
>> +attributes is added to Git, then Git reencodes the content from the
>> +specified encoding to UTF-8 and stores the result in its internal data
>> +structure.
>
> Minor Q:
>> its internal data structure.
> Should we simply write "stores the result in the index" ?
This text is targeted at the end user and I know a lot of end users
that are confused by the word "index". How about:
...stores the result in its internal data structure ("the index").
Would that work?
>
>> On checkout the content is encoded back to the specified
>> +encoding.
>
>> +
>> +Please note that using the `checkout-encoding` attribute has a number
>> +of drawbacks:
>> +
>> +- Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
>> + errors as the conversion might not be round trip safe.
>> +
>> +- Reencoding content requires resources that might slow down certain
>> + Git operations (e.g 'git checkout' or 'git add').
>> +
>> +- Git clients that do not support the `checkout-encoding` attribute or
>> + the used encoding will checkout the respective files as UTF-8 encoded.
>> + That means the content appears to be different which could cause
>> + trouble. Affected clients are older Git versions and alternative Git
>> + implementations such as JGit or libgit2 (as of January 2018).
>> +
>> +Use the `checkout-encoding` attribute only if you cannot store a file in
>> +UTF-8 encoding and if you want Git to be able to process the content as
>> +text.
>> +
>
> I would maybe rephrase a little bit (first things first):
>
> Please note that using the `checkout-encoding` attribute may have a number
> of pitfalls:
>
> - Git clients that do not support the `checkout-encoding` attribute
> will checkout the respective files as UTF-8 encoded, which typically
> causes trouble.
> Escpecialy when older Git versions are used or alternative Git
> implementations such as JGit or libgit2 (as of January 2018).
>
> - Reencoding content to non-UTF encodings (e.g. SHIFT-JIS) can cause
> errors as the conversion might not be round trip safe.
>
> - Reencoding content requires resources that might slow down certain
> Git operations (e.g 'git checkout' or 'git add').
>
> Use the `checkout-encoding` attribute only if you cannot store a file in
> UTF-8 encoding and if you want Git to be able to process the content as
> text.
Agreed!
> -----
> Side question: What happens if "the used encoding" is not supported by
> the underlying iconv lib ?
> Will Git fail, delete the file from the working tree ?
> That may be worth to mention. (And I need to do the code-review)
It should checkout the file in UTF-8 and print an error.
If you would try to add a file with an unsupported encoding,
then Git should die() and refuse the operation.
(see t0028: check unsupported encoding)
>
>> +Use the following attributes if your '*.txt' files are UTF-16 encoded
>> +with byte order mark (BOM) and you want Git to perform automatic line
>> +ending conversion based on your platform.
>> +
>> +------------------------
>> +*.txt text checkout-encoding=UTF-16
>> +------------------------
>> +
>> +Use the following attributes if your '*.txt' files are UTF-16 little
>> +endian encoded without BOM and you want Git to use Windows line endings
>> +in the working directory.
>> +
>> +------------------------
>> +*.txt checkout-encoding=UTF-16LE text eol=CRLF
>> +------------------------
>> +
>> +You can get a list of all available encodings on your platform with the
>> +following command:
>> +
>> +------------------------
>> +iconv --list
>> +------------------------
>> +
>> +
>> `ident`
>> ^^^^^^^
>>
>> diff --git a/apply.c b/apply.c
>> index 321a9fa68d..c4bd5cf1f2 100644
>> --- a/apply.c
>> +++ b/apply.c
>> @@ -2281,7 +2281,7 @@ static int read_old_data(struct stat *st, struct patch *patch,
>> * should never look at the index when explicit crlf option
>> * is given.
>> */
>> - convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf);
>> + convert_to_git(NULL, path, buf->buf, buf->len, buf, safe_crlf, 0);
>
> Hm. Do we really need another parameter here?
> The only caller that needs it (and sets it to 1) is sha1_file.c.
> When an invalid encoding/content is used, it should be safe to die() always?
> Just loud thinking..
Unfortunately it is not safe to die() always. Peff described a situation here
where you could not "checkout away" from an error state:
https://public-inbox.org/git/20171215095838.GA3567@sigill.intra.peff.net/
> If really needed, it may need less changes in the code base, if a new function
> convert_to_git_or_die() is defined, which is called by convert_to_git()
>
> (and the other alternative would be to convert safe_crlf into a bitmap,
> see https://public-inbox.org/git/20171229132828.17594-1-tboegi@web.de/T/#u
> what do you think ?)
I still need to review your patch, but a bitmap sounds like a good idea.
But why do we need a special bitmap? Can't we just pass "flags" and
calculate get_safe_crlf() in convert.c?
>>
>> +
>> + if (has_prohibited_utf_bom(enc->name, src, src_len)) {
>> + const char *error_msg = _(
>> + "BOM is prohibited for '%s' if encoded as %s");
>> + const char *advise_msg = _(
>> + "You told Git to treat '%s' as %s. A byte order mark "
>> + "(BOM) is prohibited with this encoding. Either use "
>> + "%.6s as checkout encoding or remove the BOM from the "
>> + "file.");
>> +
>> + advise(advise_msg, path, enc->name, enc->name, enc->name);
>> + if (write_obj)
>> + die(error_msg, path, enc->name);
>> + else
>> + error(error_msg, path, enc->name);
>
> As said before, I would just die(). Or do I miss something ?
> Being strict with BOMs seams like a good idea.
See above, Peffs test case. We run convert in a lot of places
(e.g. 'git diff' or if checkout away from a bad state).
>>
>> diff --git a/t/t0028-checkout-encoding.sh b/t/t0028-checkout-encoding.sh
>> new file mode 100755
>> index 0000000000..1a329ab933
>> --- /dev/null
>> +++ b/t/t0028-checkout-encoding.sh
>> @@ -0,0 +1,197 @@
>> +#!/bin/sh
>> +
>> +test_description='checkout-encoding conversion via gitattributes'
>> +
>> +. ./test-lib.sh
>> +
>> +test_expect_success 'setup test repo' '
>> +
>> + text="hallo there!\ncan you read me?" &&
>
> Is this portable ? (the "\n")
I assume it as it is used in t0001, t0008, t0024 and others.
Thanks for the review,
Lars
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic