[prev in list] [next in list] [prev in thread] [next in thread] 

List:       subversion-issues
Subject:    =?UTF-8?B?W0lzc3VlIDIxOTRdICBTdXBwb3J0IFVuaWNvZGUgZW5jb2RpbmdzIA==?=
From:       jespersm () tigris ! org
Date:       2006-03-26 14:43:43
Message-ID: 20060326144343.7606.qmail () tigris ! org
[Download RAW message or body]

http://subversion.tigris.org/issues/show_bug.cgi?id=2194



User jespersm changed the following:

                What    |Old value                 |New value
================================================================================
                      CC|''                        |'jespersm'
--------------------------------------------------------------------------------
                 Version|1.1.x                     |1.3.x
--------------------------------------------------------------------------------




------- Additional comments from jespersm@tigris.org Sun Mar 26 06:43:42 -0800 2006 -------
I've read the discussion and I think we can implement the functionality without
changing 80% of the client code. I hope I'm not just stating the obvious, but I
hereby propose a solution for this text encoding issue, to be implemented in a
manageable number of steps:

1) Property support for specifying text encoding
2) ASCII-based encoding conversion support in the WC library (along with current
EOL and keyword handling in 'subst')
3) UTF-16 charset support in diff and merge
4) Adding auto-property support for text encoding 

The proposal is based on four main principles: 
 * Its free if you don't need it
 * Don't surprise the user
 * UTF-8 is the new ASCII
 * The Unicode character set is sufficient for all things text

Ad 1) Property support for specifying text encoding:

I propose that we introduce a new property for text files called
svn:text-encoding (or mabye svn:text-encoding-style, or perhaps just svn:encoding)?
This can take three kinds of values:
 - The name of a specific encoding, like ISO-8859-1 or UTF-8
 - The special value 'native'
 - Empty or missing (the default)

The idea is that IF svn:text-encoding is specified, then the WC library and the
clients in general are responsible for converting to and from to the specified
format (with 'native' being the system's default encoding), and that the RA
lever only ever sees UTF-8 for these text resources. The encoding is said to be
"managed".
This follows the style of svn:eol-style and needs support in roughly the same
places.
The 'native' mode is interesting for the case where text files (like Java source
files) do not carry the encoding with them (like e.g. XML does).
If the text-encoding is not set, then the encoding is "unmanaged" in that it
works like it does today.

Ad 2) ASCII-based encoding conversion support in the WC library:

The first step in supporting this would be to add the support into the WC and
client libraries. For 8-bit (ASCII-based) encodings, the basic support of this
doen't touch the diff support, which at this point already makes some
assumptions about the encoding, as far as I can tell.
I think the "streamy" API in svn_subst.c can be layered with the encoding support.
Also diff output should be reflect these encoding changes, to show "encoding
only" changes:

Index: cool-stuff/todo.txt
===================================================================
--- cool-stuff/todo.txt (revision 42, ISO-8859-1)
+++ cool-stuff/todo.txt (working copy, UTF-8)
svn:text-encoding = UTF-8

Property changes on: cool-stuff/todo.txt
___________________________________________________________________
Name: svn:text-encoding
   - ISO-8859-1
   + UTF-8

There are some edge cases to be considered, when the text-encoding changes from
"unmanaged" to "managed" (or back), where the diff engine would pick up all
kinds of "bogus" text changes. This may need special attention.
Another edge case: Some commit logic should be present to check that a "managed"
file being checked in is in fact valid in the said encoding (so that careless
handling of file encoding won't inadvertently break the repository data).

Ad 3) Extend this support to "non-ASCII" charsets (like UTF-16,  charset support
in diff and merge, etc:

Actually this may not be a big issue at all, if the conversions are added at the
right level, since the main diffing engine would always work on UTF-8 (in fact,
it would always work on 8-bit oriented streams separated by LFs, just like it
does now).
Mabye it will even be a no-op.
The only change I can think of right now is the fixed width keyword
substitution, which today works on bytes, but could work fine on characters if
the knowledge was there.

Ad 4) Adding auto-property support for text encoding:

Plenty of options exist: BOM detection, detecting of UTF-8 leading/trailing
bytes, checking for XML declarations, etc.
There should also be a configuration setting for preferring the native encoding
over the detected one (if the detector sees a file encoded with the encoding
which is also the current native encoding).

In summary - the various text formats and their relations are expressed like
this (for "managed" content):

   WC file contents (user specified)
         || 
  Enriched contents (UTF-8, w/keywords and EOL trans)
         || 
   Pristine contents (UTF-8, as stored in FS)

'svn diff' between WC and pristine would convert the WC file up to the
"enriched" level before feeding to the diff libraries (Not sure how this would
be handled for external diff packages)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@subversion.tigris.org
For additional commands, e-mail: issues-help@subversion.tigris.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic