'Re: [cfe-dev] [PATCH] C++0x unicode string and character literals'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cfe-dev
Subject:    Re: [cfe-dev] [PATCH] C++0x unicode string and character literals
From:       Eli Friedman <eli.friedman () gmail ! com>
Date:       2011-07-31 20:20:10
Message-ID: CAJdarcGPsweB0PdwHQ-kq9c24v2vspBs89Grqm-JCLdgUV+vPA () mail ! gmail ! com
[Download RAW message or body]

On Sun, Jul 31, 2011 at 1:03 PM, Seth Cantrell <seth.cantrell@gmail.com> wrote:
> > > So I've got a couple questions.
> > > 
> > > Is the lexer really the appropriate place to be doing this? Originally \
> > > CodeGenModule::GetStringForStringLiteral seemed like the thing I should be \
> > > modifying, but I discovered that the string literal's bytes had already been \
> > > zero extended by the time it got there. Would it be reasonable for the \
> > > StringLiteralParser to just produce a UTF-8 encoded internal representation of \
> > > the string and leave producing the final representation until later? I think \
> > > the main complication with that is that I'll have to encode UCNs with their \
> > > UTF-8 representation.
> > 
> > Given the possibility of character escapes which can't be represented
> > in UTF-8, I'm not sure we can...
> 
> Yeah, I see that's correct now. I need a way to discriminate between \
> "\xF0\x9F\x9A\x80" and U"\xF0\x9F\x9A\x80" as well. 
> Perhaps instead the internal representation could be a discriminated union, based \
> on the string literal's Kind or CharByteWidth? 
> If the final representation does have to be computed inside the string literal \
> parser I'll need to get the target's endianess. I looked through the definition for \
> the TargetInfo object the StringLiteralParser has but didn't see a way to do this. \
> Is this info accessible during this phase?

A string is an array of CharByteWidth-size integers; just keeping it
in the native endianness of the compiler and letting the stuff below
clang byteswap if necessary should be sufficient.  (Granted, IIRC
clang IRGen doesn't really handle wide strings in a very intuitive
manner at the moment.)  Pretty sure there's some way to get endianness
off the TargetInfo if you really need it, though; at the very least,
it's in the target data layout string.

> > > I assume eventually someone will want source and execution charset \
> > > configuration, but for now I'm content to assume source is UTF-8 and that that \
> > > the execution character sets are UTF-8, UTF-16, and UTF-32, with the target's \
> > > native endianess. Is that good enough for now?
> > 
> > The C execution character set can't be UTF-16 or UTF-32 given 8-bit
> > char's.  But yes, feel free to assume the source and execution
> > charsets are UTF-8 for the moment.  (Windows is the only interesting
> > platform where this isn't the case normally.)
> 
> Well, by execution charset I just meant the literal's representation at execution \
> time, so there'd be an 'execution charset' for each string literal type. Perhaps \
> this isn't the right terminology.

Oh, yes, that's fine.  That term (or at least very similar ones) has
pretty specific definition in the C standard.

-Eli

_______________________________________________
cfe-dev mailing list
cfe-dev@cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/cfe-dev

[prev in list] [next in list] [prev in thread] [next in thread]