'Re: [Wireshark-dev] Note about proto_tree_add_unicode_string (r43379)'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wireshark-dev
Subject:    Re: [Wireshark-dev] Note about proto_tree_add_unicode_string (r43379)
From:       Guy Harris <guy () alum ! mit ! edu>
Date:       2012-06-19 19:47:28
Message-ID: CB0BE04F-4E9E-4F57-97C7-6A2CE7ABE24B () alum ! mit ! edu
[Download RAW message or body]


On Jun 19, 2012, at 12:01 PM, Jakub Zawadzki wrote:

> Hi,
> 
> String from tvb_get_ephemeral_string() still needs escaping with format_text(),
> cause it doesn't check encoding.
> 
> When you use:
> tvb_get_ephemeral_string_enc(tvb, offset, length, ENC_UTF_8 | ENC_NA);
> 
> It guarantees result encoded in UTF-8:
> * string as converted from the appropriate encoding to UTF-8 ...
> 
> (Code to do it is still in XXX's but this is bug in libwireshark and no one can \
> blame you that you used wrong function :))

Or, rather, there is no code to do it and there's an XXX comment noting that:

        case ENC_UTF_8:
                /*
                 * XXX - should map all invalid UTF-8 sequences
                 * to a "substitute" UTF-8 character.
                 */
                strbuf = tvb_get_ephemeral_string(tvb, offset, length);
                break;

*That's* why escaping is still needed - we don't handle malformed UTF-8 strings.

If we were to:

	store strings in the protocol tree as a combination of an encoding value and a raw \
bucket of bytes;

	in the display filter code:

		implement comparison of string fields with text strings by attempting to convert \
the string field value to UTF-8 at that point and having all comparisons other than \
!= fail and have != succeed if the string can't be converted;

		also support comparison of string fields with *byte* strings, e.g. a sequence of \
hex-encoded octets, and have that compare the raw bucket of bytes;

	have the "construct a filter from this field" UI generate, for string fields, a \
comparison with a UTF-8 string if the value can be converted to UTF-8 and a \
comparison with a byte string if it can't;

	in the code that generates displayable text for the protocol tree, replace anything \
in the string that can't be mapped to UTF-8 to a "substitute" UTF-8 character, \
including mapping invalid UTF-8 sequences if the string's encoding is UTF-8;

	do similar stuff if a string field's value is used to generate anything in the \
packet summary.

that might handle this, and might also defer conversion to UTF-8 until it's \
absolutely necessary, which might cut CPU usage.

(A further optimization might be to, in the display filter code implement comparison \
of string fields with text strings by attempting to convert the *text string* to the \
field's encoding at that point, with the result cached so you only have to do it \
once, and having all comparisons other than != fail and have != succeed if the string \
can't be converted.)

As for PDML:

	http://www.nbee.org/doku.php?id=netpdl:visualization_expression#data_associated_to_each_netpdl_field


says

	value (always present)
		It contains the field value as an hex string. In other words, the content of the \
value attribute in case of a 16 bit field whose value is 0800 (hex) is the string \
0800. This value is exactly the same as written in the hex file, i.e. it does not \
change in case of little-endian or big-endian fields. Please note that for bitmasked \
fields, this value corresponds to the 'unmasked' value (i.e. the real field value \
must be derived from a bitwise AND operation between the mask and the value).

	showvalue(always present)

		It contains the field value in a “printable” form. For instance, field value \
ffffffffffff (related to a MAC address) can be associated to value ffffff-ffffff and \
assigned to this attribute.

which seems to indicate that, for a string field, "value" should be the raw hex bytes \
of the string (except perhaps for byte-swapping in the case of UCS-2, UCS-4, or \
UTF-16), *NOT* anything that looks like the text, and suggests to me that "showvalue" \
would be the same as what's generated by the code that generates displayable text for \
the protocol tree, i.e. it'd have invalid UTF-8 sequences replaced by a "substitute" \
character.

However, I don't know whether that's what programs out there that read PDML would \
expect - and it doesn't include the encoding for the string, so that code wouldn't \
know what to do with the string's value. \
___________________________________________________________________________ Sent via: \
                Wireshark-dev mailing list <wireshark-dev@wireshark.org>
Archives:    http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
             mailto:wireshark-dev-request@wireshark.org?subject=unsubscribe


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic