[prev in list] [next in list] [prev in thread] [next in thread]
List: wireshark-dev
Subject: Re: [Wireshark-dev] Note about proto_tree_add_unicode_string (r43379)
From: Guy Harris <guy () alum ! mit ! edu>
Date: 2012-06-19 19:47:28
Message-ID: CB0BE04F-4E9E-4F57-97C7-6A2CE7ABE24B () alum ! mit ! edu
[Download RAW message or body]
On Jun 19, 2012, at 12:01 PM, Jakub Zawadzki wrote:
> Hi,
>
> String from tvb_get_ephemeral_string() still needs escaping with format_text(),
> cause it doesn't check encoding.
>
> When you use:
> tvb_get_ephemeral_string_enc(tvb, offset, length, ENC_UTF_8 | ENC_NA);
>
> It guarantees result encoded in UTF-8:
> * string as converted from the appropriate encoding to UTF-8 ...
>
> (Code to do it is still in XXX's but this is bug in libwireshark and no one can \
> blame you that you used wrong function :))
Or, rather, there is no code to do it and there's an XXX comment noting that:
case ENC_UTF_8:
/*
* XXX - should map all invalid UTF-8 sequences
* to a "substitute" UTF-8 character.
*/
strbuf = tvb_get_ephemeral_string(tvb, offset, length);
break;
*That's* why escaping is still needed - we don't handle malformed UTF-8 strings.
If we were to:
store strings in the protocol tree as a combination of an encoding value and a raw \
bucket of bytes;
in the display filter code:
implement comparison of string fields with text strings by attempting to convert \
the string field value to UTF-8 at that point and having all comparisons other than \
!= fail and have != succeed if the string can't be converted;
also support comparison of string fields with *byte* strings, e.g. a sequence of \
hex-encoded octets, and have that compare the raw bucket of bytes;
have the "construct a filter from this field" UI generate, for string fields, a \
comparison with a UTF-8 string if the value can be converted to UTF-8 and a \
comparison with a byte string if it can't;
in the code that generates displayable text for the protocol tree, replace anything \
in the string that can't be mapped to UTF-8 to a "substitute" UTF-8 character, \
including mapping invalid UTF-8 sequences if the string's encoding is UTF-8;
do similar stuff if a string field's value is used to generate anything in the \
packet summary.
that might handle this, and might also defer conversion to UTF-8 until it's \
absolutely necessary, which might cut CPU usage.
(A further optimization might be to, in the display filter code implement comparison \
of string fields with text strings by attempting to convert the *text string* to the \
field's encoding at that point, with the result cached so you only have to do it \
once, and having all comparisons other than != fail and have != succeed if the string \
can't be converted.)
As for PDML:
http://www.nbee.org/doku.php?id=netpdl:visualization_expression#data_associated_to_each_netpdl_field
says
value (always present)
It contains the field value as an hex string. In other words, the content of the \
value attribute in case of a 16 bit field whose value is 0800 (hex) is the string \
0800. This value is exactly the same as written in the hex file, i.e. it does not \
change in case of little-endian or big-endian fields. Please note that for bitmasked \
fields, this value corresponds to the 'unmasked' value (i.e. the real field value \
must be derived from a bitwise AND operation between the mask and the value).
showvalue(always present)
It contains the field value in a “printable” form. For instance, field value \
ffffffffffff (related to a MAC address) can be associated to value ffffff-ffffff and \
assigned to this attribute.
which seems to indicate that, for a string field, "value" should be the raw hex bytes \
of the string (except perhaps for byte-swapping in the case of UCS-2, UCS-4, or \
UTF-16), *NOT* anything that looks like the text, and suggests to me that "showvalue" \
would be the same as what's generated by the code that generates displayable text for \
the protocol tree, i.e. it'd have invalid UTF-8 sequences replaced by a "substitute" \
character.
However, I don't know whether that's what programs out there that read PDML would \
expect - and it doesn't include the encoding for the string, so that code wouldn't \
know what to do with the string's value. \
___________________________________________________________________________ Sent via: \
Wireshark-dev mailing list <wireshark-dev@wireshark.org>
Archives: http://www.wireshark.org/lists/wireshark-dev
Unsubscribe: https://wireshark.org/mailman/options/wireshark-dev
mailto:wireshark-dev-request@wireshark.org?subject=unsubscribe
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic