[prev in list] [next in list] [prev in thread] [next in thread]
List: avro-dev
Subject: Re: Avro int vs long and varint encoding
From: Micah Kornfield <emkornfield () gmail ! com>
Date: 2021-03-26 21:56:31
Message-ID: CAK7Z5T-_0ZfjeuJ+fRbB3NVN7KJ+rVYB9m=JxQ2GF=+9qyyAKA () mail ! gmail ! com
[Download RAW message or body]
Hi Spencer,
Just as a guess, I suppose that int and long exist to make life a little
> easier when you're deserializing. They act as promises that, despite being
> variable-length, the integer you read out will fit in a certain-sized chunk
> of memory
I agree this is the reason. A lot of languages (python is not one of them)
need to have a consistent size for efficiency.
But if this is the goal, it seems like fixed-size integers would do a
> better job, and would let deserializers be *much* more efficient since they
> don't need to do multiple instructions (including a conditional!) on every
> single byte of the input.
There is a tradeoff here between storage/wire efficiency and CPU
efficiency. In a lot of cases integers tend to be small and branch
predictors can do a good job of eliminating the inefficiencies of branches.
When this applies, increasing storage/on the wire cost by 4x-8x doesn't
make sense. Other serialization formats like protobuf [1] do allow users
to make the choice between fixed and variable width data.
I think the general assumptions about cost of compute, storage/networking
has changed to a certain degree from the time when a lot of serialization
formats were created, and that in many cases, making more use of fixed size
types make sense. Flatbuffers [2] and Capnproto [3] are serialization
formats that take this to the extreme.
I can't speak to plans for an Avro 2 implementation but based on my
experience, it is non-trivial to get a new serialization format to a point
where it is useful (i.e. has good support across a lot of languages), and
even more so for adoption.
[1] https://developers.google.com/protocol-buffers/docs/proto3#scalar
[2] https://google.github.io/flatbuffers/
[3] https://capnproto.org/
On Fri, Mar 26, 2021 at 2:35 PM Spencer Nelson <s@spencerwnelson.com> wrote:
> Why does Avro have separate 'int' and 'long' types, and what does it mean
> for them to represent 32- and 64-bit integers?
>
> Since their binary encoding is variable-length, it appears to me that they
> are unbounded in practice. Encoder and decoder functions for int and long
> are completely identical, other than checking that a value doesn't exceed
> the 32- and 64-bit maximum values. In fact, the Python implementation's
> read_int is just an alias for read_long [1], and the same goes for
> write_int / write_long.
>
> Just as a guess, I suppose that int and long exist to make life a little
> easier when you're deserializing. They act as promises that, despite being
> variable-length, the integer you read out will fit in a certain-sized chunk
> of memory.
>
> But if this is the goal, it seems like fixed-size integers would do a
> better job, and would let deserializers be *much* more efficient since they
> don't need to do multiple instructions (including a conditional!) on every
> single byte of the input.
>
> I don't know how open folks are to dramatic changes (is Avro 2 a concept
> that anyone is talking about?) but I think that Avro would be significantly
> improved with a rethink of its integer types. Some sort of fixed-length
> integer type[s], and just one variable-length integer type, would be
> clarifying, more expressive, and more efficient, I think.
>
>
>
> [1]
>
> https://github.com/apache/avro/blob/5bd7cfe0bf742d0482bf6f54b4541b4d22cc87d9/lang/py/avro/io.py#L251-L255
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic