[prev in list] [next in list] [prev in thread] [next in thread] 

List:       avro-dev
Subject:    Re: Avro int vs long and varint encoding
From:       Micah Kornfield <emkornfield () gmail ! com>
Date:       2021-03-26 21:56:31
Message-ID: CAK7Z5T-_0ZfjeuJ+fRbB3NVN7KJ+rVYB9m=JxQ2GF=+9qyyAKA () mail ! gmail ! com
[Download RAW message or body]


Hi Spencer,

Just as a guess, I suppose that int and long exist to make life a little
> easier when you're deserializing. They act as promises that, despite being
> variable-length, the integer you read out will fit in a certain-sized chunk
> of memory


I agree this is the reason.  A lot of languages (python is not one of them)
need to have a consistent size for efficiency.

But if this is the goal, it seems like fixed-size integers would do a
> better job, and would let deserializers be *much* more efficient since they
> don't need to do multiple instructions (including a conditional!) on every
> single byte of the input.


There is a tradeoff here between storage/wire efficiency and CPU
efficiency.  In a lot of cases integers tend to be small and branch
predictors can do a good job of eliminating the inefficiencies of branches.
When this applies, increasing storage/on the wire cost by 4x-8x doesn't
make sense.   Other serialization formats like protobuf [1] do allow users
to make the choice between fixed and variable width data.

I think the general assumptions about cost of compute, storage/networking
has changed to a certain degree from the time when a lot of serialization
formats were created, and that in many cases, making more use of fixed size
types make sense.  Flatbuffers [2] and Capnproto [3] are serialization
formats that take this to the extreme.

I can't speak to plans for  an Avro 2 implementation but based on my
experience, it is non-trivial to get a new serialization format to a point
where it is useful (i.e. has good support across a lot of languages), and
even more so for adoption.


[1] https://developers.google.com/protocol-buffers/docs/proto3#scalar
[2] https://google.github.io/flatbuffers/
[3] https://capnproto.org/

On Fri, Mar 26, 2021 at 2:35 PM Spencer Nelson <s@spencerwnelson.com> wrote:

> Why does Avro have separate 'int' and 'long' types, and what does it mean
> for them to represent 32- and 64-bit integers?
> 
> Since their binary encoding is variable-length, it appears to me that they
> are unbounded in practice. Encoder and decoder functions for int and long
> are completely identical, other than checking that a value doesn't exceed
> the 32- and 64-bit maximum values. In fact, the Python implementation's
> read_int is just an alias for read_long [1], and the same goes for
> write_int / write_long.
> 
> Just as a guess, I suppose that int and long exist to make life a little
> easier when you're deserializing. They act as promises that, despite being
> variable-length, the integer you read out will fit in a certain-sized chunk
> of memory.
> 
> But if this is the goal, it seems like fixed-size integers would do a
> better job, and would let deserializers be *much* more efficient since they
> don't need to do multiple instructions (including a conditional!) on every
> single byte of the input.
> 
> I don't know how open folks are to dramatic changes (is Avro 2 a concept
> that anyone is talking about?) but I think that Avro would be significantly
> improved with a rethink of its integer types. Some sort of fixed-length
> integer type[s], and just one variable-length integer type, would be
> clarifying, more expressive, and more efficient, I think.
> 
> 
> 
> [1]
> 
> https://github.com/apache/avro/blob/5bd7cfe0bf742d0482bf6f54b4541b4d22cc87d9/lang/py/avro/io.py#L251-L255
>  



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic