'Re: Another columnar format Parquet'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       drill-dev
Subject:    Re: Another columnar format Parquet
From:       Tsuyoshi OZAWA <ozawa.tsuyoshi () gmail ! com>
Date:       2013-03-13 21:08:01
Message-ID: CAAD07OK9L0zivtykrb5crkhSybZrtF5=k3byBqTGT7Sij+TJzQ () mail ! gmail ! com
[Download RAW message or body]

One alternative columnar storage is wiredtiger used by amazon.com.
It provides with a columnar storage and record-style storage library
API like berkley DB.

One concern is that wiredtiger is licensed by GPL and BSD.
However, supporting it can empower Drill project.

http://wiredtiger.com/

On Wed, Mar 13, 2013 at 4:22 PM, Ted Dunning <ted.dunning@gmail.com> wrote:
> Can you bring 5 slides on parquet?  (ppt or pptx?)
>
> On Tue, Mar 12, 2013 at 8:59 PM, Julien Le Dem <julien@twitter.com> wrote:
>
>> I should be able to come to the Drill meetup tomorrow.
>> We can chat about it then.
>> Julien
>>
>> On Tue, Mar 12, 2013 at 1:43 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
>> wrote:
>> > ColumnIO implementations return values from a column independently of
>> other
>> > columns; RecordReaderImplementation does materialize the whole record (by
>> > using a bunch of column readers at the same time). You could construct a
>> > column-at-a-time, late materialization api by dropping directly into
>> using
>> > column readers; so it just depends on which level of abstraction you want
>> > to hook up with.
>> >
>> > We were initially concerned with "record-oriented" frameworks so we built
>> > the record materialization machinery for them first; a  more truly
>> columnar
>> > engine should work with ColumnIO instead of RecordReaders.
>> >
>> > Also, since the API is still young, it's certainly open to discussion and
>> > improvement.
>> >
>> > D
>> >
>> >
>> > On Tue, Mar 12, 2013 at 11:16 AM, Todd Lipcon <todd@cloudera.com> wrote:
>> >
>> >> On Tue, Mar 12, 2013 at 11:11 AM, Jacques Nadeau <jacques@apache.org>
>> >> wrote:
>> >>
>> >> > Joined, thanks.  I'm glad that the approach was open for this.  I
>> think
>> >> > that helps it chances to be ubiquitous.  As much as this might be
>> >> > blasphemous to some, I really hope that the final solution to the
>> query
>> >> > wars is a collaborative solution as opposed to a competitive one.
>> >> >
>> >> > Having not looked at the code yet, do the existing read interfaces
>> >> support
>> >> > working with "late materialization" execution strategies similar to
>> some
>> >> of
>> >> > the ideas at [1]?  Definitely seems harder to implement in a
>> >> > nested/repeated environment but wanted to get a sense of the thinking
>> >> > behind the initial efforts.
>> >> >
>> >>
>> >> The existing read interface in Java is tuple-at-a-time, but there's no
>> >> reason one couldn't build a column-at-a-time late materialization
>> approach.
>> >> It would just be a lot more "custom", and not directly user-usable, so
>> >> there's none in the initial implementation.
>> >>
>> >> Like you said, it's a little tougher with arbitrary nesting, but I think
>> >> still doable.
>> >>
>> >> -Todd
>> >>
>> >> >
>> >> > On Tue, Mar 12, 2013 at 10:40 AM, Todd Lipcon <todd@cloudera.com>
>> wrote:
>> >> >
>> >> > > Hey Jacques,
>> >> > >
>> >> > > Feel free to ping us with any questions. Despite some of the
>> _users_ of
>> >> > > Parquet competing with each other (eg query engines), we hope the
>> file
>> >> > > format itself can be easily implemented by everyone and become
>> >> > ubiquitous.
>> >> > >
>> >> > > There are a few changes still in flight that we're working on, so
>> you
>> >> may
>> >> > > want to join the parquet dev mailing list as well to follow along.
>> >> > >
>> >> > > Thanks
>> >> > > -Todd
>> >> > >
>> >> > > On Tue, Mar 12, 2013 at 10:29 AM, Jacques Nadeau <
>> jacques@apache.org>
>> >> > > wrote:
>> >> > >
>> >> > > > When you said soon, you meant very soon.  This looks like great
>> work.
>> >> > > >  Thanks for sharing it with the world.  Will come back after
>> spending
>> >> > > some
>> >> > > > time with it.
>> >> > > >
>> >> > > > thanks again,
>> >> > > > Jacques
>> >> > > >
>> >> > > >
>> >> > > >
>> >> > > > On Tue, Mar 12, 2013 at 9:50 AM, Julien Le Dem <
>> julien@twitter.com>
>> >> > > wrote:
>> >> > > >
>> >> > > > > The repo is now available: http://parquet.github.com/
>> >> > > > > Let me know if you have questions
>> >> > > > >
>> >> > > > > On Mon, Mar 11, 2013 at 11:31 AM, Jacques Nadeau <
>> >> jacques@apache.org
>> >> > >
>> >> > > > > wrote:
>> >> > > > > > There definitely seem to be some new kids on the block.  I
>> really
>> >> > > hope
>> >> > > > > that
>> >> > > > > > Drill can adopt either ORC or Parquet as a closely related
>> >> "native"
>> >> > > > > format.
>> >> > > > > >   At the moment, I'm actually more focused on the in-memory
>> >> > execution
>> >> > > > > > format and the right abstraction to support compressed
>> columnar
>> >> > > > execution
>> >> > > > > > and vectorization.  Historically, the biggest gaps I'd worry
>> >> about
>> >> > > are
>> >> > > > > > java-centricity and expectation of early materialization &
>> >> > > > decompression.
>> >> > > > > >  Once we get some execution stuff working, lets see how each
>> fits
>> >> > in.
>> >> > > > > >  Rather than start a third competing format (or fourth if you
>> >> count
>> >> > > > > > Trevni), let's either use or extend/contribute back on one of
>> the
>> >> > > > > existing
>> >> > > > > > new kids.
>> >> > > > > >
>> >> > > > > > Julien, do you think more will be shared about Parquet before
>> the
>> >> > > > Hadoop
>> >> > > > > > Summit so we can start toying with using it inside of Drill?
>> >> > > > > >
>> >> > > > > > J
>> >> > > > > >
>> >> > > > > > On Mon, Mar 11, 2013 at 11:02 AM, Ken Krugler
>> >> > > > > > <kkrugler_lists@transpac.com>wrote:
>> >> > > > > >
>> >> > > > > >> Hi all,
>> >> > > > > >>
>> >> > > > > >> I've been trying to track down status/comparisons of various
>> >> > > columnar
>> >> > > > > >> formats, and just heard about Parquet.
>> >> > > > > >>
>> >> > > > > >> I don't have any direct experience with Parquet, but Really
>> >> Smart
>> >> > > Guy
>> >> > > > > said:
>> >> > > > > >>
>> >> > > > > >> > From what I hear there are two key features that
>> >> > > > > >> > differentiate it from ORC and Trevni: 1) columns can be
>> >> > optionally
>> >> > > > > split
>> >> > > > > >> into
>> >> > > > > >> > separate files, and 2) the mechanism for shredding nested
>> >> fields
>> >> > > > into
>> >> > > > > >> > columns is taken almost verbatim from Dremel. Feature (1)
>> >> won't
>> >> > be
>> >> > > > > >> practical
>> >> > > > > >> > to use until Hadoop introduces support for a file group
>> >> locality
>> >> > > > > >> feature, but once it
>> >> > > > > >> > does this feature should enable more efficient use of the
>> >> buffer
>> >> > > > cache
>> >> > > > > >> for predicate
>> >> > > > > >> > pushdown operations.
>> >> > > > > >>
>> >> > > > > >> -- Ken
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >> On Mar 11, 2013, at 10:56am, Julien Le Dem wrote:
>> >> > > > > >>
>> >> > > > > >> > Parquet is actually implementing the algorithm described in
>> >> the
>> >> > > > > >> > "Nested Columnar Storage" section of the Dremel paper[1].
>> >> > > > > >> >
>> >> > > > > >> > [1] http://research.google.com/pubs/pub36632.html
>> >> > > > > >> >
>> >> > > > > >> > On Mon, Mar 11, 2013 at 10:41 AM, Timothy Chen <
>> >> > tnachen@gmail.com
>> >> > > >
>> >> > > > > >> wrote:
>> >> > > > > >> >> Just saw this:
>> >> > > > > >> >>
>> >> > > > > >> >> http://t.co/ES1dGDZlKA
>> >> > > > > >> >>
>> >> > > > > >> >> I know Trevni is another Dremel inspired Columnar format
>> as
>> >> > well,
>> >> > > > > anyone
>> >> > > > > >> >> saw much info Parquet and how it's different?
>> >> > > > > >> >>
>> >> > > > > >> >> Tim
>> >> > > > > >>
>> >> > > > > >> --------------------------
>> >> > > > > >> Ken Krugler
>> >> > > > > >> +1 530-210-6378
>> >> > > > > >> http://www.scaleunlimited.com
>> >> > > > > >> custom big data solutions & training
>> >> > > > > >> Hadoop, Cascading, Cassandra & Solr
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > > >>
>> >> > > > >
>> >> > > >
>> >> > >
>> >> > >
>> >> > >
>> >> > > --
>> >> > > Todd Lipcon
>> >> > > Software Engineer, Cloudera
>> >> > >
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Todd Lipcon
>> >> Software Engineer, Cloudera
>> >>
>>



--
- Tsuyoshi

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic