[prev in list] [next in list] [prev in thread] [next in thread]
List: drill-dev
Subject: Re: contribution
From: Timothy Chen <tnachen () gmail ! com>
Date: 2013-03-13 21:40:10
Message-ID: CAFx0iW8f9iEVtaN_kOuDN+b7ZU6NRhGYe4cxa0Oowb+y_h7d9Q () mail ! gmail ! com
[Download RAW message or body]
Looking forward to the plumbing as well, since my json scan op sat there
for a while now :)
Tim
On Wed, Mar 13, 2013 at 2:30 PM, David Alves <davidralves@gmail.com> wrote:
> Getting the basic plumbing to a point where we could work together on
> it/use it elsewhere as soon as you can would be awesome.
> As soon as I get that I can start on the daemons/scripts.
> I'll focus on the SE iface and on HBase pushdown for the moment.
>
> -david
>
> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <jacques@apache.org> wrote:
>
> > I'm working on some physical plan stuff as well as some basic plumbing
> for
> > distributed execution. Its very in progress so I need to clean things
> up a
> > bit before we could collaborate/ divide and conquer on it. Depending on
> > your timing and availability, maybe I could put some of this together in
> > the next couple days so that you could plug in rather than reinvent. In
> > the meantime, pushing forward the builder stuff, additional test cases on
> > the reference interpreter and/or thinking through the logical plan
> storage
> > engine pushdown/rewrite could be very useful.
> >
> > Let me know your thoughts.
> >
> > thanks,
> > Jacques
> >
> > On Wed, Mar 13, 2013 at 9:47 AM, David Alves <davidralves@gmail.com>
> wrote:
> >
> >> Hi Jacques
> >>
> >> I can assign issues to me now, thanks.
> >> What you say wrt to the logical/physical/execution layers sounds
> >> good.
> >> My main concern, for the moment is to have something working as
> >> fast as possible, i.e. some daemons that I'd be able to deploy to a
> working
> >> hbase cluster and send them work to do in some form (first step would
> be to
> >> treat is as a non distributed engine where each daemon runs an instance
> of
> >> the prototype).
> >> Here's where I'd like to go next:
> >> - lay the ground work for the daemons (scripts/rpc iface/wiring
> >> protocol).
> >> - create an execution engine iface that allows to abstract future
> >> implementations, and make it available through the rpc iface. this would
> >> sit in front of the ref impl for now and would be replaced by cpp down
> the
> >> line.
> >>
> >> I think we can probably concentrate on the capabilities iface a
> >> bit down the line but, as a first approach, I see it simply providing a
> >> simple set of ops that it is able to run internally.
> >> How to abstract locality/partitioning/schema capabilities is till
> >> not clear to me though, thoughts?
> >>
> >> David
> >>
> >> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <jacques@apache.org>
> wrote:
> >>
> >>> I'm working on a presentation that will better illustrate the layers.
> >>> There are actually three key plans. Thinking to date has been to break
> >>> the plans down into logical, physical and execution. The third hasn't
> >> been
> >>> expressed well here and is entirely an internal domain to the execution
> >>> engine. Following some classic methods: Logical expresses what we want
> >> to
> >>> do, Physical expresses how we want to do it (adding points of
> >>> parallelization but not specifying particular amounts of
> parallelization
> >> or
> >>> node by node assignments). The execution engine is then responsible
> for
> >>> determining the amount of parallelization of a particular plan along
> with
> >>> system load (likely leveraging Berkeley's Sparrow work), task priority
> >> and
> >>> specific data locality information, building sub-dags to be assigned to
> >>> individual nodes and execute the plan.
> >>>
> >>> So in the higher logical and physical levels, a single Scan and
> >> subsequent
> >>> ScanPOP should be okay... (ScanROPs have a separate problems since
> they
> >>> ignore the level of separation we're planning for the real execution
> >> layer.
> >>> This is the why the current ref impl turns a single Scan into
> potentially
> >>> a union of ScanROPs... not elegant but logically correct.)
> >>>
> >>> The capabilities interface still needs to be defined for how a storage
> >>> engine reveals its logical capabilities and thus consumes part of the
> >> plan.
> >>>
> >>> J
> >>>
> >>>
> >>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com>
> >> wrote:
> >>>
> >>>> Hi Linsen
> >>>>
> >>>> Some of what you are saying like push down of ops like filter,
> >>>> projection or partial aggregation below the storage engine scanner
> >> level,
> >>>> or sub tree execution are actively being discussed in issues DRILL-13
> >>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
> >> input
> >>>> in these issues is most welcome.
> >>>>
> >>>> HBase in particular has the notion of
> >>>> enpoints/coprocessors/filters that allow pushing this down easily
> (this
> >> is
> >>>> also in line with what other parallel database over nosql
> >> implementations
> >>>> like tajo do).
> >>>> A possible approach is to have the optimizer change the order of
> >>>> the ops to place them below the storage engine scanner and let the SE
> >> impl
> >>>> deal with it internally.
> >>>>
> >>>> There are also some other pieces missing at the moment AFAIK,
> >> like
> >>>> a distributed metadata store, the drill daemons, wiring, etc.
> >>>>
> >>>> So in summary, you're absolutely right, and if you're
> >> particularly
> >>>> interested in the HBase SE impl (as I am, for the moment) I'd be
> >> interested
> >>>> in collaborating.
> >>>>
> >>>> Best
> >>>> David
> >>>>
> >>>>
> >>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <immars@gmail.com> wrote:
> >>>>
> >>>>> Hi David,
> >>>>>
> >>>>> Very nice to see your effort on this.
> >>>>>
> >>>>> Hi Jacques,
> >>>>>
> >>>>> we are also extending drill prototype, to see if there is any chance
> to
> >>>>> meet our production need. However, We find that implementing a
> >> performant
> >>>>> HBase storage engine is a not so straight-forward work, and requires
> >> some
> >>>>> workaround. The problem is in Scan interface.
> >>>>>
> >>>>> In drill's physical plan model, ScanROP is in charge of table scan.
> >>>> Storage
> >>>>> engine provides output for a whole data source, a csv file for
> example.
> >>>>> It's sufficient for input source like plain file, but for hbase, it's
> >> not
> >>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
> >> htable
> >>>>> into drill. Storage engines like HBase should have some ablility to
> do
> >>>> part
> >>>>> of the DrQL query, like Filter, if a filter can be performed by
> >>>> specifying
> >>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
> >> even
> >>>>> Join.
> >>>>>
> >>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
> >> of
> >>>>> logical plan DAG instead of a single Scan node in logical plan. If
> so,
> >>>> more
> >>>>> implementation-specific information would coupe into the plan
> >>>> optimization
> >>>>> & transformation phase. I guess that's the price to pay when
> >> optimization
> >>>>> comes, or is there other way I failed to see?
> >>>>>
> >>>>> Please correct me if anything is wrong.
> >>>>>
> >>>>> thanks,
> >>>>>
> >>>>> Lisen
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <davidralves@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Hi Jacques
> >>>>>>
> >>>>>> I've submitted a fist pass patch to DRILL-15.
> >>>>>> I did this mostly because HBase will be my main target and
> >>>> because
> >>>>>> I wanted to get a feel of what would be a nice interface for
> DRILL-13.
> >>>> Have
> >>>>>> some thoughts that I will post soon.
> >>>>>> btw: I still can't assign issues to myself in JIRA, did you
> >>>> forget
> >>>>>> to add me as a contributor?
> >>>>>>
> >>>>>> Best
> >>>>>> David
> >>>>>>
> >>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <jacques@apache.org>
> >> wrote:
> >>>>>>
> >>>>>>> Hey David,
> >>>>>>>
> >>>>>>> These sound good. I've add you as a contributor on jira so you can
> >>>>>> assign
> >>>>>>> tasks to yourself. I think 45 and 46 are good places to start. 15
> >>>>>> depends
> >>>>>>> on 13 and working on the two hand in hand would probably be a good
> >>>> idea.
> >>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
> have
> >>>>>> some
> >>>>>>> time to focus on it.
> >>>>>>>
> >>>>>>> Jacques
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
> davidralves@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi All
> >>>>>>>>
> >>>>>>>> I have a new academic project for which I'd like to use drill
> >>>>>>>> since none of the other parallel database over hadoop/nosql
> >>>>>> implementations
> >>>>>>>> fit just right.
> >>>>>>>> To this goal I've been tinkering with the prototype trying to
> >>>>>> find
> >>>>>>>> where I'd be most useful.
> >>>>>>>>
> >>>>>>>> Here's where I'd like to start, if you agree:
> >>>>>>>> - implement HBase storage engine (DRILL-15)
> >>>>>>>> - start with simple scanning an push down of
> >>>>>>>> selection/projection
> >>>>>>>> - implement the LogicalPlanBuilder (DRILL-45)
> >>>>>>>> - setup coding style in the wiki (formatting/imports etc,
> >>>>>> DRILL-46)
> >>>>>>>> - create builders for all logical plan elements/make logical
> >>>>>> plans
> >>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
> first).
> >>>>>>>>
> >>>>>>>> Please let me know your thoughts, and if you agree please
> >> assign
> >>>>>>>> the issues to me (it seems that I can't assign them myself).
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>> David Alves
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic