'Re: contribution'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       drill-dev
Subject:    Re: contribution
From:       Jacques Nadeau <jacques () apache ! org>
Date:       2013-03-17 1:34:38
Message-ID: CAKa9qDnyytdSAwp4o3yc8s5W++HXtDvzLw9FAOgse4MrnwdsjA () mail ! gmail ! com
[Download RAW message or body]


Hey David,

The java-exec framework is not far enough along that it makes sense for me
to push it externally yet.  However, I did push my initial wip physical
plan approach.  You can find it here:
https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates

Hopefully, I will get further along on the java-exec stuff soon.

I'd suggest that you focus your energy on the StorageEngine API and HBase
implementation.  If you're up for it, let's do a quick skype chat to sync
up.  Let me know your availability over the next few days.

Thanks,
Jacques



On Fri, Mar 15, 2013 at 6:59 PM, David Alves <davidralves@gmail.com> wrote:

> that'd be great thanks.
>
> -david
>
> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau <jacques.drill@gmail.com>
> wrote:
>
> > I've been under the weather the last few days and haven't made much
> > progress. Let me see if I can get you something tomorrow.
> >
> > On Mar 15, 2013, at 2:36 PM, David Alves <davidralves@gmail.com> wrote:
> >
> >> Hi Jacques
> >>
> >>   Is there any chance we could get a preview of this physical plan
> stuff and basic plumbing for distributed execution before the weekend?
> maybe in a github branch somewhere?
> >>   I mean it doesn't have to be complete or even running, I'd just like
> to make some progress with other stuff and keeping it in line with
> whichever plumbing you already have would be great.
> >>
> >> Best
> >> David
> >>
> >> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <jacques@apache.org> wrote:
> >>
> >>> I'm working on some physical plan stuff as well as some basic plumbing
> for
> >>> distributed execution.  Its very in progress so I need to clean things
> up a
> >>> bit before we could collaborate/ divide and conquer on it.  Depending
> on
> >>> your timing and availability, maybe I could put some of this together
> in
> >>> the next couple days so that you could plug in rather than reinvent.
>  In
> >>> the meantime, pushing forward the builder stuff, additional test cases
> on
> >>> the reference interpreter and/or thinking through the logical plan
> storage
> >>> engine pushdown/rewrite could be very useful.
> >>>
> >>> Let me know your thoughts.
> >>>
> >>> thanks,
> >>> Jacques
> >>>
> >>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <davidralves@gmail.com>
> wrote:
> >>>
> >>>> Hi Jacques
> >>>>
> >>>>      I can assign issues to me now, thanks.
> >>>>      What you say wrt to the logical/physical/execution layers sounds
> >>>> good.
> >>>>      My main concern, for the moment is to have something working as
> >>>> fast as possible, i.e. some daemons that I'd be able to deploy to a
> working
> >>>> hbase cluster and send them work to do in some form (first step would
> be to
> >>>> treat is as a non distributed engine where each daemon runs an
> instance of
> >>>> the prototype).
> >>>>      Here's where I'd like to go next:
> >>>>      - lay the ground work for the daemons (scripts/rpc iface/wiring
> >>>> protocol).
> >>>>      - create an execution engine iface that allows to abstract future
> >>>> implementations, and make it available through the rpc iface. this
> would
> >>>> sit in front of the ref impl for now and would be replaced by cpp
> down the
> >>>> line.
> >>>>
> >>>>      I think we can probably concentrate on the capabilities iface a
> >>>> bit down the line but, as a first approach, I see it simply providing
> a
> >>>> simple set of ops that it is able to run internally.
> >>>>      How to abstract locality/partitioning/schema capabilities is till
> >>>> not clear to me though, thoughts?
> >>>>
> >>>> David
> >>>>
> >>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <jacques@apache.org>
> wrote:
> >>>>
> >>>>> I'm working on a presentation that will better illustrate the layers.
> >>>>> There are actually three key plans.  Thinking to date has been to
> break
> >>>>> the plans down into logical, physical and execution.  The third
> hasn't
> >>>> been
> >>>>> expressed well here and is entirely an internal domain to the
> execution
> >>>>> engine.  Following some classic methods: Logical expresses what we
> want
> >>>> to
> >>>>> do, Physical expresses how we want to do it (adding points of
> >>>>> parallelization but not specifying particular amounts of
> parallelization
> >>>> or
> >>>>> node by node assignments).  The execution engine is then responsible
> for
> >>>>> determining the amount of parallelization of a particular plan along
> with
> >>>>> system load (likely leveraging Berkeley's Sparrow work), task
> priority
> >>>> and
> >>>>> specific data locality information, building sub-dags to be assigned
> to
> >>>>> individual nodes and execute the plan.
> >>>>>
> >>>>> So in the higher logical and physical levels, a single Scan and
> >>>> subsequent
> >>>>> ScanPOP should be okay...  (ScanROPs have a separate problems since
> they
> >>>>> ignore the level of separation we're planning for the real execution
> >>>> layer.
> >>>>> This is the why the current ref impl turns a single Scan into
> potentially
> >>>>> a union of ScanROPs... not elegant but logically correct.)
> >>>>>
> >>>>> The capabilities interface still needs to be defined for how a
> storage
> >>>>> engine reveals its logical capabilities and thus consumes part of the
> >>>> plan.
> >>>>>
> >>>>> J
> >>>>>
> >>>>>
> >>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com
> >
> >>>> wrote:
> >>>>>
> >>>>>> Hi Linsen
> >>>>>>
> >>>>>>     Some of what you are saying like push down of ops like filter,
> >>>>>> projection or partial aggregation below the storage engine scanner
> >>>> level,
> >>>>>> or sub tree execution are actively being discussed in issues
> DRILL-13
> >>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine),
> your
> >>>> input
> >>>>>> in these issues is most welcome.
> >>>>>>
> >>>>>>     HBase in particular has the notion of
> >>>>>> enpoints/coprocessors/filters that allow pushing this down easily
> (this
> >>>> is
> >>>>>> also in line with what other parallel database over nosql
> >>>> implementations
> >>>>>> like tajo do).
> >>>>>>     A possible approach is to have the optimizer change the order of
> >>>>>> the ops to place them below the storage engine scanner and let the
> SE
> >>>> impl
> >>>>>> deal with it internally.
> >>>>>>
> >>>>>>     There are also some other pieces missing at the moment AFAIK,
> >>>> like
> >>>>>> a distributed metadata store, the drill daemons, wiring, etc.
> >>>>>>
> >>>>>>     So in summary, you're absolutely right, and if you're
> >>>> particularly
> >>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
> >>>> interested
> >>>>>> in collaborating.
> >>>>>>
> >>>>>> Best
> >>>>>> David
> >>>>>>
> >>>>>>
> >>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <immars@gmail.com> wrote:
> >>>>>>
> >>>>>>> Hi David,
> >>>>>>>
> >>>>>>> Very nice to see your effort on this.
> >>>>>>>
> >>>>>>> Hi Jacques,
> >>>>>>>
> >>>>>>> we are also extending drill prototype, to see if there is any
> chance to
> >>>>>>> meet our production need. However, We find that implementing a
> >>>> performant
> >>>>>>> HBase storage engine is a not so straight-forward work, and
> requires
> >>>> some
> >>>>>>> workaround. The problem is in Scan interface.
> >>>>>>>
> >>>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
> >>>>>> Storage
> >>>>>>> engine provides output for a whole data source, a csv file for
> example.
> >>>>>>> It's sufficient for input source like plain file, but for hbase,
> it's
> >>>> not
> >>>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
> >>>> htable
> >>>>>>> into drill. Storage engines like HBase should have some ablility
> to do
> >>>>>> part
> >>>>>>> of the DrQL query, like Filter, if a filter can be performed by
> >>>>>> specifying
> >>>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
> >>>> even
> >>>>>>> Join.
> >>>>>>>
> >>>>>>> Generally, it would be more clear if a ScanROP is mapped to a
> sub-DAG
> >>>> of
> >>>>>>> logical plan DAG instead of a single Scan node in logical plan. If
> so,
> >>>>>> more
> >>>>>>> implementation-specific information would coupe into the plan
> >>>>>> optimization
> >>>>>>> & transformation phase. I guess that's the price to pay when
> >>>> optimization
> >>>>>>> comes, or is there other way I failed to see?
> >>>>>>>
> >>>>>>> Please correct me if anything is wrong.
> >>>>>>>
> >>>>>>> thanks,
> >>>>>>>
> >>>>>>> Lisen
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <
> davidralves@gmail.com>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi Jacques
> >>>>>>>>
> >>>>>>>>    I've submitted a fist pass patch to DRILL-15.
> >>>>>>>>    I did this mostly because HBase will be my main target and
> >>>>>> because
> >>>>>>>> I wanted to get a feel of what would be a nice interface for
> DRILL-13.
> >>>>>> Have
> >>>>>>>> some thoughts that I will post soon.
> >>>>>>>>    btw: I still can't assign issues to myself in JIRA, did you
> >>>>>> forget
> >>>>>>>> to add me as a contributor?
> >>>>>>>>
> >>>>>>>> Best
> >>>>>>>> David
> >>>>>>>>
> >>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <jacques@apache.org>
> >>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hey David,
> >>>>>>>>>
> >>>>>>>>> These sound good.  I've add you as a contributor on jira so you
> can
> >>>>>>>> assign
> >>>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.
>  15
> >>>>>>>> depends
> >>>>>>>>> on 13 and working on the two hand in hand would probably be a
> good
> >>>>>> idea.
> >>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you
> have
> >>>>>>>> some
> >>>>>>>>> time to focus on it.
> >>>>>>>>>
> >>>>>>>>> Jacques
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <
> davidralves@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi All
> >>>>>>>>>>
> >>>>>>>>>>   I have a new academic project for which I'd like to use drill
> >>>>>>>>>> since none of the other parallel database over hadoop/nosql
> >>>>>>>> implementations
> >>>>>>>>>> fit just right.
> >>>>>>>>>>   To this goal I've been tinkering with the prototype trying to
> >>>>>>>> find
> >>>>>>>>>> where I'd be most useful.
> >>>>>>>>>>
> >>>>>>>>>>   Here's where I'd like to start, if you agree:
> >>>>>>>>>>   - implement HBase storage engine (DRILL-15)
> >>>>>>>>>>           - start with simple scanning an push down of
> >>>>>>>>>> selection/projection
> >>>>>>>>>>   - implement the LogicalPlanBuilder (DRILL-45)
> >>>>>>>>>>   - setup coding style in the wiki (formatting/imports etc,
> >>>>>>>> DRILL-46)
> >>>>>>>>>>   - create builders for all logical plan elements/make logical
> >>>>>>>> plans
> >>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts
> first).
> >>>>>>>>>>
> >>>>>>>>>>   Please let me know your thoughts, and if you agree please
> >>>> assign
> >>>>>>>>>> the issues to me (it seems that I can't assign them myself).
> >>>>>>>>>>
> >>>>>>>>>> Best
> >>>>>>>>>> David Alves
> >>
>
>


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic