From drill-dev Wed Mar 13 21:40:10 2013 From: Timothy Chen Date: Wed, 13 Mar 2013 21:40:10 +0000 To: drill-dev Subject: Re: contribution Message-Id: X-MARC-Message: https://marc.info/?l=drill-dev&m=142065425229415 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--e89a8fb1ed04196d6f04d7d540e0" --e89a8fb1ed04196d6f04d7d540e0 Content-Type: text/plain; charset=ISO-8859-1 Looking forward to the plumbing as well, since my json scan op sat there for a while now :) Tim On Wed, Mar 13, 2013 at 2:30 PM, David Alves wrote: > Getting the basic plumbing to a point where we could work together on > it/use it elsewhere as soon as you can would be awesome. > As soon as I get that I can start on the daemons/scripts. > I'll focus on the SE iface and on HBase pushdown for the moment. > > -david > > On Mar 13, 2013, at 3:12 PM, Jacques Nadeau wrote: > > > I'm working on some physical plan stuff as well as some basic plumbing > for > > distributed execution. Its very in progress so I need to clean things > up a > > bit before we could collaborate/ divide and conquer on it. Depending on > > your timing and availability, maybe I could put some of this together in > > the next couple days so that you could plug in rather than reinvent. In > > the meantime, pushing forward the builder stuff, additional test cases on > > the reference interpreter and/or thinking through the logical plan > storage > > engine pushdown/rewrite could be very useful. > > > > Let me know your thoughts. > > > > thanks, > > Jacques > > > > On Wed, Mar 13, 2013 at 9:47 AM, David Alves > wrote: > > > >> Hi Jacques > >> > >> I can assign issues to me now, thanks. > >> What you say wrt to the logical/physical/execution layers sounds > >> good. > >> My main concern, for the moment is to have something working as > >> fast as possible, i.e. some daemons that I'd be able to deploy to a > working > >> hbase cluster and send them work to do in some form (first step would > be to > >> treat is as a non distributed engine where each daemon runs an instance > of > >> the prototype). > >> Here's where I'd like to go next: > >> - lay the ground work for the daemons (scripts/rpc iface/wiring > >> protocol). > >> - create an execution engine iface that allows to abstract future > >> implementations, and make it available through the rpc iface. this would > >> sit in front of the ref impl for now and would be replaced by cpp down > the > >> line. > >> > >> I think we can probably concentrate on the capabilities iface a > >> bit down the line but, as a first approach, I see it simply providing a > >> simple set of ops that it is able to run internally. > >> How to abstract locality/partitioning/schema capabilities is till > >> not clear to me though, thoughts? > >> > >> David > >> > >> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau > wrote: > >> > >>> I'm working on a presentation that will better illustrate the layers. > >>> There are actually three key plans. Thinking to date has been to break > >>> the plans down into logical, physical and execution. The third hasn't > >> been > >>> expressed well here and is entirely an internal domain to the execution > >>> engine. Following some classic methods: Logical expresses what we want > >> to > >>> do, Physical expresses how we want to do it (adding points of > >>> parallelization but not specifying particular amounts of > parallelization > >> or > >>> node by node assignments). The execution engine is then responsible > for > >>> determining the amount of parallelization of a particular plan along > with > >>> system load (likely leveraging Berkeley's Sparrow work), task priority > >> and > >>> specific data locality information, building sub-dags to be assigned to > >>> individual nodes and execute the plan. > >>> > >>> So in the higher logical and physical levels, a single Scan and > >> subsequent > >>> ScanPOP should be okay... (ScanROPs have a separate problems since > they > >>> ignore the level of separation we're planning for the real execution > >> layer. > >>> This is the why the current ref impl turns a single Scan into > potentially > >>> a union of ScanROPs... not elegant but logically correct.) > >>> > >>> The capabilities interface still needs to be defined for how a storage > >>> engine reveals its logical capabilities and thus consumes part of the > >> plan. > >>> > >>> J > >>> > >>> > >>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves > >> wrote: > >>> > >>>> Hi Linsen > >>>> > >>>> Some of what you are saying like push down of ops like filter, > >>>> projection or partial aggregation below the storage engine scanner > >> level, > >>>> or sub tree execution are actively being discussed in issues DRILL-13 > >>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your > >> input > >>>> in these issues is most welcome. > >>>> > >>>> HBase in particular has the notion of > >>>> enpoints/coprocessors/filters that allow pushing this down easily > (this > >> is > >>>> also in line with what other parallel database over nosql > >> implementations > >>>> like tajo do). > >>>> A possible approach is to have the optimizer change the order of > >>>> the ops to place them below the storage engine scanner and let the SE > >> impl > >>>> deal with it internally. > >>>> > >>>> There are also some other pieces missing at the moment AFAIK, > >> like > >>>> a distributed metadata store, the drill daemons, wiring, etc. > >>>> > >>>> So in summary, you're absolutely right, and if you're > >> particularly > >>>> interested in the HBase SE impl (as I am, for the moment) I'd be > >> interested > >>>> in collaborating. > >>>> > >>>> Best > >>>> David > >>>> > >>>> > >>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu wrote: > >>>> > >>>>> Hi David, > >>>>> > >>>>> Very nice to see your effort on this. > >>>>> > >>>>> Hi Jacques, > >>>>> > >>>>> we are also extending drill prototype, to see if there is any chance > to > >>>>> meet our production need. However, We find that implementing a > >> performant > >>>>> HBase storage engine is a not so straight-forward work, and requires > >> some > >>>>> workaround. The problem is in Scan interface. > >>>>> > >>>>> In drill's physical plan model, ScanROP is in charge of table scan. > >>>> Storage > >>>>> engine provides output for a whole data source, a csv file for > example. > >>>>> It's sufficient for input source like plain file, but for hbase, it's > >> not > >>>>> very efficient, if not impossible, to let ScanROP retrieve a whole > >> htable > >>>>> into drill. Storage engines like HBase should have some ablility to > do > >>>> part > >>>>> of the DrQL query, like Filter, if a filter can be performed by > >>>> specifying > >>>>> startRowKey and endRowKey. Storage engine like mysql could do more, > >> even > >>>>> Join. > >>>>> > >>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG > >> of > >>>>> logical plan DAG instead of a single Scan node in logical plan. If > so, > >>>> more > >>>>> implementation-specific information would coupe into the plan > >>>> optimization > >>>>> & transformation phase. I guess that's the price to pay when > >> optimization > >>>>> comes, or is there other way I failed to see? > >>>>> > >>>>> Please correct me if anything is wrong. > >>>>> > >>>>> thanks, > >>>>> > >>>>> Lisen > >>>>> > >>>>> > >>>>> > >>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves > >>>> wrote: > >>>>> > >>>>>> Hi Jacques > >>>>>> > >>>>>> I've submitted a fist pass patch to DRILL-15. > >>>>>> I did this mostly because HBase will be my main target and > >>>> because > >>>>>> I wanted to get a feel of what would be a nice interface for > DRILL-13. > >>>> Have > >>>>>> some thoughts that I will post soon. > >>>>>> btw: I still can't assign issues to myself in JIRA, did you > >>>> forget > >>>>>> to add me as a contributor? > >>>>>> > >>>>>> Best > >>>>>> David > >>>>>> > >>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau > >> wrote: > >>>>>> > >>>>>>> Hey David, > >>>>>>> > >>>>>>> These sound good. I've add you as a contributor on jira so you can > >>>>>> assign > >>>>>>> tasks to yourself. I think 45 and 46 are good places to start. 15 > >>>>>> depends > >>>>>>> on 13 and working on the two hand in hand would probably be a good > >>>> idea. > >>>>>>> Maybe we could do a design discussion on 15 and 13 here once you > have > >>>>>> some > >>>>>>> time to focus on it. > >>>>>>> > >>>>>>> Jacques > >>>>>>> > >>>>>>> > >>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves < > davidralves@gmail.com> > >>>>>> wrote: > >>>>>>> > >>>>>>>> Hi All > >>>>>>>> > >>>>>>>> I have a new academic project for which I'd like to use drill > >>>>>>>> since none of the other parallel database over hadoop/nosql > >>>>>> implementations > >>>>>>>> fit just right. > >>>>>>>> To this goal I've been tinkering with the prototype trying to > >>>>>> find > >>>>>>>> where I'd be most useful. > >>>>>>>> > >>>>>>>> Here's where I'd like to start, if you agree: > >>>>>>>> - implement HBase storage engine (DRILL-15) > >>>>>>>> - start with simple scanning an push down of > >>>>>>>> selection/projection > >>>>>>>> - implement the LogicalPlanBuilder (DRILL-45) > >>>>>>>> - setup coding style in the wiki (formatting/imports etc, > >>>>>> DRILL-46) > >>>>>>>> - create builders for all logical plan elements/make logical > >>>>>> plans > >>>>>>>> immutable (no issue for this, I'd like to hear your thoughts > first). > >>>>>>>> > >>>>>>>> Please let me know your thoughts, and if you agree please > >> assign > >>>>>>>> the issues to me (it seems that I can't assign them myself). > >>>>>>>> > >>>>>>>> Best > >>>>>>>> David Alves > >>>>>> > >>>>>> > >>>> > >>>> > >> > >> > > --e89a8fb1ed04196d6f04d7d540e0--