Hi Jacques Is there any chance we could get a preview of this physical plan = stuff and basic plumbing for distributed execution before the weekend? = maybe in a github branch somewhere? I mean it doesn't have to be complete or even running, I'd just = like to make some progress with other stuff and keeping it in line with = whichever plumbing you already have would be great. =09 Best David On Mar 13, 2013, at 3:12 PM, Jacques Nadeau wrote: > I'm working on some physical plan stuff as well as some basic plumbing = for > distributed execution. Its very in progress so I need to clean things = up a > bit before we could collaborate/ divide and conquer on it. Depending = on > your timing and availability, maybe I could put some of this together = in > the next couple days so that you could plug in rather than reinvent. = In > the meantime, pushing forward the builder stuff, additional test cases = on > the reference interpreter and/or thinking through the logical plan = storage > engine pushdown/rewrite could be very useful. >=20 > Let me know your thoughts. >=20 > thanks, > Jacques >=20 > On Wed, Mar 13, 2013 at 9:47 AM, David Alves = wrote: >=20 >> Hi Jacques >>=20 >> I can assign issues to me now, thanks. >> What you say wrt to the logical/physical/execution layers = sounds >> good. >> My main concern, for the moment is to have something working = as >> fast as possible, i.e. some daemons that I'd be able to deploy to a = working >> hbase cluster and send them work to do in some form (first step would = be to >> treat is as a non distributed engine where each daemon runs an = instance of >> the prototype). >> Here's where I'd like to go next: >> - lay the ground work for the daemons (scripts/rpc = iface/wiring >> protocol). >> - create an execution engine iface that allows to abstract = future >> implementations, and make it available through the rpc iface. this = would >> sit in front of the ref impl for now and would be replaced by cpp = down the >> line. >>=20 >> I think we can probably concentrate on the capabilities iface = a >> bit down the line but, as a first approach, I see it simply providing = a >> simple set of ops that it is able to run internally. >> How to abstract locality/partitioning/schema capabilities is = till >> not clear to me though, thoughts? >>=20 >> David >>=20 >> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau = wrote: >>=20 >>> I'm working on a presentation that will better illustrate the = layers. >>> There are actually three key plans. Thinking to date has been to = break >>> the plans down into logical, physical and execution. The third = hasn't >> been >>> expressed well here and is entirely an internal domain to the = execution >>> engine. Following some classic methods: Logical expresses what we = want >> to >>> do, Physical expresses how we want to do it (adding points of >>> parallelization but not specifying particular amounts of = parallelization >> or >>> node by node assignments). The execution engine is then responsible = for >>> determining the amount of parallelization of a particular plan along = with >>> system load (likely leveraging Berkeley's Sparrow work), task = priority >> and >>> specific data locality information, building sub-dags to be assigned = to >>> individual nodes and execute the plan. >>>=20 >>> So in the higher logical and physical levels, a single Scan and >> subsequent >>> ScanPOP should be okay... (ScanROPs have a separate problems since = they >>> ignore the level of separation we're planning for the real execution >> layer. >>> This is the why the current ref impl turns a single Scan into = potentially >>> a union of ScanROPs... not elegant but logically correct.) >>>=20 >>> The capabilities interface still needs to be defined for how a = storage >>> engine reveals its logical capabilities and thus consumes part of = the >> plan. >>>=20 >>> J >>>=20 >>>=20 >>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves = >> wrote: >>>=20 >>>> Hi Linsen >>>>=20 >>>> Some of what you are saying like push down of ops like = filter, >>>> projection or partial aggregation below the storage engine scanner >> level, >>>> or sub tree execution are actively being discussed in issues = DRILL-13 >>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), = your >> input >>>> in these issues is most welcome. >>>>=20 >>>> HBase in particular has the notion of >>>> enpoints/coprocessors/filters that allow pushing this down easily = (this >> is >>>> also in line with what other parallel database over nosql >> implementations >>>> like tajo do). >>>> A possible approach is to have the optimizer change the order = of >>>> the ops to place them below the storage engine scanner and let the = SE >> impl >>>> deal with it internally. >>>>=20 >>>> There are also some other pieces missing at the moment AFAIK, >> like >>>> a distributed metadata store, the drill daemons, wiring, etc. >>>>=20 >>>> So in summary, you're absolutely right, and if you're >> particularly >>>> interested in the HBase SE impl (as I am, for the moment) I'd be >> interested >>>> in collaborating. >>>>=20 >>>> Best >>>> David >>>>=20 >>>>=20 >>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu wrote: >>>>=20 >>>>> Hi David, >>>>>=20 >>>>> Very nice to see your effort on this. >>>>>=20 >>>>> Hi Jacques, >>>>>=20 >>>>> we are also extending drill prototype, to see if there is any = chance to >>>>> meet our production need. However, We find that implementing a >> performant >>>>> HBase storage engine is a not so straight-forward work, and = requires >> some >>>>> workaround. The problem is in Scan interface. >>>>>=20 >>>>> In drill's physical plan model, ScanROP is in charge of table = scan. >>>> Storage >>>>> engine provides output for a whole data source, a csv file for = example. >>>>> It's sufficient for input source like plain file, but for hbase, = it's >> not >>>>> very efficient, if not impossible, to let ScanROP retrieve a whole >> htable >>>>> into drill. Storage engines like HBase should have some ablility = to do >>>> part >>>>> of the DrQL query, like Filter, if a filter can be performed by >>>> specifying >>>>> startRowKey and endRowKey. Storage engine like mysql could do = more, >> even >>>>> Join. >>>>>=20 >>>>> Generally, it would be more clear if a ScanROP is mapped to a = sub-DAG >> of >>>>> logical plan DAG instead of a single Scan node in logical plan. If = so, >>>> more >>>>> implementation-specific information would coupe into the plan >>>> optimization >>>>> & transformation phase. I guess that's the price to pay when >> optimization >>>>> comes, or is there other way I failed to see? >>>>>=20 >>>>> Please correct me if anything is wrong. >>>>>=20 >>>>> thanks, >>>>>=20 >>>>> Lisen >>>>>=20 >>>>>=20 >>>>>=20 >>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves = >>>> wrote: >>>>>=20 >>>>>> Hi Jacques >>>>>>=20 >>>>>> I've submitted a fist pass patch to DRILL-15. >>>>>> I did this mostly because HBase will be my main target and >>>> because >>>>>> I wanted to get a feel of what would be a nice interface for = DRILL-13. >>>> Have >>>>>> some thoughts that I will post soon. >>>>>> btw: I still can't assign issues to myself in JIRA, did you >>>> forget >>>>>> to add me as a contributor? >>>>>>=20 >>>>>> Best >>>>>> David >>>>>>=20 >>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau >> wrote: >>>>>>=20 >>>>>>> Hey David, >>>>>>>=20 >>>>>>> These sound good. I've add you as a contributor on jira so you = can >>>>>> assign >>>>>>> tasks to yourself. I think 45 and 46 are good places to start. = 15 >>>>>> depends >>>>>>> on 13 and working on the two hand in hand would probably be a = good >>>> idea. >>>>>>> Maybe we could do a design discussion on 15 and 13 here once you = have >>>>>> some >>>>>>> time to focus on it. >>>>>>>=20 >>>>>>> Jacques >>>>>>>=20 >>>>>>>=20 >>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves = >>>>>> wrote: >>>>>>>=20 >>>>>>>> Hi All >>>>>>>>=20 >>>>>>>> I have a new academic project for which I'd like to use = drill >>>>>>>> since none of the other parallel database over hadoop/nosql >>>>>> implementations >>>>>>>> fit just right. >>>>>>>> To this goal I've been tinkering with the prototype trying = to >>>>>> find >>>>>>>> where I'd be most useful. >>>>>>>>=20 >>>>>>>> Here's where I'd like to start, if you agree: >>>>>>>> - implement HBase storage engine (DRILL-15) >>>>>>>> - start with simple scanning an push down of >>>>>>>> selection/projection >>>>>>>> - implement the LogicalPlanBuilder (DRILL-45) >>>>>>>> - setup coding style in the wiki (formatting/imports etc, >>>>>> DRILL-46) >>>>>>>> - create builders for all logical plan elements/make = logical >>>>>> plans >>>>>>>> immutable (no issue for this, I'd like to hear your thoughts = first). >>>>>>>>=20 >>>>>>>> Please let me know your thoughts, and if you agree please >> assign >>>>>>>> the issues to me (it seems that I can't assign them myself). >>>>>>>>=20 >>>>>>>> Best >>>>>>>> David Alves >>>>>>=20 >>>>>>=20 >>>>=20 >>>>=20 >>=20 >>=20