Hey Jacques Sorry to be a nag, but is there any change to take a sneak peak = at the protobuf rpc stuff? I'd really like hack something together wrt to the daemon this = weekend. Also, wrt to configuration management (zk/helix) maybe you could = post the iface so that it'd be possible to hack something static (i.e. = non-ft, properties file based) just to make dist execution work. Thanks David On Mar 16, 2013, at 8:34 PM, Jacques Nadeau wrote: > Hey David, >=20 > The java-exec framework is not far enough along that it makes sense = for me > to push it externally yet. However, I did push my initial wip = physical > plan approach. You can find it here: > = https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates >=20 > Hopefully, I will get further along on the java-exec stuff soon. >=20 > I'd suggest that you focus your energy on the StorageEngine API and = HBase > implementation. If you're up for it, let's do a quick skype chat to = sync > up. Let me know your availability over the next few days. >=20 > Thanks, > Jacques >=20 >=20 >=20 > On Fri, Mar 15, 2013 at 6:59 PM, David Alves = wrote: >=20 >> that'd be great thanks. >>=20 >> -david >>=20 >> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau >> wrote: >>=20 >>> I've been under the weather the last few days and haven't made much >>> progress. Let me see if I can get you something tomorrow. >>>=20 >>> On Mar 15, 2013, at 2:36 PM, David Alves = wrote: >>>=20 >>>> Hi Jacques >>>>=20 >>>> Is there any chance we could get a preview of this physical plan >> stuff and basic plumbing for distributed execution before the = weekend? >> maybe in a github branch somewhere? >>>> I mean it doesn't have to be complete or even running, I'd just = like >> to make some progress with other stuff and keeping it in line with >> whichever plumbing you already have would be great. >>>>=20 >>>> Best >>>> David >>>>=20 >>>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau = wrote: >>>>=20 >>>>> I'm working on some physical plan stuff as well as some basic = plumbing >> for >>>>> distributed execution. Its very in progress so I need to clean = things >> up a >>>>> bit before we could collaborate/ divide and conquer on it. = Depending >> on >>>>> your timing and availability, maybe I could put some of this = together >> in >>>>> the next couple days so that you could plug in rather than = reinvent. >> In >>>>> the meantime, pushing forward the builder stuff, additional test = cases >> on >>>>> the reference interpreter and/or thinking through the logical plan >> storage >>>>> engine pushdown/rewrite could be very useful. >>>>>=20 >>>>> Let me know your thoughts. >>>>>=20 >>>>> thanks, >>>>> Jacques >>>>>=20 >>>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves = >> wrote: >>>>>=20 >>>>>> Hi Jacques >>>>>>=20 >>>>>> I can assign issues to me now, thanks. >>>>>> What you say wrt to the logical/physical/execution layers = sounds >>>>>> good. >>>>>> My main concern, for the moment is to have something working = as >>>>>> fast as possible, i.e. some daemons that I'd be able to deploy to = a >> working >>>>>> hbase cluster and send them work to do in some form (first step = would >> be to >>>>>> treat is as a non distributed engine where each daemon runs an >> instance of >>>>>> the prototype). >>>>>> Here's where I'd like to go next: >>>>>> - lay the ground work for the daemons (scripts/rpc = iface/wiring >>>>>> protocol). >>>>>> - create an execution engine iface that allows to abstract = future >>>>>> implementations, and make it available through the rpc iface. = this >> would >>>>>> sit in front of the ref impl for now and would be replaced by cpp >> down the >>>>>> line. >>>>>>=20 >>>>>> I think we can probably concentrate on the capabilities iface = a >>>>>> bit down the line but, as a first approach, I see it simply = providing >> a >>>>>> simple set of ops that it is able to run internally. >>>>>> How to abstract locality/partitioning/schema capabilities is = till >>>>>> not clear to me though, thoughts? >>>>>>=20 >>>>>> David >>>>>>=20 >>>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau >> wrote: >>>>>>=20 >>>>>>> I'm working on a presentation that will better illustrate the = layers. >>>>>>> There are actually three key plans. Thinking to date has been = to >> break >>>>>>> the plans down into logical, physical and execution. The third >> hasn't >>>>>> been >>>>>>> expressed well here and is entirely an internal domain to the >> execution >>>>>>> engine. Following some classic methods: Logical expresses what = we >> want >>>>>> to >>>>>>> do, Physical expresses how we want to do it (adding points of >>>>>>> parallelization but not specifying particular amounts of >> parallelization >>>>>> or >>>>>>> node by node assignments). The execution engine is then = responsible >> for >>>>>>> determining the amount of parallelization of a particular plan = along >> with >>>>>>> system load (likely leveraging Berkeley's Sparrow work), task >> priority >>>>>> and >>>>>>> specific data locality information, building sub-dags to be = assigned >> to >>>>>>> individual nodes and execute the plan. >>>>>>>=20 >>>>>>> So in the higher logical and physical levels, a single Scan and >>>>>> subsequent >>>>>>> ScanPOP should be okay... (ScanROPs have a separate problems = since >> they >>>>>>> ignore the level of separation we're planning for the real = execution >>>>>> layer. >>>>>>> This is the why the current ref impl turns a single Scan into >> potentially >>>>>>> a union of ScanROPs... not elegant but logically correct.) >>>>>>>=20 >>>>>>> The capabilities interface still needs to be defined for how a >> storage >>>>>>> engine reveals its logical capabilities and thus consumes part = of the >>>>>> plan. >>>>>>>=20 >>>>>>> J >>>>>>>=20 >>>>>>>=20 >>>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves = >>=20 >>>>>> wrote: >>>>>>>=20 >>>>>>>> Hi Linsen >>>>>>>>=20 >>>>>>>> Some of what you are saying like push down of ops like = filter, >>>>>>>> projection or partial aggregation below the storage engine = scanner >>>>>> level, >>>>>>>> or sub tree execution are actively being discussed in issues >> DRILL-13 >>>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage = engine), >> your >>>>>> input >>>>>>>> in these issues is most welcome. >>>>>>>>=20 >>>>>>>> HBase in particular has the notion of >>>>>>>> enpoints/coprocessors/filters that allow pushing this down = easily >> (this >>>>>> is >>>>>>>> also in line with what other parallel database over nosql >>>>>> implementations >>>>>>>> like tajo do). >>>>>>>> A possible approach is to have the optimizer change the = order of >>>>>>>> the ops to place them below the storage engine scanner and let = the >> SE >>>>>> impl >>>>>>>> deal with it internally. >>>>>>>>=20 >>>>>>>> There are also some other pieces missing at the moment = AFAIK, >>>>>> like >>>>>>>> a distributed metadata store, the drill daemons, wiring, etc. >>>>>>>>=20 >>>>>>>> So in summary, you're absolutely right, and if you're >>>>>> particularly >>>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd = be >>>>>> interested >>>>>>>> in collaborating. >>>>>>>>=20 >>>>>>>> Best >>>>>>>> David >>>>>>>>=20 >>>>>>>>=20 >>>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu = wrote: >>>>>>>>=20 >>>>>>>>> Hi David, >>>>>>>>>=20 >>>>>>>>> Very nice to see your effort on this. >>>>>>>>>=20 >>>>>>>>> Hi Jacques, >>>>>>>>>=20 >>>>>>>>> we are also extending drill prototype, to see if there is any >> chance to >>>>>>>>> meet our production need. However, We find that implementing a >>>>>> performant >>>>>>>>> HBase storage engine is a not so straight-forward work, and >> requires >>>>>> some >>>>>>>>> workaround. The problem is in Scan interface. >>>>>>>>>=20 >>>>>>>>> In drill's physical plan model, ScanROP is in charge of table = scan. >>>>>>>> Storage >>>>>>>>> engine provides output for a whole data source, a csv file for >> example. >>>>>>>>> It's sufficient for input source like plain file, but for = hbase, >> it's >>>>>> not >>>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a = whole >>>>>> htable >>>>>>>>> into drill. Storage engines like HBase should have some = ablility >> to do >>>>>>>> part >>>>>>>>> of the DrQL query, like Filter, if a filter can be performed = by >>>>>>>> specifying >>>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do = more, >>>>>> even >>>>>>>>> Join. >>>>>>>>>=20 >>>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a >> sub-DAG >>>>>> of >>>>>>>>> logical plan DAG instead of a single Scan node in logical = plan. If >> so, >>>>>>>> more >>>>>>>>> implementation-specific information would coupe into the plan >>>>>>>> optimization >>>>>>>>> & transformation phase. I guess that's the price to pay when >>>>>> optimization >>>>>>>>> comes, or is there other way I failed to see? >>>>>>>>>=20 >>>>>>>>> Please correct me if anything is wrong. >>>>>>>>>=20 >>>>>>>>> thanks, >>>>>>>>>=20 >>>>>>>>> Lisen >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves < >> davidralves@gmail.com> >>>>>>>> wrote: >>>>>>>>>=20 >>>>>>>>>> Hi Jacques >>>>>>>>>>=20 >>>>>>>>>> I've submitted a fist pass patch to DRILL-15. >>>>>>>>>> I did this mostly because HBase will be my main target and >>>>>>>> because >>>>>>>>>> I wanted to get a feel of what would be a nice interface for >> DRILL-13. >>>>>>>> Have >>>>>>>>>> some thoughts that I will post soon. >>>>>>>>>> btw: I still can't assign issues to myself in JIRA, did you >>>>>>>> forget >>>>>>>>>> to add me as a contributor? >>>>>>>>>>=20 >>>>>>>>>> Best >>>>>>>>>> David >>>>>>>>>>=20 >>>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau = >>>>>> wrote: >>>>>>>>>>=20 >>>>>>>>>>> Hey David, >>>>>>>>>>>=20 >>>>>>>>>>> These sound good. I've add you as a contributor on jira so = you >> can >>>>>>>>>> assign >>>>>>>>>>> tasks to yourself. I think 45 and 46 are good places to = start. >> 15 >>>>>>>>>> depends >>>>>>>>>>> on 13 and working on the two hand in hand would probably be = a >> good >>>>>>>> idea. >>>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once = you >> have >>>>>>>>>> some >>>>>>>>>>> time to focus on it. >>>>>>>>>>>=20 >>>>>>>>>>> Jacques >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves < >> davidralves@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>=20 >>>>>>>>>>>> Hi All >>>>>>>>>>>>=20 >>>>>>>>>>>> I have a new academic project for which I'd like to use = drill >>>>>>>>>>>> since none of the other parallel database over hadoop/nosql >>>>>>>>>> implementations >>>>>>>>>>>> fit just right. >>>>>>>>>>>> To this goal I've been tinkering with the prototype trying = to >>>>>>>>>> find >>>>>>>>>>>> where I'd be most useful. >>>>>>>>>>>>=20 >>>>>>>>>>>> Here's where I'd like to start, if you agree: >>>>>>>>>>>> - implement HBase storage engine (DRILL-15) >>>>>>>>>>>> - start with simple scanning an push down of >>>>>>>>>>>> selection/projection >>>>>>>>>>>> - implement the LogicalPlanBuilder (DRILL-45) >>>>>>>>>>>> - setup coding style in the wiki (formatting/imports etc, >>>>>>>>>> DRILL-46) >>>>>>>>>>>> - create builders for all logical plan elements/make = logical >>>>>>>>>> plans >>>>>>>>>>>> immutable (no issue for this, I'd like to hear your = thoughts >> first). >>>>>>>>>>>>=20 >>>>>>>>>>>> Please let me know your thoughts, and if you agree please >>>>>> assign >>>>>>>>>>>> the issues to me (it seems that I can't assign them = myself). >>>>>>>>>>>>=20 >>>>>>>>>>>> Best >>>>>>>>>>>> David Alves >>>>=20 >>=20 >>=20