From drill-dev Sun Mar 24 00:13:57 2013 From: Jacques Nadeau Date: Sun, 24 Mar 2013 00:13:57 +0000 To: drill-dev Subject: Re: contribution Message-Id: X-MARC-Message: https://marc.info/?l=drill-dev&m=142065425429505 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--bcaec54b4568836fc804d8a090ae" --bcaec54b4568836fc804d8a090ae Content-Type: text/plain; charset=ISO-8859-1 Not yet. I will share as soon as I get something cohesive together. Thanks, Jacques On Fri, Mar 22, 2013 at 12:06 PM, David Alves wrote: > Hey Jacques > > Sorry to be a nag, but is there any change to take a sneak peak at > the protobuf rpc stuff? > I'd really like hack something together wrt to the daemon this > weekend. > Also, wrt to configuration management (zk/helix) maybe you could > post the iface so that it'd be possible to hack something static (i.e. > non-ft, properties file based) just to make dist execution work. > > Thanks > David > > On Mar 16, 2013, at 8:34 PM, Jacques Nadeau wrote: > > > Hey David, > > > > The java-exec framework is not far enough along that it makes sense for > me > > to push it externally yet. However, I did push my initial wip physical > > plan approach. You can find it here: > > https://github.com/jacques-n/incubator-drill/tree/physical_plan_updates > > > > Hopefully, I will get further along on the java-exec stuff soon. > > > > I'd suggest that you focus your energy on the StorageEngine API and HBase > > implementation. If you're up for it, let's do a quick skype chat to sync > > up. Let me know your availability over the next few days. > > > > Thanks, > > Jacques > > > > > > > > On Fri, Mar 15, 2013 at 6:59 PM, David Alves > wrote: > > > >> that'd be great thanks. > >> > >> -david > >> > >> On Mar 15, 2013, at 8:51 PM, Jacques Nadeau > >> wrote: > >> > >>> I've been under the weather the last few days and haven't made much > >>> progress. Let me see if I can get you something tomorrow. > >>> > >>> On Mar 15, 2013, at 2:36 PM, David Alves > wrote: > >>> > >>>> Hi Jacques > >>>> > >>>> Is there any chance we could get a preview of this physical plan > >> stuff and basic plumbing for distributed execution before the weekend? > >> maybe in a github branch somewhere? > >>>> I mean it doesn't have to be complete or even running, I'd just like > >> to make some progress with other stuff and keeping it in line with > >> whichever plumbing you already have would be great. > >>>> > >>>> Best > >>>> David > >>>> > >>>> On Mar 13, 2013, at 3:12 PM, Jacques Nadeau > wrote: > >>>> > >>>>> I'm working on some physical plan stuff as well as some basic > plumbing > >> for > >>>>> distributed execution. Its very in progress so I need to clean > things > >> up a > >>>>> bit before we could collaborate/ divide and conquer on it. Depending > >> on > >>>>> your timing and availability, maybe I could put some of this together > >> in > >>>>> the next couple days so that you could plug in rather than reinvent. > >> In > >>>>> the meantime, pushing forward the builder stuff, additional test > cases > >> on > >>>>> the reference interpreter and/or thinking through the logical plan > >> storage > >>>>> engine pushdown/rewrite could be very useful. > >>>>> > >>>>> Let me know your thoughts. > >>>>> > >>>>> thanks, > >>>>> Jacques > >>>>> > >>>>> On Wed, Mar 13, 2013 at 9:47 AM, David Alves > >> wrote: > >>>>> > >>>>>> Hi Jacques > >>>>>> > >>>>>> I can assign issues to me now, thanks. > >>>>>> What you say wrt to the logical/physical/execution layers sounds > >>>>>> good. > >>>>>> My main concern, for the moment is to have something working as > >>>>>> fast as possible, i.e. some daemons that I'd be able to deploy to a > >> working > >>>>>> hbase cluster and send them work to do in some form (first step > would > >> be to > >>>>>> treat is as a non distributed engine where each daemon runs an > >> instance of > >>>>>> the prototype). > >>>>>> Here's where I'd like to go next: > >>>>>> - lay the ground work for the daemons (scripts/rpc iface/wiring > >>>>>> protocol). > >>>>>> - create an execution engine iface that allows to abstract > future > >>>>>> implementations, and make it available through the rpc iface. this > >> would > >>>>>> sit in front of the ref impl for now and would be replaced by cpp > >> down the > >>>>>> line. > >>>>>> > >>>>>> I think we can probably concentrate on the capabilities iface a > >>>>>> bit down the line but, as a first approach, I see it simply > providing > >> a > >>>>>> simple set of ops that it is able to run internally. > >>>>>> How to abstract locality/partitioning/schema capabilities is > till > >>>>>> not clear to me though, thoughts? > >>>>>> > >>>>>> David > >>>>>> > >>>>>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau > >> wrote: > >>>>>> > >>>>>>> I'm working on a presentation that will better illustrate the > layers. > >>>>>>> There are actually three key plans. Thinking to date has been to > >> break > >>>>>>> the plans down into logical, physical and execution. The third > >> hasn't > >>>>>> been > >>>>>>> expressed well here and is entirely an internal domain to the > >> execution > >>>>>>> engine. Following some classic methods: Logical expresses what we > >> want > >>>>>> to > >>>>>>> do, Physical expresses how we want to do it (adding points of > >>>>>>> parallelization but not specifying particular amounts of > >> parallelization > >>>>>> or > >>>>>>> node by node assignments). The execution engine is then > responsible > >> for > >>>>>>> determining the amount of parallelization of a particular plan > along > >> with > >>>>>>> system load (likely leveraging Berkeley's Sparrow work), task > >> priority > >>>>>> and > >>>>>>> specific data locality information, building sub-dags to be > assigned > >> to > >>>>>>> individual nodes and execute the plan. > >>>>>>> > >>>>>>> So in the higher logical and physical levels, a single Scan and > >>>>>> subsequent > >>>>>>> ScanPOP should be okay... (ScanROPs have a separate problems since > >> they > >>>>>>> ignore the level of separation we're planning for the real > execution > >>>>>> layer. > >>>>>>> This is the why the current ref impl turns a single Scan into > >> potentially > >>>>>>> a union of ScanROPs... not elegant but logically correct.) > >>>>>>> > >>>>>>> The capabilities interface still needs to be defined for how a > >> storage > >>>>>>> engine reveals its logical capabilities and thus consumes part of > the > >>>>>> plan. > >>>>>>> > >>>>>>> J > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves < > davidralves@gmail.com > >>> > >>>>>> wrote: > >>>>>>> > >>>>>>>> Hi Linsen > >>>>>>>> > >>>>>>>> Some of what you are saying like push down of ops like filter, > >>>>>>>> projection or partial aggregation below the storage engine scanner > >>>>>> level, > >>>>>>>> or sub tree execution are actively being discussed in issues > >> DRILL-13 > >>>>>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), > >> your > >>>>>> input > >>>>>>>> in these issues is most welcome. > >>>>>>>> > >>>>>>>> HBase in particular has the notion of > >>>>>>>> enpoints/coprocessors/filters that allow pushing this down easily > >> (this > >>>>>> is > >>>>>>>> also in line with what other parallel database over nosql > >>>>>> implementations > >>>>>>>> like tajo do). > >>>>>>>> A possible approach is to have the optimizer change the order > of > >>>>>>>> the ops to place them below the storage engine scanner and let the > >> SE > >>>>>> impl > >>>>>>>> deal with it internally. > >>>>>>>> > >>>>>>>> There are also some other pieces missing at the moment AFAIK, > >>>>>> like > >>>>>>>> a distributed metadata store, the drill daemons, wiring, etc. > >>>>>>>> > >>>>>>>> So in summary, you're absolutely right, and if you're > >>>>>> particularly > >>>>>>>> interested in the HBase SE impl (as I am, for the moment) I'd be > >>>>>> interested > >>>>>>>> in collaborating. > >>>>>>>> > >>>>>>>> Best > >>>>>>>> David > >>>>>>>> > >>>>>>>> > >>>>>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu wrote: > >>>>>>>> > >>>>>>>>> Hi David, > >>>>>>>>> > >>>>>>>>> Very nice to see your effort on this. > >>>>>>>>> > >>>>>>>>> Hi Jacques, > >>>>>>>>> > >>>>>>>>> we are also extending drill prototype, to see if there is any > >> chance to > >>>>>>>>> meet our production need. However, We find that implementing a > >>>>>> performant > >>>>>>>>> HBase storage engine is a not so straight-forward work, and > >> requires > >>>>>> some > >>>>>>>>> workaround. The problem is in Scan interface. > >>>>>>>>> > >>>>>>>>> In drill's physical plan model, ScanROP is in charge of table > scan. > >>>>>>>> Storage > >>>>>>>>> engine provides output for a whole data source, a csv file for > >> example. > >>>>>>>>> It's sufficient for input source like plain file, but for hbase, > >> it's > >>>>>> not > >>>>>>>>> very efficient, if not impossible, to let ScanROP retrieve a > whole > >>>>>> htable > >>>>>>>>> into drill. Storage engines like HBase should have some ablility > >> to do > >>>>>>>> part > >>>>>>>>> of the DrQL query, like Filter, if a filter can be performed by > >>>>>>>> specifying > >>>>>>>>> startRowKey and endRowKey. Storage engine like mysql could do > more, > >>>>>> even > >>>>>>>>> Join. > >>>>>>>>> > >>>>>>>>> Generally, it would be more clear if a ScanROP is mapped to a > >> sub-DAG > >>>>>> of > >>>>>>>>> logical plan DAG instead of a single Scan node in logical plan. > If > >> so, > >>>>>>>> more > >>>>>>>>> implementation-specific information would coupe into the plan > >>>>>>>> optimization > >>>>>>>>> & transformation phase. I guess that's the price to pay when > >>>>>> optimization > >>>>>>>>> comes, or is there other way I failed to see? > >>>>>>>>> > >>>>>>>>> Please correct me if anything is wrong. > >>>>>>>>> > >>>>>>>>> thanks, > >>>>>>>>> > >>>>>>>>> Lisen > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves < > >> davidralves@gmail.com> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi Jacques > >>>>>>>>>> > >>>>>>>>>> I've submitted a fist pass patch to DRILL-15. > >>>>>>>>>> I did this mostly because HBase will be my main target and > >>>>>>>> because > >>>>>>>>>> I wanted to get a feel of what would be a nice interface for > >> DRILL-13. > >>>>>>>> Have > >>>>>>>>>> some thoughts that I will post soon. > >>>>>>>>>> btw: I still can't assign issues to myself in JIRA, did you > >>>>>>>> forget > >>>>>>>>>> to add me as a contributor? > >>>>>>>>>> > >>>>>>>>>> Best > >>>>>>>>>> David > >>>>>>>>>> > >>>>>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau > > >>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Hey David, > >>>>>>>>>>> > >>>>>>>>>>> These sound good. I've add you as a contributor on jira so you > >> can > >>>>>>>>>> assign > >>>>>>>>>>> tasks to yourself. I think 45 and 46 are good places to start. > >> 15 > >>>>>>>>>> depends > >>>>>>>>>>> on 13 and working on the two hand in hand would probably be a > >> good > >>>>>>>> idea. > >>>>>>>>>>> Maybe we could do a design discussion on 15 and 13 here once > you > >> have > >>>>>>>>>> some > >>>>>>>>>>> time to focus on it. > >>>>>>>>>>> > >>>>>>>>>>> Jacques > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves < > >> davidralves@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hi All > >>>>>>>>>>>> > >>>>>>>>>>>> I have a new academic project for which I'd like to use drill > >>>>>>>>>>>> since none of the other parallel database over hadoop/nosql > >>>>>>>>>> implementations > >>>>>>>>>>>> fit just right. > >>>>>>>>>>>> To this goal I've been tinkering with the prototype trying to > >>>>>>>>>> find > >>>>>>>>>>>> where I'd be most useful. > >>>>>>>>>>>> > >>>>>>>>>>>> Here's where I'd like to start, if you agree: > >>>>>>>>>>>> - implement HBase storage engine (DRILL-15) > >>>>>>>>>>>> - start with simple scanning an push down of > >>>>>>>>>>>> selection/projection > >>>>>>>>>>>> - implement the LogicalPlanBuilder (DRILL-45) > >>>>>>>>>>>> - setup coding style in the wiki (formatting/imports etc, > >>>>>>>>>> DRILL-46) > >>>>>>>>>>>> - create builders for all logical plan elements/make logical > >>>>>>>>>> plans > >>>>>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts > >> first). > >>>>>>>>>>>> > >>>>>>>>>>>> Please let me know your thoughts, and if you agree please > >>>>>> assign > >>>>>>>>>>>> the issues to me (it seems that I can't assign them myself). > >>>>>>>>>>>> > >>>>>>>>>>>> Best > >>>>>>>>>>>> David Alves > >>>> > >> > >> > > --bcaec54b4568836fc804d8a090ae--