Hi Jacques I can assign issues to me now, thanks. What you say wrt to the logical/physical/execution layers sounds = good. My main concern, for the moment is to have something working as = fast as possible, i.e. some daemons that I'd be able to deploy to a = working hbase cluster and send them work to do in some form (first step = would be to treat is as a non distributed engine where each daemon runs = an instance of the prototype). Here's where I'd like to go next: - lay the ground work for the daemons (scripts/rpc iface/wiring = protocol). - create an execution engine iface that allows to abstract = future implementations, and make it available through the rpc iface. = this would sit in front of the ref impl for now and would be replaced by = cpp down the line. =09 I think we can probably concentrate on the capabilities iface a = bit down the line but, as a first approach, I see it simply providing a = simple set of ops that it is able to run internally.=20 How to abstract locality/partitioning/schema capabilities is = till not clear to me though, thoughts? David On Mar 13, 2013, at 11:12 AM, Jacques Nadeau wrote: > I'm working on a presentation that will better illustrate the layers. > There are actually three key plans. Thinking to date has been to = break > the plans down into logical, physical and execution. The third hasn't = been > expressed well here and is entirely an internal domain to the = execution > engine. Following some classic methods: Logical expresses what we = want to > do, Physical expresses how we want to do it (adding points of > parallelization but not specifying particular amounts of = parallelization or > node by node assignments). The execution engine is then responsible = for > determining the amount of parallelization of a particular plan along = with > system load (likely leveraging Berkeley's Sparrow work), task priority = and > specific data locality information, building sub-dags to be assigned = to > individual nodes and execute the plan. >=20 > So in the higher logical and physical levels, a single Scan and = subsequent > ScanPOP should be okay... (ScanROPs have a separate problems since = they > ignore the level of separation we're planning for the real execution = layer. > This is the why the current ref impl turns a single Scan into = potentially > a union of ScanROPs... not elegant but logically correct.) >=20 > The capabilities interface still needs to be defined for how a storage > engine reveals its logical capabilities and thus consumes part of the = plan. >=20 > J >=20 >=20 > On Tue, Mar 12, 2013 at 10:19 PM, David Alves = wrote: >=20 >> Hi Linsen >>=20 >> Some of what you are saying like push down of ops like filter, >> projection or partial aggregation below the storage engine scanner = level, >> or sub tree execution are actively being discussed in issues DRILL-13 >> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your = input >> in these issues is most welcome. >>=20 >> HBase in particular has the notion of >> enpoints/coprocessors/filters that allow pushing this down easily = (this is >> also in line with what other parallel database over nosql = implementations >> like tajo do). >> A possible approach is to have the optimizer change the order = of >> the ops to place them below the storage engine scanner and let the SE = impl >> deal with it internally. >>=20 >> There are also some other pieces missing at the moment AFAIK, = like >> a distributed metadata store, the drill daemons, wiring, etc. >>=20 >> So in summary, you're absolutely right, and if you're = particularly >> interested in the HBase SE impl (as I am, for the moment) I'd be = interested >> in collaborating. >>=20 >> Best >> David >>=20 >>=20 >> On Mar 12, 2013, at 11:44 PM, Lisen Mu wrote: >>=20 >>> Hi David, >>>=20 >>> Very nice to see your effort on this. >>>=20 >>> Hi Jacques, >>>=20 >>> we are also extending drill prototype, to see if there is any chance = to >>> meet our production need. However, We find that implementing a = performant >>> HBase storage engine is a not so straight-forward work, and requires = some >>> workaround. The problem is in Scan interface. >>>=20 >>> In drill's physical plan model, ScanROP is in charge of table scan. >> Storage >>> engine provides output for a whole data source, a csv file for = example. >>> It's sufficient for input source like plain file, but for hbase, = it's not >>> very efficient, if not impossible, to let ScanROP retrieve a whole = htable >>> into drill. Storage engines like HBase should have some ablility to = do >> part >>> of the DrQL query, like Filter, if a filter can be performed by >> specifying >>> startRowKey and endRowKey. Storage engine like mysql could do more, = even >>> Join. >>>=20 >>> Generally, it would be more clear if a ScanROP is mapped to a = sub-DAG of >>> logical plan DAG instead of a single Scan node in logical plan. If = so, >> more >>> implementation-specific information would coupe into the plan >> optimization >>> & transformation phase. I guess that's the price to pay when = optimization >>> comes, or is there other way I failed to see? >>>=20 >>> Please correct me if anything is wrong. >>>=20 >>> thanks, >>>=20 >>> Lisen >>>=20 >>>=20 >>>=20 >>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves >> wrote: >>>=20 >>>> Hi Jacques >>>>=20 >>>> I've submitted a fist pass patch to DRILL-15. >>>> I did this mostly because HBase will be my main target and >> because >>>> I wanted to get a feel of what would be a nice interface for = DRILL-13. >> Have >>>> some thoughts that I will post soon. >>>> btw: I still can't assign issues to myself in JIRA, did you >> forget >>>> to add me as a contributor? >>>>=20 >>>> Best >>>> David >>>>=20 >>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau = wrote: >>>>=20 >>>>> Hey David, >>>>>=20 >>>>> These sound good. I've add you as a contributor on jira so you = can >>>> assign >>>>> tasks to yourself. I think 45 and 46 are good places to start. = 15 >>>> depends >>>>> on 13 and working on the two hand in hand would probably be a good >> idea. >>>>> Maybe we could do a design discussion on 15 and 13 here once you = have >>>> some >>>>> time to focus on it. >>>>>=20 >>>>> Jacques >>>>>=20 >>>>>=20 >>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves = >>>> wrote: >>>>>=20 >>>>>> Hi All >>>>>>=20 >>>>>> I have a new academic project for which I'd like to use = drill >>>>>> since none of the other parallel database over hadoop/nosql >>>> implementations >>>>>> fit just right. >>>>>> To this goal I've been tinkering with the prototype trying = to >>>> find >>>>>> where I'd be most useful. >>>>>>=20 >>>>>> Here's where I'd like to start, if you agree: >>>>>> - implement HBase storage engine (DRILL-15) >>>>>> - start with simple scanning an push down of >>>>>> selection/projection >>>>>> - implement the LogicalPlanBuilder (DRILL-45) >>>>>> - setup coding style in the wiki (formatting/imports etc, >>>> DRILL-46) >>>>>> - create builders for all logical plan elements/make logical >>>> plans >>>>>> immutable (no issue for this, I'd like to hear your thoughts = first). >>>>>>=20 >>>>>> Please let me know your thoughts, and if you agree please = assign >>>>>> the issues to me (it seems that I can't assign them myself). >>>>>>=20 >>>>>> Best >>>>>> David Alves >>>>=20 >>>>=20 >>=20 >>=20