[prev in list] [next in list] [prev in thread] [next in thread]
List: drill-dev
Subject: Re: contribution
From: David Alves <davidralves () gmail ! com>
Date: 2013-03-13 21:30:47
Message-ID: 34E600B8-2EC1-4094-BDCD-FE5A920D3E17 () gmail ! com
[Download RAW message or body]
Getting the basic plumbing to a point where we could work together on it/use it elsewhere as soon as you can would be awesome.
As soon as I get that I can start on the daemons/scripts.
I'll focus on the SE iface and on HBase pushdown for the moment.
-david
On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <jacques@apache.org> wrote:
> I'm working on some physical plan stuff as well as some basic plumbing for
> distributed execution. Its very in progress so I need to clean things up a
> bit before we could collaborate/ divide and conquer on it. Depending on
> your timing and availability, maybe I could put some of this together in
> the next couple days so that you could plug in rather than reinvent. In
> the meantime, pushing forward the builder stuff, additional test cases on
> the reference interpreter and/or thinking through the logical plan storage
> engine pushdown/rewrite could be very useful.
>
> Let me know your thoughts.
>
> thanks,
> Jacques
>
> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <davidralves@gmail.com> wrote:
>
>> Hi Jacques
>>
>> I can assign issues to me now, thanks.
>> What you say wrt to the logical/physical/execution layers sounds
>> good.
>> My main concern, for the moment is to have something working as
>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>> hbase cluster and send them work to do in some form (first step would be to
>> treat is as a non distributed engine where each daemon runs an instance of
>> the prototype).
>> Here's where I'd like to go next:
>> - lay the ground work for the daemons (scripts/rpc iface/wiring
>> protocol).
>> - create an execution engine iface that allows to abstract future
>> implementations, and make it available through the rpc iface. this would
>> sit in front of the ref impl for now and would be replaced by cpp down the
>> line.
>>
>> I think we can probably concentrate on the capabilities iface a
>> bit down the line but, as a first approach, I see it simply providing a
>> simple set of ops that it is able to run internally.
>> How to abstract locality/partitioning/schema capabilities is till
>> not clear to me though, thoughts?
>>
>> David
>>
>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <jacques@apache.org> wrote:
>>
>>> I'm working on a presentation that will better illustrate the layers.
>>> There are actually three key plans. Thinking to date has been to break
>>> the plans down into logical, physical and execution. The third hasn't
>> been
>>> expressed well here and is entirely an internal domain to the execution
>>> engine. Following some classic methods: Logical expresses what we want
>> to
>>> do, Physical expresses how we want to do it (adding points of
>>> parallelization but not specifying particular amounts of parallelization
>> or
>>> node by node assignments). The execution engine is then responsible for
>>> determining the amount of parallelization of a particular plan along with
>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>> and
>>> specific data locality information, building sub-dags to be assigned to
>>> individual nodes and execute the plan.
>>>
>>> So in the higher logical and physical levels, a single Scan and
>> subsequent
>>> ScanPOP should be okay... (ScanROPs have a separate problems since they
>>> ignore the level of separation we're planning for the real execution
>> layer.
>>> This is the why the current ref impl turns a single Scan into potentially
>>> a union of ScanROPs... not elegant but logically correct.)
>>>
>>> The capabilities interface still needs to be defined for how a storage
>>> engine reveals its logical capabilities and thus consumes part of the
>> plan.
>>>
>>> J
>>>
>>>
>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com>
>> wrote:
>>>
>>>> Hi Linsen
>>>>
>>>> Some of what you are saying like push down of ops like filter,
>>>> projection or partial aggregation below the storage engine scanner
>> level,
>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>> input
>>>> in these issues is most welcome.
>>>>
>>>> HBase in particular has the notion of
>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>> is
>>>> also in line with what other parallel database over nosql
>> implementations
>>>> like tajo do).
>>>> A possible approach is to have the optimizer change the order of
>>>> the ops to place them below the storage engine scanner and let the SE
>> impl
>>>> deal with it internally.
>>>>
>>>> There are also some other pieces missing at the moment AFAIK,
>> like
>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>>
>>>> So in summary, you're absolutely right, and if you're
>> particularly
>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>> interested
>>>> in collaborating.
>>>>
>>>> Best
>>>> David
>>>>
>>>>
>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <immars@gmail.com> wrote:
>>>>
>>>>> Hi David,
>>>>>
>>>>> Very nice to see your effort on this.
>>>>>
>>>>> Hi Jacques,
>>>>>
>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>> meet our production need. However, We find that implementing a
>> performant
>>>>> HBase storage engine is a not so straight-forward work, and requires
>> some
>>>>> workaround. The problem is in Scan interface.
>>>>>
>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>> Storage
>>>>> engine provides output for a whole data source, a csv file for example.
>>>>> It's sufficient for input source like plain file, but for hbase, it's
>> not
>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>> htable
>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>> part
>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>> specifying
>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>> even
>>>>> Join.
>>>>>
>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>> of
>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>> more
>>>>> implementation-specific information would coupe into the plan
>>>> optimization
>>>>> & transformation phase. I guess that's the price to pay when
>> optimization
>>>>> comes, or is there other way I failed to see?
>>>>>
>>>>> Please correct me if anything is wrong.
>>>>>
>>>>> thanks,
>>>>>
>>>>> Lisen
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <davidralves@gmail.com>
>>>> wrote:
>>>>>
>>>>>> Hi Jacques
>>>>>>
>>>>>> I've submitted a fist pass patch to DRILL-15.
>>>>>> I did this mostly because HBase will be my main target and
>>>> because
>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>> Have
>>>>>> some thoughts that I will post soon.
>>>>>> btw: I still can't assign issues to myself in JIRA, did you
>>>> forget
>>>>>> to add me as a contributor?
>>>>>>
>>>>>> Best
>>>>>> David
>>>>>>
>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <jacques@apache.org>
>> wrote:
>>>>>>
>>>>>>> Hey David,
>>>>>>>
>>>>>>> These sound good. I've add you as a contributor on jira so you can
>>>>>> assign
>>>>>>> tasks to yourself. I think 45 and 46 are good places to start. 15
>>>>>> depends
>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>> idea.
>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>> some
>>>>>>> time to focus on it.
>>>>>>>
>>>>>>> Jacques
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <davidralves@gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi All
>>>>>>>>
>>>>>>>> I have a new academic project for which I'd like to use drill
>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>> implementations
>>>>>>>> fit just right.
>>>>>>>> To this goal I've been tinkering with the prototype trying to
>>>>>> find
>>>>>>>> where I'd be most useful.
>>>>>>>>
>>>>>>>> Here's where I'd like to start, if you agree:
>>>>>>>> - implement HBase storage engine (DRILL-15)
>>>>>>>> - start with simple scanning an push down of
>>>>>>>> selection/projection
>>>>>>>> - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>> - setup coding style in the wiki (formatting/imports etc,
>>>>>> DRILL-46)
>>>>>>>> - create builders for all logical plan elements/make logical
>>>>>> plans
>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>>
>>>>>>>> Please let me know your thoughts, and if you agree please
>> assign
>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>>
>>>>>>>> Best
>>>>>>>> David Alves
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic