'Re: contribution'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       drill-dev
Subject:    Re: contribution
From:       David Alves <davidralves () gmail ! com>
Date:       2013-03-13 21:30:47
Message-ID: 34E600B8-2EC1-4094-BDCD-FE5A920D3E17 () gmail ! com
[Download RAW message or body]

Getting the basic plumbing to a point where we could work together on it/use it elsewhere as soon as you can would be awesome.
As soon as I get that I can start on the daemons/scripts.
I'll  focus on the SE iface and on HBase pushdown for the moment.

-david

On Mar 13, 2013, at 3:12 PM, Jacques Nadeau <jacques@apache.org> wrote:

> I'm working on some physical plan stuff as well as some basic plumbing for
> distributed execution.  Its very in progress so I need to clean things up a
> bit before we could collaborate/ divide and conquer on it.  Depending on
> your timing and availability, maybe I could put some of this together in
> the next couple days so that you could plug in rather than reinvent.  In
> the meantime, pushing forward the builder stuff, additional test cases on
> the reference interpreter and/or thinking through the logical plan storage
> engine pushdown/rewrite could be very useful.
> 
> Let me know your thoughts.
> 
> thanks,
> Jacques
> 
> On Wed, Mar 13, 2013 at 9:47 AM, David Alves <davidralves@gmail.com> wrote:
> 
>> Hi Jacques
>> 
>>        I can assign issues to me now, thanks.
>>        What you say wrt to the logical/physical/execution layers sounds
>> good.
>>        My main concern, for the moment is to have something working as
>> fast as possible, i.e. some daemons that I'd be able to deploy to a working
>> hbase cluster and send them work to do in some form (first step would be to
>> treat is as a non distributed engine where each daemon runs an instance of
>> the prototype).
>>        Here's where I'd like to go next:
>>        - lay the ground work for the daemons (scripts/rpc iface/wiring
>> protocol).
>>        - create an execution engine iface that allows to abstract future
>> implementations, and make it available through the rpc iface. this would
>> sit in front of the ref impl for now and would be replaced by cpp down the
>> line.
>> 
>>        I think we can probably concentrate on the capabilities iface a
>> bit down the line but, as a first approach, I see it simply providing a
>> simple set of ops that it is able to run internally.
>>        How to abstract locality/partitioning/schema capabilities is till
>> not clear to me though, thoughts?
>> 
>> David
>> 
>> On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <jacques@apache.org> wrote:
>> 
>>> I'm working on a presentation that will better illustrate the layers.
>>> There are actually three key plans.  Thinking to date has been to break
>>> the plans down into logical, physical and execution.  The third hasn't
>> been
>>> expressed well here and is entirely an internal domain to the execution
>>> engine.  Following some classic methods: Logical expresses what we want
>> to
>>> do, Physical expresses how we want to do it (adding points of
>>> parallelization but not specifying particular amounts of parallelization
>> or
>>> node by node assignments).  The execution engine is then responsible for
>>> determining the amount of parallelization of a particular plan along with
>>> system load (likely leveraging Berkeley's Sparrow work), task priority
>> and
>>> specific data locality information, building sub-dags to be assigned to
>>> individual nodes and execute the plan.
>>> 
>>> So in the higher logical and physical levels, a single Scan and
>> subsequent
>>> ScanPOP should be okay...  (ScanROPs have a separate problems since they
>>> ignore the level of separation we're planning for the real execution
>> layer.
>>> This is the why the current ref impl turns a single Scan into potentially
>>> a union of ScanROPs... not elegant but logically correct.)
>>> 
>>> The capabilities interface still needs to be defined for how a storage
>>> engine reveals its logical capabilities and thus consumes part of the
>> plan.
>>> 
>>> J
>>> 
>>> 
>>> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com>
>> wrote:
>>> 
>>>> Hi Linsen
>>>> 
>>>>       Some of what you are saying like push down of ops like filter,
>>>> projection or partial aggregation below the storage engine scanner
>> level,
>>>> or sub tree execution are actively being discussed in issues DRILL-13
>>>> (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your
>> input
>>>> in these issues is most welcome.
>>>> 
>>>>       HBase in particular has the notion of
>>>> enpoints/coprocessors/filters that allow pushing this down easily (this
>> is
>>>> also in line with what other parallel database over nosql
>> implementations
>>>> like tajo do).
>>>>       A possible approach is to have the optimizer change the order of
>>>> the ops to place them below the storage engine scanner and let the SE
>> impl
>>>> deal with it internally.
>>>> 
>>>>       There are also some other pieces missing at the moment AFAIK,
>> like
>>>> a distributed metadata store, the drill daemons, wiring, etc.
>>>> 
>>>>       So in summary, you're absolutely right, and if you're
>> particularly
>>>> interested in the HBase SE impl (as I am, for the moment) I'd be
>> interested
>>>> in collaborating.
>>>> 
>>>> Best
>>>> David
>>>> 
>>>> 
>>>> On Mar 12, 2013, at 11:44 PM, Lisen Mu <immars@gmail.com> wrote:
>>>> 
>>>>> Hi David,
>>>>> 
>>>>> Very nice to see your effort on this.
>>>>> 
>>>>> Hi Jacques,
>>>>> 
>>>>> we are also extending drill prototype, to see if there is any chance to
>>>>> meet our production need. However, We find that implementing a
>> performant
>>>>> HBase storage engine is a not so straight-forward work, and requires
>> some
>>>>> workaround. The problem is in Scan interface.
>>>>> 
>>>>> In drill's physical plan model, ScanROP is in charge of table scan.
>>>> Storage
>>>>> engine provides output for a whole data source, a csv file for example.
>>>>> It's sufficient for input source like plain file, but for hbase, it's
>> not
>>>>> very efficient, if not impossible, to let ScanROP retrieve a whole
>> htable
>>>>> into drill. Storage engines like HBase should have some ablility to do
>>>> part
>>>>> of the DrQL query, like Filter, if a filter can be performed by
>>>> specifying
>>>>> startRowKey and endRowKey. Storage engine like mysql could do more,
>> even
>>>>> Join.
>>>>> 
>>>>> Generally, it would be more clear if a ScanROP is mapped to a sub-DAG
>> of
>>>>> logical plan DAG instead of a single Scan node in logical plan. If so,
>>>> more
>>>>> implementation-specific information would coupe into the plan
>>>> optimization
>>>>> & transformation phase. I guess that's the price to pay when
>> optimization
>>>>> comes, or is there other way I failed to see?
>>>>> 
>>>>> Please correct me if anything is wrong.
>>>>> 
>>>>> thanks,
>>>>> 
>>>>> Lisen
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, Mar 13, 2013 at 9:33 AM, David Alves <davidralves@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Hi Jacques
>>>>>> 
>>>>>>      I've submitted a fist pass patch to DRILL-15.
>>>>>>      I did this mostly because HBase will be my main target and
>>>> because
>>>>>> I wanted to get a feel of what would be a nice interface for DRILL-13.
>>>> Have
>>>>>> some thoughts that I will post soon.
>>>>>>      btw: I still can't assign issues to myself in JIRA, did you
>>>> forget
>>>>>> to add me as a contributor?
>>>>>> 
>>>>>> Best
>>>>>> David
>>>>>> 
>>>>>> On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <jacques@apache.org>
>> wrote:
>>>>>> 
>>>>>>> Hey David,
>>>>>>> 
>>>>>>> These sound good.  I've add you as a contributor on jira so you can
>>>>>> assign
>>>>>>> tasks to yourself.  I think 45 and 46 are good places to start.  15
>>>>>> depends
>>>>>>> on 13 and working on the two hand in hand would probably be a good
>>>> idea.
>>>>>>> Maybe we could do a design discussion on 15 and 13 here once you have
>>>>>> some
>>>>>>> time to focus on it.
>>>>>>> 
>>>>>>> Jacques
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Mar 11, 2013 at 3:02 AM, David Alves <davidralves@gmail.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi All
>>>>>>>> 
>>>>>>>>     I have a new academic project for which I'd like to use drill
>>>>>>>> since none of the other parallel database over hadoop/nosql
>>>>>> implementations
>>>>>>>> fit just right.
>>>>>>>>     To this goal I've been tinkering with the prototype trying to
>>>>>> find
>>>>>>>> where I'd be most useful.
>>>>>>>> 
>>>>>>>>     Here's where I'd like to start, if you agree:
>>>>>>>>     - implement HBase storage engine (DRILL-15)
>>>>>>>>             - start with simple scanning an push down of
>>>>>>>> selection/projection
>>>>>>>>     - implement the LogicalPlanBuilder (DRILL-45)
>>>>>>>>     - setup coding style in the wiki (formatting/imports etc,
>>>>>> DRILL-46)
>>>>>>>>     - create builders for all logical plan elements/make logical
>>>>>> plans
>>>>>>>> immutable (no issue for this, I'd like to hear your thoughts first).
>>>>>>>> 
>>>>>>>>     Please let me know your thoughts, and if you agree please
>> assign
>>>>>>>> the issues to me (it seems that I can't assign them myself).
>>>>>>>> 
>>>>>>>> Best
>>>>>>>> David Alves
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic