'Re: contribution'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       drill-dev
Subject:    Re: contribution
From:       David Alves <davidralves () gmail ! com>
Date:       2013-03-13 16:47:02
Message-ID: B2829968-530D-4300-89CD-3D346B280BB9 () gmail ! com
[Download RAW message or body]

Hi Jacques

	I can assign issues to me now, thanks.
	What you say wrt to the logical/physical/execution layers sounds good.
	My main concern, for the moment is to have something working as fast as possible, \
i.e. some daemons that I'd be able to deploy to a working hbase cluster and send them \
work to do in some form (first step would be to treat is as a non distributed engine \
where each daemon runs an instance of the prototype).  Here's where I'd like to go \
                next:
	- lay the ground work for the daemons (scripts/rpc iface/wiring protocol).
	- create an execution engine iface that allows to abstract future implementations, \
and make it available through the rpc iface. this would sit in front of the ref impl \
for now and would be replaced by cpp down the line.  
	I think we can probably concentrate on the capabilities iface a bit down the line \
but, as a first approach, I see it simply providing a simple set of ops that it is \
able to run internally.   How to abstract locality/partitioning/schema capabilities \
is till not clear to me though, thoughts?

David

On Mar 13, 2013, at 11:12 AM, Jacques Nadeau <jacques@apache.org> wrote:

> I'm working on a presentation that will better illustrate the layers.
> There are actually three key plans.  Thinking to date has been to break
> the plans down into logical, physical and execution.  The third hasn't been
> expressed well here and is entirely an internal domain to the execution
> engine.  Following some classic methods: Logical expresses what we want to
> do, Physical expresses how we want to do it (adding points of
> parallelization but not specifying particular amounts of parallelization or
> node by node assignments).  The execution engine is then responsible for
> determining the amount of parallelization of a particular plan along with
> system load (likely leveraging Berkeley's Sparrow work), task priority and
> specific data locality information, building sub-dags to be assigned to
> individual nodes and execute the plan.
> 
> So in the higher logical and physical levels, a single Scan and subsequent
> ScanPOP should be okay...  (ScanROPs have a separate problems since they
> ignore the level of separation we're planning for the real execution layer.
> This is the why the current ref impl turns a single Scan into potentially
> a union of ScanROPs... not elegant but logically correct.)
> 
> The capabilities interface still needs to be defined for how a storage
> engine reveals its logical capabilities and thus consumes part of the plan.
> 
> J
> 
> 
> On Tue, Mar 12, 2013 at 10:19 PM, David Alves <davidralves@gmail.com> wrote:
> 
> > Hi Linsen
> > 
> > Some of what you are saying like push down of ops like filter,
> > projection or partial aggregation below the storage engine scanner level,
> > or sub tree execution are actively being discussed in issues DRILL-13
> > (Strorage Engine Interface) and DRILL-15 (Hbase storage engine), your input
> > in these issues is most welcome.
> > 
> > HBase in particular has the notion of
> > enpoints/coprocessors/filters that allow pushing this down easily (this is
> > also in line with what other parallel database over nosql implementations
> > like tajo do).
> > A possible approach is to have the optimizer change the order of
> > the ops to place them below the storage engine scanner and let the SE impl
> > deal with it internally.
> > 
> > There are also some other pieces missing at the moment AFAIK, like
> > a distributed metadata store, the drill daemons, wiring, etc.
> > 
> > So in summary, you're absolutely right, and if you're particularly
> > interested in the HBase SE impl (as I am, for the moment) I'd be interested
> > in collaborating.
> > 
> > Best
> > David
> > 
> > 
> > On Mar 12, 2013, at 11:44 PM, Lisen Mu <immars@gmail.com> wrote:
> > 
> > > Hi David,
> > > 
> > > Very nice to see your effort on this.
> > > 
> > > Hi Jacques,
> > > 
> > > we are also extending drill prototype, to see if there is any chance to
> > > meet our production need. However, We find that implementing a performant
> > > HBase storage engine is a not so straight-forward work, and requires some
> > > workaround. The problem is in Scan interface.
> > > 
> > > In drill's physical plan model, ScanROP is in charge of table scan.
> > Storage
> > > engine provides output for a whole data source, a csv file for example.
> > > It's sufficient for input source like plain file, but for hbase, it's not
> > > very efficient, if not impossible, to let ScanROP retrieve a whole htable
> > > into drill. Storage engines like HBase should have some ablility to do
> > part
> > > of the DrQL query, like Filter, if a filter can be performed by
> > specifying
> > > startRowKey and endRowKey. Storage engine like mysql could do more, even
> > > Join.
> > > 
> > > Generally, it would be more clear if a ScanROP is mapped to a sub-DAG of
> > > logical plan DAG instead of a single Scan node in logical plan. If so,
> > more
> > > implementation-specific information would coupe into the plan
> > optimization
> > > & transformation phase. I guess that's the price to pay when optimization
> > > comes, or is there other way I failed to see?
> > > 
> > > Please correct me if anything is wrong.
> > > 
> > > thanks,
> > > 
> > > Lisen
> > > 
> > > 
> > > 
> > > On Wed, Mar 13, 2013 at 9:33 AM, David Alves <davidralves@gmail.com>
> > wrote:
> > > 
> > > > Hi Jacques
> > > > 
> > > > I've submitted a fist pass patch to DRILL-15.
> > > > I did this mostly because HBase will be my main target and
> > because
> > > > I wanted to get a feel of what would be a nice interface for DRILL-13.
> > Have
> > > > some thoughts that I will post soon.
> > > > btw: I still can't assign issues to myself in JIRA, did you
> > forget
> > > > to add me as a contributor?
> > > > 
> > > > Best
> > > > David
> > > > 
> > > > On Mar 11, 2013, at 2:13 PM, Jacques Nadeau <jacques@apache.org> wrote:
> > > > 
> > > > > Hey David,
> > > > > 
> > > > > These sound good.  I've add you as a contributor on jira so you can
> > > > assign
> > > > > tasks to yourself.  I think 45 and 46 are good places to start.  15
> > > > depends
> > > > > on 13 and working on the two hand in hand would probably be a good
> > idea.
> > > > > Maybe we could do a design discussion on 15 and 13 here once you have
> > > > some
> > > > > time to focus on it.
> > > > > 
> > > > > Jacques
> > > > > 
> > > > > 
> > > > > On Mon, Mar 11, 2013 at 3:02 AM, David Alves <davidralves@gmail.com>
> > > > wrote:
> > > > > 
> > > > > > Hi All
> > > > > > 
> > > > > > I have a new academic project for which I'd like to use drill
> > > > > > since none of the other parallel database over hadoop/nosql
> > > > implementations
> > > > > > fit just right.
> > > > > > To this goal I've been tinkering with the prototype trying to
> > > > find
> > > > > > where I'd be most useful.
> > > > > > 
> > > > > > Here's where I'd like to start, if you agree:
> > > > > > - implement HBase storage engine (DRILL-15)
> > > > > > - start with simple scanning an push down of
> > > > > > selection/projection
> > > > > > - implement the LogicalPlanBuilder (DRILL-45)
> > > > > > - setup coding style in the wiki (formatting/imports etc,
> > > > DRILL-46)
> > > > > > - create builders for all logical plan elements/make logical
> > > > plans
> > > > > > immutable (no issue for this, I'd like to hear your thoughts first).
> > > > > > 
> > > > > > Please let me know your thoughts, and if you agree please assign
> > > > > > the issues to me (it seems that I can't assign them myself).
> > > > > > 
> > > > > > Best
> > > > > > David Alves
> > > > 
> > > > 
> > 
> > 


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic