[prev in list] [next in list] [prev in thread] [next in thread]
List: hadoop-user
Subject: Re: Examples of chained MapReduce?
From: James Kennedy <james.kennedy () troove ! net>
Date: 2007-06-25 16:30:21
Message-ID: 467FED9D.2030908 () troove ! net
[Download RAW message or body]
Ok guys, thank you for your responses.
Devaraj Das wrote:
> I haven't confirmed this but I vaguely remember that the resource schedulers
> (Torque, Condor) provide the feature using which one can submit a DAG of
> jobs, etc. The resource manager doesn't invoke a node in the DAG unless all
> nodes pointing to it have successfully finished (or something like that) and
> the resource scheduler framework does the bookkeeping to take care of failed
> jobs, etc.
> In hadoop there is an effort "Integration of Hadoop with batch schedulers"
> https://issues.apache.org/jira/browse/HADOOP-719
> I am not sure whether it handles the use case, where one could submit a
> chain of jobs, but think it potentially can handle that.
>
> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Sunday, June 24, 2007 6:10 AM
> To: hadoop-user@lucene.apache.org
> Subject: Re: Examples of chained MapReduce?
>
>
> We are still evaluating Hadoop for use in our main-line analysis systems,
> but we already have the problem of workflow scheduling.
>
> Our solution for that was to implement a simpler version of Amazon's Simple
> Queue Service. This allows us to have multiple redundant workers for some
> tasks or to choke a task down on other tasks.
>
> The basic idea is that queues contain XML tasks. Tasks are read from the
> queue by workers, but are kept in a holding pen for a queue specific time
> period after they are read. If the task completes normally, the worker will
> delete the task, but if the timeout expires before the worker completes the
> task, it is added back to the queue.
>
> Workers are structured as a triple of scripts that are executed of a manager
> process. These are a pre-condition that can determine if any work should be
> done (usually this is a check for available local disk space or available
> CPU cycles), an item qualification (this is done with a particular item in
> case the work is subject to resource reservation) and a worker script.
>
> Even this tiny little framework suffices for quite complex workflows and
> work constraints. It would be very easy to schedule map-reduce tasks via a
> similar mechanism.
>
> On 6/23/07 5:34 AM, "Andrzej Bialecki" <ab@getopt.org> wrote:
>
>
>> James Kennedy wrote:
>>
>>> But back to my original question... Doug suggests that dependence on
>>> a driver process is acceptable. But has anyone needed true MapReduce
>>> chaining or tried it successfully? Or is it generally accepted that
>>> a multi-MapReduce algorithm should always be driven by a single process?
>>>
>> I would argue that this functionality is outside the scope of Hadoop.
>> As far as I understand your question, you need orchestration, which
>> involves the ability to record a state of previously executed
>> map-reduce jobs, and starting next map-reduce jobs based on the
>> existing state, possibly long time after the first job completes and
>> from a different process.
>>
>> I'm frequently facing this problem, and so far I've been using a
>> poor-man's workflow system, consisting of a bunch of cron jobs, shell
>> scripts, and simple marker files to record current state of data. In a
>> similar way you can implement advisory application-level locking,
>> using lock files.
>>
>> Example: adding a new batch of pages to a Nutch index involves many
>> steps, starting with fetchlist generation, fetching, parsing, updating
>> the db, extraction of link information, and indexing. Each of these
>> steps consists of one (or several) map-reduce jobs, and the input to
>> the next jobs depends on the output of previous jobs. What you
>> referred to in your previous email was a single-app driver for this
>> workflow, called Crawl. But I'm using the slightly modified individual
>> tools, which on successful completion create marker files (e.g.
>> fetching.done). Other tools check for the existence of these files,
>> and either perform their function or exit (if I want to run updatedb
>> from a segment that is fetched but not parsed).
>>
>> To summarize this long answer - I think that this functionality
>> belongs in the application layer built on top of Hadoop, and IMHO we
>> are better off not implementing it in the Hadoop proper.
>>
>>
>
>
>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic