[prev in list] [next in list] [prev in thread] [next in thread] 

List:       mesos-user
Subject:    Re: HDFS on Mesos
From:       Maxime Brugidou <maxime.brugidou () gmail ! com>
Date:       2014-06-30 14:18:06
Message-ID: CAHReGaiCSgf3ee0kPM_mrKg_eLxq4PmvkqRdnVanvE_PA0dnJg () mail ! gmail ! com
[Download RAW message or body]

Since i wanted to learn the scheduler API i did a quick and dirty proof of
concept to run HDFS over Mesos: https://github.com/brugidou/hdfs-mesos

I actually run it with deimos/docker and it's very ugly: not much exception
checking. logging is trash, a lot of stuff is hard-coded... etc. It only
runs one namenode for the entire cluster and one datanode on each slave.
You can theoretically run multiple clusters using different cluster names
and data directories. There is no HA, Security, JournalNode.... etc. It
runs hadoop-2.4.1 (latest release).

The hard part is to get to know where the namenode will run, in order to
simplify this i chose to pre-select (using configuration) where the
namenode will run.

Other than that i doubt i chose the easiest framework to begin with...


On Thu, Jun 26, 2014 at 4:21 PM, Rick Richardson <rick.richardson@gmail.com>
wrote:

> It might be that some tighter integration beyond a framework is needed.  A
> killer docker/chroot feature would simply be providing a standard Port to
> all containers which is an open socket to a namenode.
> 
> As this is more about general purpose storage, it would probably be nice
> to use something with fewer sharp edges, like  CephFS or Lustre.  HDFS
> requires the data author to think about the size and shape of their data.
> 
> 
> 
> On Thu, Jun 26, 2014 at 7:03 AM, Maxime Brugidou <
> maxime.brugidou@gmail.com> wrote:
> 
> > This has been discussed before apparently
> > http://mail-archives.apache.org/mod_mbox/mesos-user/201401.mbox/%3CCAAoDHHF4CvcsjFJ5zuSUAhbLw+0iie5ARHpmHJVKUCVjMqTNsg@mail.gmail.com%3E
> >  
> > I think that this topic will become more important now that external
> > containerization is out. The write-outside-sandbox pattern won't work in a
> > chroot or docker AFAIK.
> > 
> > In addition the docker pattern for persistent data storage is to use a
> > data-only docker image. Not sure if this is appropriate here.
> > On Jun 26, 2014 12:42 PM, "Maxime Brugidou" <maxime.brugidou@gmail.com>
> > wrote:
> > 
> > > There is clearly a need for persistent storage management in Mesos from
> > > what I can observe.
> > > 
> > > The current sandbox is what I consider ephemeral storage since it gets
> > > lost when task exits. It can recover after a slave failure using the
> > > recovery mechanism but for example it won't survive a slave reboot.
> > > 
> > > Other frameworks I know of that seem to use or need persistent storage
> > > are Cassandra and Kafka. I wonder what has been done in the framework to
> > > survive a DC power outage for example. Is all data lost?
> > > 
> > > As Vinod said if we want to implement persistent storage by ourselves we
> > > need to track the resource "manually" using attributes or zk. This "trick"
> > > will be reimplemented over and over by frameworks and will be outside
> > > Mesos' control (I don't even know if this trick is feasible with docker
> > > containerization).
> > > 
> > > The proper way would be to have a persistent disk resource type (or
> > > something else equivalent) that let you keep data on disk. The resource
> > > will belong to a user/framework and we can have quotas. I have no idea how
> > > to implement that since I'm not familiar with the details but it could be
> > > using simple FS quotas and directories in the mesos directory itself (so we
> > > mutualize ephemeral and persistent storage), it could also be on the form
> > > of raw storage using LVM volumes to enable other sort of applications... Or
> > > it could be both actually, mesos could have a raw volume group to use for
> > > any sort of temporary/ephemeral and persistent volumes.
> > > 
> > > This is probably very complex since you will need tools to report the
> > > storage usage and do some cleanup (or have a TTL/expiry mechanism). But I
> > > believe that every storage framework will reinvent this every time outside
> > > Mesos.
> > > On Jun 26, 2014 1:01 AM, "Vinod Kone" <vinodkone@gmail.com> wrote:
> > > 
> > > > Thanks for listing this out Adam.
> > > > 
> > > > Data Residency:
> > > > > - Should we destroy the sandbox/hdfs-data when shutting down a DN?
> > > > > - If starting DN on node that was previously running a DN, can/should
> > > > > we try to revive the existing data?
> > > > > 
> > > > 
> > > > I think this is one of the key challenges for a production quality HDFS
> > > > on Mesos. Currently, since sandbox is deleted after a task exits, if all
> > > > the data nodes that hold a block (and its replicas) get lost/killed for
> > > > whatever reason there would be data loss. A short terms solution would be
> > > > to write outside sandbox and use slave attributes to track where to
> > > > re-launch data node tasks.
> > > > 
> > > > 
> > > > 
> 
> 
> --
> 
> "Historically, the most terrible things - war, genocide, and slavery -
> have resulted not from disobedience, but from obedience"
> --  Howard
> Zinn
> 


[Attachment #3 (text/html)]

<div dir="ltr">Since i wanted to learn the scheduler API i did a quick and dirty \
proof of concept to run HDFS over Mesos:  <a \
href="https://github.com/brugidou/hdfs-mesos" \
target="_blank">https://github.com/brugidou/hdfs-mesos</a><div> <br><div>I actually \
run it with deimos/docker and it&#39;s very ugly: not much exception checking. \
logging is trash, a lot of stuff is hard-coded... etc. It only runs one namenode for \
the entire cluster and one datanode on each slave. You can theoretically run multiple \
clusters using different cluster names and data directories. There is no HA, \
Security, JournalNode.... etc. It runs hadoop-2.4.1 (latest release).</div>

<div><br></div><div>The hard part is to get to know where the namenode will run, in \
order to simplify this i chose to pre-select (using configuration) where the namenode \
will run.</div></div><div><br></div><div>Other than that i doubt i chose the easiest \
framework to begin with...</div> </div><div class="gmail_extra"><br><br><div \
class="gmail_quote">On Thu, Jun 26, 2014 at 4:21 PM, Rick Richardson <span \
dir="ltr">&lt;<a href="mailto:rick.richardson@gmail.com" \
target="_blank">rick.richardson@gmail.com</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><div dir="ltr">It might be that some tighter integration \
beyond a framework is needed.   A killer docker/chroot feature would simply be \
providing a standard Port to all containers which is an open socket to a namenode.   \
<br>

<br>As this is more about general purpose storage, it would probably be nice to use \
something with fewer sharp edges, like   CephFS or Lustre.   HDFS requires the data \
author to think about the size and shape of their data.  <br>

<br></div><div class="gmail_extra"><div><div class="h5"><br><br><div \
class="gmail_quote">On Thu, Jun 26, 2014 at 7:03 AM, Maxime Brugidou <span \
dir="ltr">&lt;<a href="mailto:maxime.brugidou@gmail.com" \
target="_blank">maxime.brugidou@gmail.com</a>&gt;</span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><p dir="ltr">This has been discussed before apparently <a \
href="http://mail-archives.apache.org/mod_mbox/mesos-user/201401.mbox/%3CCAAoDHHF4CvcsjFJ5zuSUAhbLw+0iie5ARHpmHJVKUCVjMqTNsg@mail.gmail.com%3E" \
target="_blank">http://mail-archives.apache.org/mod_mbox/mesos-user/201401.mbox/%3CCAAoDHHF4CvcsjFJ5zuSUAhbLw+0iie5ARHpmHJVKUCVjMqTNsg@mail.gmail.com%3E</a></p>




<p dir="ltr">I think that this topic will become more important now that external \
containerization is out. The write-outside-sandbox pattern won&#39;t work in a chroot \
or docker AFAIK.</p> <p dir="ltr">In addition the docker pattern for persistent data \
storage is to use a data-only docker image. Not sure if this is appropriate \
here.</p><div><div> <div class="gmail_quote">On Jun 26, 2014 12:42 PM, &quot;Maxime \
Brugidou&quot; &lt;<a href="mailto:maxime.brugidou@gmail.com" \
target="_blank">maxime.brugidou@gmail.com</a>&gt; wrote:<br \
type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex">


<p dir="ltr">There is clearly a need for persistent storage management in Mesos from \
what I can observe.</p> <p dir="ltr">The current sandbox is what I consider ephemeral \
storage since it gets lost when task exits. It can recover after a slave failure \
using the recovery mechanism but for example it won&#39;t survive a slave reboot.</p>




<p dir="ltr">Other frameworks I know of that seem to use or need persistent storage \
are Cassandra and Kafka. I wonder what has been done in the framework to survive a DC \
power outage for example. Is all data lost?</p> <p dir="ltr">As Vinod said if we want \
to implement persistent storage by ourselves we need to track the resource \
&quot;manually&quot; using attributes or zk. This &quot;trick&quot; will be \
reimplemented over and over by frameworks and will be outside Mesos&#39; control (I \
don&#39;t even know if this trick is feasible with docker containerization).</p>




<p dir="ltr">The proper way would be to have a persistent disk resource type (or \
something else equivalent) that let you keep data on disk. The resource will belong \
to a user/framework and we can have quotas. I have no idea how to implement that \
since I&#39;m not familiar with the details but it could be using simple FS quotas \
and directories in the mesos directory itself (so we mutualize ephemeral and \
persistent storage), it could also be on the form of raw storage using LVM volumes to \
enable other sort of applications... Or it could be both actually, mesos could have a \
raw volume group to use for any sort of temporary/ephemeral and persistent \
volumes.</p>




<p dir="ltr">This is probably very complex since you will need tools to report the \
storage usage and do some cleanup (or have a TTL/expiry mechanism). But I believe \
that every storage framework will reinvent this every time outside Mesos.</p>




<div class="gmail_quote">On Jun 26, 2014 1:01 AM, &quot;Vinod Kone&quot; &lt;<a \
href="mailto:vinodkone@gmail.com" target="_blank">vinodkone@gmail.com</a>&gt; \
wrote:<br type="attribution"><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex">



<div dir="ltr"><div class="gmail_extra">Thanks for listing this out Adam.</div><div \
class="gmail_extra"><br><div class="gmail_quote"><blockquote class="gmail_quote" \
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">





<div>Data Residency:<br>- Should we destroy the sandbox/hdfs-data when shutting down \
a DN?<br>- If starting DN on node that was previously running a DN, can/should we try \
to revive the existing data?<br></div></blockquote>





</div><div class="gmail_extra"><br></div><div class="gmail_extra">I think this is one \
of the key challenges for a production quality HDFS on Mesos. Currently, since \
sandbox is deleted after a task exits, if all the data nodes that hold a block (and \
its replicas) get lost/killed for whatever reason there would be data loss. A short \
terms solution would be to write outside sandbox and use slave attributes to track \
where to re-launch data node tasks.</div>





<br><br></div></div>
</blockquote></div>
</blockquote></div>
</div></div></blockquote></div><br><br clear="all"><div><br></div></div></div><span \
class="HOEnZb"><font color="#888888">-- <br><div \
dir="ltr"><div><div><br></div>&quot;Historically, the most terrible things - war, \
genocide, and slavery - have resulted not from disobedience, but from obedience&quot; \
<br>

</div>                                                                                \
--   Howard Zinn<br></div> </font></span></div>
</blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic