[prev in list] [next in list] [prev in thread] [next in thread] 

List:       cassandra-user
Subject:    Re: expiring data out of Cassandra/time to live
From:       Ryan Daum <ryan () thimbleware ! com>
Date:       2010-03-31 20:53:37
Message-ID: m2p22dabfc71003311353ia59f214azb613fc5b0bd4b812 () mail ! gmail ! com
[Download RAW message or body]

On that topic, what exactly is keeping this feature out of the official
releases?

On Wed, Mar 31, 2010 at 3:43 PM, Daniel Kluesing <dk@bluekai.com> wrote:

>  We also applied this patch to the 0.6 branch and have been running it for
> a bit over a week. Works well, would love to see it get into trunk/0.7
> proper.
>
>
>
> *From:* Ryan Daum [mailto:ryan@thimbleware.com]
> *Sent:* Wednesday, March 31, 2010 11:49 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: expiring data out of Cassandra/time to live
>
>
>
> I was able to successfully merge this patch into the 0.6 branch a few weeks
> ago by doing the following:
>
>
>
>    - Downloading the patch
>    - Checking out the trunk of Cassandra from github
>    - Rolling back (checking out) the git repo to the same date that the
>    patch was submitted to Jira
>    - Applying the patch
>    - Committing to Git
>    - Merging forward to the 0.6 branch
>    - Resolve one or two minor conflicts.
>
>
>
> R
>
>
>
> On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
>
> Sounds like you want to follow
> https://issues.apache.org/jira/browse/CASSANDRA-699.  There is a patch
> there but I wouldn't recommend merging it if Java scares you. :)
>
>
> On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
> <mike.e.gallamore@googlemail.com> wrote:
> > Hello everyone,
> >
> > I saw a thread on the incubator user chat that started a few months ago:
> >
> http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html
> > . It looks like this is the new official user mailing list so I'll add my
> > thoughts/question here.
> >
> > Is there any way to set a TTL on data stored in Cassandra? Deleting old
> > SSTables isn't enough for my needs. I need the data to go away after a
> fixed
> > period of time. Here is what I'm trying to do and my reasoning why I
> think
> > Cassandra and not something like Flare/Memcache mets my need:
> >
> > I'm building a reputation system. We get lots of data at my work (in the
> > 10's of GB of reputation data a day). The trick is that old data is not
> > useful as a senders ip address might have changed, they might have had a
> bot
> > on their system and no have removed it, etc. So I need to be able to keep
> > data for a fixed period of time and then afterwords it isn't
> needed/ideally
> > would be GC'd out.
> >
> > We want to do one thing if we either never heard of the individual or at
> > least not since the expiry time, and another thing based on the
> reputation
> > data that is stored in Cassandra if it is current. So ideally a Cassandra
> > call for a key for someone who's reputation is expired would return
> nothing
> > and we'd reply with our default reputation for that individual. There
> really
> > is no point using network bandwidth to return all the fields associated
> with
> > that key only to look at a timestamp and end up ignoring it anyways.
> > Similarly the latency of requesting first the timestamp and then the data
> in
> > two separate requests is prohibitive.
> >
> > Why Cassandra:
> >
> > Our data is complex and is hard to handle completely in a key/value
> sense.
> > In the past we were doing this and just encoding the complex structure
> > inside of JSON but this isn't ideal. It is very nice algorithmically to
> be
> > able to say: give me this column, or update this element of this hash
> etc,
> > rather than having to pull the old version, decode, modify, re-encode and
> > push back to a cache based system.
> > Our data is large (in the low TB's at the moment, but expected to grow to
> > 50-100TB of live data)
> > Need quick response for both searches and writes: typically for each
> thing
> > we track we get a request for the reputation, the message gets processed
> and
> > then we get feedback back from the recipient. So reads and writes are
> > symmetric.
> > High request rate: millions per hour
> > hundreds of millions of unique reputations (this is way crawling though
> the
> > data with a script purging old data doesn't make sense)
> > Availablity/load balancing a must. Data needs to be replicated a disk
> copy
> > is useful so if we have a power outage we don't lose the system.
> > It would be interesting to keep a local subset of our data at customers
> > sites and have them "replicate up" there data rather than send there
> > feedback in a different manner that then has to be processed and pumped
> into
> > our datastore (hopefully this is possible with Cassandra with some
> creative
> > choices of how the data is hashed between nodes)
> >
> > Does the capability to set an expiry time exist? If not is there any
> plans
> > to add it? My java experience is very limited (I'm accessing Cassandra
> via
> > thrift/Perl) so it isn't something I'd be able to jump in and run with
> > myself.
> >
>
>
>

[Attachment #3 (text/html)]

On that topic, what exactly is keeping this feature out of the official \
releases?<br><br><div class="gmail_quote">On Wed, Mar 31, 2010 at 3:43 PM, Daniel \
Kluesing <span dir="ltr">&lt;<a \
href="mailto:dk@bluekai.com">dk@bluekai.com</a>&gt;</span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex;">








<div lang="EN-US" link="blue" vlink="purple">

<div>

<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">We also applied \
this patch to the 0.6 branch and have been running it for a bit over a week. Works \
well, would love to see it get into trunk/0.7 proper.</span></p>

<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>

<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">

<p class="MsoNormal"><b><span style="font-size:10.0pt">From:</span></b><span \
style="font-size:10.0pt"> Ryan Daum [mailto:<a href="mailto:ryan@thimbleware.com" \
target="_blank">ryan@thimbleware.com</a>] <br> <b>Sent:</b> Wednesday, March 31, 2010 \
11:49 AM<br> <b>To:</b> <a href="mailto:user@cassandra.apache.org" \
target="_blank">user@cassandra.apache.org</a><br> <b>Subject:</b> Re: expiring data \
out of Cassandra/time to live</span></p>

</div><div><div></div><div class="h5">

<p class="MsoNormal"> </p>

<p class="MsoNormal">I was able to successfully merge this patch into the 0.6
branch a few weeks ago by doing the following:</p>

<div>

<p class="MsoNormal"> </p>

</div>

<div>

<ul type="disc">
 <li class="MsoNormal">Downloading the patch</li>
 <li class="MsoNormal">Checking out the trunk of Cassandra from github</li>
 <li class="MsoNormal">Rolling back (checking out) the git repo to the
     same date that the patch was submitted to Jira</li>
 <li class="MsoNormal">Applying the patch</li>
 <li class="MsoNormal">Committing to Git</li>
 <li class="MsoNormal">Merging forward to the 0.6 branch</li>
 <li class="MsoNormal">Resolve one or two minor conflicts.</li>
</ul>

</div>

<div>

<p class="MsoNormal"> </p>

</div>

<div>

<p class="MsoNormal">R</p>

</div>

<div>

<p class="MsoNormal"> </p>

<div>

<p class="MsoNormal">On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis &lt;<a \
href="mailto:jbellis@gmail.com" target="_blank">jbellis@gmail.com</a>&gt; wrote:</p>

<p class="MsoNormal">Sounds like you want to follow<br>
<a href="https://issues.apache.org/jira/browse/CASSANDRA-699" \
target="_blank">https://issues.apache.org/jira/browse/CASSANDRA-699</a>.  There is a \
patch<br> there but I wouldn&#39;t recommend merging it if Java scares you. :)</p>

<div>

<div>

<p class="MsoNormal"><br>
On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore<br>
&lt;<a href="mailto:mike.e.gallamore@googlemail.com" \
target="_blank">mike.e.gallamore@googlemail.com</a>&gt; wrote:<br>
&gt; Hello everyone,<br>
&gt;<br>
&gt; I saw a thread on the incubator user chat that started a few months ago:<br>
&gt; <a href="http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html" \
target="_blank">http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html</a><br>
 &gt; . It looks like this is the new official user mailing list so I&#39;ll add \
my<br> &gt; thoughts/question here.<br>
&gt;<br>
&gt; Is there any way to set a TTL on data stored in Cassandra? Deleting old<br>
&gt; SSTables isn&#39;t enough for my needs. I need the data to go away after a
fixed<br>
&gt; period of time. Here is what I&#39;m trying to do and my reasoning why I \
think<br> &gt; Cassandra and not something like Flare/Memcache mets my need:<br>
&gt;<br>
&gt; I&#39;m building a reputation system. We get lots of data at my work (in the<br>
&gt; 10&#39;s of GB of reputation data a day). The trick is that old data is not<br>
&gt; useful as a senders ip address might have changed, they might have had a
bot<br>
&gt; on their system and no have removed it, etc. So I need to be able to keep<br>
&gt; data for a fixed period of time and then afterwords it isn&#39;t
needed/ideally<br>
&gt; would be GC&#39;d out.<br>
&gt;<br>
&gt; We want to do one thing if we either never heard of the individual or at<br>
&gt; least not since the expiry time, and another thing based on the reputation<br>
&gt; data that is stored in Cassandra if it is current. So ideally a Cassandra<br>
&gt; call for a key for someone who&#39;s reputation is expired would return
nothing<br>
&gt; and we&#39;d reply with our default reputation for that individual. There
really<br>
&gt; is no point using network bandwidth to return all the fields associated
with<br>
&gt; that key only to look at a timestamp and end up ignoring it anyways.<br>
&gt; Similarly the latency of requesting first the timestamp and then the data
in<br>
&gt; two separate requests is prohibitive.<br>
&gt;<br>
&gt; Why Cassandra:<br>
&gt;<br>
&gt; Our data is complex and is hard to handle completely in a key/value sense.<br>
&gt; In the past we were doing this and just encoding the complex structure<br>
&gt; inside of JSON but this isn&#39;t ideal. It is very nice algorithmically to \
be<br> &gt; able to say: give me this column, or update this element of this hash \
etc,<br> &gt; rather than having to pull the old version, decode, modify, re-encode \
and<br> &gt; push back to a cache based system.<br>
&gt; Our data is large (in the low TB&#39;s at the moment, but expected to grow \
to<br> &gt; 50-100TB of live data)<br>
&gt; Need quick response for both searches and writes: typically for each thing<br>
&gt; we track we get a request for the reputation, the message gets processed
and<br>
&gt; then we get feedback back from the recipient. So reads and writes are<br>
&gt; symmetric.<br>
&gt; High request rate: millions per hour<br>
&gt; hundreds of millions of unique reputations (this is way crawling though
the<br>
&gt; data with a script purging old data doesn&#39;t make sense)<br>
&gt; Availablity/load balancing a must. Data needs to be replicated a disk copy<br>
&gt; is useful so if we have a power outage we don&#39;t lose the system.<br>
&gt; It would be interesting to keep a local subset of our data at customers<br>
&gt; sites and have them &quot;replicate up&quot; there data rather than send
there<br>
&gt; feedback in a different manner that then has to be processed and pumped
into<br>
&gt; our datastore (hopefully this is possible with Cassandra with some
creative<br>
&gt; choices of how the data is hashed between nodes)<br>
&gt;<br>
&gt; Does the capability to set an expiry time exist? If not is there any plans<br>
&gt; to add it? My java experience is very limited (I&#39;m accessing Cassandra \
via<br> &gt; thrift/Perl) so it isn&#39;t something I&#39;d be able to jump in and \
run with<br> &gt; myself.<br>
&gt;</p>

</div>

</div>

</div>

<p class="MsoNormal"> </p>

</div>

</div></div></div>

</div>


</blockquote></div><br>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic