[prev in list] [next in list] [prev in thread] [next in thread]
List: cassandra-user
Subject: Re: expiring data out of Cassandra/time to live
From: Ryan Daum <ryan () thimbleware ! com>
Date: 2010-03-31 20:53:37
Message-ID: m2p22dabfc71003311353ia59f214azb613fc5b0bd4b812 () mail ! gmail ! com
[Download RAW message or body]
On that topic, what exactly is keeping this feature out of the official
releases?
On Wed, Mar 31, 2010 at 3:43 PM, Daniel Kluesing <dk@bluekai.com> wrote:
> We also applied this patch to the 0.6 branch and have been running it for
> a bit over a week. Works well, would love to see it get into trunk/0.7
> proper.
>
>
>
> *From:* Ryan Daum [mailto:ryan@thimbleware.com]
> *Sent:* Wednesday, March 31, 2010 11:49 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: expiring data out of Cassandra/time to live
>
>
>
> I was able to successfully merge this patch into the 0.6 branch a few weeks
> ago by doing the following:
>
>
>
> - Downloading the patch
> - Checking out the trunk of Cassandra from github
> - Rolling back (checking out) the git repo to the same date that the
> patch was submitted to Jira
> - Applying the patch
> - Committing to Git
> - Merging forward to the 0.6 branch
> - Resolve one or two minor conflicts.
>
>
>
> R
>
>
>
> On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis <jbellis@gmail.com> wrote:
>
> Sounds like you want to follow
> https://issues.apache.org/jira/browse/CASSANDRA-699. There is a patch
> there but I wouldn't recommend merging it if Java scares you. :)
>
>
> On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore
> <mike.e.gallamore@googlemail.com> wrote:
> > Hello everyone,
> >
> > I saw a thread on the incubator user chat that started a few months ago:
> >
> http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html
> > . It looks like this is the new official user mailing list so I'll add my
> > thoughts/question here.
> >
> > Is there any way to set a TTL on data stored in Cassandra? Deleting old
> > SSTables isn't enough for my needs. I need the data to go away after a
> fixed
> > period of time. Here is what I'm trying to do and my reasoning why I
> think
> > Cassandra and not something like Flare/Memcache mets my need:
> >
> > I'm building a reputation system. We get lots of data at my work (in the
> > 10's of GB of reputation data a day). The trick is that old data is not
> > useful as a senders ip address might have changed, they might have had a
> bot
> > on their system and no have removed it, etc. So I need to be able to keep
> > data for a fixed period of time and then afterwords it isn't
> needed/ideally
> > would be GC'd out.
> >
> > We want to do one thing if we either never heard of the individual or at
> > least not since the expiry time, and another thing based on the
> reputation
> > data that is stored in Cassandra if it is current. So ideally a Cassandra
> > call for a key for someone who's reputation is expired would return
> nothing
> > and we'd reply with our default reputation for that individual. There
> really
> > is no point using network bandwidth to return all the fields associated
> with
> > that key only to look at a timestamp and end up ignoring it anyways.
> > Similarly the latency of requesting first the timestamp and then the data
> in
> > two separate requests is prohibitive.
> >
> > Why Cassandra:
> >
> > Our data is complex and is hard to handle completely in a key/value
> sense.
> > In the past we were doing this and just encoding the complex structure
> > inside of JSON but this isn't ideal. It is very nice algorithmically to
> be
> > able to say: give me this column, or update this element of this hash
> etc,
> > rather than having to pull the old version, decode, modify, re-encode and
> > push back to a cache based system.
> > Our data is large (in the low TB's at the moment, but expected to grow to
> > 50-100TB of live data)
> > Need quick response for both searches and writes: typically for each
> thing
> > we track we get a request for the reputation, the message gets processed
> and
> > then we get feedback back from the recipient. So reads and writes are
> > symmetric.
> > High request rate: millions per hour
> > hundreds of millions of unique reputations (this is way crawling though
> the
> > data with a script purging old data doesn't make sense)
> > Availablity/load balancing a must. Data needs to be replicated a disk
> copy
> > is useful so if we have a power outage we don't lose the system.
> > It would be interesting to keep a local subset of our data at customers
> > sites and have them "replicate up" there data rather than send there
> > feedback in a different manner that then has to be processed and pumped
> into
> > our datastore (hopefully this is possible with Cassandra with some
> creative
> > choices of how the data is hashed between nodes)
> >
> > Does the capability to set an expiry time exist? If not is there any
> plans
> > to add it? My java experience is very limited (I'm accessing Cassandra
> via
> > thrift/Perl) so it isn't something I'd be able to jump in and run with
> > myself.
> >
>
>
>
[Attachment #3 (text/html)]
On that topic, what exactly is keeping this feature out of the official \
releases?<br><br><div class="gmail_quote">On Wed, Mar 31, 2010 at 3:43 PM, Daniel \
Kluesing <span dir="ltr"><<a \
href="mailto:dk@bluekai.com">dk@bluekai.com</a>></span> wrote:<br> <blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex;">
<div lang="EN-US" link="blue" vlink="purple">
<div>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D">We also applied \
this patch to the 0.6 branch and have been running it for a bit over a week. Works \
well, would love to see it get into trunk/0.7 proper.</span></p>
<p class="MsoNormal"><span style="font-size:11.0pt;color:#1F497D"> </span></p>
<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">
<p class="MsoNormal"><b><span style="font-size:10.0pt">From:</span></b><span \
style="font-size:10.0pt"> Ryan Daum [mailto:<a href="mailto:ryan@thimbleware.com" \
target="_blank">ryan@thimbleware.com</a>] <br> <b>Sent:</b> Wednesday, March 31, 2010 \
11:49 AM<br> <b>To:</b> <a href="mailto:user@cassandra.apache.org" \
target="_blank">user@cassandra.apache.org</a><br> <b>Subject:</b> Re: expiring data \
out of Cassandra/time to live</span></p>
</div><div><div></div><div class="h5">
<p class="MsoNormal"> </p>
<p class="MsoNormal">I was able to successfully merge this patch into the 0.6
branch a few weeks ago by doing the following:</p>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<ul type="disc">
<li class="MsoNormal">Downloading the patch</li>
<li class="MsoNormal">Checking out the trunk of Cassandra from github</li>
<li class="MsoNormal">Rolling back (checking out) the git repo to the
same date that the patch was submitted to Jira</li>
<li class="MsoNormal">Applying the patch</li>
<li class="MsoNormal">Committing to Git</li>
<li class="MsoNormal">Merging forward to the 0.6 branch</li>
<li class="MsoNormal">Resolve one or two minor conflicts.</li>
</ul>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">R</p>
</div>
<div>
<p class="MsoNormal"> </p>
<div>
<p class="MsoNormal">On Wed, Mar 31, 2010 at 2:46 PM, Jonathan Ellis <<a \
href="mailto:jbellis@gmail.com" target="_blank">jbellis@gmail.com</a>> wrote:</p>
<p class="MsoNormal">Sounds like you want to follow<br>
<a href="https://issues.apache.org/jira/browse/CASSANDRA-699" \
target="_blank">https://issues.apache.org/jira/browse/CASSANDRA-699</a>. There is a \
patch<br> there but I wouldn't recommend merging it if Java scares you. :)</p>
<div>
<div>
<p class="MsoNormal"><br>
On Wed, Mar 31, 2010 at 1:39 PM, Mike Gallamore<br>
<<a href="mailto:mike.e.gallamore@googlemail.com" \
target="_blank">mike.e.gallamore@googlemail.com</a>> wrote:<br>
> Hello everyone,<br>
><br>
> I saw a thread on the incubator user chat that started a few months ago:<br>
> <a href="http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html" \
target="_blank">http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg02047.html</a><br>
> . It looks like this is the new official user mailing list so I'll add \
my<br> > thoughts/question here.<br>
><br>
> Is there any way to set a TTL on data stored in Cassandra? Deleting old<br>
> SSTables isn't enough for my needs. I need the data to go away after a
fixed<br>
> period of time. Here is what I'm trying to do and my reasoning why I \
think<br> > Cassandra and not something like Flare/Memcache mets my need:<br>
><br>
> I'm building a reputation system. We get lots of data at my work (in the<br>
> 10's of GB of reputation data a day). The trick is that old data is not<br>
> useful as a senders ip address might have changed, they might have had a
bot<br>
> on their system and no have removed it, etc. So I need to be able to keep<br>
> data for a fixed period of time and then afterwords it isn't
needed/ideally<br>
> would be GC'd out.<br>
><br>
> We want to do one thing if we either never heard of the individual or at<br>
> least not since the expiry time, and another thing based on the reputation<br>
> data that is stored in Cassandra if it is current. So ideally a Cassandra<br>
> call for a key for someone who's reputation is expired would return
nothing<br>
> and we'd reply with our default reputation for that individual. There
really<br>
> is no point using network bandwidth to return all the fields associated
with<br>
> that key only to look at a timestamp and end up ignoring it anyways.<br>
> Similarly the latency of requesting first the timestamp and then the data
in<br>
> two separate requests is prohibitive.<br>
><br>
> Why Cassandra:<br>
><br>
> Our data is complex and is hard to handle completely in a key/value sense.<br>
> In the past we were doing this and just encoding the complex structure<br>
> inside of JSON but this isn't ideal. It is very nice algorithmically to \
be<br> > able to say: give me this column, or update this element of this hash \
etc,<br> > rather than having to pull the old version, decode, modify, re-encode \
and<br> > push back to a cache based system.<br>
> Our data is large (in the low TB's at the moment, but expected to grow \
to<br> > 50-100TB of live data)<br>
> Need quick response for both searches and writes: typically for each thing<br>
> we track we get a request for the reputation, the message gets processed
and<br>
> then we get feedback back from the recipient. So reads and writes are<br>
> symmetric.<br>
> High request rate: millions per hour<br>
> hundreds of millions of unique reputations (this is way crawling though
the<br>
> data with a script purging old data doesn't make sense)<br>
> Availablity/load balancing a must. Data needs to be replicated a disk copy<br>
> is useful so if we have a power outage we don't lose the system.<br>
> It would be interesting to keep a local subset of our data at customers<br>
> sites and have them "replicate up" there data rather than send
there<br>
> feedback in a different manner that then has to be processed and pumped
into<br>
> our datastore (hopefully this is possible with Cassandra with some
creative<br>
> choices of how the data is hashed between nodes)<br>
><br>
> Does the capability to set an expiry time exist? If not is there any plans<br>
> to add it? My java experience is very limited (I'm accessing Cassandra \
via<br> > thrift/Perl) so it isn't something I'd be able to jump in and \
run with<br> > myself.<br>
></p>
</div>
</div>
</div>
<p class="MsoNormal"> </p>
</div>
</div></div></div>
</div>
</blockquote></div><br>
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic