'Re: Client-side deduplication during extraction'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       tarsnap-users
Subject:    Re: Client-side deduplication during extraction
From:       James Cass <jamescass.sc () gmail ! com>
Date:       2017-11-20 12:21:03
Message-ID: CAF81nx-rRgD99_ohi7vL3P8Xzv3989n0PCXaSojVBAmzTPjCzw () mail ! gmail ! com
[Download RAW message or body]

+1 for me.  This sounds like a good idea.
That's my 2 satoshis.  :-)

On Sun, Nov 19, 2017 at 8:03 PM, Colin Percival <cperciva@tarsnap.com>
wrote:

> On 11/19/17 12:37, Robie Basak wrote:
> > On Sat, Apr 08, 2017 at 07:52:54PM -0700, Colin Percival wrote:
> >> On 04/04/17 13:06, Robie Basak wrote:
> >>> Since the redundancy is there and my client has all the details,
> >>> is there any way I can take advantage of this?
> >>
> >> Not right now.  This is something I've been thinking about implementing,
> >> but it's rather complicated (the tarsnap "read" path would need to look
> at
> >> data on disk to see what it can "reuse", and normally it doesn't read
> any
> >> files from disk).
> >
> > In case it helps others, I hacked together a client-side cache for this
> > one task. It appears to have worked. Patch below.
>
> Ah yes, I was thinking in terms of "notice that we're extracting the file
> 'foo' and there is already a file 'foo', then read that file in and split
> it into blocks in case any can be reused" -- the case you've covered here
> of keeping a cache of downloaded blocks is much simpler (but only covers
> the "multiple downloads of the same data" case, not the more general case
> of "synchronizing" a system with an archive).
>
> > This is absolutely a hack and not production ready (no concurrency, bad
> > error handling, hardcoded cache path whose directory must be created in
> > advance and permissions set manually, etc), but for a one-off task it
> > was enough for me to get my data out.
> > [snip patch]
>
> Yes, this patch definitely looks like it does what you want.  I'd consider
> including it (well, with details tidied up) but I'm not sure if anyone else
> would want to use this functionality... anyone else on the list interested?
>
> --
> Colin Percival
> Security Officer Emeritus, FreeBSD | The power to serve
> Founder, Tarsnap | www.tarsnap.com | Online backups for the truly paranoid
>

[Attachment #3 (text/html)]

<div dir="ltr"><div><div>+1 for me.   This sounds like a good \
idea.<br></div>That&#39;s my 2 satoshis.   :-)<br></div></div><div \
class="gmail_extra"><br><div class="gmail_quote">On Sun, Nov 19, 2017 at 8:03 PM, \
Colin Percival <span dir="ltr">&lt;<a href="mailto:cperciva@tarsnap.com" \
target="_blank">cperciva@tarsnap.com</a>&gt;</span> wrote:<br><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"><span class="">On 11/19/17 12:37, Robie Basak wrote:<br> &gt; \
On Sat, Apr 08, 2017 at 07:52:54PM -0700, Colin Percival wrote:<br> &gt;&gt; On \
04/04/17 13:06, Robie Basak wrote:<br> </span><span class="">&gt;&gt;&gt; Since the \
redundancy is there and my client has all the details,<br> &gt;&gt;&gt; is there any \
way I can take advantage of this?<br> &gt;&gt;<br>
&gt;&gt; Not right now.   This is something I&#39;ve been thinking about \
implementing,<br> &gt;&gt; but it&#39;s rather complicated (the tarsnap \
&quot;read&quot; path would need to look at<br> &gt;&gt; data on disk to see what it \
can &quot;reuse&quot;, and normally it doesn&#39;t read any<br> &gt;&gt; files from \
disk).<br> &gt;<br>
&gt; In case it helps others, I hacked together a client-side cache for this<br>
&gt; one task. It appears to have worked. Patch below.<br>
<br>
</span>Ah yes, I was thinking in terms of &quot;notice that we&#39;re extracting the \
file<br> &#39;foo&#39; and there is already a file &#39;foo&#39;, then read that file \
in and split<br> it into blocks in case any can be reused&quot; -- the case \
you&#39;ve covered here<br> of keeping a cache of downloaded blocks is much simpler \
(but only covers<br> the &quot;multiple downloads of the same data&quot; case, not \
the more general case<br> of &quot;synchronizing&quot; a system with an archive).<br>
<span class=""><br>
&gt; This is absolutely a hack and not production ready (no concurrency, bad<br>
&gt; error handling, hardcoded cache path whose directory must be created in<br>
&gt; advance and permissions set manually, etc), but for a one-off task it<br>
&gt; was enough for me to get my data out.<br>
</span>&gt; [snip patch]<br>
<br>
Yes, this patch definitely looks like it does what you want.   I&#39;d consider<br>
including it (well, with details tidied up) but I&#39;m not sure if anyone else<br>
would want to use this functionality... anyone else on the list interested?<br>
<div class="HOEnZb"><div class="h5"><br>
--<br>
Colin Percival<br>
Security Officer Emeritus, FreeBSD | The power to serve<br>
Founder, Tarsnap | <a href="http://www.tarsnap.com" rel="noreferrer" \
target="_blank">www.tarsnap.com</a> | Online backups for the truly paranoid<br> \
</div></div></blockquote></div><br></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic