'Re: [HACKERS] Page replacement algorithm in buffer cache'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       pgsql-hackers
Subject:    Re: [HACKERS] Page replacement algorithm in buffer cache
From:       Jeff Janes <jeff.janes () gmail ! com>
Date:       2013-03-31 18:27:06
Message-ID: CAMkU=1zVSyNRR_AQh4j_w6h37+qyvAz4fY8A+QP8e0dsuBg7Fw () mail ! gmail ! com
[Download RAW message or body]

On Friday, March 22, 2013, Ants Aasma wrote:

> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure@gmail.com<javascript:;>>
> wrote:
> > well if you do a non-locking test first you could at least avoid some
> > cases (and, if you get the answer wrong, so what?) by jumping to the
> > next buffer immediately.  if the non locking test comes good, only
> > then do you do a hardware TAS.
> >
> > you could in fact go further and dispense with all locking in front of
> > usage_count, on the premise that it's only advisory and not a real
> > refcount.  so you only then lock if/when it's time to select a
> > candidate buffer, and only then when you did a non locking test first.
> >  this would of course require some amusing adjustments to various
> > logical checks (usage_count <= 0, heh).
>
> Moreover, if the buffer happens to miss a decrement due to a data
> race, there's a good chance that the buffer is heavily used and
> wouldn't need to be evicted soon anyway. (if you arrange it to be a
> read-test-inc/dec-store operation then you will never go out of
> bounds) However, clocksweep and usage_count maintenance is not what is
> causing contention because that workload is distributed. The issue is
> pinning and unpinning.


That is one of multiple issues.  Contention on the BufFreelistLock is
another one.  I agree that usage_count maintenance is unlikely to become a
bottleneck unless one or both of those is fixed first (and maybe not even
then)

...



> The issue with the current buffer management algorithm is that it
> seems to scale badly with increasing shared_buffers.


I do not think that this is the case.  Neither of the SELECT-only
contention points (pinning/unpinning of index root blocks when all data is
in shared_buffers, and BufFreelistLock when all data is not in
shared_buffers) are made worse by increasing shared_buffers that I have
seen.  They do scale badly with number of concurrent processes, though.

The reports of write-heavy workloads not scaling well with shared_buffers
do not seem to be driven by the buffer management algorithm, or at least
not the freelist part of it.  They mostly seem to center on the kernel and
the IO controllers.

 Cheers,

Jeff

[Attachment #3 (text/html)]

On Friday, March 22, 2013, Ants Aasma  wrote:<br><blockquote class="gmail_quote" \
style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Fri, Mar 22, \
2013 at 10:22 PM, Merlin Moncure &lt;<a href="javascript:;" onclick="_e(event, \
&#39;cvml&#39;, &#39;mmoncure@gmail.com&#39;)">mmoncure@gmail.com</a>&gt; wrote:<br>

&gt; well if you do a non-locking test first you could at least avoid some<br>
&gt; cases (and, if you get the answer wrong, so what?) by jumping to the<br>
&gt; next buffer immediately.  if the non locking test comes good, only<br>
&gt; then do you do a hardware TAS.<br>
&gt;<br>
&gt; you could in fact go further and dispense with all locking in front of<br>
&gt; usage_count, on the premise that it&#39;s only advisory and not a real<br>
&gt; refcount.  so you only then lock if/when it&#39;s time to select a<br>
&gt; candidate buffer, and only then when you did a non locking test first.<br>
&gt;  this would of course require some amusing adjustments to various<br>
&gt; logical checks (usage_count &lt;= 0, heh).<br>
<br>
Moreover, if the buffer happens to miss a decrement due to a data<br>
race, there&#39;s a good chance that the buffer is heavily used and<br>
wouldn&#39;t need to be evicted soon anyway. (if you arrange it to be a<br>
read-test-inc/dec-store operation then you will never go out of<br>
bounds) However, clocksweep and usage_count maintenance is not what is<br>
causing contention because that workload is distributed. The issue is<br>
pinning and unpinning. </blockquote><div><br></div><div>That is one of multiple \
issues.  Contention on the BufFreelistLock is another one.  I agree that usage_count \
maintenance is unlikely to become a bottleneck unless one or both of those is fixed \
first (and maybe not even then)</div> \
<div><br></div><div>...</div><div><br></div><div> </div><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex"> The issue with the current buffer management algorithm is \
that it<br> seems to scale badly with increasing \
shared_buffers.</blockquote><div><br></div><div>I do not think that this is the case. \
Neither of the SELECT-only contention points (pinning/unpinning of index root blocks \
when all data is in shared_buffers, and BufFreelistLock when all data is not in \
shared_buffers) are made worse by increasing shared_buffers that I have seen.  They \
do scale badly with number of concurrent processes, though.</div> \
<div><br></div><div>The reports of write-heavy workloads not scaling well with \
shared_buffers do not seem to be driven by the buffer management algorithm, or at \
least not the freelist part of it.  They mostly seem to center on the kernel and the \
IO controllers.</div> <div><br></div><div> \
Cheers,</div><div><br></div><div>Jeff</div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic