'Re: [GENERAL] debugging server connection issue'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: [GENERAL] debugging server connection issue
From:       Stephen Constable <sjconsta () scisat ! ca>
Date:       2016-03-31 12:37:15
Message-ID: CACfdzQmdT6=dUgUFehn12+sC9trCLFf=h6d76MTwgTKFYQU7QA () mail ! gmail ! com
[Download RAW message or body]

In case anybody looks back on this thread in the future, I fixed the
problem (at least temporarily) by running the following in CentOS 7:
# echo "1025 65535" >/proc/sys/net/ipv4/ip_local_port_range
# sysctl -w net.ipv4.tcp_timestamps=1
# sysctl -w net.ipv4.tcp_tw_recycle=1
# sysctl -w tcp_tw_reuse=1

On both the database server and the client nodes.  It seems the clients
were eating up all the available ephemeral ports opening and closing
database connections.  I'm going to have a long talk with the developer
about why this is done in the first place :)

Cheers, and thanks for the tips.
--Steve

On Tue, Mar 29, 2016 at 6:51 PM Adrian Klaver <adrian.klaver@aklaver.com>
wrote:

> On 03/29/2016 03:25 PM, Stephen Constable wrote:
> > Sorry, my client environment is Linux.
>
> Hmm, so I was reading win32.c wrong. It is mapping a Windows error
> message to that string.
>
> >
> > My current theory is that my clients are running out of available
> > ephemeral ports, like in this thread:
> >
> http://dba.stackexchange.com/questions/59650/pgbouncer-works-great-but-occasionally-becomes-unavailable
> (but
> > I"m not currently using pg bouncer).  I tried pg bouncer before and had
> > the same errors, which in retrospect makes the client-side issue seem
> > more likely.  Are there any configuration variables I can set to reduce
> > the number of ephemeral ports required in the postgresql client
> > libraries?  Otherwise, I will attempt to reconfigure the OS of the
> > client machines tomorrow morning.
>
> Not sure how that would work. To make a network connection would seem to
> me to require a port.
>
> Are you seeing the same sort of port churn on your 8.4 machine?
>
> Is the fact that is processing results slower maybe giving the ports a
> chance to timeout their wait time, versus not on the newer faster machine?
>
> The issue, to me at least, seems to be less the number of jobs, but the
> number of connections each job is producing. T
> >
> > Thanks,
> > Steve
> >
> > On Tue, Mar 29, 2016 at 4:44 PM Adrian Klaver <adrian.klaver@aklaver.com
> > <mailto:adrian.klaver@aklaver.com>> wrote:
> >
> >     On 03/29/2016 01:28 PM, Stephen Constable wrote:
> >      > My apologies, I'm not sure what part of the networking stack the
> >      > messages are coming from.  It also states:
> >      > """
> >      > could not connect to server: Cannot assign requested address
> >      > Is the server running on host "<hostname>" and accepting
> >      > TCP/IP connections on port <port>?
> >      > """
> >
> >     Alright I lied, the above is a Postgres error message. I am just not
> >     used to seeing 'Cannot assign requested address'. Turns out it is in
> >     interfaces/libpq/win32.c.
> >
> >     So your client is running on Windows?
> >
> >
> >      > This error is only printed under a 32-job load, never a single
> >     job load.
> >      >
> >      > The processes are indeed connecting over a local network.
> >      >
> >      > I have only enabled the logging of connections and disconnections
> >     since
> >      > I figured that would be the most telling :) perhaps that was not
> the
> >      > best idea.  but, FYI, I see over 5000 such notices in a single
> >     minute.
> >      > I will reconfigure the logging to be more verbose.
> >      >
> >      > Thanks,
> >      > Steve
> >      >
> >      > On Tue, Mar 29, 2016 at 4:21 PM Adrian Klaver
> >     <adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> >      > <mailto:adrian.klaver@aklaver.com
> >     <mailto:adrian.klaver@aklaver.com>>> wrote:
> >      >
> >      >     On 03/29/2016 01:10 PM, Stephen Constable wrote:
> >      >      > Hi All,
> >      >      >
> >      >      > I'm a new-ish sysadmin working on porting legacy
> >     scientific code
> >      >     from a
> >      >      > local server/client to new supercomputer environment.  My
> >     work is
> >      >     mostly
> >      >      > done, except that my postgres database doesn't seem to be
> >     able to
> >      >     keep
> >      >      > up with the new environment.  The application is written
> >     in-house
> >      >     in a
> >      >      > mixture of FORTAN 77 and C, and uses postgres BLOBS as its
> >     main data
> >      >      > store.  This application in particular only reads from the
> >      >     database, it
> >      >      > never writes, which *should* make it easy to scale.
> >      >      >
> >      >      > My main problem is that this client application is unable
> to
> >      >     connect to
> >      >      > the database under a modest load (32 simultaneous jobs).
> >     The client
> >      >      > error logs print out messages like "could not connect to
> >     server:
> >      >     Cannot
> >      >      > assign requested address" and "Cannot connect to database
> >      >     [runlog]!!!"
> >      >      > (an important database of ours).  The "cannot assign
> requested
> >      >     address"
> >      >
> >      >     Well those do not look like Postgres error messages to me, so
> >     the first
> >      >     thing would be to determine what part of the stack is
> >     generating them.
> >      >
> >      >     Is the client software connecting to the database over a
> network?
> >      >
> >      >     Are you using connection pooling?
> >      >
> >      >      > message makes me think it's a configuration issue.  The
> >     logs are
> >      >     flooded
> >      >      > with hundreds of connection and disconnection notices per
> >      >     second.  This
> >      >
> >      >     Might want to turn off logging connections/disconnections:
> >      >
> >      >
> >
> http://www.postgresql.org/docs/9.4/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHAT
> >      >
> >      >     log_connections (boolean)
> >      >
> >      >     log_disconnections (boolean)
> >      >
> >      >      > same code and configuration runs fine on our mid-2000's
> >     Solaris
> >      >     10 box
> >      >      > with postgres 8.4 (albeit very slowly) but totally fails
> >     with these
> >      >      > connection errors on a modern Dell system running CentOS 7
> or
> >      >     FreeBSD 10
> >      >      > (I tested both) with postgres 9.4.
> >      >      >
> >      >      > While the database is under load (and jobs are actively
> >     failing),
> >      >     select
> >      >      > count(*) from pg_stat_activity returns 30-34 ish
> >     connections, show
> >      >      > max_connections returns 100, and show
> >     superuser_reserved_connections
> >      >      > shows 3.  My only other hint is that right after a fresh
> >     install of
> >      >      > CentOS 7 my job success rate was around 50%, and now it has
> >      >     approached
> >      >      > approximately 5%, so something is changing over time.
> >      >      >
> >      >      > Does anyone have any advice or experience with similar
> issues?
> >      >
> >      >     What else does the Postgres log show besides the
> >      >     connections/disconnections, that might be of interest?
> >      >
> >      >     What does the system log show?
> >      >
> >      >      >
> >      >      > Thanks,
> >      >      > Steve
> >      >      >
> >      >
> >      >
> >      >     --
> >      >     Adrian Klaver
> >      > adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> >     <mailto:adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com
> >>
> >      >
> >
> >
> >     --
> >     Adrian Klaver
> >     adrian.klaver@aklaver.com <mailto:adrian.klaver@aklaver.com>
> >
>
>
> --
> Adrian Klaver
> adrian.klaver@aklaver.com
>

[Attachment #3 (text/html)]

<div dir="ltr">In case anybody looks back on this thread in the future, I fixed the \
problem (at least temporarily) by running the following in CentOS 7:<div># echo \
&quot;1025 65535&quot; &gt;/proc/sys/net/ipv4/ip_local_port_range<br></div><div>#  \
sysctl -w net.ipv4.tcp_timestamps=1</div><div># sysctl -w \
net.ipv4.tcp_tw_recycle=1</div><div>#  sysctl -w \
tcp_tw_reuse=1</div><div><br></div><div>On both the database server and the client \
nodes.   It seems the clients were eating up all the available ephemeral ports \
opening and closing database connections.   I&#39;m going to have a long talk with \
the developer about why this is done in the first place \
:)</div><div><br></div><div>Cheers, and thanks for the \
tips.</div><div>--Steve</div></div><br><div class="gmail_quote"><div dir="ltr">On \
Tue, Mar 29, 2016 at 6:51 PM Adrian Klaver &lt;<a \
href="mailto:adrian.klaver@aklaver.com">adrian.klaver@aklaver.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex">On 03/29/2016 03:25 PM, Stephen \
Constable wrote:<br> &gt; Sorry, my client environment is Linux.<br>
<br>
Hmm, so I was reading win32.c wrong. It is mapping a Windows error<br>
message to that string.<br>
<br>
&gt;<br>
&gt; My current theory is that my clients are running out of available<br>
&gt; ephemeral ports, like in this thread:<br>
&gt; <a href="http://dba.stackexchange.com/questions/59650/pgbouncer-works-great-but-occasionally-becomes-unavailable" \
rel="noreferrer" target="_blank">http://dba.stackexchange.com/questions/59650/pgbouncer-works-great-but-occasionally-becomes-unavailable</a> \
(but<br> &gt; I&quot;m not currently using pg bouncer).   I tried pg bouncer before \
and had<br> &gt; the same errors, which in retrospect makes the client-side issue \
seem<br> &gt; more likely.   Are there any configuration variables I can set to \
reduce<br> &gt; the number of ephemeral ports required in the postgresql client<br>
&gt; libraries?   Otherwise, I will attempt to reconfigure the OS of the<br>
&gt; client machines tomorrow morning.<br>
<br>
Not sure how that would work. To make a network connection would seem to<br>
me to require a port.<br>
<br>
Are you seeing the same sort of port churn on your 8.4 machine?<br>
<br>
Is the fact that is processing results slower maybe giving the ports a<br>
chance to timeout their wait time, versus not on the newer faster machine?<br>
<br>
The issue, to me at least, seems to be less the number of jobs, but the<br>
number of connections each job is producing. T<br>
&gt;<br>
&gt; Thanks,<br>
&gt; Steve<br>
&gt;<br>
&gt; On Tue, Mar 29, 2016 at 4:44 PM Adrian Klaver &lt;<a \
href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a><br> &gt; &lt;mailto:<a \
href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a>&gt;&gt; wrote:<br> &gt;<br>
&gt;        On 03/29/2016 01:28 PM, Stephen Constable wrote:<br>
&gt;         &gt; My apologies, I&#39;m not sure what part of the networking stack \
the<br> &gt;         &gt; messages are coming from.   It also states:<br>
&gt;         &gt; &quot;&quot;&quot;<br>
&gt;         &gt; could not connect to server: Cannot assign requested address<br>
&gt;         &gt; Is the server running on host &quot;&lt;hostname&gt;&quot; and \
accepting<br> &gt;         &gt; TCP/IP connections on port &lt;port&gt;?<br>
&gt;         &gt; &quot;&quot;&quot;<br>
&gt;<br>
&gt;        Alright I lied, the above is a Postgres error message. I am just not<br>
&gt;        used to seeing &#39;Cannot assign requested address&#39;. Turns out it is \
in<br> &gt;        interfaces/libpq/win32.c.<br>
&gt;<br>
&gt;        So your client is running on Windows?<br>
&gt;<br>
&gt;<br>
&gt;         &gt; This error is only printed under a 32-job load, never a single<br>
&gt;        job load.<br>
&gt;         &gt;<br>
&gt;         &gt; The processes are indeed connecting over a local network.<br>
&gt;         &gt;<br>
&gt;         &gt; I have only enabled the logging of connections and \
disconnections<br> &gt;        since<br>
&gt;         &gt; I figured that would be the most telling :) perhaps that was not \
the<br> &gt;         &gt; best idea.   but, FYI, I see over 5000 such notices in a \
single<br> &gt;        minute.<br>
&gt;         &gt; I will reconfigure the logging to be more verbose.<br>
&gt;         &gt;<br>
&gt;         &gt; Thanks,<br>
&gt;         &gt; Steve<br>
&gt;         &gt;<br>
&gt;         &gt; On Tue, Mar 29, 2016 at 4:21 PM Adrian Klaver<br>
&gt;        &lt;<a href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a> &lt;mailto:<a \
href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a>&gt;<br> &gt;         &gt; &lt;mailto:<a \
href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a><br> &gt;        &lt;mailto:<a \
href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a>&gt;&gt;&gt; wrote:<br> &gt;         \
&gt;<br> &gt;         &gt;        On 03/29/2016 01:10 PM, Stephen Constable \
wrote:<br> &gt;         &gt;         &gt; Hi All,<br>
&gt;         &gt;         &gt;<br>
&gt;         &gt;         &gt; I&#39;m a new-ish sysadmin working on porting \
legacy<br> &gt;        scientific code<br>
&gt;         &gt;        from a<br>
&gt;         &gt;         &gt; local server/client to new supercomputer environment.  \
My<br> &gt;        work is<br>
&gt;         &gt;        mostly<br>
&gt;         &gt;         &gt; done, except that my postgres database doesn&#39;t \
seem to be<br> &gt;        able to<br>
&gt;         &gt;        keep<br>
&gt;         &gt;         &gt; up with the new environment.   The application is \
written<br> &gt;        in-house<br>
&gt;         &gt;        in a<br>
&gt;         &gt;         &gt; mixture of FORTAN 77 and C, and uses postgres BLOBS as \
its<br> &gt;        main data<br>
&gt;         &gt;         &gt; store.   This application in particular only reads \
from the<br> &gt;         &gt;        database, it<br>
&gt;         &gt;         &gt; never writes, which *should* make it easy to \
scale.<br> &gt;         &gt;         &gt;<br>
&gt;         &gt;         &gt; My main problem is that this client application is \
unable to<br> &gt;         &gt;        connect to<br>
&gt;         &gt;         &gt; the database under a modest load (32 simultaneous \
jobs).<br> &gt;        The client<br>
&gt;         &gt;         &gt; error logs print out messages like &quot;could not \
connect to<br> &gt;        server:<br>
&gt;         &gt;        Cannot<br>
&gt;         &gt;         &gt; assign requested address&quot; and &quot;Cannot \
connect to database<br> &gt;         &gt;        [runlog]!!!&quot;<br>
&gt;         &gt;         &gt; (an important database of ours).   The &quot;cannot \
assign requested<br> &gt;         &gt;        address&quot;<br>
&gt;         &gt;<br>
&gt;         &gt;        Well those do not look like Postgres error messages to me, \
so<br> &gt;        the first<br>
&gt;         &gt;        thing would be to determine what part of the stack is<br>
&gt;        generating them.<br>
&gt;         &gt;<br>
&gt;         &gt;        Is the client software connecting to the database over a \
network?<br> &gt;         &gt;<br>
&gt;         &gt;        Are you using connection pooling?<br>
&gt;         &gt;<br>
&gt;         &gt;         &gt; message makes me think it&#39;s a configuration issue. \
The<br> &gt;        logs are<br>
&gt;         &gt;        flooded<br>
&gt;         &gt;         &gt; with hundreds of connection and disconnection notices \
per<br> &gt;         &gt;        second.   This<br>
&gt;         &gt;<br>
&gt;         &gt;        Might want to turn off logging \
connections/disconnections:<br> &gt;         &gt;<br>
&gt;         &gt;<br>
&gt;        <a href="http://www.postgresql.org/docs/9.4/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHAT" \
rel="noreferrer" target="_blank">http://www.postgresql.org/docs/9.4/interactive/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHAT</a><br>
 &gt;         &gt;<br>
&gt;         &gt;        log_connections (boolean)<br>
&gt;         &gt;<br>
&gt;         &gt;        log_disconnections (boolean)<br>
&gt;         &gt;<br>
&gt;         &gt;         &gt; same code and configuration runs fine on our \
mid-2000&#39;s<br> &gt;        Solaris<br>
&gt;         &gt;        10 box<br>
&gt;         &gt;         &gt; with postgres 8.4 (albeit very slowly) but totally \
fails<br> &gt;        with these<br>
&gt;         &gt;         &gt; connection errors on a modern Dell system running \
CentOS 7 or<br> &gt;         &gt;        FreeBSD 10<br>
&gt;         &gt;         &gt; (I tested both) with postgres 9.4.<br>
&gt;         &gt;         &gt;<br>
&gt;         &gt;         &gt; While the database is under load (and jobs are \
actively<br> &gt;        failing),<br>
&gt;         &gt;        select<br>
&gt;         &gt;         &gt; count(*) from pg_stat_activity returns 30-34 ish<br>
&gt;        connections, show<br>
&gt;         &gt;         &gt; max_connections returns 100, and show<br>
&gt;        superuser_reserved_connections<br>
&gt;         &gt;         &gt; shows 3.   My only other hint is that right after a \
fresh<br> &gt;        install of<br>
&gt;         &gt;         &gt; CentOS 7 my job success rate was around 50%, and now \
it has<br> &gt;         &gt;        approached<br>
&gt;         &gt;         &gt; approximately 5%, so something is changing over \
time.<br> &gt;         &gt;         &gt;<br>
&gt;         &gt;         &gt; Does anyone have any advice or experience with similar \
issues?<br> &gt;         &gt;<br>
&gt;         &gt;        What else does the Postgres log show besides the<br>
&gt;         &gt;        connections/disconnections, that might be of interest?<br>
&gt;         &gt;<br>
&gt;         &gt;        What does the system log show?<br>
&gt;         &gt;<br>
&gt;         &gt;         &gt;<br>
&gt;         &gt;         &gt; Thanks,<br>
&gt;         &gt;         &gt; Steve<br>
&gt;         &gt;         &gt;<br>
&gt;         &gt;<br>
&gt;         &gt;<br>
&gt;         &gt;        --<br>
&gt;         &gt;        Adrian Klaver<br>
&gt;         &gt; <a href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a> &lt;mailto:<a \
href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a>&gt;<br> &gt;        &lt;mailto:<a \
href="mailto:adrian.klaver@aklaver.com" target="_blank">adrian.klaver@aklaver.com</a> \
&lt;mailto:<a href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a>&gt;&gt;<br> &gt;         &gt;<br>
&gt;<br>
&gt;<br>
&gt;        --<br>
&gt;        Adrian Klaver<br>
&gt;        <a href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a> &lt;mailto:<a \
href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a>&gt;<br> &gt;<br>
<br>
<br>
--<br>
Adrian Klaver<br>
<a href="mailto:adrian.klaver@aklaver.com" \
target="_blank">adrian.klaver@aklaver.com</a><br> </blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic