'Re: Multi master disjoint cluster'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       postgresql-general
Subject:    Re: Multi master disjoint cluster
From:       Vijaykumar Jain <vijaykumarjain.github () gmail ! com>
Date:       2022-10-26 16:58:15
Message-ID: CAM+6J95ZzSrdqLTi6fRquOP8gkn_R4YaqZPCoFp+itDhoyV77w () mail ! gmail ! com
[Download RAW message or body]

Actually, pls ignore my email.

re reading my mail makes it look like I did not research it throughly and
just asked without actual implementation of both options and having a clear
goal on what can incompromise along with no read downtime.
I'll write better next time.

On Wed, Oct 26, 2022, 10:04 PM Vijaykumar Jain <
vijaykumarjain.github@gmail.com> wrote:

> Hi all,
>
> tl;dr
> I have a simple question,
> given a choice if I can write the same data to two databases in parallel,
> should I opt for primary / replica setup or multi writer/master setup. Th=
is
> setup has the ability to make use of kafka consumer groups (like two
> replication slots each having their own lsn offsets)  to write to both db
> node pairs in parallel via the application layer.
>
> The churn of data is really high, there is a lot of wals generated, aroun=
d
> 500gb/hr.
>
> If I go with primary/replica, (lr not feasible)
>
> I need to ensure both are on the same major version. Upgrades are tricky
> (we don't have qa) so we just have option to schema dump for upgrade
> compatibility. Data, we trust postgresql for that :). (I wish we had zfs
> everywhere but no )
>
> Any excl table blocking operations, (although with later versions there
> are very less blocking operations) can impact queries on replica as well
> (excluding delay settings).
>
> Corruption can cause downtime (we have tons of them and raids to protect
> them) so if replica is having issues, we can zero the pages on the replic=
a
> and do some operations if we isolate the problem pages, else resync the
> replica from primary. But if primary is having some issues, we copy data
> from replica to disk and copy in to primary after truncating etc. Some
> downtime but not a lot. (I am not expert at data recovery) and mostly rel=
y
> on amcheck, dd, and raid checks.
>
> We don't use pitr (too many wals =C3=97 58) or delayed replication as we =
can't
> afford more servers.
>
> ddl deploys are guaranteed by replication. So no need to try 2pc like
> stuff at app layer. (Although apps use deploy tools to ensure eventually
> the ddls are consistent and idempotent)
>
> Basically primary/replica relieves the app to think what is there on the
> primary is also on the replica eventually, so there can be source of trut=
h.
>
> But with multi writers, any app mishandling like bug in catching exceptio=
n
> etc can result in diversion and no more mirrored setup.
> We need to have checks/reconciliation to ensure both write nodes in pair
> have almost similar data at the end of the day so we can trust this setup
> independent of any app mistakes.
>
> But if app layer gets robust, we have almost no downtime in reads and
> writes, we can have both nodes on different versions, (w/o logical
> replication) can query both nodes real time, no real replication lag issu=
es
> , conflicts etc, can upgrade like blue green, canary test some changes on
> one if needed etc.
>
> Am I making sense at all? Or I am sounding confused, and I don't know the
> difference between primary/replica vs multi writer. This is not bdr like
> thing, they don't really need each other unless we are into some recovery=
.
>
> My point is, we have 58 such primary/replica shards (each 10tb+)
> (consistent hashing at app layer, no fdw)  and there is no scope of
> downtime for reads, so any issue like post upgrade performance degradatio=
n
> (if any) gives me chills. and we have no qa to test real data.
>
> There are too many dimensions to shard on and aggregations need to run
> across the shards (Yes there is no scope of data isolation).
>
>

[Attachment #3 (text/html)]

<div dir="auto">Actually, pls ignore my email.<div dir="auto"><br><div dir="auto">re \
reading my mail makes it look like I did not research it throughly and just asked \
without actual implementation of both options and having a clear goal on what can \
incompromise along with no read downtime.</div><div dir="auto">I&#39;ll write better \
next time.</div></div></div><br><div class="gmail_quote"><div dir="ltr" \
class="gmail_attr">On Wed, Oct 26, 2022, 10:04 PM Vijaykumar Jain &lt;<a \
href="mailto:vijaykumarjain.github@gmail.com">vijaykumarjain.github@gmail.com</a>&gt; \
wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 \
.8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div dir="auto">Hi \
all,</div><div dir="auto"><br></div><div dir="auto">tl;dr</div><div dir="auto">I have \
a simple question,</div><div dir="auto">given a choice if I can write the same data \
to two databases in parallel, should I opt for primary / replica setup or multi \
writer/master setup. This setup has the ability to make use of kafka consumer groups \
(like two replication slots each having their own lsn offsets)   to write to both db \
node pairs in parallel via the application layer.</div><div dir="auto"><br></div><div \
dir="auto">The churn of data is really high, there is a lot of wals generated, around \
500gb/hr.</div><div dir="auto"><br></div><div dir="auto">If I go with \
primary/replica, (lr not feasible)</div><div dir="auto"><br></div><div dir="auto">I \
need to ensure both are on the same major version. Upgrades are tricky (we don&#39;t \
have qa) so we just have option to schema dump for upgrade compatibility. Data, we \
trust postgresql for that :). (I wish we had zfs everywhere but no )</div><div \
dir="auto"><br></div><div dir="auto">Any excl table blocking operations, (although \
with later versions there are very less blocking operations) can impact queries on \
replica as well (excluding delay settings).</div><div dir="auto"><br></div><div \
dir="auto">Corruption can cause downtime (we have tons of them and raids to protect \
them) so if replica is having issues, we can zero the pages on the replica and do \
some operations if we isolate the problem pages, else resync the replica from \
primary. But if primary is having some issues, we copy data from replica to disk and \
copy in to primary after truncating etc. Some downtime but not a lot. (I am not \
expert at data recovery) and mostly rely on amcheck, dd, and raid checks.</div><div \
dir="auto"><br></div><div dir="auto">We don&#39;t use pitr (too many wals × 58) or \
delayed replication as we can&#39;t afford more servers.</div><div \
dir="auto"><br></div><div dir="auto">ddl deploys are guaranteed by replication. So no \
need to try 2pc like stuff at app layer. (Although apps use deploy tools to ensure \
eventually the ddls are consistent and idempotent)</div><div \
dir="auto"><br></div><div dir="auto">Basically primary/replica relieves the app to \
think what is there on the primary is also on the replica eventually, so there can be \
source of truth.</div><div dir="auto"><br></div><div dir="auto">But with multi \
writers, any app mishandling like bug in catching exception etc can result in \
diversion and no more mirrored setup.</div><div dir="auto">We need to have \
checks/reconciliation to ensure both write nodes in pair have almost similar data at \
the end of the day so we can trust this setup independent of any app \
mistakes.</div><div dir="auto"><br></div><div dir="auto">But if app layer gets \
robust, we have almost no downtime in reads and writes, we can have both nodes on \
different versions, (w/o logical replication) can query both nodes real time, no real \
replication lag issues , conflicts etc, can upgrade like blue green, canary test some \
changes on one if needed etc.</div><div dir="auto"><br></div><div dir="auto">Am I \
making sense at all? Or I am sounding confused, and I don&#39;t know the difference \
between primary/replica vs multi writer. This is not bdr like thing, they don&#39;t \
really need each other unless we are into some recovery.</div><div \
dir="auto"><br></div><div dir="auto">My point is, we have 58 such primary/replica \
shards (each 10tb+) (consistent hashing at app layer, no fdw)   and there is no scope \
of downtime for reads, so any issue like post upgrade performance degradation (if \
any) gives me chills. and we have no qa to test real data.</div><div \
dir="auto"><br></div><div dir="auto">There are too many dimensions to shard on and \
aggregations need to run across the shards (Yes there is no scope of data \
isolation).</div><div dir="auto"><br></div></div> </blockquote></div>



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic