'Re: [Slony1-general] [Slony 2.0.1] Events'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       slony1-general
Subject:    Re: [Slony1-general] [Slony 2.0.1] Events
From:       Cyril Scetbon <cscetbon.ext () orange-ftgroup ! com>
Date:       2009-04-28 15:25:43
Message-ID: 49F71FF7.8030706 () orange-ftgroup ! com
[Download RAW message or body]

I'm still investigating ....

The slon restart does not resolve the error. After the restart I have 
something like :

select ev_origin, min(ev_seqno), max(ev_seqno),
        date_trunc('minutes', min(now() - ev_timestamp)),
        date_trunc('minutes', max(now() - ev_timestamp)),
        min(now() - ev_timestamp) > '00:30:00' as agehi
    from _pns_slony_voila_archi_0.sl_event group by ev_origin;
 ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
-----------+---------+---------+------------+------------+-------
       103 |      65 |      66 | 00:01:00   | 00:47:00   | f
       104 |      60 |      61 | 00:01:00   | 00:47:00   | f
       102 |      33 |      34 | 00:01:00   | 00:47:00   | f
       101 | 4541581 | 4542350 | 00:00:00   | 00:13:00   | f
(4 rows)

So, another event is generated (for example, 66 from node 103)

log extract :
2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 33 processing
2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_102: no sets need 
syncing for this event
2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event 
102,33 SYNC
2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 102,33 ignored 
- duplicate
2009-04-28 15:51:53 CEST DEBUG2 remoteListenThread_104: queue event 
104,60 SYNC
2009-04-28 15:51:53 CEST DEBUG2 remoteWorker_event: event 104,60 ignored 
- duplicate
2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_104: SYNC 60 processing
2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_104: no sets need 
syncing for this event
2009-04-28 15:51:53 CEST DEBUG2 remoteWorkerThread_103: SYNC 65 processing
2009-04-28 15:51:53 CEST DEBUG1 remoteWorkerThread_103: no sets need 
syncing for this event

after a few minutes I have :

select ev_origin, min(ev_seqno), max(ev_seqno),
        date_trunc('minutes', min(now() - ev_timestamp)),
        date_trunc('minutes', max(now() - ev_timestamp)),
        min(now() - ev_timestamp) > '00:30:00' as agehi
    from _pns_slony_voila_archi_0.sl_event group by ev_origin;
 ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
-----------+---------+---------+------------+------------+-------
       103 |      66 |      66 | 00:12:00   | 00:12:00   | f
       104 |      61 |      61 | 00:12:00   | 00:12:00   | f
       102 |      34 |      34 | 00:12:00   | 00:12:00   | f
       101 | 4542310 | 4543008 | 00:00:00   | 00:11:00   | f

log extract :
2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_103: SYNC 66 processing
2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_103: no sets need 
syncing for this event
2009-04-28 16:37:49 CEST DEBUG2 remoteWorkerThread_104: SYNC 61 processing
2009-04-28 16:37:49 CEST DEBUG1 remoteWorkerThread_104: no sets need 
syncing for this event
2009-04-28 16:37:53 CEST DEBUG2 remoteWorkerThread_102: SYNC 34 processing
2009-04-28 16:37:53 CEST DEBUG1 remoteWorkerThread_102: no sets need 
syncing for this event

Precedent events (65,60 and 33) have been cleared, and a new one per 
node has been generated. But, the error continues and after 30 minutes I 
have :

select ev_origin, min(ev_seqno), max(ev_seqno),
        date_trunc('minutes', min(now() - ev_timestamp)),
        date_trunc('minutes', max(now() - ev_timestamp)),
        min(now() - ev_timestamp) > '00:30:00' as agehi
    from _pns_slony_voila_archi_0.sl_event group by ev_origin;
 ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
-----------+---------+---------+------------+------------+-------
       103 |      66 |      66 | 00:41:00   | 00:41:00   | t
       104 |      61 |      61 | 00:41:00   | 00:41:00   | t
       102 |      34 |      34 | 00:41:00   | 00:41:00   | t
       101 | 4543634 | 4544715 | 00:00:00   | 00:18:00   | f

In Summary :

Events are not correctly propagated from nodes 102, 103 and 104. If I 
restart every slons processes new events are generated on these nodes 
but still not correctly propagated. However the last event on each node 
seems to be propagated thanks to the restart stage.

Any idea ?

Cyril Scetbon wrote:
> Hi,
>
> I found in slony log files that the worker has restarted the process 
> on each host in my configuration (1 provider, 3 receivers) at the same 
> time :
>
> 2009-04-28 08:47:37 CEST INFO   remoteWorkerThread_101: SYNC 4516360 
> sync_event timing:  pqexec (s/count)- provider 0.001/1 - subscriber 
> 0.035/1 - IUD 0.000/2
> 2009-04-28 08:47:38 CEST ERROR  remoteListenThread_102: "select 
> ev_origin, ev_seqno, ev_timestamp,        ev_snapshot,        
> "pg_catalog".txid_snapshot_xmin(ev_snapshot),        
> "pg_catalog".txid_snapshot_xmax(ev_snapshot),        ev_type,        
> ev_data1, ev_data2,        ev_data3, ev_data4,        ev_data5,
> ev_data6,        ev_data7, ev_data8 from 
> "_pns_slony_voila_archi_0".sl_event e where (e.ev_origin = '102' and 
> e.ev_seqno > '29') or (e.ev_origin = '103' and e.ev_seqno > '62') 
> order by e.ev_origin, e.ev_seqno limit 40" - FATAL:  terminating 
> connection due to administrator command
> 2009-04-28 08:47:38 CEST ERROR  remoteListenThread_102: "select 
> con_origin, con_received,     max(con_seqno) as con_seqno,     
> max(con_timestamp) as con_timestamp from 
> "_pns_slony_voila_archi_0".sl_confirm where con_received <> 104 group 
> by con_origin, con_received" 2009-04-28 08:47:38 CEST CONFIG slon: 
> child terminated status: 11; pid: 28152, current worker pid: 28152
> 2009-04-28 08:47:38 CEST CONFIG slon: restart of worker in 10 seconds
>
> The origin of the first error is that postgresql has been restarted on 
> one receiver without stopping slon before. Is it a known source of 
> errors ?
> The slon process has been terminated with status=0 on the receiver 
> where postgresql has been restarted and has segfault (chid status=11) 
> on the others (each watchdog has restart a new slon process). This 
> fact seems to cause the error with events and confirmations.
>
> Regards.
>
> Cyril Scetbon wrote:
>> A simple restart of all slon processes seems to have resolved the 
>> issue. weird ....
>>
>> Cyril Scetbon wrote:
>>> I use test_slony_state and see some informations like :
>>>
>>> Check of event info
>>> ---------------------------------------------------
>>> Problem : Events not propagating to node 2
>>> Problem : Events not propagating to node 4
>>> Problem : Events not propagating to node 3
>>>
>>> Check of sl_confirm aging
>>> ---------------------------------------------------
>>> Confirmations not propagating from 2 to 1
>>> Confirmations not propagating from 2 to 3
>>> Confirmations not propagating from 2 to 4
>>> Confirmations not propagating from 3 to 1
>>> Confirmations not propagating from 3 to 2
>>> Confirmations not propagating from 3 to 4
>>> Confirmations not propagating from 4 to 1
>>> Confirmations not propagating from 4 to 2
>>> Confirmations not propagating from 4 to 3
>>>
>>> You can see the results on one of my databases :
>>>
>>> - for Confirmations
>>>
>>> select con_origin, con_received, min(con_seqno) as minseq, 
>>> max(con_seqno) as maxseq, date_trunc('minutes', 
>>> min(now()-con_timestamp)) as age1, date_trunc('minutes', 
>>> max(now()-con_timestamp)) as age2, min(now() - con_timestamp) > 
>>> '00:30:00' as tooold from _pns_slony_voila_archi_0.sl_confirm group 
>>> by con_origin, con_received order by con_origin, con_received;
>>> con_origin | con_received | minseq  | maxseq  |   age1   |   age2   
>>> | tooold 
>>> ------------+--------------+---------+---------+----------+----------+-------- 
>>>
>>>        101 |          102 | 4464029 | 4464792 | 00:00:00 | 00:16:00 | f
>>>        101 |          103 | 4464027 | 4464792 | 00:00:00 | 00:16:00 | f
>>>        101 |          104 | 4464024 | 4464792 | 00:00:00 | 00:16:00 | f
>>>        102 |          101 |      29 |      29 | 03:39:00 | 03:39:00 | t
>>>        102 |          103 |      29 |      29 | 03:39:00 | 03:39:00 | t
>>>        102 |          104 |      29 |      29 | 03:39:00 | 03:39:00 | t
>>>        103 |          101 |      62 |      62 | 03:39:00 | 03:39:00 | t
>>>        103 |          102 |      62 |      62 | 03:39:00 | 03:39:00 | t
>>>        103 |          104 |      62 |      62 | 03:39:00 | 03:39:00 | t
>>>        104 |          101 |      57 |      57 | 03:39:00 | 03:39:00 | t
>>>        104 |          102 |      57 |      57 | 03:39:00 | 03:39:00 | t
>>>        104 |          103 |      57 |      57 | 03:39:00 | 03:39:00 | t
>>>
>>>
>>> - for Events
>>>
>>> select ev_origin, min(ev_seqno), max(ev_seqno),
>>>         date_trunc('minutes', min(now() - ev_timestamp)),
>>>         date_trunc('minutes', max(now() - ev_timestamp)),
>>>         min(now() - ev_timestamp) > '00:30:00' as agehi
>>>     from _pns_slony_voila_archi_0.sl_event group by ev_origin;
>>> ev_origin |   min   |   max   | date_trunc | date_trunc | agehi
>>> -----------+---------+---------+------------+------------+-------
>>>       103 |      62 |      62 | 03:49:00   | 03:49:00   | t
>>>       104 |      57 |      57 | 03:49:00   | 03:49:00   | t
>>>       102 |      29 |      29 | 03:49:00   | 03:49:00   | t
>>>       101 | 4464493 | 4465346 | 00:00:00   | 00:14:00   | f
>>>
>>>
>>> What can be the source of these errors and how to track them ? FYI, 
>>> I have logs but in debug level 1.
>>>
>>> I did not have issues in 1.2.15.
>>>
>>> Regards.
>>
>

-- 
Cyril SCETBON
_______________________________________________
Slony1-general mailing list
Slony1-general@lists.slony.info
http://lists.slony.info/mailman/listinfo/slony1-general
[prev in list] [next in list] [prev in thread] [next in thread]