'[Linux-ha-dev] Partial CTS Test Run: March 15 : 8 nodes,'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       linux-ha-dev
Subject:    [Linux-ha-dev] Partial CTS Test Run: March 15 : 8 nodes,
From:       Andrew Beekhof (GMail) <beekhof () gmail ! com>
Date:       2005-03-15 22:45:17
Message-ID: e47a24a7bac85efea3ba944efcd2a478 () gmail ! com
[Download RAW message or body]

Notes:
In addition to a resource that follows the DC, each node now has a 
resource with a preference to run there.  The audits check that if the 
node X is running, that resource X is only running on that node.

The deadtimes being used for testing have been dramatically reduced:
	keepalive      100ms
	warntime       1
	deadtime       2
	initdead       4

Only a partial run, but some interesting results anyway.
Andrew

Failures:
Test 73 (Restart)
	This actually passed, as can be seen during the audit which notes that 
the node is indeed up.  The actual problem was log loss plus a faulty 
backup test (which has since been fixed).

Test 76 (Restart)
	This test passed but heartbeat aborted due to excessive CPU usage 
during the audit.  This is a known problem in large Heartbeat clusters 
and is being investigated as bug 298.

Test 77 (SimulStop)
	The same CPU problem as test 76

Test 80 (SimulStart)
	I was a bit slow killing the lrmd on c001n01 (so i didnt have to keep 
explaining why it was still running - c001n06 I got to fast enough).  
In the end I killed all heartbeat related processes.
	A few minutes later, heartbeat aborted due to CPU usage on c001n06, 
n02, n07, n04, n05, n03 and finally on n08.  At that point I aborted 
the tests and wrote this up.

Development aborts were triggered during tests 3 (StartOnebyOne) and 53 
(StopOnebyOne)
	crmd: [29805]: CRIT: fn(copy_lrm_op): Triggered dev assert at 
utils.c:1222 : op->rsc != NULL
These are related to LRM failures and more information can be found 
under bug 332

BadNews:
Test 53 (StopOnebyOne)
	Mar 15 21:31:33 c001n08 crmd: [29805]: ERROR: fn(do_lrm_rsc_op): 
Operation start on c001n07 failed
An unknown error prevented the operation from completing.  
Investigation is currently blocked by bug 333.

Test 24 (SimulStart)
	Mar 15 19:31:27 c001n01 crmd: [7144]: ERROR: fn(cib_client_sync_from): 
Could not retrive current CIB.
	Mar 15 19:31:27 c001n01 crmd: [7144]: ERROR: fn(do_dc_join_finalize): 
Sync from c001n05 resulted in an error: No master service is currently 
active
This error is somewhat inaccurate and a new "remote host timed out" 
error has since been created.  The cause is likely to be that the CIB 
on c001n05 had not yet updated its membership list from CCM and our 
request was discarded.  A lower timeout and re-try procedure is being 
tested.

Test 78 (SimulStop)
	Mar 15 23:01:18 c001n01 lrmd: [13887]: ERROR: already running: [pid 
13458].
	Mar 15 23:01:18 c001n01 lrmd: [13887]: ERROR: Startup aborted (already 
running).  Shutting down.
	Mar 15 23:03:21 c001n06 lrmd: [24795]: ERROR: already running: [pid 
16706].
	Mar 15 23:03:21 c001n06 lrmd: [24795]: ERROR: Startup aborted (already 
running).  Shutting down.
These are a side effect of the heartbeat failures in tests 76 and 77

Test 79 (SimulStop)
	Mar 15 23:05:15 c001n01 lrmd: [13961]: ERROR: already running: [pid 
13458].
	Mar 15 23:05:15 c001n01 lrmd: [13961]: ERROR: Startup aborted (already 
running).  Shutting down.
	Mar 15 23:06:51 c001n06 lrmd: [25005]: ERROR: already running: [pid 
16706].
	Mar 15 23:06:51 c001n06 lrmd: [25005]: ERROR: Startup aborted (already 
running).  Shutting down.
Because the running lrmd is no longer a managed process of the current 
heartbeat it is never shut down.  It would be nice if heartbeat could 
adopt orphan lrmd processes but I cant see it happening any time soon.

--
Andrew Beekhof

"Would the last person to leave please turn out the enlightenment?" - 
TISM

_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]