[prev in list] [next in list] [prev in thread] [next in thread]
List: linux-ha-dev
Subject: [Linux-ha-dev] Partial CTS Test Run: March 15 : 8 nodes,
From: Andrew Beekhof (GMail) <beekhof () gmail ! com>
Date: 2005-03-15 22:45:17
Message-ID: e47a24a7bac85efea3ba944efcd2a478 () gmail ! com
[Download RAW message or body]
Notes:
In addition to a resource that follows the DC, each node now has a
resource with a preference to run there. The audits check that if the
node X is running, that resource X is only running on that node.
The deadtimes being used for testing have been dramatically reduced:
keepalive 100ms
warntime 1
deadtime 2
initdead 4
Only a partial run, but some interesting results anyway.
Andrew
Failures:
Test 73 (Restart)
This actually passed, as can be seen during the audit which notes that
the node is indeed up. The actual problem was log loss plus a faulty
backup test (which has since been fixed).
Test 76 (Restart)
This test passed but heartbeat aborted due to excessive CPU usage
during the audit. This is a known problem in large Heartbeat clusters
and is being investigated as bug 298.
Test 77 (SimulStop)
The same CPU problem as test 76
Test 80 (SimulStart)
I was a bit slow killing the lrmd on c001n01 (so i didnt have to keep
explaining why it was still running - c001n06 I got to fast enough).
In the end I killed all heartbeat related processes.
A few minutes later, heartbeat aborted due to CPU usage on c001n06,
n02, n07, n04, n05, n03 and finally on n08. At that point I aborted
the tests and wrote this up.
Development aborts were triggered during tests 3 (StartOnebyOne) and 53
(StopOnebyOne)
crmd: [29805]: CRIT: fn(copy_lrm_op): Triggered dev assert at
utils.c:1222 : op->rsc != NULL
These are related to LRM failures and more information can be found
under bug 332
BadNews:
Test 53 (StopOnebyOne)
Mar 15 21:31:33 c001n08 crmd: [29805]: ERROR: fn(do_lrm_rsc_op):
Operation start on c001n07 failed
An unknown error prevented the operation from completing.
Investigation is currently blocked by bug 333.
Test 24 (SimulStart)
Mar 15 19:31:27 c001n01 crmd: [7144]: ERROR: fn(cib_client_sync_from):
Could not retrive current CIB.
Mar 15 19:31:27 c001n01 crmd: [7144]: ERROR: fn(do_dc_join_finalize):
Sync from c001n05 resulted in an error: No master service is currently
active
This error is somewhat inaccurate and a new "remote host timed out"
error has since been created. The cause is likely to be that the CIB
on c001n05 had not yet updated its membership list from CCM and our
request was discarded. A lower timeout and re-try procedure is being
tested.
Test 78 (SimulStop)
Mar 15 23:01:18 c001n01 lrmd: [13887]: ERROR: already running: [pid
13458].
Mar 15 23:01:18 c001n01 lrmd: [13887]: ERROR: Startup aborted (already
running). Shutting down.
Mar 15 23:03:21 c001n06 lrmd: [24795]: ERROR: already running: [pid
16706].
Mar 15 23:03:21 c001n06 lrmd: [24795]: ERROR: Startup aborted (already
running). Shutting down.
These are a side effect of the heartbeat failures in tests 76 and 77
Test 79 (SimulStop)
Mar 15 23:05:15 c001n01 lrmd: [13961]: ERROR: already running: [pid
13458].
Mar 15 23:05:15 c001n01 lrmd: [13961]: ERROR: Startup aborted (already
running). Shutting down.
Mar 15 23:06:51 c001n06 lrmd: [25005]: ERROR: already running: [pid
16706].
Mar 15 23:06:51 c001n06 lrmd: [25005]: ERROR: Startup aborted (already
running). Shutting down.
Because the running lrmd is no longer a managed process of the current
heartbeat it is never shut down. It would be nice if heartbeat could
adopt orphan lrmd processes but I cant see it happening any time soon.
--
Andrew Beekhof
"Would the last person to leave please turn out the enlightenment?" -
TISM
_______________________________________________________
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic