[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lustre-discuss
Subject:    [Lustre-discuss] Getting MDS failover to work reliably
From:       rs2006ts () hotmail ! com (RS RS)
Date:       2006-08-11 15:40:25
Message-ID: BAY120-F23DE8F87E4C97F21787A05CD4B0 () phx ! gbl
[Download RAW message or body]


OK, Redundancy works for me sometimes, but not others.
I used Heartbeat, and I failed the Active MDS (10.200.1.251) to the
Standby MDS (10.200.1.252).  I did a "cat /mnt/lustre/*" on the
client (10.200.1.40), and the client is hung.  The OST doesn’t
even try to contact the MDS, as far as I can tell.

Here are the key parts of /var/log/messages on my three systems.
Does anyone know what is wrong, and how I can fix it?

-Roger

On MDS:  Why is it giving me an error when the client tries to connect?

Aug 11 17:29:33 roger-ha-2 heartbeat: info: mach_down takeover complete for 
node roger-ha-1.
Aug 11 17:29:33 roger-ha-2 heartbeat[10093]: info: Exiting status process 
10150 returned rc 0.
Aug 11 17:29:33 roger-ha-2 heartbeat[10093]: debug: RscMgmtProc 'status' 
exited code 0
Aug 11 17:29:41 roger-ha-2 kernel: LustreError: 
11139:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 
12345-10.200.1.40@tcp portal 12 match 0x3a offset 0 length 240: no match
Aug 11 17:31:50 roger-ha-2 kernel: LustreError: 
11139:0:(lib-move.c:152:lnet_match_md()) Dropping PUT from 
12345-10.200.1.40@tcp portal 12 match 0x44 offset 0 length 240: no match


On OST:  Why doesn't it try to connect to Standby MDS?

Aug 11 17:29:09 blade-lustre2 kernel: Lustre: A connection with 10.200.1.251 
tim
ed out; the network or that node may be down.
Aug 11 17:29:09 blade-lustre2 kernel: Lustre: 
6767:0:(router.c:184:lnet_notify()
) Upcall: NID 10.200.1.251@tcp is dead
Aug 11 17:29:09 blade-lustre2 kernel: Lustre: 
4:0:(linux-debug.c:96:libcfs_run_u
pcall()) Invoked portals upcall /usr/lib/lustre/lnet_upcall 
ROUTER_NOTIFY,10.200
..1.251@tcp,down,1155331715
Aug 11 17:32:38 blade-lustre2 kernel: Lustre: ost1-ts: haven't heard from 
10.200
..1.251@tcp in 243 seconds. Last request was at 1155331715. I think it's 
dead, an
d I am evicting it.
(I didn't see any more output for several minutes)


On OSC:  Why can't it find the MDS?
Aug 11 17:29:21 blade-lustre0 kernel: Lustre: A connection with 10.200.1.251 
tim
ed out; the network or that node may be down.
Aug 11 17:29:21 blade-lustre0 kernel: Lustre: 
5539:0:(router.c:184:lnet_notify()
) Upcall: NID 10.200.1.251@tcp is dead
Aug 11 17:29:21 blade-lustre0 kernel: Lustre: 
4:0:(linux-debug.c:96:libcfs_run_u
pcall()) Invoked portals upcall /usr/lib/lustre/lnet_upcall 
ROUTER_NOTIFY,10.200
..1.251@tcp,down,1155331702
Aug 11 17:29:39 blade-lustre0 kernel: Lustre: Changing connection for 
MDC_blade-
lustre0_ha-mds_MNT_client to roger-ha-2_UUID
Aug 11 17:31:44 blade-lustre0 kernel: Lustre: Changing connection for 
MDC_blade-
lustre0_ha-mds_MNT_client to roger-ha-1_UUID
Aug 11 17:31:47 blade-lustre0 kernel: Lustre: Changing connection for 
MDC_blade-
lustre0_ha-mds_MNT_client to roger-ha-2_UUID
Aug 11 17:33:52 blade-lustre0 kernel: Lustre: Changing connection for 
MDC_blade-
lustre0_ha-mds_MNT_client to roger-ha-1_UUID
Aug 11 17:35:36 blade-lustre0 kernel: Lustre: Changing connection for 
MDC_blade-
lustre0_ha-mds_MNT_client to roger-ha-1_UUID
Aug 11 17:35:36 blade-lustre0 kernel: Lustre: previously skipped 1 similar 
messa
ges

_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfee® 
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic