'Re: [SSI-users] Full HA with only 2 computers ?? --- Drbd root-failover'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       ssic-linux-users
Subject:    Re: [SSI-users] Full HA with only 2 computers  ?? --- Drbd root-failover
From:       Andreas <roos () convis ! de>
Date:       2004-05-24 13:42:23
Message-ID: c8sua8$p82$1 () sea ! gmane ! org
[Download RAW message or body]

Hello

Jaideep Dharap wrote:
> 
> I have tried drbd-root failover succesfully. I have compiled a tar-ball 
> that includes a How-to, sample configuration files
> and the openssi-enabled drbd code. The process does require manual 
> tweaking of the ramdisk since it is not yet integrated
> it with mkinitrd and installation. But the steps are pretty 
> straightforward and outlined in the How-to.
> The tar ball is available at
>     http://www.openssi.org/contrib/.
> I am working on a rpm that should install the modules and drbd utilities 
> on an openssi cluster. Right now the tar-ball
> contains code that needs to be compiled and installed.
> Let me know if there are any questions and let us all know how it goes 
> for you if you  do end up doing drbd-failover :-).
>                             Jai.
> 
I was able to follow the instructions in the how-to. But now I have a 
problem. I have a configuration with two nodes (Debian). node 1 is the 
initnode. After node 2 boots the resync process is started. After that 
the cluster works fine. The problem I have is with the failover. After I 
turn off node 1 node 2 takes over. While recovering it starts the script 
rc.sysrecover I think that script must be updated two. For DEVICE it 
still calls findfs. I changed that line to DEVICE=/dev/nbd/0 and it 
works fine. Before I did that /etc/mtab was wrong because fix_mtab 
wasn't called. The output of df was


NOTAVAIL 	3842376	3113847	256799	90%	/


But like I said that was easy to fix. Was that correct?

The next problem I have is that after the failover I try to reboot node 
2 (the last remaining node in the cluster) and I get a kernel panic. 
That accurs when the system tries to unmount the lokal filesystems.

Another problem I have is with the bootmanager. I still use lilo, but 
the problem is after the sync with the node 1 lilo does not work any 
longer. I think while the syncronisation the mbr of the disc of node 2 
is changed so that lilo cannot work. After I start node 2 with a knoppix 
cd and call lilo again (after chroot) I works again.


I hope anybody have some ideas how to help.

Andreas

>>
>> Eric Piollet wrote:
>>
>>> I have only 2 computers :
>>> The computer n°1 : openldap + sendmail (or later postfix) + imap + 
>>> DNS + LAMPP on RH 9 (groupware applis)  I would to have with my 2 
>>> computers full openssi :
>>> Services : I can have some benefits to use 2 nodes instead of one
>>> HA: Replication computer n°1 to computer n°2 -> *without a shared 
>>> disk* but a little like drbd system.
>>> So If my computer n°1 is down , the computer n°2 can reboot with is 
>>> own disk without lost my data.
>>>  
>>> Is it possible at time ?
>>>  
>>>  
>>>  
>>>
>>
>> I don't have a good answer for you, but I can tell you what I've tried 
>> so far, and hopefully some others on the list with more knowledge of 
>> OpenSSI will chime in.
>>
>> My first approach was to use DRBD to mirror the root filesystem (and 
>> another filesystem) to the second node. However, I was never able to 
>> figure out how to get the boot sequence to handle the mounting of a 
>> root filesystem on a DRBD device, because the timing of the boot 
>> process didn't match the timing of the DRBD device becoming available. 
>> I know several people on the list are working on this approach, but I 
>> haven't heard anything recently about the status of their efforts. I 
>> also don't have a clear picture of how the failover would work. My 
>> intent was to keep the root filesystem mirrored so that in the case of 
>> the primary node's failure, the secondary would boot from its copy of 
>> the primary's root filesystem (instead of booting from an Etherboot 
>> CDROM, as it does otherwise), and should come up as though it were the 
>> primary node. However, this still seems to have the problem that the 
>> MAC addresses in /etc/clustertab would reflect the NICs in the old 
>> primary. Nevertheless, this seems to be the best long-term approach, 
>> and any comments from others on the list who are working on this would 
>> be welcome.
>>
>> I've also considered using either ISCSI or Lustre with a separate 
>> (probably non-SSI) machine as the root filesystem, but this represents 
>> a single point of failure. I'm also not clear whether Lustre offers 
>> any advantage over ISCSI here - it seems to add an unnecessary level 
>> of complexity to the boot process.
>>
>> My current thinking is to mirror the primary's root filesystem to the 
>> secondary via periodic rsyncs. I may be able to get away with this 
>> because the systems should be fairly static once they're configured, 
>> and there isn't much critical application data stored on the root. 
>> Obviously this approach won't work for every application. The 
>> advantage I see of doing it this way is that I don't have to deal with 
>> the complexity of getting DRBD involved in the boot sequence, and I 
>> can exclude the few files (/etc/clustertab is all I know about so far) 
>> that should be kept un-mirrored on the secondary. I might still use 
>> DRBD for non-root filesystems if I needed real mirroring.
>>
>> While this probably gets me a backup primary that can be brought up 
>> fairly quickly in the case of a total failure of the original primary, 
>> I'm still not clear on what I need to do to automate the failover. I 
>> assume I need to modify the heartbeat scripts, and probably other boot 
>> scripts, to force a reboot of the secondary node and restart 
>> processes. Any pointers on which files I should be looking at would be 
>> appreciated.
>>
>>
>>
>> -------------------------------------------------------
>> This SF.Net email is sponsored by: Oracle 10g
>> Get certified on the hottest thing ever to hit the market... Oracle 
>> 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
>> http://ads.osdn.com/?ad_id149&alloc_id66&op=click
>> _______________________________________________
>> Ssic-linux-users mailing list
>> Ssic-linux-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ssic-linux-users
>>
>>
> 
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g. 
> Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click




-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Ssic-linux-users mailing list
Ssic-linux-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ssic-linux-users
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic