[prev in list] [next in list] [prev in thread] [next in thread]
List: ssic-linux-users
Subject: Re: [SSI-users] Full HA with only 2 computers ?? --- Drbd root-failover
From: Andreas <roos () convis ! de>
Date: 2004-05-24 13:42:23
Message-ID: c8sua8$p82$1 () sea ! gmane ! org
[Download RAW message or body]
Hello
Jaideep Dharap wrote:
>
> I have tried drbd-root failover succesfully. I have compiled a tar-ball
> that includes a How-to, sample configuration files
> and the openssi-enabled drbd code. The process does require manual
> tweaking of the ramdisk since it is not yet integrated
> it with mkinitrd and installation. But the steps are pretty
> straightforward and outlined in the How-to.
> The tar ball is available at
> http://www.openssi.org/contrib/.
> I am working on a rpm that should install the modules and drbd utilities
> on an openssi cluster. Right now the tar-ball
> contains code that needs to be compiled and installed.
> Let me know if there are any questions and let us all know how it goes
> for you if you do end up doing drbd-failover :-).
> Jai.
>
I was able to follow the instructions in the how-to. But now I have a
problem. I have a configuration with two nodes (Debian). node 1 is the
initnode. After node 2 boots the resync process is started. After that
the cluster works fine. The problem I have is with the failover. After I
turn off node 1 node 2 takes over. While recovering it starts the script
rc.sysrecover I think that script must be updated two. For DEVICE it
still calls findfs. I changed that line to DEVICE=/dev/nbd/0 and it
works fine. Before I did that /etc/mtab was wrong because fix_mtab
wasn't called. The output of df was
NOTAVAIL 3842376 3113847 256799 90% /
But like I said that was easy to fix. Was that correct?
The next problem I have is that after the failover I try to reboot node
2 (the last remaining node in the cluster) and I get a kernel panic.
That accurs when the system tries to unmount the lokal filesystems.
Another problem I have is with the bootmanager. I still use lilo, but
the problem is after the sync with the node 1 lilo does not work any
longer. I think while the syncronisation the mbr of the disc of node 2
is changed so that lilo cannot work. After I start node 2 with a knoppix
cd and call lilo again (after chroot) I works again.
I hope anybody have some ideas how to help.
Andreas
>>
>> Eric Piollet wrote:
>>
>>> I have only 2 computers :
>>> The computer n1 : openldap + sendmail (or later postfix) + imap +
>>> DNS + LAMPP on RH 9 (groupware applis) I would to have with my 2
>>> computers full openssi :
>>> Services : I can have some benefits to use 2 nodes instead of one
>>> HA: Replication computer n1 to computer n2 -> *without a shared
>>> disk* but a little like drbd system.
>>> So If my computer n1 is down , the computer n2 can reboot with is
>>> own disk without lost my data.
>>>
>>> Is it possible at time ?
>>>
>>>
>>>
>>>
>>
>> I don't have a good answer for you, but I can tell you what I've tried
>> so far, and hopefully some others on the list with more knowledge of
>> OpenSSI will chime in.
>>
>> My first approach was to use DRBD to mirror the root filesystem (and
>> another filesystem) to the second node. However, I was never able to
>> figure out how to get the boot sequence to handle the mounting of a
>> root filesystem on a DRBD device, because the timing of the boot
>> process didn't match the timing of the DRBD device becoming available.
>> I know several people on the list are working on this approach, but I
>> haven't heard anything recently about the status of their efforts. I
>> also don't have a clear picture of how the failover would work. My
>> intent was to keep the root filesystem mirrored so that in the case of
>> the primary node's failure, the secondary would boot from its copy of
>> the primary's root filesystem (instead of booting from an Etherboot
>> CDROM, as it does otherwise), and should come up as though it were the
>> primary node. However, this still seems to have the problem that the
>> MAC addresses in /etc/clustertab would reflect the NICs in the old
>> primary. Nevertheless, this seems to be the best long-term approach,
>> and any comments from others on the list who are working on this would
>> be welcome.
>>
>> I've also considered using either ISCSI or Lustre with a separate
>> (probably non-SSI) machine as the root filesystem, but this represents
>> a single point of failure. I'm also not clear whether Lustre offers
>> any advantage over ISCSI here - it seems to add an unnecessary level
>> of complexity to the boot process.
>>
>> My current thinking is to mirror the primary's root filesystem to the
>> secondary via periodic rsyncs. I may be able to get away with this
>> because the systems should be fairly static once they're configured,
>> and there isn't much critical application data stored on the root.
>> Obviously this approach won't work for every application. The
>> advantage I see of doing it this way is that I don't have to deal with
>> the complexity of getting DRBD involved in the boot sequence, and I
>> can exclude the few files (/etc/clustertab is all I know about so far)
>> that should be kept un-mirrored on the secondary. I might still use
>> DRBD for non-root filesystems if I needed real mirroring.
>>
>> While this probably gets me a backup primary that can be brought up
>> fairly quickly in the case of a total failure of the original primary,
>> I'm still not clear on what I need to do to automate the failover. I
>> assume I need to modify the heartbeat scripts, and probably other boot
>> scripts, to force a reboot of the secondary node and restart
>> processes. Any pointers on which files I should be looking at would be
>> appreciated.
>>
>>
>>
>> -------------------------------------------------------
>> This SF.Net email is sponsored by: Oracle 10g
>> Get certified on the hottest thing ever to hit the market... Oracle
>> 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
>> http://ads.osdn.com/?ad_id149&alloc_id66&op=click
>> _______________________________________________
>> Ssic-linux-users mailing list
>> Ssic-linux-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/ssic-linux-users
>>
>>
>
>
>
>
>
> -------------------------------------------------------
> This SF.Net email is sponsored by: Oracle 10g
> Get certified on the hottest thing ever to hit the market... Oracle 10g.
> Take an Oracle 10g class now, and we'll give you the exam FREE.
> http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g.
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Ssic-linux-users mailing list
Ssic-linux-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ssic-linux-users
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic