[prev in list] [next in list] [prev in thread] [next in thread] 

List:       evms-devel
Subject:    [Evms-devel] Re: [Evms-cluster] Error while intializing the EVMS engine
From:       Steve Dobbelstein <steved () us ! ibm ! com>
Date:       2004-12-07 20:56:24
Message-ID: OFE72B8127.C5AB5FF4-ON06256F63.007140C8-06256F63.0073072E () us ! ibm ! com
[Download RAW message or body]





vijay agrawal <agrawal_vijay2000@yahoo.co.in> wrote on 12/07/2004 03:01:23
AM:

> hi steve/evms-cluster guys/linux-HA cluster guys !

Greetings.

> i have taken all the evms daemon logs and engine logs
> again by using -d debug option. One more thing i want
> to state is that..
> i am telling the sequence that i have used for
> starting each service/daemon.
> first of all i have started the heartbeat service at
> machine1 (master node) then i started the heartbeat
> service at fire node. then i start the ccm service by
> using /usr/lib/heartbeat/ccm on the command line at
> machine1 and after that on fire node. then i started
> the evmsd with -d debug option on the machine1 after
> starting it gives the following error:
> .......Daemon: There was an error when connecting to
> linux-HA. The error code was 11 : Resource temporarily
> unavailable. EVMS only manage local devices on this
> system. ( this thing i forgot to include in my
> previous message, i am sorry for that..actually this
> problem is displayed only a single time after
> rebooting the machine if i stop the emvsd and then run
> again it doesn't displays this error. )

Some older versions of Linux-HA can take a while for the heartbeat code to
settle it before it is ready.  I used to wait ten seconds between starting
heartbeat and starting evmsd in my testing.  In this case, the old
programmer's method of trying the same action and hoping for a different
result is actually the right thing to do. :)

> then i started the evmsd with the -d debug option on
> the fire node.
> then i started the evms engine by using evmsgui -d
> debug on the machine1 node. then after waiting for
> around 25 minutes it gives an error message:
>
> Engine: there was an error when starting EVMS
> on the other node in the cluster.the error code was
> connection timed out : evms will only manage local
> devices on this system.
>
> am i wrong in any sequence ???

The sequence is correct.

> If yes what is the right sequence for starting these
> services/daemons?
> Is it due to some version mismatch.
> I am using fedora core 1 with a built kernel 2.6.8.1
> on both the machine and heartbeat version 1.0.4 and
> evms version 2.4.0.

EVMS can handle the older versions of Linux-HA, although they haven't been
tested on EVMS version 2.4.0.

> If there can be any version mismatch plz suggest me
> the most suitable version of EVMS and heartbeat with
> fedora core 1. (plz ignore the heartbeat-1.2.3 version
> as i have already tried this version and at that time
> my ccm is not working well)
>
> thanks and best regards
> vijay

I looked at the logs you attached.  There seems to be an installation
problem on the fire machine.

Some background for your edification:   When you run evmsgui on machine1,
EVMS will open the Engine on the other nodes in the cluster as well.  It
does this by sending a request to open the Engine to the daemons running on
the other nodes in the cluster.  The daemon then launches evmsd_worker
which will handle the invocation of EVMS APIs on that node of the cluster,
including the initial evms_open_engine() API.

The evms-daemon.log on fire shows that the daemon tried to execvp()
evmsd_worker and got back error code 2: No such file or directory.  Is
evmsd_worker installed on fire?  (It should be installed on machine1, too.)
Is it installed in a path where evmsd can find it?  The default location is
/sbin.

What happens is that the daemon waits for evmsd_worker to handle the
evms_open_engine(), which it never does since it was never run.  The Engine
back on machine1 waits for 10 minutes before it times out the call to open
the engine on the other nodes.  Since the open of the Engine failed, it
then issues a call to close the Engine on the other nodes in the cluster in
case any of the other nodes successfully opened the Engine.  (Yes I know
that in a two node environment there is no need to close the Engine on the
other node if the Engine didn't open.  But EVMS is written to handle more
than two nodes in a cluster.)  The Engine then waits another 10 minutes
before it times out the call to close the Engine on the other node.  Thus,
you have a 20+ minute delay before evmsgui finally fails.

Make sure that evmsd_worker is installed in a place where evmsd can find it
and launch it.  Then see if evmsgui will start successfully for you.

Steve D.

> --- Steve Dobbelstein <steved@us.ibm.com> wrote:
> >
> >
> >
> >
> > vijay agrawal <agrawal_vijay2000@yahoo.co.in> wrote
> > on 12/06/2004 05:59:53
> > AM:
> >
> > > hi steve !!
> > > thanks for replying..
> > > i am attaching the log of both the machine ..
> > > machine1 is the master node and the fire machine
> > is
> > > the another node of the cluster. the error is same
> > at
> > > the starting of EVMS
> > > Engine: there was an error when starting EVMS on
> > > the other node in the cluster.the error code was
> > 110:
> > > connection timed out : evms will only manage local
> > > devices on this system.
> > >
> > > i have looked in the logs too..
> > > it's giving out some membership errors and an
> > error
> > > that ece deamon is not running yet.
> > > i have attached the ha log and debug files too.
> > >
> > > plz help me in this matter..
> > >
> > > thanks and best regards..
> > > vijay
> >
> > Hi, Vijay.
> >
> > I only got a few clues from the logs.
> >
> > The latest daemon log from machine fire
> > (evms-daemon.log) showed that the
> > Linux-HA plug-in in EVMS got errors back when it
> > tried to send a message
> > over the IPC to the other node.  I don't know what
> > would cause that
> > offhand.  The error is coming from heartbeat.
> >
> > The other daemon logs from both machines look OK.
> >
> > The Engine logs from machine machine1 show that the
> > logging level was left
> > at "default".  There is not a whole lot of debugging
> > information at the
> > default debug level.  Please rerun the test with the
> > debug level set to
> > "debug" on machine1.  Either run "evmsgui -d debug"
> > or set the
> > "debug_level" in the "engine" section of evms.conf
> > to "debug" on machine1.
>
> >
> > Steve D.
> >
> > > --- Steve Dobbelstein <steved@us.ibm.com> wrote:
> > > >
> > > >
> > > >
> > > >
> > > > vijay agrawal <agrawal_vijay2000@yahoo.co.in>
> > wrote
> > > > on 12/03/2004 06:57:48
> > > > AM:
> > > >
> > > > > HI all !
> > > >
> > > > Hi, Vijay.
> > > >
> > > > > I have switched to the 1.0.4 version of
> > heartbeat
> > > > and
> > > > > Now it's fine. Now there is no error while
> > > > starting
> > > > > /usr/lib/heartbeat/ccm and it runs on both the
> > > > machine
> > > > > on the cluster.After that i run the evmsd on
> > both
> > > > the
> > > > > machines. Now when i started evmsgui on the
> > master
> > > > > node it displays a message for a long time
> > that
> > > > evms
> > > > > is starting engine at the other nodes of the
> > > > cluster
> > > > > then after a long time it gives an error
> > ..below
> > > > is
> > > > > the error:
> > > > > Engine: there was an error when starting EVMS
> > on
> > > > the
> > > > > other node in the cluster.the error code was
> > 110:
> > > > > connection timed out : evms will only manage
> > local
> > > > > devices on this system.
> > > >
> > > > This usually happens when there is a break in
> > > > communication between the
> > > > nodes.  On older versions of Linux-HA I have
> > > > occasionally seen messages
> > > > sent to the CCM but then no reply that the CCM
> > sent
> > > > the message
> > > > successfully.  The EVMS Engine waits 5 minutes
> > > > before timing out the
> > > > request.  It times out the request so that the
> > code
> > > > isn't stuck waiting
> > > > forever.
> > > >
> > > > > Sometimes it shows a different error message
> > that
> > > > is
> > > > > as follows:
> > > > >
> > > > > Engine : there was an error when starting EVMS
> > on
> > > > the
> > > > > other nodes in the cluster. The error code was
> > 12:
> > > > > cannot allocate memory. EVMS will only manage
> > > > local
> > > > > expecting command name.
> > > > >
> > > > > In case of second error message when i strace
> > EVMS
> > > > > command then i found that it waits for a long
> > time
> > > > at
> > > > > the API futex ( that is used for locking
> > memory in
> > > > > user space, plz check for it) then some other
> > > > process
> > > > > gives a wake up call to the futex and it
> > returns
> > > > after
> > > > > a long time so may be the error is due to this
> > > > locking
> > > > > .( I am not sure).
> > > >
> > > > This is strange.  I have never seen EVMS fail to
> > > > allocate memory.  EVMS
> > > > uses standard C functions for memory allocation
> > --
> > > > malloc(), calloc(),
> > > > realloc(), strdup(), memalign(), and free().  It
> > > > doesn't manipulate any
> > > > futex directly.  I'm not sure what is going
> > wrong
> > > > either.
> > > >
> > > > > then i look for the
> > /usr/lib/heartbeat/api_test
> > > > > it is giving following output:
> > > > > info : PID=10247
> > > > > info : Setting message filter mode
> > > > > info : cluster node: fire :status :active
> > > > > ERROR:node fire : intf : eth0 ifstatus :up
> > > > > info : cluster node : machine 1 : status :
> > active
> > > > > ERROR:node machine1 : intf : eth0 ifstatus: up
> > > > > info : sleeping....
> > > > > ERROR: sent ping request to cluster
> > > > > ERROR: waiting for message
> > > > > notice: got message 1 of type [hbapi-clstat]
> > from
> > > > > machine1]
> > > > > notice: got message 2 of type [hbapi-clstat]
> > from
> > > > > machine1]
> > > > > info: sent ping reply(0) to [machine 1]
> > > > > info: sent ping reply(1) to [machine 1]
> > > > > info: sent ping reply(2) to [machine 1]
> > > > > info: sent ping reply(3) to [machine 1]
> > > > > ....................................
> > > > > info: sent ping reply(8) to [machine 1]
> > > > > info: sent ping reply(9) to [machine 1]
> > > > > notice: got message 3 of type [ping reply]
> > from
> > > > > machine1]
> > > > > notice: got message 4 of type [ping reply]
> > from
> > > > > machine1]
> > > > > notice: got message 5 of type [ping reply]
> > from
> > > > > machine1]
> > > > > notice: got message 6 of type [ping reply]
> > from
> > > > > machine1]
> > > > > notice: got message 7 of type [ping reply]
> > from
> > > > > machine1]
> > > > > notice: got message 8 of type [ping reply]
> > from
> >
> === message truncated ===



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Evms-devel mailing list
Evms-devel@lists.sourceforge.net
To subscribe/unsubscribe, please visit:
https://lists.sourceforge.net/lists/listinfo/evms-devel
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic