'RE: dhcp-3.0.1rc12 server daemon woes :-('

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       dhcp-hackers
Subject:    RE: dhcp-3.0.1rc12 server daemon woes :-(
From:       "Nick Garfield" <Nicholas.Garfield () cern ! ch>
Date:       2003-10-17 22:30:22
Message-ID: AC71B50DCA239B469EE08BABF32AD278012EDB38 () cernxchg01 ! cern ! ch
[Download RAW message or body]

Bill,

I have found a fix for this problem.  I stopped the secondary, deleted
the leases file on the secondary server and restarted the server.  After
the file was rebuilt (and the MCLT time passed) the "peer holds all free
leases" messages had gone and leases were getting assigned correctly.  

Unfortunately this solution does not explain why/how this occurred.  I
guess it is a problem with the MCLT.  The MCLT was set to 600 seconds
(<max-lease-time 3600).  I thought that setting a shorter MCLT might
help if the "communications interrupted" problem re-occurred.  I have
reset the MCLT to 3600 seconds.

If you are interested, for analysis I can send you the dhcpd.leases file
which was causing the problem.

Bug #9365 is still a problem.
Thanks for the ping patch....that works fine.  

And thanks for the support!

Regards

Nick

> -----Original Message-----
> From: Stephens, Bill {PBSG} [mailto:Bill.Stephens@pbsg.com] 
> Sent: Thursday, October 16, 2003 3:27 PM
> To: 'dhcp-hackers@isc.org'
> Subject: RE: dhcp-3.0.1rc12 server daemon woes :-(
> 
> 
> What do the leases look like when you run the leasstate 
> script from Kevin
> Miller's site (http://www.contrib.andrew.cmu.edu/~kevinm/dhcp)?
> 
> I noticed, going back through the archives, that you ran into 
> this same
> problem in July.  How did you correct the issue then?
> 
> -----Original Message-----
> From: Nick Garfield [mailto:Nicholas.Garfield@cern.ch] 
> Sent: Thursday, October 16, 2003 4:12 AM
> To: dhcp-hackers@isc.org
> Subject: RE: dhcp-3.0.1rc12 server daemon woes :-(
> 
> Bill,
> 
> I rolled back to rc11 yesterday and I actually find no improvement on
> the "peer holds all free leases" problem on the secondary.
> 
> ip-srv-3 =3D primary
> ip-srv-4 =3D secondary
> 
> Time on both servers is OK.
> 
> ip-srv-3.cern.ch/ROOT[4] date
> Thu Oct 16 11:07:32 CEST 2003
> 
> ip-srv-4.cern.ch/ROOT[5] date
> Thu Oct 16 11:07:34 CEST 2003
> 
> Let's have a look at how many peer holds all free leases 
> messages there
> are on both servers.=20
> 
> ip-srv-3.cern.ch/ROOT[3] grep holds /var/log/today/dhcp.log | wc -l
>       0
> 
> ip-srv-4.cern.ch/ROOT[8] grep "peer holds all free leases"
> /var/log/today/dhcp.log | wc -l
>    1248
> 
> Here we can clearly see the problem.
> 
> Communication is OK:
> 
> ip-srv-3.cern.ch/ROOT[1] grep "my state" /usr/local/etc/dhcpd.leases
>   my state normal at 4 2003/10/16 08:11:52;
>   my state normal at 4 2003/10/16 08:11:52;
>   my state normal at 4 2003/10/16 08:11:52;
>   my state normal at 4 2003/10/16 08:11:52;
>   my state normal at 4 2003/10/16 08:11:52;
>   my state communications-interrupted at 4 2003/10/16 08:48:52;
>   my state communications-interrupted at 4 2003/10/16 08:48:52;
>   my state normal at 4 2003/10/16 08:48:52;
> ip-srv-3.cern.ch/ROOT[2]=20
> 
> 
> ip-srv-4.cern.ch/ROOT[1] grep "my state" /usr/local/etc/dhcpd.leases
>   my state normal at 4 2003/10/16 08:48:46;
>   my state normal at 4 2003/10/16 08:48:46;
>   my state normal at 4 2003/10/16 08:48:46;
>   my state normal at 4 2003/10/16 08:48:46;
>   my state normal at 4 2003/10/16 08:48:46;
> ip-srv-4.cern.ch/ROOT[2]=20
> 
> 
> Is there any information that I can give you to make more 
> sense of this?
> 
> Thanks
> 
> Nick
> 
> 
> 
> > -----Original Message-----
> > From: Stephens, Bill {PBSG} [mailto:Bill.Stephens@pbsg.com]=20
> > Sent: Wednesday, October 15, 2003 6:58 PM
> > To: 'dhcp-hackers@isc.org'
> > Subject: RE: dhcp-3.0.1rc12 server daemon woes :-(
> >=20
> >=20
> > Check to make sure the partners are communicating (my state in
> > dhcpd.leases).  Also check to make sure clocks are in sync 
> (run an ntp
> > daemon on both servers). =20
> >=20
> > -----Original Message-----
> > From: Nick Garfield [mailto:Nicholas.Garfield@cern.ch]=20
> > Sent: Wednesday, October 15, 2003 10:48 AM
> > To: dhcp-hackers@isc.org
> > Subject: dhcp-3.0.1rc12 server daemon woes :-(
> >=20
> > Hello Hackers list,
> >=20
> > I sent the email below to the dhcp-server list earlier 
> today.  I think
> > that this list is probably a more appropriate place for it :-)
> >=20
> > I cannot offer much help in the knowledge of the=20
> > code....but.... if you
> > can tell me what/how to begin debugging the code then I might=20
> > be of some
> > use.
> >=20
> > Please see the email text below for details of the problem.
> >=20
> > Thanks
> >=20
> > Nick
> >=20
> >=20
> >=20
> > ------------------------------------------
> > Hi,
> >=20
> > I upgraded our failover servers to the latest dhcpd on 9th September
> > from dhcp-3.0.1rc11.  The only noticeable "feature" to 
> begin with were
> > lots of "ping timeout statements" in the log files.  I 
> posted a query
> > about these with no response from the list.
> >=20
> > Now, it seems, I have run into more serious problems which 
> have taken
> > some time to verify.  Previously, with rc11, I had problems 
> with "peer
> > holds all free leases" messages on both servers.  This 
> situation would
> > occur only when the communication link was cut between the=20
> > two servers.
> > Re-establishment of the link - say rebooting a router - would=20
> > leave the
> > servers in the "communications interrupted/lost" state.
> >=20
> > The situation above was not so bad - I wrote a script to 
> send an alarm
> > when/if the problem occurs.
> >=20
> > The new error is worse.  Since I upgraded to rc12 I have 
> been getting
> > occasional phone calls,  Mr.x can't get an address on subnet=20
> > (1), Mr. y
> > can't get an address (2) etc etc.
> >=20
> > I had to rule out saturated hubs, switches, wireless access 
> points etc
> > etc.  No problem here.
> >=20
> > Next, I had to rule that the pools were not over utilized - 
> I wrote a
> > few perl functions to write reports on this.  The pools run between
> > 0-50% utilization.
> >=20
> > Armed with the information above I confidently tell the=20
> > users, "There is
> > no problem here....".
> >=20
> > That is, until yesterday.  I receive the usual "I can't get an
> > address..." phone call: I go through the usual routine....servers
> > up.....servers communicating.....network alive.....pools 
> have lots of
> > free addresses....   hmmm.....  what is going on?
> >=20
> > The pool in question has 54 addresses,  4 were used and 50 free.
> >=20
> > Normally at this point I would say, "Everything is fine", but=20
> > I instead
> > looked in the logs:
> >=20
> > On the secondary there were A LOT of "peer holds all free leases"
> > messages.  On the primary there were none of these messages.
> >=20
> > This makes no sense because my report scripts show the=20
> > binding-state of
> > each lease and the split was roughly 50/50 free/backup with only one
> > address reported as expired and a few active.
> >=20
> > From the primary log:
> >=20
> > Oct 15 00:58:42 ip-srv-3 dhcpd: DHCPDISCOVER from=20
> > 00:0b:ac:e6:b3:c8 via
> > 137.138.194.65: load balance to peer boson
> >=20
> > From the secondary log:
> >=20
> > Oct 15 00:58:42 ip-srv-4 dhcpd: DHCPDISCOVER from=20
> > 00:0b:ac:e6:b3:c8 via
> > 137.138.194.65: peer holds all free leases
> >=20
> > So here is the problem.....when this situation occurs NO=20
> > OFFER WAS SENT
> > TO THE CLIENT :-(
> > I do not understand why the primary has tried to 
> communicate with the
> > failover server on receiving a DISCOVER message.  IMO the=20
> > load balancing
> > should only occur AFTER a DHCPACK has been received.
> >=20
> > I conclude that there are two choices.
> >=20
> > (1) Roll-back to rc11.
> >=20
> > (2) Play with MCLT as suggested in the DHCP Handbook 2nd 
> ed.  The  HA
> > part of the conf file looks like this:
> >=20
> >   max-response-delay 60;
> >   max-unacked-updates 10;
> >   mclt 600;
> >   split 128;
> >   load balance max seconds 3;
> >=20
> > and the global options in the conf file are as follows:
> >=20
> > not authoritative; #by default, see PB services below
> > allow bootp;
> > ddns-update-style none;
> > deny duplicates;
> > default-lease-time 1800; # for all devices but network devices
> > max-lease-time 3600;
> >=20
> > I believe the MCLT and lease-times are configured correctly,=20
> > therefore I
> > plan to roll-back to rc11, unless anyone can give me a good=20
> > reason to do
> > otherwise.
> >=20
> > The total conf file is 30,000 lines (26,000 lines used in host
> > statements), therefore too large to send in an email.
> > Both deny/allow- unknown/known clients and client-classing 
> (blocking)
> > are used in the config.
> >=20
> > I would be pleased to provide the maintainer with the 
> configurations,
> > leases files, diagrams and trace files if they would be of any help.
> >=20
> > If anyone has any useful suggestions about why the above is 
> happening
> > then I would be very pleased to hear about your experiences 
> with rc12.
> >=20
> > Thanks,
> >=20
> > Nick
> >=20
> >=20
> >=20
> >=20
> > Nick Garfield
> > IT/CS Campus Networking Section
> > CERN
> > Geneva
> > Switzerland
> >=20
> > Tel:+41 22 76 74 533=3D20
> >=20
> >=20
> 
> 

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic