'[opennms-install] Antw: Re: Performance issues and false outages?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       opennms-install
Subject:    [opennms-install] Antw: Re: Performance issues and false	outages?
From:       "Michael Seibold" <Michael.Seibold () Gek ! de>
Date:       2009-08-18 7:48:45
Message-ID: 4A8A78FD.8752.00AD.0 () Gek ! de
[Download RAW message or body]

Hi Kjetil,

you should look at your icmp response time graphs in opennsm. You might want to check \
them from time to time to see if there are intermittent higher values.

To my experience those strange outages are often due to some unknown problems in the \
net / systems. You changed only one thing that might have improved opennms \
performance - the max_fsm_pages of postgres.

If you change this back and the outages don't reappear they should be originated by \
intermittent higher response times in the network and are now hidden by the higher \
timeout / retry values.

If they reappear you know for sure that it has been a database / performance problem.

I have configured strafeping in a different way than it was designed to find such \
"strange" errors which are normally hidden by the retries / long timeouts: every lost \
strafeping will generate a "strafepin-outages" as I didn't configure retries. While \
during normal operation in a local network (and even in our WAN which has a quite \
good QoS configuration) there sould n't be any ping losses there are sometimes \
intermittent strafeping "outages" in the WAN. Most of the time they significate that \
the WAN provider has some problems in his backbone (if the strafeping outages come \
from different locations), due to defect lines, routing problems, overload of routers \
etc. . And most of the time the provider himself doesn't see the problems at all! \
Using opennms maps with a geographic view of our locations to see the availability \
(which is affected by the strafeping outages) we can often narrow in which part of \
our country the provider has a problem and tell him to look there...

So strafeping outages in our configuration is a sign that something is going wrong, \
but the users can still work (probably with reduced performance), while icmp outages \
are considered to be "loss of function" for our users.

- Michael



> > > Kjetil Roso <kjetil@roso.no> 18.08.2009 02:18 >>>
Michael, Ronald and Tarus,

Thanks for your replies! They are very much appreciated.
I've read all your replies and have done some more investigations on my
system and network. The network is more or less healthy, and different
network monitors shows normal traffic and load. There are no overload
situations what-so-ever.
The pings and snmpwalks are performed from a different server (on the same
subnet, though) during the "outages".
This OpenNMS server is a dedicated server, with no other applications
running. In my cacti-installation, the SNMP-timeout is 500ms with 3 retries.
In Cacti, the average response times for the same nodes are around 3ms.
I haven't registered any CRC-errors, but there are some ifInDiscards and
ifInErrors, but those interfaces are not applicable for the OpenNMS
performance (different vlan).

What have been done after my initial posting?
I overlooked a parameter in postgresql.conf:
max_fsm_pages has been changed from 204800 to 2048000.

I have grep'ed the logs for ERROR, FATAL and Exception:
In snmp4j-internal.log, I found countless entries of
[DefaultUDPTransportMapping_127.0.0.1/0]
org.snmp4j.transport.DefaultUdpTransportMapping:
java.lang.InterruptedException
I Googled this message, and got some indications that this message means
that an SNMP request has timed out. I increased the values for for "timeout"
and "retry" in snmp-config.xml to a timeout of 2000 and retry set to 3.
I restarted OpenNMS 8 hours ago, and the dataCollectionFailed"-events have
disappeared. The false outages are also gone. Everything seems to work to my
fully satisfaction. I will give it a day or two before I dare to enable the
notifications, but it certainly looks promising.

Everything is now working excellent, but I still get loads of
[DefaultUDPTransportMapping_127.0.0.1/0]
org.snmp4j.transport.DefaultUdpTransportMapping:
java.lang.InterruptedException -entries in snmp4j-internal.log. I find that
somewhat annoying, but I'll investigate that later.

Before landing on OpenNMS, I tried other NMS applications. For all these
applications I experienced no problems regarding polling and false outages.
I also regularly do network monitoring/traffic analysis. This is why I
eliminated network issues as the cause to the polling problems.
Anyway, it looks like my issues regarding OpenNMS performance and outages
are solved. In a few days I will add 100 mode devices and 1000+ interfaces
into the application. Hopefully the performance will prevail.

Thanks again!!

Kjetil    
  


On 17.08.09 01.21, in article C6AE6103.102C%kjetil@roso.no, "Kjetil Roso"
<kjetil@roso.no> wrote:

> Hi,
> 
> I'm running OpenNMS v1.6.5-1 (stable) on the following server configuration:
> Java Version:    1.5.0_18 Sun Microsystems Inc.
> Java Virtual Machine:    1.5.0_18-b02 Sun Microsystems Inc.
> Operating System:    Linux 2.6.23.1-42.fc8 (amd64)
> Hardware: HP ProLiant G5 3.0GHz XEON-QuadCore with 6 GB RAM
> 
> OpenNMS parameters:
> JAVA_HEAP_SIZE=4096
> All PostgreSQL-tuning is according to the wiki:
> "Systems_with_lots_of_RAM_and_PostgreSQL_8.2".
> All other relevant configuration files are "out of the box".
> 
> I'm currently monitoring 42 nodes (5xCisco7606, 30xCisco ME3400G + some
> other devices). For the time being, I only monitoring the ICMP service on
> the nodes. Every aspect of OpenNMS is working great except for the two
> issues below that I need help to resolve.
> 
> Occasionally, I get storms of
> "uei.opennms.org/nodes/dataCollectionFailed"-events. I cannot see any reason
> for this. This has no other practical consequences than generating a lot of
> events in the event-log in addition to generating "holes" in my graphs.
> These states of data collection failures typically lasts from 2 to 30
> minutes
> I'm running Cacti on another server polling the same devices without any
> errors. Cacti has been running for 2 years without any mentionable data
> collecting-errors.
> 
> My main problem is ICMP-outages. 3-4 times a day, I get a storm of
> ICMP-outages on all nodes. This results in a notification storm telling
> "Node down". The outages lasts for approx. 20 minutes. The "funny" thing is
> that during the outages, all nodes are working fine. Both Ping and snmpwalk
> works fine. There's no indications in the nodes' syslog-messages either.
> TOP shows me that java takes up 25% of both memory and cpu. PostgreSQL takes
> up 10% of both cpu and memory (at most).
> 
> I have eliminated network problems and all nodes have less than 9%
> utilization. I have also grep'ed all daemon log-files for "error" and
> "fatal" with no results. All signs points to that these problems are related
> to OpenNMS. 
> 
> Most likely, I have missed a few tweaks and settings on the way, but I need
> some tips on where to look.
> 
> Any tips and hints are deeply appreciated,
> 
> Kjetil
> Network Manager
> 
> 
> 
> ------------------------------------------------------------------------------
> Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day
> trial. Simplify your report design, integration and deployment - and focus on
> what you do best, core application coding. Discover what's new with
> Crystal Reports now.  http://p.sf.net/sfu/bobj-july 
> _______________________________________________
> Please read the OpenNMS Mailing List FAQ:
> http://www.opennms.org/index.php/Mailing_List_FAQ 
> 
> opennms-install mailing list
> 
> To *unsubscribe* or change your subscription options, see the bottom of this
> page:
> https://lists.sourceforge.net/lists/listinfo/opennms-install 
> 



------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july 
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ 

opennms-install mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-install

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
_______________________________________________
Please read the OpenNMS Mailing List FAQ:
http://www.opennms.org/index.php/Mailing_List_FAQ

opennms-install mailing list

To *unsubscribe* or change your subscription options, see the bottom of this page:
https://lists.sourceforge.net/lists/listinfo/opennms-install


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic