[prev in list] [next in list] [prev in thread] [next in thread] 

List:       netcool-users
Subject:    Re: Superseding/hiding events
From:       "Dennis A. Hahn" <mrwizard () shellus ! com>
Date:       1998-07-23 17:42:36
[Download RAW message or body]


James started this thread by asking:
> Has anyone got any generic approaches to this sort of scenario?

In the foggy reaches of my mind, I recall a particular topic of discussion in
San Francisco that was facinating. James brought the memory back to life, and
E.T.'s rigorous description made it even more clear to me (nice job!)

I hope not to draw attention away from James' need, but let me twist the
requirement just a little bit harder.

Hypothetically suppose you have this collection of events. By the term "event"
I mean an entry within the alerts.status which represents a current/active
problem for which you also know the associated clearning condition. (Yes, dear
readers, that distinction is important -- dealing with "random" alarms is not
an area I'm trying to focus upon.) Also suppose that the clearing condition has
not yet occurred.

Now with a wave of my namesake's wand, and a *lot* of smoke and mirrors,
suppose I can tell the, ahem, ROOT CAUSE of all these events. How this is
accomplished is left as an exercise to the student and all the casual
observers (I for one still haven't got a clue how to do this generically, but
I am still working on solving one of the remaining golden myths of Network
Management...).

Time for some examples:

1) James' "filesystem >= 95% full" condition certainly is a ROOT CAUSE of his
"filesystem >= 90% full" condition.

2) The perverbial "router interface down" certainly is a ROOT CAUSE to numerous
"ping" storms (and the student above knew exactly which ping events should be
hidden by which interface down).

3) A "response time failure" of a particular application running on host A
being caused by a port wrap condition of host B, because host B was providing
the DNS services for host C, of which host C could not transmit the data to
host A due to an untranslated name (boy, do I have war stories!!!).

Anyway, let's propose you have identified Event A which as a result has spawned
Events A.1 through A.n.  There are times when the suppressed alarm is obviously
caused by another condition (see example 1), other times you have to
momentarily assume that a suppressed alarm is only caused by another condition
and not necessarily true (see example 2), and the remaining times you have to
maintain your membership in the hair replacement clubs (see example 3).

By necessity, you cannot always assume that a suppressed alarm (an "effect")
has not happened once you can identify the ROOT CAUSE (the "cause").  In
addition, there may be multiple layers of hiding required: first an edge router
goes down, then a backbone router.  As I think about it, perhaps mulitple
layering might not be the best answer, and only a single overriding layer could
be performed -- fix the (latest) identified ROOT CAUSE, then re-evaluate the
remaining collection is probably the safer approach.

Irregardless, it's a *nasty* problem.



But for the moment (and with great pleading on my knees begging someone else
to figure out how the above ROOT CAUSE identification was done!), let's assume
we only have the capabilities which we have currently -- and hope that version
4 does some really slick internal operations to support "the need for speed"
internally.

Step 1) Define yet another alerts.status integer field called @SuppressedBy.
The probes do not touch this field, leaving it in it's initial condition of
zero.  Obviously, set this as an updated field within any gateway.

Step 2) ...by reading this statement, you hereby agree that if you figure out
how to do the generic ROOT CAUSE determination, you will celebrate accordingly,
and in such rejoicing agree to send the NOC a copy of your accomplishments --
not gloating, but a copy of the routine at how you accomplished this feat! For
more simple implementation, where the ROOT CAUSE is obvious, well, still
rejoice; life's too short to miss out on some joyous moments!

Step 3) Within your automagic checking, let's say you've identified Event A.
Subsequently set the @SuppressedBy field for Events A.1 through A.n to the
value of Event A's @Serial.

Step 4) For your EventLists, set the filter condition to only select
@SuppressedBy entries equal to zero in addition to all other filter conditions.
Also, this is a great first check, because it's checking an integer field for
an exact value -- think optimization!

Step 5) When the eventual Clearing condition for Event A occurs (remember that
caveat in the beginning of this note?), then within the associated Action of
that clearing trigger, don't forget to also reset the @SuppressedBy fields back
to zero. On the other hand, you could also select all alerts.status entries
that have non-zero @SupressedBy fields and verify that the suppression event
Serial still exists within alerts.status (pretty ugly though).

Step 6) Well, this is really step 2 all over again, always scanning through
your collection of events looking for ROOT CAUSES.



Would love to see the discussion about this!



(Side note to the Micromuse Developers: what I'm thinking about is that
@SuppressedBy implementation is really a linked list associated with an entry
within alerts.status, referrenced by the provided @Serial values. Upon deletion
of an entry, the chain could be chased to reset the @SuppressedBy fields very
easily. Should a subsequent event desire to override the initial suppression,
as in the backbone router example above, then the initial suppression linked
list would automatically be joined to the subsequent linked list. I realize
that this requires yet another method definition within NCOMS.sql, but I
think the ability described above would justify it. Just a thought -- do you
think this doable/desirable?) 

-------------------------------------------------------------------------------
Dennis Hahn - KB5EEG  <*|;{)}
Shell Services International (Global End To End Systems Management Development)
1500 Old Spanish Trail, Houston, Texas, USA 77054  (Mail Stop: IC-2H18G2)
Voice: 1-713-245-3629  Fax: 1-713-245-1664  E-mail: mrwizard@shellus.com

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic