'[netsaint-devel] A long like of things to discuss; WAS: Re: question...'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       netsaint-devel
Subject:    [netsaint-devel] A long like of things to discuss; WAS: Re: question...
From:       "Brandon Knitter" <knitterb () blandsite ! org>
Date:       2000-10-04 4:57:53
[Download RAW message or body]

I had to start with this one...

> Its hidden so well that even I don't know where it is... Truthfully,
> there isn't one yet.  I just haven't had the time to learn CVS.  I don't
> want to screw things up when I configure things and I have to learn
> how to handle branching, etc. before I do anything.  If and when I
> do get it setup, it'll probably be read-only.  I'll just periodically dump
> whatever working code I have to the CVS tree.  That way
> developers can track changes that I make without having to wait for
> another alpha or beta release.

WOW!  How do you develop?  Seriously, do yourself a favor and get
used to using CVS.  It's soooo incredibly easy.  Really, there are only
about 2 commands you need to know to do that most basic stuff:

    cvs add <file/dir>
    cvs commit <file?

That's it.  Can I help you get your stuff into CVS?  Please?  We rely
so heavily on Netsaint that I really want to help you out here.  How
do you have multiple ppl working on stuff?  I see a Netsaint project
at SourceForge, is that yours?  If so, let's open that up.

BTW: How many ppl do work on Netsaint?  I mean the core product?
I see a lot of plugins and such, but I'm curious on the number working on
the core product?

> I can see the arguments against a DB backend, but the fact is that
> users will be able to choose what types of data they want stored in
> the old format or in a database.  I personally like the idea of
> keeping status info in a db because of the big speed
> improvements, especially if you are monitoring a large number of
> services and hosts.  As mentioned before the overhead of keeping
> status data in a text file is very high.

Yeah, the load on the status file is huge!  We checked out a few things
and we're working on a prototype that would change the this drastically.
We're thinking that changing the main execution while loop to a select
loop would greatly improve performance.  Also, as part of that select loop
we could write to the status.log only as necessary.

Another idea is to just leave all the data in Netsaint all together.  A TCP
request could be made to Netsaint to get the lastest info.  And for
backwards
compatibility, you could do the status.log writing every x seconds, or
just create a named pipe.

I think your idea of speeding things up by going to the DB has merit.  But
there are things in a DB that have to happen a whole lot more than on a
single file.  I keep coming back to this, but the status.log file should
not be written every tmie items are reaped, it should be written
to on a specified period.  This approach would yield a much higher
performance enhancement than moving this to a DB.

The fact that the number of parallel process is not decremented until all of
the outstanding processes have been reaped is an example of the
design flaw.  Instead of decrementing this once all has been reaped, you
should really decrement this as soon as the process is complete.  Now,
I have reviewed this code and can see how that is not possible.

What we see (from 20 parallel process configuration) is 20 processes
do their check, then there are 0, then there is a shitload of syscalls
to status.log and then status.tmp (all context changes because of I/O).
Once the status.tmp has been created and renamed into place, then
another 20 checks go, and we loop.

The only difference you are going to gain is that the pauses will be due
to talking to the DB.  And I can't see talking to a DB as being any faster
than talking to a file.

Once again, if you write the reaper's changes to memory, and then schedule
an independent task to write the status.log to disk (from this memory
structure) then you will see a large performance gain.

> I guess the only real time there would be a problem is if the db
> server was down when NetSaint shutdown/restarted and you had
> chosen to save retention info to the db.  NetSaint isn't going to wait
> around to save the retention info - its just going to keep on going.
> That's why I've added automatic dumps of retention data at user-
> specified intervals in the 0.0.7 code.

You should probably have "last saved state" info in a file just in case
someone wants it.

You may want to handle other stuff from the DB including running out
of space, locks, connection, etc...  DB's in general are not going to
speed up or reduce complexity to your application.  What they will do
is offer more data mining ability which is why I suggest you just store
information changes into the DB (insert only data warehouse style).

Databases should not be used to save state (generally speaking)
information such as configuration.  The overhead is just too big, and
the risk versus gain is not evident.

> The latest release of MySQL does support experimental
> transactions used Berkley DBD tables, but I haven't tried them out
> yet.  The way the db code is being written, I think the only time
> you'd really need transaction support is when you save retention
> information.  You don't want to blow away old retention info until
> you know the new stuff has been committed...

Postgres has always supported transactions, and it's often been
shown to be faster than mySQL for multi-user applications.
[Editorial: I don't want to get into any pissing matches or
benchmark tests.  Netsaint I don't think will go that far.]

I would be worried about that "beta" part! :)  We don't need any part of
a monitoring system to be beta....hmmmmmmmmm....

Is there a reason for mySQL?  I'm a big Postgres fan just cause of all
the neato stuff it has.  Oh yeah, it compiles for me too! :)

Seriously though, the transaction support is very important for things like
host and service updates.  If you need to update both, and then mark the
service as checked, you want to make sure they are all done, or not at all,
so that the service check fires again.  That's where transaction come in.

DB's in general tend to slow things down.  Unless you plan to have the
clients
themselves update the info (not the reaper) then you aren't going to win
much
performance here.  If you are going to use a DB (cool feature although I
am scared) then you should really use it properly and make sure you lock
data
for write in case another write comes along, or a dirty read attempt is
made.

I still would like historical information.  For instance, get the logging of
system states and notifications into the DB (insert only model), before
the real time status of things.  I'd rather see the real time stats in
memory
in Netsaint, and dropped to disk for backwards compatibility.

> Internally implemented checks might be nice, but I see the ability
> to write a simple shell script to perform a check of some kind as
> being a very handy feature.

Agreed 100%!  I am not at all suggesting that the ability to write command
line applications needs to go away.  What I am thinking is a bridge to the
other daemon, or a very common check of http.  A programmer's API
would eventually need to be developed for openess, but I'm not thinking
that far ahead.

> The problem I see with this is what happens if the second daemon
> (the one that's running all the checks) gets hosed?  You still have
> the same situation to deal with.

You are correct, if that daemon fails, then you are hosed.  But if
I run out of file descriptors or processes (we've had the latter), I'm
hosed again!  So what we do to make this better is to have a check goto
the daemon.  If that check is in Netsaint, I'll get the warning, if it's
another
file system check, I could still be out of file descriptors.

Chicken and egg, who's gonna monitor the monitor server? :)

> Passive checks (those submitted via the command file) actually
> incur much less overhead, as NetSaint only has to fork once to
> process all of them.  I believe there was some talk on the plugin
> devel list a while back about adding the ability to do multiple
> simultaneous checks using the check_snmp plugin.  Basically, the
> results for multiple checks would be dumped to the external
> command file for NetSaint to pick up.

Well that explains it.  I didn't really understand passive checks.  I think
I do
now and will be making some mods tomorrow.  So in other words this will
fire the check, but not wait around for a response.  It's the check's job
(the
script on the file system) to notify Netsaint via the netsaint.cmd file?

> Writes to the pipe will not be interleaved because I've placed
> restrictions on the max size of the message structure to be under
> 512 bytes.  I believe that POSIX states that any writes to a pipe
> that are under 512 bytes in length are atomic.  Many OSes support
> larger atomic writes, but that's the minimum for POSIX compliance.

Took us a while, but we found that.  That explains why the output length is
coded for a length of 352.  We added up the struct service_message_struct
and
what do ya know, it equals 504, less than the size of the standard PIPE_BUF.
We're big on asserts over here, so you should really add a line of code
somewhere:

    assert(sizeof(service_message_struct) <= PIPE_BUF);

> This is also a possibility.  If you use a db as a backend for status
> data it doesn't matter much.  However, mabye when using standard
> text file to store the status info, the core should just refresh the
> whole darn thing every 30 seconds or something.  I like the idea of
> an immediate update, but it does cause a lot of overhead.
>
> I would recommend using a db for the status data backend in the
> future, but that's just me... :-)

That's fair to recommend a DB! :)  I just know how much of a bitch they are
and how often they fail.  Although this is a small one! :)

We've thought of a couple ways to make the status.log writing quicker.  The
main
one is to keep all the status info in memory, and just write out a new
status.log
every x seconds.  This would mean that the reaper, instead of collecting and
writing to
the log, it would collect into a memory structure, then at x seconds it
would dump that
status.log to disk.  Instead of "grepping" the status.log file for creating
the new .tmp file
with the new info, there could be more precise updating in memory, and the
write to
disk would be faster.

There are a lot of system calls at startup as well.  A ton of them to
status.log as it
reads in the status.sav file.  It seems that as it updates each service,
it's doing the host
check to make sure the host is okay (a thing you noted about services
previously).
Removing this need at startup would surely speed things up.

> Sure, send it to the devel list if its a small patch.  Otherwise,
> mabye just an URL where we can grab it.

BTW, I'm serious about CVS.  You should really look into that.  You don't
/have/ to
do anything like branching.  It is nice, but is not a requirement.  I think
once you
start to use it you will be amazed.  There is a reason so many ppl use it.
If time
is a contraint, I'm be more than happy to help you set it all up.  It would
make it easier
for us to send you patches as well!

-bk

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic