Prevention is better than cure

I loved this posting. I think it applies just as much to DBA’s and System Administrators. I have to confess I have had the occasional thrill of playing the hero at the scene of a car crash of a situation. It can feel good. In fact I think you can easily get addicted to being in this type of situation, where you hold the key piece of knowledge to fix a particular problem. I think I have at times been on the edge of the precipice of enjoying it too much and almost hoping for another opportunity to be a troubleshooter. But I see the danger of enjoying being a troubleshooter and I can see the benefit to the organisation of being a “troublepreventer”.

Recently, on the Oracle-L mailing list someone was asking for a list of tasks that a good dba should do on a regular basis. The person was given short-shrift in that they were told if there are tasks that your dba is performing on a daily basis, why are they doing them and why have they not automated them?

I also like the idea of “exception based reporting”, that is if everything about a system/database etc. is running within specified criteria, I don’t want to know about it, and only when a monitored criteria becomes outwith an acceptable criteria do you want to hear about it. However, this does require you monitor the system closely. As an example, I would suggest monitoring your I/O response time on your database, and/or critical query response time, and then if these change to be outwith acceptable times you flag it up and get someone to look at it.

Of course, I should eat my own dog food a bit more here.


5 thoughts on “Prevention is better than cure

  1. I completely agree with the idea of being a troublepreventer – I can’t stand repeating the same things day after day. As for the Oracle monitoring – let’s have some nagios checks written please 🙂

  2. Er yeah, I did say I need to eat my own dog food more on this one. Though were you not banging on about some nagios generic plugin that would allow you to run any script as a check?

  3. Jason, depending how well you want to know your systems, you might want to check them regularly. Actually, you should watch your systems for a while BEFORE you can come up with sensible criteria of what is unusual and than from time to time verify whether adjustments are required.

    But don’t take me wrong – I agree with your idea. 🙂

  4. Hiya Alex,

    As I said in part what prompted this was seeing the posting about a DBA checklist on Oracle-L, one of the things on the checklist, was checking every day if your instance was up!?!

    If you need to be checking that (rather than monitoring it), i think the game is up.

    But I take your point about needing a baseline, but after a while of managing a system, I think you should have developed a good intuition of what is normal behaviour.

    You using nagios there at pythian? Or have you rolled your own?

  5. We use our own home grown tool.

    Intuition… hm… that leads to guesses. 😉

    In the system of collected metrics, it’s set of thresholds. Sometimes we use more complex models with probability and normal deviation but those are mode difficult to tune.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s