As we mentioned earlier, the more things are automated, the more stable the server will be. In general, there are three things that we want to ensure:

  1. Apache is up and properly serving requests. Remember that it can be running but unable to serve requests (for example, if there is a stale lock and all processes are waiting to acquire it).

  2. All the resources that mod_perl relies on are available and working. This might include database engines, SMTP services, NIS or LDAP services, etc.

  3. The system is healthy. Make sure that there is no system resource contention, such as a small amount of free RAM, a heavily swapping system, or low disk space.

None of these categories has a higher priority than the others. A system administrator's role includes the proper functioning of the whole system. Even if the administrator is responsible for just part of the system, she must still ensure that her part does not cause problems for the system as a whole. If any of the above categories is not monitored, the system is not safe.

A specific setup might certainly have additional concerns that are not covered here, but it is most likely that they will fall into one of the above categories.

Before we delve into details, we should mention that all automated tools can be divided into two categories: tools that know how to detect problems and notify the owner, and tools that not only detect problems but also try to solve them, notifying the owner about both the problems and the results of the attempt to solve them.

Automatic tools are generally called watchdogs. They can alert the owner when there is a problem, just as a watchdog will bark when something is wrong. They will also try to solve problems themselves when the owner is not around, just as watchdogs will bite thieves when their owners are asleep.

Although some tools can perform corrective actions when something goes wrong without human intervention (e.g., during the night or on weekends), for some problems it may be that only human intervention can resolve the situation. In such cases, the tool should not attempt to do anything at all. For example, if a hardware failure occurs, it is almost certain that a human will have to intervene.

Below are some techniques and tools that apply to each category.