It's not so easy to detect hanging processes. There is no way you can tell how long the request is taking to process by using plain system utilities such as ps and top. The reason is that each Apache process serves many requests without quitting. System utilities can tell how long the process has been running since its creation, but this information is useless in our case, since Apache processes normally run for extended periods.
However, there are a few approaches that can help to detect a hanging process. If the hanging process happens to demand lots of resources, it's quite easy to spot it by using the top utility. You will see the same process show up in the first few lines of the automatically refreshed report. (But often the hanging process uses few resources—e.g., when waiting for some event to happen.)
Another easy case is when some process thrashes the error_log file, writing millions of error messages there. Generally this process uses lots of resources and is also easily spotted by using top.
Two other tools that report the status of Apache processes are the mod_status module, which is usually accessed from the /server_status location, and the Apache::VMonitor module, covered in Chapter 5.
Both tools provide counters of requests processed per Apache process. You can watch the report for a few minutes and try to spot any process with an unchanging number of processed requests and a status of W (waiting). This means that it has hung.
But if you have 50 processes, it can be quite hard to spot such a process. Apache::Watchdog::RunAway is a hanging-processes monitor and terminator that implements this feature and could be used to solve this kind of problem. It's covered in Chapter 5.
If you have a really bad problem, where processes hang one after the other, the time will come when the number of hanging processes is equal to the value of MaxClients. This means that no more processes will be spawned. As far as the users are concerned, your server will be down. It is easy to detect this situation, attempt to resolve it, and notify the administrator using a simple crontab watchdog that periodically requests some very light script (see Chapter 5).
In the watchdog, you set a timeout appropriate for your service, which may be anything from a few seconds to a few minutes. If the server fails to respond before the timeout expires, the watchdog spots trouble and attempts to restart the server. After a restart an email report is sent to the administrator saying that there was a problem and whether or not the restart was successful.
If you get such reports constantly, something is wrong with your web service and you should review your code. Note that it's possible that your server is being overloaded by more requests than it can handle, so the requests are being queued and not processed for a while, which triggers the watchdog's alarm. If this is the case, you may need to add more servers or more memory, or perhaps split a single machine across a cluster of machines.
 
Continue to: