Most indexing robots are polite and pause between requests. However, some search engine robots behave very badly, issuing too many requests too often, thus slowing down the service for human users. While everybody wants their sites to be indexed by search engines, it is really annoying when an initially welcomed spider gives the server a hard time, eventually becoming an unwanted spider.
A common remedy for keeping impolite robots off a site is based on an AccessHandler that checks the name of the robot and disallows access to the server if it is listed in the robot blacklist. For an example of such an AccessHandler, see the Apache::BlockAgent module, available from http://www.modperl.com/.
Unfortunately, some robots have learned to work around this blocking technique, masquerading as human users by using user agent strings identifying them as conventional browsers. This prevents us from blocking just by looking at the robot's name—we have to be more sophisticated and beat the robots by turning their own behavior against them. Robots work much faster than humans, so we can gather statistics over a period of time, and when we detect too many requests issued too fast from a specific IP, this IP can be blocked.
The Apache::SpeedLimit module, also available from http://www.modperl.com/, provides this advanced filtering technique.
There might be a problem with proxy servers, however, where many users browse the Web via a single proxy. These users are seen from the outside world (and from our sites) as coming from the proxy's single IP address or from one of a small set of IP addresses. In this case, Apache::SpeedLimit cannot be used, since it might block legitimate users and not just robots. However, we could modify the module to ignore specific IP addresses that we designate as acceptable.
Randal Schwartz wrote Stonehenge::Throttle for one of his Linux Magazine columns. This module does CPU percentage-based throttling. The module looks at the recent CPU usage over a given window for a given IP. If the percentage exceeds a threshold, a 503 error and a correct Retry-After: header are sent, telling for how long access from this IP is banned. The documentation can be found at http://www.stonehenge.com/merlyn/LinuxMag/col17.html, and the source code is available at http://www.stonehenge.com/merlyn/LinuxMag/col17.listing.txt.
Neil Gunton has developed a Spambot Trap (http://www.neilgunton.com/spambot_trap/) that keeps robots harvesting email addresses away from your web content. One of the important components of the trap is the robots.txt file, which is a standard mechanism for controlling which agents can reach your site and which areas can be browsed. This is an advisory mechanism, so if the agent doesn't follow the standard it will simply ignore the rules of the house listed in this file. For more information, refer to the W3C specification at http://www.w3.org/TR/html401/appendix/notes.html#h-B.4.1.1.