404 Error Log

I didn't like any of the 404 error log systems I found, so I cobbled some of them together into one I like. This document describes the system I use. IT's organized by the order in which events occur when a 404 error is generated. This system works for Apache with SHTML turned on. You should know the basics of htaccess, shtml, and cgi with Perl.

.htaccess

You need to add a line to your .htaccess file that directs the browser to call the cgi scipt when a 404 error happens:

.htaccess
ErrorDocument 404 /cgi-bin/404-handler.cgi?errorpage=/404.html

404-handler.cgi is the name of the scipt that does the first part of the work. 404.html is the name of the web page that needs to be served up to the user when a 404 error occurs. I added this feature because I wanted every section of my web site to have its own special 404 error page. This also makes it easy to delete the reference to the handler. The entry affects every directory from here on down, so if a part of your web site in a different directory wants its won 404 file, add another .htaccess file there.

404-handler.cgi

Because of the ErrorDocument entry in the .htaccess file, this cgi file is invoked every time a 404 error happens. Comments are interspersed in the file to explain how it works. This file lives in your cgi executables folder.

404-handler.cgi

#!/usr/bin/perl
use CGI::Carp qw(fatalsToBrowser);
use CGI qw(:standard);
# 404 error handler
# Michael Roeder
# 5/5/02
# Parameter: ?errorpage=path/to/error.html
#
# Each subdirectory in the web site structure can have its own
# .htaccess file and 404.html file. Each .htaccess file can
# set an ErrorDocument line like this one:
#
#ErrorDocument 404 /cgi-bin/mroeder/404-handler.cgi?errorpage=/404.html
#
# The first part points the server to the 404 handler script.
# The second part points the handler script to the 404 error page.
# If no errorpage is specified in the ErrorDocument line, then
# the default behavior is to serve up the $defaulterrorpage.
# If the errorpage is specified but is missing, this script will
# print an ugly error message.
# If the $defaulterrorpage is needed but is missing, this script
# will likewise print an ugly error message.
#
# Set $logfile to a text file that is to receive the log entries.
# This file needs to be world-writable.
$logfile = '/wwwpages/404-log.txt';
#
# Set $pathprefix the root directory of the web site.
$pathprefix = '/wwwpages';
#
# Set $defaulterrorpage to the error page in that directory.
$defaulterrorpage = '/404.html';
#
# If the log file is missing, print an ugly error message.
# The log file is the whole point of this script.
unless (-e $logfile) {error("$logfile does not exist.")}
unless (-W $logfile) {error("$logfile is not world writable.")}
#
# Build the data for the log file.
#
# get the date and time
($sec,$min,$hour,$mday,$mon,$year,$wday,$yday) = localtime(time);
$mon++;
if (length($mday) == 1) {$mday = 0 . $mday}
if (length($mon) == 1) {$mon = 0 . $mon}
if (length($hour) == 1) {$hour = 0 . $hour}
if (length($min) == 1) {$min = 0 . $min}
$year = $year + 1900;
$dt = "$year$mon$mday.$hour$min";
#
# get the requested URL
$redirect_url = "$ENV{'REDIRECT_URL'}";
#
# Get the URL we were linked from
$http_referrer = "$ENV{'HTTP_REFERER'}";
if ($http_referrer eq "") {
$http_referrer = " ";
}
else {
$http_referrer = "<a href=\"$http_referrer\">$http_referrer</a>"
}
#
# Get the address of the remote web browser.
$remote_address = "$ENV{'REMOTE_ADDR'}";
#
# Get the name of the remote web browser.
$remote_host = "$ENV{'REMOTE_HOST'}";
if ($remote_host eq "") {
$remote_host = "&nbsp;";
# it would be nice if this could do an nslookup and report that
}
#
# Get the type of browser
$user_agent = "$ENV{'HTTP_USER_AGENT'}";
if ($user_agent eq "") {
$user_agent = "&nbsp;";
}
#
# prepare the HTML fragments to make a nice table row.
$entrystart = " <tr>\n <td>";
$delimiter = "</td>\n <td>";
$entryend = "</td>\n </tr>";
#
# build log entry, which is to be one line of table cells
$logentry = $entrystart . $dt . $delimiter . $redirect_url . $delimiter .
$http_referrer . $delimiter . $remote_address . $delimiter .
$remote_host . $delimiter . $user_agent . $entryend ;
#
# write the log entry to the file
open(FILE, ">>$logfile") or error("Could not open $logfile for append: $!");
flock(FILE,2) or error("Could not lock file $logfile: $!");
print FILE $logentry;
close(FILE);
#
# having logged the 404 error, now give the right 404 page
#
# start with the default error page
$errorpage = $pathprefix . $defaulterrorpage;
# get the specified error page parameter
$myerrorpage = param('errorpage');
#
# if something was specified, then set the specified error page
if ($myerrorpage ne '') {
$errorpage = $pathprefix . $myerrorpage;
};
#
# does the error page exist?
unless (-e $errorpage) {
error("$errorpage does not exist.")
};
#
# print the 404-error page to the output
open (FILE,"$errorpage");
$content = '';
while (<FILE>){
$content .= $_;
}
close (FILE);
print "Content-type: text/html\n\n";
print "$content";
# done
exit(0);sub error
{
my $error = $_[0];
print "Content-Type: text/html\n\n";
print "<h2>Error</h2><b>$error</b>\n\n";
exit(0);
}
#eof 404-handler.cgi

404-log.txt

This starts out as an empty text file. As the cgi script is run by the web server, 404 errors are added to this file. The lines of code in this file are just an example of what's created by the cgi scipt. It's a series of text rows, with one columns for each piece of information I want to collect. This file lives in the html documents folder. It needs to have its permissions set so that anyone can write to it. To clear the log, delete the contents of this file.

404-log.txt
<tr>
<td>20040307.1600</td>
<td>/wrongfile.html</td>
<td>&nbsp;</td>
<td>10.0.0.10</td>
<td>&nbsp;</td>
<td>Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/85 (KHTML, like Gecko) Safari/85.6</td>
</tr>

404-report.shtml

This file is the wraper for 404 error report. The .shtml file type makes the Apache server process it for special commands before it is sent out to you. In this case, there's an include command that invokes the log file. As you can see, the report page defines a table with only one row, the column headers. The #include command invokes the content that's collected in the log file.

404-report.shtml
<html>
<head>
<title>404 Error Report</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
</head>
<body bgcolor="#FFFFFF">
<table border="1" cellpadding="4" cellspacing="0" bordercolorlight="#6666FF" bordercolordark="#000066" bgcolor="#CCCCFF">
<tr bgcolor="#9999FF">
<td><b>Date &amp; Time</b></td>
<td><b>Bad URL</b></td>
<td><b>Linked From</b></td>
<td><b>Remote Host Address</b></td>
<td><b>Remote Host Name</b></td>
<td><b>User Agent</b></td>
</tr>
<!--#include virtual="404-log.txt" -->
</table>
</body>
</html>

To see the report, simply visit your site and type in the name of the 404 report file. It's probably not a good idea to make the name of your 404 report file public. Here's an example of whatthe report looks like:

Date & Time Bad URL Linked From Remote Host Address Remote Host Name User Agent
20040307.1600 /wrongfile.html   10.0.0.10   Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-us) AppleWebKit/85 (KHTML, like Gecko) Safari/85.6

Each new entry adds a new row to the table.


404 error log. Revised March 7, 2004.
URL: http://www.sonic.net/~mroeder/404errorlog.html
Home Page: http://www.sonic.net/~mroeder/index.html