Grok the Web

A Programmer's Guide to the New Software Development Paradigm

by Andrew Schulman


Chapter 5

Stealing Cycles

Last revised: April 15, 1997


As suggested by the previous chapter, the web consists not only of hyperlinked documents, but also of processes which you can tap into. This chapter continues with this theme, but instead of turning forms into forms, we dispense with forms completely, and turn them (where possible) into URLs: distributed computation in the form of hypertext links! This requires understanding <FORM METHOD=get> vs. <FORM METHOD=post>. Using what seem at first like some "stupid web tricks" (hooking up NetCraft to the URL-minder), this chapter will show how to pipe one CGI process into another. That you can do this suggests that web sites are really tools, or software components. So perhaps a better chapter title would be "The Tools Approach to the Web."

As for stealing, see sidebar on legalities in chapter 3; need separate sidebar here? (see below)

This chapter includes GETURL, demonstrating point (Berners-Lee) that client is not necessarily a browser. HTTP so simple, all sorts of tools to do it. For example: geturl -stdin < test.txt | awk -f server.awk

geturl.c is now over 300 lines: does -post; http://012.345.678.901; handles URLs of form <A HREF="http://xxx">> so can do geturl http://xxx | grep HREF | geturl -stdin; also handles Location: redirection; etc. So also need to show simplest-possible version, geturl1.c. Maybe use ftp://ftp.ora.com/pub/examples/windows/win95.update/sd96/geturl.c

geturl -post tracknum=1Z742E220310270799 http://wwwapps.ups.com/tracking/tracking.cgi

Change geturl.c: socket = open_url(s); request(socket); read(socket); close(socket);

Need to fix redirection: if url doesn't contain "//" or "/" then is relative URL! (e.g., www.truevalue.com -> index.cgi).

Need to support :port in URLs?

Add -links and -tag name,name,name switches to geturl: links dumps out all links (HREF, IMG, FRAME, etc.); tag dumps out specified tags. Need to improve -split.

Did simpler geturl.c: http.c:

// http.c -- about 100 lines of code to get http data
// but adding redirection support adds another 65 lines (mostly for urlsplit())!
// change http to call urlsplit, and recursively call http -- take full URL

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <io.h>
#include "winsock.h"

#define WINSOCK_VERSION 0x0101
#define HTTP_PORT 80

#define GET       1
#define HEAD      2
#define POST      3

int http(int request, char *host, char *pathname, char *post_data, BYTE *buffer, int bufsize)
{
    static int did_startup = 0;
    static WSADATA wsaData;
    static char *requ_str = (char *) 0;
    LPHOSTENT pHostEnt;
    SOCKADDR_IN sockAddr;
    int len, num_recv;
    struct in_addr *addr;
    u_char b[4];
    SOCKET sock = INVALID_SOCKET;

    if (! did_startup)
    {
        if (WSAStartup(WINSOCK_VERSION, &wsaData))
            return 0;
        did_startup = 1;
    }
    if (! requ_str)
        if (! (requ_str = malloc(2048)))
            return 0;
    
    sockAddr.sin_family = AF_INET;
    sockAddr.sin_port = htons(HTTP_PORT);
    if (isdigit(*host))
    {
        // already have IP address; don't need DNS lookup
        sscanf(host, "%u.%u.%u.%u", &b[0], &b[1], &b[2], &b[3]);
        sockAddr.sin_addr.s_addr = *((u_long *) b);
    }
    else
    {
        if (!(pHostEnt = gethostbyname(host)))
            return 0;
        sockAddr.sin_addr = *((LPIN_ADDR)*pHostEnt->h_addr_list);
    }

    if ((sock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) == INVALID_SOCKET)
        return 0;
    
    if (connect(sock, (LPSOCKADDR)&sockAddr, sizeof(sockAddr)) != 0)
        return 0;

    switch (request)
    {
        case GET:
            sprintf(requ_str, "GET %s HTTP/1.0\n\n", pathname);
            break;
        case HEAD:
            sprintf(requ_str, "HEAD %s HTTP/1.0\n\n", pathname);
            break;
        case POST:
            if ((! post_data) || (! *post_data))
                return 0;
            sprintf(requ_str, 
                "POST %s HTTP/1.0\n"
                "Content-type: application/x-www-form-urlencoded\n"
                "Content-length: %u\n\n"
                "%s\n",
                pathname, strlen(post_data), post_data);
            break;
        default:
            return 0;
    }
            
    if (send(sock, requ_str, strlen(requ_str), 0) == SOCKET_ERROR)
        return 0;

    for (num_recv=0; bufsize>0; num_recv += len, buffer += len, bufsize -= len)
    {
        if ((len = recv(sock, buffer, bufsize, 0)) <= 0)
            break;
        // if run out of buffer, should return some sort of error!
    }
    if (bufsize)
        *buffer = '\0';

    closesocket(sock);

    return num_recv;
}

// should first check for "302 Redirection"!!
// should only do inside header, before first "\n\n"!!

static char *location_str = "Location:" ;
static int location_len = 0;

char *redirect(char *buf, int len)
{
    static char *url = (char *) 0;
    char *s, *s2;

    if (! len)
        return (char *) 0;

    if (! url)
        if (! (url = malloc(512)))
            return (char *) 0;
    if (! location_len)
        location_len = strlen(location_str);

    // whoops, not guaranteed to have a null-terminated string here; use len!!
    
    if (s = strstr(buf, location_str))
    {
        if (s2 = strchr(s, '\r'))
            *s2 = '\0'; // seal off line
        if (s2 = strchr(s, '\n'))
            *s2 = '\0'; // seal off line
        strncpy(url, s+location_len+1, 512);
        return url;
    }
    return (char *) 0;
}

#define NAME_SIZE   512

// much simpler than one in geturl.c
int urlsplit(char *url, char *hostname, char *pathname)
{
    static char *buf = (char *) 0;
    
    char *h, *p, *s;
    int protocol = HTTP_PORT;

    if (! buf)
        if (! (buf = malloc(2048)))
            return 0;
    if (! location_len)
        location_len = strlen(location_str);

    while (isspace(*url))
        url++;
    
    if (strstr(url, "//"))
    {
        if (strncmp(url, "http://", 7) == 0)
            { url += 7; protocol = HTTP_PORT; }
        else
            return 0;
    }
    
    h = url;
    p = url;
    while (*p && (*p != '/')) 
        p++;
    strncpy(pathname, *p ? p : "/", NAME_SIZE);
    *p = 0;
    strncpy(hostname, h, NAME_SIZE);
    return protocol;
}

void fail(const char *s) { puts(s); exit(1); }

main(int argc, char *argv[])
{
    char buf[20480];
    char host[NAME_SIZE], path[NAME_SIZE];
    int num;
    char *url = (argc < 2) ? "http://www.microsoft.com" : argv[1];

    puts(url);
    if (urlsplit(url, host, path ) != HTTP_PORT)
        fail("Sorry, can't do!");

    num = http(GET, host, path, 0, buf, 20480);

#if 0
redirection loop for, e.g., www.microsoft.com:
    http://www.microsoft.com
    http://msid.msn.com/mps_id_sharing/redirect.asp?www.microsoft.com/default.asp
    http://www.microsoft.com/default.asp?NewGuid=11f377aeb4fb11d0889f08002bb74f65
    http://www.microsoft.com/default.asp?MSID=11f377aeb4fb11d0889f08002bb74f65
#endif

    while (url = redirect(buf, num)) // possible multiple redirections
    {
        puts(url);
        if (urlsplit(url, host, path) == HTTP_PORT)
            num = http(GET, host, path, 0, buf, 20480);
        else
            num = 0;
    }
    puts( num ? buf : "Failed!" );
}
Does geturl properly repost after redirection? Having trouble finding an example! Discussed at http://lists.w3.org/Archives/Public/www-talk/msg01692.html: HTTP 1.0 draft says "If the 302 status code is received in response to a request using the POST method, the user agent must not automatically redirect the request unless it can be confirmed by the user, since this might change the conditions under which the request was issued."

usage: geturl [options] <http://whatever or -stdin>
  options:
  -noloc : don't do HTTP relocations (default on)
  -base <addr> : use addr as base for all relative URLs
  -head : do HTTP HEAD (default GET)
  -post <data> : do HTTP POST of data
  -input <file> : get all HTTP headers from file
  -stdin : get URLs from stdin
  -split : break HTML output into lines on tags
// GETURL.C -- Win32 console app -- rename gethttp, allow w/o http://
// Andrew Schulman, February 1997
// andrew@ora.com
// cl geturl.c wsock32.lib
// geturl [-head] [-split] [-input file] [-post data] <http://whatever or -stdin>
// also handles URLs of form <A HREF="http://xxxx">, converts to http://xxxx

// TODO:
// -- accept filename on cmdline to save to (e.g., for GIF)
// -- doesn't work for ftp:// yet!
// -- https:// doesn't really work, obviously!
// -- support mailto: to show SMTP vs. MAPI?
// -- make non-Windows version (no WSAStartup, winsock.h, etc.)
// -- support Connection: Keep-Alive

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <ctype.h>
#include <io.h>
#include "winsock.h"

void fail(const char *s) { puts(s); exit(1); }

#define msg(s)  { printf("FAIL: %s\n", s); return 0; }

char *nomem = "insufficient memory";

#define BUFFER_SIZE  20480
#define WINSOCK_VERSION 0x0101
#define NO_FLAGS 0
#define HTTP_PORT 80
#define HTTPS_PORT 443
#define FTP_PORT 21

// options
int do_head = 0, do_manual_input = 0, do_post = 0, do_split = 0, do_base = 0;
int do_loc = 1, do_verbose = 1;
char *input_file, *post_data, *base;

SOCKET ConnectWebServerSocket(char *host, int port)
{
    static int did_startup = 0;
    static WSADATA wsaData;
    LPHOSTENT pHostEnt;
    SOCKADDR_IN sockAddr;
    struct in_addr *addr;
    u_char b[4];
    SOCKET sock = INVALID_SOCKET;

    if (! did_startup)
    {
        if (WSAStartup(WINSOCK_VERSION, &wsaData))
            msg("WSAStartup");
        did_startup = 1;
    }
    
    sockAddr.sin_family = AF_INET;
    sockAddr.sin_port = htons(port);
    if (isdigit(*host))
    {
        // already have IP address; don't need DNS lookup
        sscanf(host, "%u.%u.%u.%u", &b[0], &b[1], &b[2], &b[3]);
        sockAddr.sin_addr.s_addr = *((u_long *) b);
    }
    else
    {
        if (!(pHostEnt = gethostbyname(host)))
            msg("gethostbyname");
        sockAddr.sin_addr = *((LPIN_ADDR)*pHostEnt->h_addr_list);
    }

    // TODO:since gethostbyname is expensive? (DNS), see if same as last time?
    
    // TODO: SO_USELOOPBACK for http://localhost, "local CGI", etc.
    // TODO: maybe use SO_KEEPALIVE?
    if ((sock = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)) == INVALID_SOCKET)
        msg("socket");
    
    if (do_verbose)
    {
        addr = (struct in_addr *) &sockAddr.sin_addr;
        memcpy(b, (u_char *) &addr->S_un.S_un_b, sizeof(unsigned long));
        printf("Connecting to %u.%u.%u.%u\n", b[0], b[1], b[2], b[3]);
    }

    if (connect(sock, (LPSOCKADDR)&sockAddr, sizeof(sockAddr)) != 0)
        msg("connect");
    
    return sock;
}

int SendWebQuery(SOCKET sock, char * szQuery)
{
    char *request;

    if (! (request = malloc(2048)))
        msg(nomem);
    if (do_manual_input)
    {
        char line[80];
        FILE *f = fopen(input_file, "r");
        if (f == (FILE *) -1)
            msg("Can't open input file");
        request[0] = 0;
        while (fgets(line, 80, f))
        {
            if (strlen(request)+strlen(line) >= 2048)
                msg("Sorry, 2k limit");
            strcat(request, line[0] ? line : "\n");
        }
        fclose(f);
    }
    else if (do_post)
    {
        sprintf(request, 
            "POST %s HTTP/1.0\n"
            "Content-type: application/x-www-form-urlencoded\n"
            "Content-length: %u\n\n"
            "%s\n",
            szQuery, strlen(post_data), post_data);
    }
    else
    {
        sprintf(request, "%s %s HTTP/1.0\n\n", 
            (do_head ? "HEAD" : "GET"), szQuery);
    }
    if (send(sock, request, strlen(request), NO_FLAGS) == 
        SOCKET_ERROR)
        msg("send");
    free(request);
    return 0;
}

int get_url(char *url);

static char *location_str = "Location:" ;
static int location_len = 0;

// should first check for "302 Redirection"!!
// should only do inside header, before first "\n\n"!!
int look_for_reloc(char *buf, int len)
{
    char *s, *s2;
    if (! location_len) location_len = strlen(location_str);

    if (s = strstr(buf, location_str))
    {
        char *url;
        if (! (url = malloc(512)))
            msg(nomem);
        if (s2 = strchr(s, '\r'))
            *s2 = '\0'; // seal off line
        if (s2 = strchr(s, '\n'))
            *s2 = '\0'; // seal off line
        strncpy(url, s, 512);
        puts(url);
        if (get_url(url))
        {
            free(url);
            return 1;
        }
        else
            puts("--- Couldn't get redirected URL ---");
        free(url);
    }
    return 0;
}
                                        
UINT RecvWebFile(SOCKET sock, char *buf)
{
    int len;
    if ((len = recv(sock, buf, BUFFER_SIZE, NO_FLAGS)) == SOCKET_ERROR)
        msg("recv");
    if (do_loc && look_for_reloc(buf, len))
        return 0;
    return len;
}

int get_file(SOCKET sock, char *pathname);

// confusing name, because also have -split option!
int split(char *url, char *hostname, char *pathname)
{
    char *buf, *h, *f, *s;
    int protocol = HTTP_PORT;

    if (! location_len) location_len = strlen(location_str);

    if (! (buf = malloc(2048)))
        msg(nomem);

    // can do redirection for now with geturl | grep Location: | geturl -stdin
    if (strncmp(url, location_str, location_len) == 0)
        url += location_len;
    
    // if quotes in URL, then yank out stuff within quotes
    // to handle <A HREF="xxxx"> lines gracefully
    if (strchr(url, '\"'))
    {
        strcpy(buf, url);
        url = buf;
        while (*url != '\"') url++; url++;
        if (s = strchr(url, '\"')) 
            *s = '\0';   // block off any close quote
    }
    
    while (isspace(*url))
        url++;
    
    if (do_base && (! strstr(url, "//")))
    {
        static char *buf = (char *) 0;
        if (! buf)
            if (! (buf = malloc(2048)))
                msg(nomem);
        strcpy(buf, base);
        strcat(buf, url);
        url = buf;
    }

    if (strstr(url, "//"))
    {
        if (strncmp(url, "http://", 7) == 0)
            { url += 7; protocol = HTTP_PORT; }
        else if (strncmp(url, "https://", 8) == 0)
            { url += 8; protocol = HTTPS_PORT; }
        else if (strncmp(url, "ftp://", 6) == 0)
            { url += 6; protocol = FTP_PORT; }
        else
            { protocol = 0; msg("protocol"); }
    }
    
    h = url;
    f = url;
    while (*f && (*f != '/')) 
        f++;
    strcpy(pathname, *f ? f : "/");
    *f = 0;
    strcpy(hostname, h);
    free(buf);
    return protocol;
}

void display(char *buf, int recv)
{
    if (do_split)
    {
        char *s;
        int i;
        for (i=recv, s=buf; i--; s++)
            switch (*s)
            {
                case '<' : putchar('\n'); putchar('<'); break;
                case '>' : putchar('>'); putchar('\n'); break;
                default : putchar(*s);
            }
    }
    else
        fwrite(buf, recv, 1, stdout);
}

int get_file(SOCKET sock, char *pathname)
{
    char *buf;
    UINT recv;
    char *s;

    if (! (buf = malloc(BUFFER_SIZE)))
        msg(nomem);
    if (SendWebQuery(sock, pathname) != 0)
        msg("SendWebQuery");
    while (((recv = RecvWebFile(sock, buf)) != 0))
        display(buf, recv);
    free(buf);
    putchar('\n');
    return 1;
}

int get_url(char *url)
{
    char hostname[256], pathname[512];
    SOCKET sock;
    int protocol;

    if (! (protocol = split(url, hostname, pathname)))
        return 0;

    // printf("get_url: [%u,%s,%s]\n", protocol, hostname, pathname);

    if ((sock = ConnectWebServerSocket(hostname, protocol)) == 
        INVALID_SOCKET)
        msg("ConnectWebServerSocket");
    get_file(sock, pathname);
    closesocket(sock);
    
    return 1;
}

char *usage = "usage: geturl [options] <http://whatever or -stdin>\n"
    "  options:\n"
    "  -noloc : don't do HTTP relocations (default on)\n"
    "  -base <addr> : use addr as base for all relative URLs\n"
    "  -head : do HTTP HEAD (default GET)\n"
    "  -post <data> : do HTTP POST of data\n"
    "  -input <file> : get all HTTP headers from file\n"
    "  -stdin : get URLs from stdin\n"
    "  -split : break HTML output into lines on tags\n";
        
int main(int argc, char *argv[])
{
    int i;
    int do_stdin = 0;
    if (argc < 2)
        fail(usage);
    for (i=1; i<argc; i++)
    {
        if ((argv[i][0] == '-') || (argv[i][0] == '/'))
        {
            char *option = strupr(&argv[i][1]);
            #define OPTION(x) (strcmp(option, x) == 0)
            if (OPTION("SPLIT"))        do_split = 1;
            else if (OPTION("STDIN"))   do_stdin = 1;
            else if (OPTION("HEAD"))    do_head = 1;
            else if (OPTION("NOLOC"))   do_loc = 0;
            else if (OPTION("INPUT")) { do_manual_input = 1; 
                                        input_file = argv[++i]; }
            else if (OPTION("POST"))  { do_post = 1;
                                        post_data = argv[++i]; }
            else if (OPTION("BASE"))  { do_base = 1;
                                        base = argv[++i]; }
            else                        fail(usage);
        }
        else
        {
            puts(argv[i]);
            if (! get_url(argv[i]))
                puts("--- Couldn't get URL ---");
        }
    }
    
    if (do_stdin)
    {
        char buf[1024];
        while (gets(buf))
        {
            puts(buf);
            if (! get_url(buf))
                puts("--- Couldn't get URL ---");
        }
    }
}
Need Unix version of geturl, without WSAStartup, winsock.h, etc.

Also note Rob Adams's URL utility from Phar Lap: "Richard just put your GETURL.EXE utility on our server. I thought you might be interested in a similar program that I have been using since May '94. I am going to add Cookie support, but I am going to wait until I need it for some reason. I use this script for:

fetch Dilbert once-per-day
fetch news stories once-per-day and mail them
	to my netscape mail reader
Get 1.0 headers to see what server is running.
Get stock quotes on demand.
In short, I use it the same way that you use GETURL -- I run this program and feed its output thru a munger of some kind.... This script runs on SunOS 4.1 and Windows 95, with Perl5. Here is the usage message:
Usage: bin/url [options] URL
Options:  -u user               http 1.0 user authentication
          -p pass               http 1.0 user password
          -1                    use http 1.0
          -h                    only fetch http 1.0 headers
          -t type               If-Modified-Since 'today'
                                type one of 'rfc1123', 'rfc850', 'asctime'
          -v                    verbose
          -H                    fix broken HEAD
          -r URL                Add 'Referer' line to 1.0 header
URL:      http://site.domain.name/path/path/path
          head://site/path      like http, but only get headers
          gopher://site/path    use gopher instead of http
          finger://site/user    use finger
          time://site           get time-of-day
See \books\controlweb\url.pl. Mostly options handling. Main part is: getservbyname, gethostbyname, socket, bind, connect, then read via perl "while(<S>)".

$AF_INET = 2;
$SOCK_STREAM = 1;
$IPPROTO_TCP = 6;
$crlf = "\r\n";

($name,$aliases,$port) = getservbyname($port,'tcp')
	unless $port =~ /^\d+$/;;
($name,$aliases,$type,$len,$thataddr) = gethostbyname($them);

$sockaddr = 'S n a4 x8';
$this = pack($sockaddr, $AF_INET, 0, "\0\0\0\0");
$that = pack($sockaddr, $AF_INET, $port, $thataddr);

$thataddr || die "Host \"$them\" could not be resolved";

socket(S, $AF_INET, $SOCK_STREAM, $IPPROTO_TCP) || die $!;
bind(S, $this) || die $!;
connect(S,$that) || die $!;

select(S); $| = 1;
print $msg;
select(STDOUT);
print $msg if $opt_v;

while(<S>)
{
	last if $xport eq "gopher" && /^\.\r?\n?$/;
	last if $opt_H && /^\r?\n?$/;
	if ( /^\.\r?\n?$/ && $xport eq "gopher" )
	{
		next;
	}
	print;
}
Also note perl version of geturl on web: http://www.kluge.net/NES/geturl/

Also note Java version of geturl: ftp://ftp.ora.com/pub/examples/windows/win95.update/sd96/GetURL.java

All demonstrates utter simplicity of HTTP. Meanwhile MS trying to get people to write to its wrappers upon wrappers upon wrappers on top of HTTP (see article on URL monickers and OLE hyperlinks), so attempt to mystify HTTP.

Except note one important point: geturl becoming more complex. Yes, basics of HTTP incredibly simple. But add support for Location:, cookies, Connection: Keep-Alive, relative URLs with -base option, etc. Do the MS wrappers support all this stuff?

Sample use of geturl: avcount. Assuming at least that AV numbers remain stable. Uses AltaVista count option (have to explain AV URL interface somewhere). Basically, AVCOUNT just combines GETURL with a PC version of grep (the -split option to GETURL makes it easy to find lines with grep) to get count of documents on web (known to AV) that match criteria. Can use to get rough idea of use of Java, plug-ins, and ActiveX:

 
C:\>type avcount.bat
geturl -split http://www.altavista.digital.com/cgi-bin/query?pg=aq&
what=web&fmt=n&q=%1 | grep "documents match the query"

Some sample queries:						RMS 4/15/97:

C:>avcount applet:*
141717 documents match the query.			800,000

C:>avcount embed:*
35487 documents match the query.			 30,000			

C:>avcount object:*
9990 documents match the query.				 60,000
Another batch file that can easily be built on top of geturl.exe is whatserv.bat:
C:\>type whatserv.bat
geturl -head http://%1 | grep Server:

C:\>whatserv www.ora.com
Server: WN/1.15.1

C:\>whatserv software.ora.com
Server: WebSitePro/1.1h

C:\>whatserv www.microsoft.com
Server: Microsoft-IIS/3.0

C:\>whatserv www.netscape.com
Server: Netscape-Enterprise/2.01
{{Tim: "a wealth of small tools (a la UNIX) that can be strung together using what amounts to simple scripting".}}

{{At this point, a lot of notes from here on look incoherent, or repeat material covered in chapters 1 and 2?? Also note that piping results of Netcraft into URL-minder doesn't seem to work!! Instead, try doing my own whatserv.pl, and point URL-minder at that. Save previous in history whenever change.}}

Pointing URL-minder at Netcraft didn't work: why? Contact URL-minder about false positives

http://www.netmind.com/URL-minder/faq.html deals with this question: "Q: The URL-minder keeps sending me e-mail that says that one of my registered pages has changed, but when I look at the page with my Web browser it seems the same. What gives? A: There are several possible explanations. Look on the page in question for an access counter that tells the number of visitors the page has had, or something minor that might be changing on a regular basis. If the page has an access counter, you might consider sending mail to the person who maintains the page and asking them to use the special tags provided for just that scenario. Still another is that the underlying HTML for the page has changed, even though the appearance of the page in your browser has not (this sometimes happens when a link in the page is changed because the resource it points to has moved)." Special tags described at http://www.netmind.com/URL-minder/controlled.html: <URL-MINDER-IGNORE>. Try to get NetCraft to adopt this! Also see http://www.netmind.com/URL-minder/example.html with hidden input types. New form at http://www.netmind.com/URL-minder/new/advanced.html.

See rainer1.html, netcraft1.html -- compare when get URL-minder notification to see what if anything has changed!

Why netcraft1.html, netcraft2.html so different? Need webgrep.pl, point URL-minder at that. Write in perl. With url=url and pat=pattern.

Better example: Program that extracts latitude/longitude from name server at http://www.census.gov/cgi-bin/gazetteer (URL such as http://www.census.gov/cgi-bin/gazetteer?zip=95404, or http://www.census.gov/cgi-bin/gazetteer?city=santa+rosa&state=ca), creates URL from it to drive Xerox Map Viewer.

This chapter will contain in part material about looking at sites with NetCraft, then employing URL-minder to receive notices of changes. At first it seems like "stupid web tricks," but the point is how multiple processes on the web can sometimes be combined, in much the same the same that the output from one Unix program can be piped into another. Web Pipes, really. Problem: "bit rot", brittleness. Besides the NetCraft/URL-minder combo, other examples include DejaNews/URL-minder. Hmm, I need to locate some other CGIs on the web, besides URL-minder, whose input is a URL. HTML lints, of course, but what else? AltaVista "link:" and "host:" options! This a good example, because will provide an opportunity to discuss the role of AltaVista, how its database works, non-obvious uses, etc. Always reminder the reader that this is a process they're running on another machine: distributed computation!)

CGI processes: I say "processes" rather than "programs" to emphasize that these things are running today, right now, on machines across the planet, and you can employ them, without having to write one yourself. You do have to figure out what their "API" is, though (see Jon Udell articles on implicit CGI APIs). And this API can change out from under you (so what else is new?).

Most books on CGI assume the reader is going to write their own. Not here: emphasis is on employing existing CGIs in your own HTML.

This chapter should also discuss CGI processes whose output is an embedable image: Web-Counter, US Naval Observatory time, etc. Not pipe-able, but embeddable.

(Use some of Ron Petrusha's excellent "CGI for Non-Programmers" chapter from Win-CGI book here.)

(URL-minder problems: does checksum, doesn't know about actual contents of page, so get "false positives": says change when not a significant one.)

"A Hands-On Guide" as a subtitle?

See http://www.ups.com on the side of a bus, go to the site, check out Package Tracking at http://www.ups.com/tracking/tracking.html, enter a package number (1Z 742 E22 03 3009 8519), get back results in http://wwwapps.ups.com/tracking/tracking.cgi. Notice the ".cgi" in that URL. Here's a very practical app on the web, they're using CGI. We don't have to go directly to this page http://www.netcraft.com/cgi-bin/Survey/whats. View Document Source and look at how form created:

<form action=whats method=GET>
<strong>Host</strong>: <input name=host> <strong>Port</strong>:<input name=port value=80 size=3>  <input type=submit value="Examine"> <input type=reset>
After changing action to refer explicity to NetCraft site, we could put this form on a different page, accessing the same CGI program running at NetCraft. In other words, forms and CGIs are not tied to each other. This has huge implications..., etc., etc.

Anyway, notice that answer comes back with this URL:

http://www.netcraft.com/cgi-bin/Survey/whats?host=www.ups.com&port=80

Ok, this means we can bypass the NetCraft form and just create URLs for any site we want to check. Try it with the other UPS site:

http://www.netcraft.com/cgi-bin/Survey/whats?host=wwwapps.ups.com&port=80

The answer for this machine is "wwwapps.ups.com is running Netscape-Communications/1.1."

For that matter, what is the NetCraft site running?

http://www.netcraft.com/cgi-bin/Survey/whats?host=www.netcraft.com&port=80

"www.netcraft.com is running NCSA/1.5."

Now, what if we wanted to find out when UPS changes to a different server? There's another site on the web, URL-minder, which lets you know when any arbitrary page has been changed:

http://www.netmind.com/URL-minder/new/register.html

Its form looks like this:

<FORM METHOD="GET" ACTION="http://www.netmind.com/cgi-bin/uncgi/minder-register">
<P>Enter or paste in a URL (Web page address):<BR>
<INPUT TYPE=TEXT SIZE=78 NAME=url><BR>
Your Internet e-mail address:<BR>
<INPUT TYPE=TEXT SIZE=40 NAME="required-email">
<INPUT TYPE=SUBMIT VALUE=" Register this URL "> 
</P></FORM>
Hmm, yet another CGI. There's a pattern here. Oh yeah, what's this URL-minder machine running?

http://www.netcraft.com/cgi-bin/Survey/whats?host=www.netmind.com&port=80

"www.netmind.com is running Apache/1.1.1"

Interesting, we've used one CGI program running on one type of server to look at two other servers.

Anyway, we can use the URL-minder to find out when any machine changes its server type. Remember we said earlier that URL-minder lets you know when any page has been changed. That really means when the HTML associated with some URL is different from the HTML associated with the same URL the last time they checked. So this works on CGI programs too. (Hmm, this is confusing. Need to explain it slower.)

Just go to http://www.netmind.com/URL-minder/new/register.html, and as the URL enter: http://www.netcraft.com/cgi-bin/Survey/whats?host=www.ups.com&port=80

Look carefully at what we've done here. We're using one CGI program, running on one machine using one type of server, to periodically examine the output from another CGI program, running on a completely different machine using a different type of server. Distributed computation! Set in motion with a silly-looking 5-line HTML form. Involving on the back-end some perl scripts, most likely. This stuff was supposed to be hard.

{{sidebar?}}

{{end possible sidebar?}}

You can have list of all your registered URLs sent to you via email, with URL such as following. Note how entering URL causes email to be sent to you:

http://www.netmind.com/cgi-bin/uncgi/minder-list?required-email=andrew@ora.com

(Why send by email, not display? Think about it...)

Where are these machines located? What more can we find out about them? There's a cool utility, traceroute, which will tell you this sort of thing. There are even web-based versions of traceroute, though these will show route from machine that hosts traceroute, not from your machine. (Confusion; explain.)

http://www.yahoo.com/Computers_and_Internet/Software/
Communications_and_Networking/Networking/Utilities/Traceroute/

(Logical organization of Yahoo URLs, so more generally can go to http://www.yahoo.com/Computers_and_Internet/Software/
Communications_and_Networking/Networking/Utilities/)

http://hplyot.obspm.fr/cgi-bin/nph-traceroute knows my machine name! Explain how knows this.

http://hplyot.obspm.fr/cgi-bin/nph-traceroute?pc234.west.ora.com

Typein the name of a machine (or follow here for trace to your place (pc234.west.ora.com)) you want to trace from Observatoire de Paris Meudon (France):

Examine output; explain particularly crossing ocean!

 8  raspail-ip.eurogate.net (194.206.207.18)  7 ms  13 ms  8 ms
 9  Reston.eurogate.net (194.206.207.5)  148 ms  136 ms  137 ms
10  gsl-sl-dc-fddi.gsl.net (204.59.144.198)  137 ms  145 ms *
11  sl-dc-8-F/T.sprintlink.net (198.67.0.8)  140 ms  150 ms  139 ms
12  * sl-mae-e-H2/0-T3.sprintlink.net (144.228.10.42)  138 ms *
13  maeeast-2.bbnplanet.net (192.41.177.2)  351 ms  147 ms  133 ms
14  washdc1-br2.bbnplanet.net (4.0.1.245)  143 ms *  136 ms
SprintLink?

http://www.beach.net/traceroute.html

<form method="post" action="/cgi-bin/nph-traceroute">
Enter the host or IP address for traceroute:<br>
<input name="hostname" size="20">
<input type="submit" value=" Start Trace ">
</form>
Answer comes back in http://www.beach.net/cgi-bin/nph-traceroute, looks like:

Route from www.Beach.Net to www.netcraft.com

 1  irx1.DPCSYS.COM (207.124.154.1)  1.112 ms  1.036 ms  1.007 ms
 2  s1.cisco.snni.com (165.113.229.225)  50.143 ms  20.214 ms  21.774 ms
 3  gateway-sfo1.atm.us.crl.net (165.113.56.30)  98.233 ms  99.518 ms  96.158 ms
 4  border3-hssi1-0.SanFrancisco.mci.net (149.20.64.9)  104.410 ms  102.635 ms  122.241 ms
 5  core1-fddi-0.SanFrancisco.mci.net (204.70.2.161)  110.787 ms  117.600 ms  104.914 ms
 6  bordercore4-hssi0-0-gw.WestOrange.mci.net (166.48.11.250)  335.815 ms  375.252 ms  433.819 ms
 7  * * bordercore4-hssi0-0-gw.WestOrange.mci.net (166.48.11.250)  453.337 ms
 8  bordercore4-hssi0-0-gw.WestOrange.mci.net (166.48.11.250)  543.479 ms  395.386 ms  594.489 ms
 9  194.72.24.158 (194.72.24.158)  673.688 ms  409.967 ms  332.418 ms
10  telehouse-transit-e3-2.ukcore.bt.net (194.72.27.158)  364.822 ms  451.895 ms  442.639 ms
11  londonc-smds-f1-0.ukcore.bt.net (194.72.7.12)  518.666 ms  434.343 ms *
12  * londonc-access1-e1.ukcore.bt.net (194.72.4.50)  447.981 ms  894.712 ms
13  netcraft.customer.bt.net (194.72.10.26)  926.771 ms !H *  542.796 ms !H
In addition to seeing the route from one machine to another on the web (route can change at any moment: decentralized nature of web, DARPA plans, etc., etc.), which also means what means could peek at message along way (security implications, etc., etc.), we see that www.netcraft.com is actually netcraft.customer.bt.net. We see that machines in London UK are on path. Hmm, this machine is in UK.

How else could we have found this out? Reverse DNS lookup utilities on the web:

http://www.bankes.com/nslookup.htm

Complicated form, turns into URL (method=GET rather than POST):

http://www.bankes.com/cgi-local/nslookup.pl?FROM_1=207&FROM_2=158&FROM_3=193&FROM_4=188&SEARCH=DNS&FROM_DNS=www.netcraft.com&STEP_DIRECTION=PLUS&STEP_1=&STEP_2=&STEP_3=&STEP_4=1

Returns list like this:

194.72.238.5                          www.netcraft.co.uk
194.72.238.6                         news.netcraft.co.uk
194.72.238.7                         mail.netcraft.co.uk
....
Also http://www.aetc.af.mil/AETC-NetMgmt/reverse-dns-lookup.html, which this complicated form:

<FORM METHOD="POST" FORM ACTION="/cgi-bin/reverse-dns-query"><UL>
   <H3>Please enter the complete host IP address (<I>in decimal!!</I>):</H3>
   <P>
   <INPUT TYPE="text" NAME="A1" MAXLENGTH="3" SIZE="3">.
   <INPUT TYPE="text" NAME="A2" MAXLENGTH="3" SIZE="3">.
   <INPUT TYPE="text" NAME="A3" MAXLENGTH="3" SIZE="3">.
   <INPUT TYPE="text" NAME="A4" MAXLENGTH="3" SIZE="3">
   <P>
Please choose the <STRONG>Domain Name Server</STRONG> to query:
<SELECT NAME="NS">
<OPTION SELECTED>nic.ddn.mil
<OPTION>server.af.mil
<OPTION>lanman.aetc.af.mil
</SELECT> <P>
   <INPUT TYPE="submit" VALUE=" Submit Query ">
   <INPUT TYPE="reset" VALUE=" Clear Form ">
   <P></UL>
<HR><P></FORM>
Can we turn into simple URL? Actually five variables here. Try putting them in URL as if method=GET. Sometimes works, sometimes doesn't. (Explain why.)

http://www.aetc.af.mil/cgi-bin/reverse-dns-query?a1=194&a2=72&a3=238&a4=5&ns=nic.ddn.mil

Hmm, machine not happy. Explain why: GET vs. POST matters *if* CGI program on back-end happened to have been written that way. Give example where does work.

Ping? (speaking of which, ping attack)

http://www.tio.com/tracer.html (has traceroute from different sites too!)

http://www.amazing.com/internet/club-traceroute.html (traceroute club!! people actually enjoy looking at complicated routes from Baltic states, etc.)

Anyway, ping any machine from CMU with simple URL:

http://www.net.cmu.edu/bin/ping?www.netcraft.com

www.netcraft.com:  896 145 158 148 162 365 153 174P 305PP 328
www.netcraft.com:  10/10  succ. = 100.00%; 145 min, 896 max, 28
How find out my own name or IP address right now?

http://www.bankes.com/cgi-local/nslookup.pl?FROM_1=207&FROM_2=158&FROM_3=193&FROM_4=188&FROM_DNS=www.anywhere.com&SEARCH=HOST&STEP_DIRECTION=PLUS&STEP_1=&STEP_2=&STEP_3=&STEP_4=1

Looks like www.anywhere.com is hack to have them use your IP address (which available in HTTP header; how see? Telnet?).

198.112.209.234 pc234.west.ora.com

Yikes, comes back with complete list in numerical order....

nslookup: http://www.infobear.com/nslookup.html

whois gateway: http://www.ilhawaii.net/whois.html (searches NIC database; often busy)

NSLookup: http://www.ec.lu.se/hdn/internet/dns/nslookup.html (has whole language; warns against overuse???)

nslookup -h lists CPU and operating system information for the domain. synonym for -t HINFO????

SATAN??? (http://www.netsys.com/firewalls/firewalls-9504/0034.html is description of what SATAN does)

http://www.yahoo.com/Computers_and_Internet/Software/
Communications_and_Networking/Unix_Utilities/SATAN/

(Notice Yahoo organizational structure URLs!)

http://www.fish.com/satan/ (Security Administrator's Tool for Analyzing Networks) -- Dan Farmer

Source list at http://www.ensta.fr/internet/unix/sys_admin/satan.html

perl5 is available via anonymous ftp from ftp.netlabs.com

SATAN won't run on a PC or Mac, unless you're running some version of unix. (Hmm, should NT perl have goal to host Satan?)

"SATAN was written because we realized that computer systems are becoming more and more dependent on the network, and at the same becoming more and more vulnerable to attack via that same network."

The rationale for SATAN is given in a paper posted in december 1993 (ftp.win.tue.nl:/pub/security/admin-guide-to-cracking.101.Z, flat text compressed with the UNIX compress command).

Finally, found DNS Tools page at InterNIC (http://www.rs.internic.net/tools/tools.html):

http://www.rs.internic.net/cgi-bin/whois?netcraft.com

Netcraft Ltd NETCRAFT-DOM
   Rockfield House

   Granville Road
   Bath, BA1 9BQ
   UK

   Domain Name: NETCRAFT.COM

   Administrative Contact, Technical Contact, Zone Contact:
      Prettejohn, Mike  MP132  postmaster@NETCRAFT.DEMON.CO.UK
      +44 117 651467

   Record last updated on 06-Jan-97.
   Record created on 18-Oct-94.

   Domain servers in listed order:

   SERVER.NETCRAFT.CO.UK        194.72.238.2
   NS0.BT.NET                   194.72.6.51
http://www.rs.internic.net/cgi-bin/whois?194.72.238.2
[No name] NS2103-HST

   Hostname: SERVER.NETCRAFT.CO.UK
   Address: 194.72.238.2
   System: Intel 486PC running Free BSD

   Record last updated on 04-Aug-95.
http://www.rs.internic.net/cgi-bin/whois?194.72.6.51
British Telecommunications Plc BT-HST

   Hostname: NS0.BT.NET
   Address: 194.72.6.51
   System: ? running ?

   Coordinator:
      Titley, Nigel  NT13  Nigel.Titley@BT.NET
      +44 1442 237674 +44 1442 237000 (FAX) +44 1442 237728

   Domain Server

   Record last updated on 13-Jan-97.
But can only do second-level names (ups.com rather than www.ups.com).

http://www.rs.internic.net/cgi-bin/whois?194.72.238.5 -- no match

{{sidebar on CGI legalities}}

In chapter 3, discussed question of IMG reuse legalities. Similar with reusing CGI. Chapter title: "Stealing cycles" might have made you wince.

Aside from issues in chapter 3, an important one here is Standard for Robot Exclusion (SRE). Any robots.txt must be respected by CGI reuser. See Clinton Wong, p. x, 7, 203-205. See http://info.webcrawler.com/mak/projects/robots/eval.html, http://info.webcrawler.com/mak/projects/robots/norobots.html, http://www.kollar.com/robots.html. Example:

http://www.altavista.com/robots.txt
Content-type: text/plain

User-agent: *
Disallow: /stage/
Disallow: /sites/
Disallow: /howdy/
Disallow: /keith/
Disallow: /webcannon/

http://www.yahoo.com/robots.txt

User-agent: *
Disallow: /gnn
Disallow: /msn
Disallow: /pacbell

# Rover is a bad dog 
User-agent: Roverbot
Disallow: /
If robot excluded, then probably other uses should stay away? {{end sidebar on CGI legalities}}