MOMspider -- Usage Information

After the default configuration options are set and a testable instruction file has been created, you can start using MOMspider.

Command-line Options
Testing MOMspider
Full-scale Maintenance for Information Providers
Hotlists and other Auxiliary Usage
Running MOMspider on Remote Sites

The normal usage of MOMspider is to have it periodically run by an entry in the crontab -- the system utility for running batch programs on a regular (chronological) basis. However, you will want to start by testing MOMspider using command-line entry.

Command-line Options

The following information is printed in response to the command "momspider -h" or if any invalid command-line option is given:

usage: momspider [-h] [-e errorfile] [-o outfile] [-i instructfile]
                      [-d maxdepth] [-a avoidfile] [-s sitesfile]
                      [-A system_avoidfile] [-S system_sitesfile]
MOMspider/1.00
WWW Spider for multi-owner maintenance of distributed hypertext infostructures.
Options:                                                        [DEFAULT]
     -h  Help    -- just display this message and quit.
     -e  Append error  history to the following file.           [STDERR]
     -o  Append output history to the following file.           [STDOUT]
     -i  Get your instructions from the following file. 
         [$HOME/.momspider-instruct]
     -d  Maximum traversal depth.                               [20]
     -a  Read/write the user's URLs to avoid into the following file. 
         [$HOME/.momspider-avoid]
     -s  Read/write the user's sites visited into the following file. 
         [$HOME/.momspider-sites]
     -A  Read the systemwide URLs to avoid from the following file. 
         [$MOMSPIDER_HOME/system-avoid]
     -S  Read the systemwide sites visited from the following file. 
         [$MOMSPIDER_HOME/system-sites]

A more in-depth explanation of each command-line option is as follows:

-h: The Help option prints the usage information as above. Note that the actual option defaults (as set in momconfig.pl) will be printed inside the square brackets.
-e errfile: Append MOMspider's error output to the given errfile. It is recommended that this option always be used when the process is going to be run for longer than ten minutes. Since MOMspider writes its output unbuffered, you can monitor the file as the program proceeds through its tasks. If no -e option is given, the error output is written to STDERR.
-o outfile: Write MOMspider's diagnostic output and result summaries to the given outfile. If outfile already exists, it will be moved to outfile.bak before a new file is started. It is recommended that this option always be used when the process is going to be run for longer than ten minutes. Since MOMspider writes its output unbuffered, you can monitor the file as the program proceeds through its tasks. If no -o option is given, the output is written to STDOUT.
-i instructfile: Read the file instructfile for MOMspider's instructions which tell it what other options to set and what tasks to perform during the process.

The remaining options should only be used on the command-line for very unusual circumstances. Under normal conditions, the avoid and sites files should be specified in the instructions or just in the configuration defaults. See Avoiding and Leafing Specific URLs for more information.

-a avoidfile: Read and write URLs to avoid/leaf to the file avoidfile. If avoidfile already exists, it will be moved to avoidfile.bak before a new file is written. The avoidfile is rewritten after every update to MOMspider's internal avoid table.
-s sitesfile: Read and write sites checked for a /robots.txt file to the file sitesfile. If sitesfile already exists, it will be moved to sitesfile.bak before a new file is written. The sitesfile is rewritten after every update to MOMspider's internal avoid table.
-A system_avoidfile: Read URLs to avoid/leaf from the systemwide file system_avoidfile.
-S system_sitesfile: Read sites already checked for a /robots.txt file from the systemwide file system_avoidfile.

Testing MOMspider

The key to testing MOMspider is to not to do too much at once. Copy a small section of your existing web, preferably a self-contained tree of HTML files with a multi-level, hierarchical structure into a safe test area (be sure to use the -p option of the cp command so that modification dates are not updated by the copy). You want to use a separate copy for testing so that you can make intentional changes to the files (and thus make MOMspider's output more interesting).

Once you have the test documents in place, create an instruction file which will traverse that hierachy. Start with just a single Tree traversal task which points to the top node, and later expand it into multiple tasks reflecting the hierarchical levels. Also, use a file://localhost/ URL to point to the top -- MOMspider will not invoke its internal speed limits while traversing local file URLs and thus the program will run much faster on a local-only tree.

You can test most of the features/options of MOMspider on a local file tree. Some things you can't test are redirected files and the avoid tables. Once you have tired of testing on the local files, just change your task instructions so that they point to the real "Top URLs" and run MOMspider again. At this point, you should note a change in speed as MOMspider intentionally slows down to avoid overloading your server. If possible, you should monitor the server's performance as it responds to the requests. If you have a slow server, you should increase the delay times as specified in the default configuration options.

Another thing you will note is that the server will start checking for /robots.txt files on remote HTTP servers before the first test of a URL at that site. This behavior is part of the robot exclusion protocol and is explained in the document on avoiding URLs.

Finally, you should always test a new instruction file before running it as a batch process. If MOMspider encounters a problem, try running the program with the perl debugger (perl -d momspider ...) and stepping through the instructions by hand.

Full-scale Maintenance for Information Providers

What MOMspider does best is handle the really big jobs. The more documents that are traversed within a single execution of MOMspider, the greater the reuse of test results and thus the greater the savings on overall network bandwidth. The challenge then is to construct a single instruction file which covers the entire server contents and yet does so in a manner useful for information providers.

What you need to do first is partition your server documents (mentally) into their associated infostructures. If you don't understand what an infostructure is, read the WWW94 paper. The TopURL of each infostructure should be exactly the same as whatever is used in other documents which link to it.

If your server is structured properly, most identifiable infostructures should reside in their own directory hierarchy. If so, a Tree traversal (or series of Tree traversals if it contains nested infostructures) can encompass each infostructure separately from the rest of the server documents and thus produce an index specific to that structure. Higher-level tasks should use the Exclude directive to leaf those portions of the infostructure that were already traversed in a prior task -- links will automatically be added to the lower-level index file wherever its top URL appears in the other indexes.

Unfortunately, not all infostructures are located within a single directory hierarchy. If you are lucky enough to have a server that can send HTML metainformation as headers in response to a HEAD request, then you can use the strategy described in Making Document Metainformation Visible and the Owner traversal type. None of the widely available HTTP servers currently support that capability.

Finally, the last instruction should be a Site traversal starting at your server's root (or welcome page). It should exclude all of the URLs from the prior Tree traversals and have at least one reachable link to all the other documents that were missed by prior traversal tasks. If your existing server root document cannot do this, you may want to create a dummy document that just points to each real top-level document (i.e. a table-of-contents for your server) and use that as your final top URL.

See the examples directory for a number of example intruction files. In particular, the file ICS-instruct will show you what I use to run MOMspider on all of my server's contents at UC Irvine's Department of Information and Computer Science.

Please e-mail to Roy Fielding <fielding@ics.uci.edu> a cut-and-pasted copy of the "Summary of Process Results" generated by MOMspider on the FIRST FULL TEST of your site (i.e. BEFORE you fix any of the problems reported). THIS IS VERY IMPORTANT as it will allow us to perform further research into the usability of distributed hypertext and the effectiveness of tools like MOMspider. Any other comments you wish to send will also be welcome.

If your site is not partitionable into separate infostructures, MOMspider can still be run on the entire site using a Site traversal. The only problem is that the resulting HTML index file will probably be too large for any normal web browsing client to handle. My best advice in that case is to start restructuring your server contents so that they are more hierarchical (readers like that better anyway).

Once you have a working instruction file, you can set it up to run periodically by including an entry in your system's crontab. At large University sites like ours where the server contents change often, it is sufficient to run MOMspider once per week on the entire site. Except for safety-critical applications, I cannot imagine a site where such testing is needed more often. Most business sites (once inititial document creation is completed) should be maintainable with just one test every other week, with even less needed if the site does not reference many external sites.

Hotlists and other Auxiliary Usage

Even though it is not its intended purpose, I am well aware that many people will want to use MOMspider just to check their hotlist files (or some other change-sensitive list of URLs). I am hoping that each site's webmaster(s) will recognize this "need" and set up a hotlist-registry. To do so, just create a single instruction file which lists each user's hotlist file as a separate task of type Owner for each user. For example, if Fred, Barney, and Wilma all want to be notified when something in their list changes, MOMspider can do so by running the following instruction every day:

<Owner
     Name             Fred
     TopURL           http://myserver/~fred/hotlist.html
     IndexURL         http://myserver/MOM/hotlists/Fred.html
     IndexFile        /usr/local/httpd/docroot/MOM/hotlists/Fred.html
     EmailAddress     fred
     EmailBroken
     EmailRedirected
     EmailChanged     1
>
<Owner
     Name             Wilma
     TopURL           http://myserver/~wilma/hotlist.html
     IndexURL         http://myserver/MOM/hotlists/Wilma.html
     IndexFile        /usr/local/httpd/docroot/MOM/hotlists/Wilma.html
     EmailAddress     wilma
     EmailBroken
     EmailRedirected
     EmailChanged     1
>
<Owner
     Name             Barney
     TopURL           http://myserver/~barney/hotlist.html
     IndexURL         http://myserver/MOM/hotlists/Barney.html
     IndexFile        /usr/local/httpd/docroot/MOM/hotlists/Barney.html
     EmailAddress     barney
     EmailBroken
     EmailRedirected
     EmailChanged     1
>

If you have a lot of users, this could eventually be very popular and at the same time be much more efficient than each individual user doing their own testing.

Running MOMspider on Remote Sites

MOMspider is perfectly capable of traversing remote sites just like it does for your local site. DON'T DO THIS unless there is a compelling reason, and NEVER DO IT unless that site's webmaster first gives you permission to do so.

Why? Because it is a terribly inefficient use of network resources. Up to 95% of a normal site's MOMspider tests (HEAD requests) and all of its traversals (GET requests) will be performed on the server at that site. If the user of MOMspider is located at that site, those requests are essentially free and have no impact on other network sites. In contrast, running MOMspider on a remote site places ALL of those requests on the network between your site and the remote one. If that network happens to a public one such as the Internet, you will be misusing the limited network resources and people will get VERY upset. If many users decide to do so, I will be forced to recall MOMspider and issue special licenses only to those who are known to be responsible.

There are only three circumstances in which running on a remote site is okay:

Your sites share a private network connection and bridges are used to keep messages between those sites within that local network.
You are only performing traversals of type Owner and thus very few files (possibly only one file) at the remote site is being traversed and the rest of the requests are just tests.
The tests are a necessary component of network-related research which will later be of demonstrable value to the people using that network.

Finally, MOMspider's speed constraints were designed for use on sites local to the person running it. If you do need to run MOMspider on remote sites, you should increase the BetweenTime configuration option to lessen the impact on network resources.

Roy Fielding <fielding@ics.uci.edu>
Department of Information and Computer Science,
University of California, Irvine, CA 92717-3425

Last modified: Tue Aug 9 21:59:38 1994