momspider -h
" or if any invalid command-line option is given:
usage: momspider [-h] [-e errorfile] [-o outfile] [-i instructfile] [-d maxdepth] [-a avoidfile] [-s sitesfile] [-A system_avoidfile] [-S system_sitesfile] MOMspider/1.00 WWW Spider for multi-owner maintenance of distributed hypertext infostructures. Options: [DEFAULT] -h Help -- just display this message and quit. -e Append error history to the following file. [STDERR] -o Append output history to the following file. [STDOUT] -i Get your instructions from the following file. [$HOME/.momspider-instruct] -d Maximum traversal depth. [20] -a Read/write the user's URLs to avoid into the following file. [$HOME/.momspider-avoid] -s Read/write the user's sites visited into the following file. [$HOME/.momspider-sites] -A Read the systemwide URLs to avoid from the following file. [$MOMSPIDER_HOME/system-avoid] -S Read the systemwide sites visited from the following file. [$MOMSPIDER_HOME/system-sites]A more in-depth explanation of each command-line option is as follows:
-h
-e errfile
errfile
.
It is recommended that this option always be used when
the process is going to be run for longer than ten minutes. Since
MOMspider writes its output unbuffered, you can monitor the file as
the program proceeds through its tasks. If no -e option is
given, the error output is written to STDERR.
-o outfile
outfile
. If outfile
already exists,
it will be moved to outfile.bak
before a new file is
started. It is recommended that this option always be used when
the process is going to be run for longer than ten minutes. Since
MOMspider writes its output unbuffered, you can monitor the file as
the program proceeds through its tasks. If no -o option is
given, the output is written to STDOUT.
-i instructfile
instructfile
for MOMspider's
instructions which tell it what other
options to set and what tasks to perform during the process.
-a avoidfile
avoidfile.
If avoidfile
already exists,
it will be moved to avoidfile.bak
before a new file is
written. The avoidfile
is rewritten after every update
to MOMspider's internal avoid table.
-s sitesfile
/robots.txt
file
to the file sitesfile
.
If sitesfile
already exists,
it will be moved to sitesfile.bak
before a new file is
written. The sitesfile
is rewritten after every update
to MOMspider's internal avoid table.
-A system_avoidfile
system_avoidfile
.
-S system_sitesfile
/robots.txt
file
from the systemwide file system_avoidfile
.
Once you have the test documents in place, create an
instruction file which will traverse that hierachy. Start with just
a single Tree traversal task which points to the top node, and later expand
it into multiple tasks reflecting the hierarchical levels. Also, use a
file://localhost/
URL to point to the top -- MOMspider will
not invoke its internal speed limits while traversing local file URLs and
thus the program will run much faster on a local-only tree.
You can test most of the features/options of MOMspider on a local file tree. Some things you can't test are redirected files and the avoid tables. Once you have tired of testing on the local files, just change your task instructions so that they point to the real "Top URLs" and run MOMspider again. At this point, you should note a change in speed as MOMspider intentionally slows down to avoid overloading your server. If possible, you should monitor the server's performance as it responds to the requests. If you have a slow server, you should increase the delay times as specified in the default configuration options.
Another thing you will note is that the server will start checking for
/robots.txt
files on remote HTTP servers before the first
test of a URL at that site. This behavior is part of the robot exclusion
protocol and is explained in the document on
avoiding URLs.
Finally, you should always test a new instruction file before running
it as a batch process. If MOMspider encounters a problem, try running
the program with the perl debugger (perl -d momspider ...
)
and stepping through the instructions by hand.
What you need to do first is partition your server documents (mentally) into their associated infostructures. If you don't understand what an infostructure is, read the WWW94 paper. The TopURL of each infostructure should be exactly the same as whatever is used in other documents which link to it.
If your server is structured properly, most identifiable infostructures should reside in their own directory hierarchy. If so, a Tree traversal (or series of Tree traversals if it contains nested infostructures) can encompass each infostructure separately from the rest of the server documents and thus produce an index specific to that structure. Higher-level tasks should use the Exclude directive to leaf those portions of the infostructure that were already traversed in a prior task -- links will automatically be added to the lower-level index file wherever its top URL appears in the other indexes.
Unfortunately, not all infostructures are located within a single directory hierarchy. If you are lucky enough to have a server that can send HTML metainformation as headers in response to a HEAD request, then you can use the strategy described in Making Document Metainformation Visible and the Owner traversal type. None of the widely available HTTP servers currently support that capability.
Finally, the last instruction should be a Site traversal starting at your server's root (or welcome page). It should exclude all of the URLs from the prior Tree traversals and have at least one reachable link to all the other documents that were missed by prior traversal tasks. If your existing server root document cannot do this, you may want to create a dummy document that just points to each real top-level document (i.e. a table-of-contents for your server) and use that as your final top URL.
See the examples directory for a number of example intruction files. In particular, the file ICS-instruct will show you what I use to run MOMspider on all of my server's contents at UC Irvine's Department of Information and Computer Science.
Please e-mail to Roy Fielding <fielding@ics.uci.edu> a cut-and-pasted copy of the "Summary of Process Results" generated by MOMspider on the FIRST FULL TEST of your site (i.e. BEFORE you fix any of the problems reported). THIS IS VERY IMPORTANT as it will allow us to perform further research into the usability of distributed hypertext and the effectiveness of tools like MOMspider. Any other comments you wish to send will also be welcome.If your site is not partitionable into separate infostructures, MOMspider can still be run on the entire site using a Site traversal. The only problem is that the resulting HTML index file will probably be too large for any normal web browsing client to handle. My best advice in that case is to start restructuring your server contents so that they are more hierarchical (readers like that better anyway).
Once you have a working instruction file, you can set it up to run periodically by including an entry in your system's crontab. At large University sites like ours where the server contents change often, it is sufficient to run MOMspider once per week on the entire site. Except for safety-critical applications, I cannot imagine a site where such testing is needed more often. Most business sites (once inititial document creation is completed) should be maintainable with just one test every other week, with even less needed if the site does not reference many external sites.
<Owner Name Fred TopURL http://myserver/~fred/hotlist.html IndexURL http://myserver/MOM/hotlists/Fred.html IndexFile /usr/local/httpd/docroot/MOM/hotlists/Fred.html EmailAddress fred EmailBroken EmailRedirected EmailChanged 1 > <Owner Name Wilma TopURL http://myserver/~wilma/hotlist.html IndexURL http://myserver/MOM/hotlists/Wilma.html IndexFile /usr/local/httpd/docroot/MOM/hotlists/Wilma.html EmailAddress wilma EmailBroken EmailRedirected EmailChanged 1 > <Owner Name Barney TopURL http://myserver/~barney/hotlist.html IndexURL http://myserver/MOM/hotlists/Barney.html IndexFile /usr/local/httpd/docroot/MOM/hotlists/Barney.html EmailAddress barney EmailBroken EmailRedirected EmailChanged 1 >If you have a lot of users, this could eventually be very popular and at the same time be much more efficient than each individual user doing their own testing.
Why? Because it is a terribly inefficient use of network resources. Up to 95% of a normal site's MOMspider tests (HEAD requests) and all of its traversals (GET requests) will be performed on the server at that site. If the user of MOMspider is located at that site, those requests are essentially free and have no impact on other network sites. In contrast, running MOMspider on a remote site places ALL of those requests on the network between your site and the remote one. If that network happens to a public one such as the Internet, you will be misusing the limited network resources and people will get VERY upset. If many users decide to do so, I will be forced to recall MOMspider and issue special licenses only to those who are known to be responsible.
There are only three circumstances in which running on a remote site is okay: