MOMspider Instruction Files
The primary location for defining the behavior of a MOMspider process
is within the instruction file. At the beginning of processing,
MOMspider reads the instructions in their entirety and loads them into
internal tables. The location of the text-based instruction file is
named by the -i command-line option or
by the default name set in the configuration
defaults.
A MOMspider instruction file consists of a series of (optional)
global directives followed by a series of
traversal tasks. MOMspider sets the configuration
options associated with the global directives and then proceeds to perform
each of the listed tasks in the given order. After completing the last
task, MOMspider will output a summary of the overall process results
and then exit.
The format for the instruction file is fairly rigid. Blank lines and
any lines beginning with '#' are ignored. All other instruction
directives should be on a single line (regardless of length) and there
is no line-continuation character. Task instructions are begun with a
"<TYPE" directive and end with a ">" directive on a line by
itself. Several examples are provided with
the distribution. All instructions are case-sensitive.
All global directives should be listed at the top of the instruction
file, one per line, with the directive name flush-left. The following
global directives are available:
-
SystemAvoid
pathname
- This directive specifies that the systemwide
avoid file for this process can be found
at the given pathname. If present, this directive overrides the
default configuration, but can itself be overridden on the
command-line by the -A
option.
-
SystemSites
pathname
- This directive specifies that the systemwide
sites file for this process can be found
at the given pathname. If present, this directive overrides the
default configuration, but can itself be overridden on the
command-line by the -S
option.
-
AvoidFile
pathname
- This directive specifies that the user's writable
avoid file for this process can be found
at the given pathname. If present, this directive overrides the
default configuration, but can itself be overridden on the
command-line by the -a
option.
-
SitesFile
pathname
- This directive specifies that the user's writable
sites file for this process can be found
at the given pathname. If present, this directive overrides the
default configuration, but can itself be overridden on the
command-line by the -s
option.
-
SitesCheck
N
- This directive specifies the number of days between checks of a site's
/robots.txt
file as per the
robot exclusion protocol. The default
is usually fifteen (15) days.
-
ReplyTo
email_address
- This directive specifies the real e-mail address of the person running
this MOMspider. This address MUST correspond to the
human being that should be notified in case someone is having problems
with how you have been running MOMspider. The default address is
normally set by libwww-perl to be user@hostname, but
should be re-specified here if the default address does not receive
e-mail.
-
MaxDepth
N
- This directive specifies the maximum allowed depth of any MOMspider
traversal. It's purpose is to prevent the spider from crawling down
a "black hole" -- an infinitely recursive and self-modifying URL.
The default value (usually 20) should be larger than any of
the traversal hierarchies that MOMspider will ever want to traverse.
Traversal tasks are compound instructions, consisting of a set of
task directives surrounded by angle brackets and the type of the traversal.
For each task, MOMspider traverses the web, in breadth-first
order, from the specified top document down to each leaf node. A leaf
node is defined to be any information object which is not of
content-type HTML (and thus cannot contain any further links) or
which is outside the given infostructure. MOMspider determines the
boundaries of an infostructure according to the task's traversal type:
Site, Tree, or Owner.
Tasks are performed in the order they are listed in the file. In general,
it is most useful to list the tasks in a bottom-up order by their
hierarchy. This allows more information to be available for the later,
higher-level indexes which may link to these earlier tasks.
The following task directives are available:
-
<Site
- This directive indicates the start of a task instruction for a
Site traversal. Site traversal specifies that any URL which
points to a site (the pairing of hostname/IP address and port) other
than that of the top document is considered a leaf node.
-
<Tree
- This directive indicates the start of a task instruction for a
Tree traversal. Tree traversal specifies that any document
not at or below the "level" of the top document is considered a leaf
node, where level is determined by the pathname in the URL. Note
that a tree traversal of any URL at the server's root level will have
the same effect as a Site traversal of that URL.
-
<Owner
- This directive indicates the start of a task instruction for an
Owner traversal. Owner traversal specifies that any document
beyond the top which does not contain an "Owner:"
metainformation header
equal to the infostructure name is considered a leaf node. On most
current servers, this effectively means that only the top URL is
traversed.
-
Name
infostructure_name
- Specifies the infostructure name. This is used both to identify
the infostructure in generated messages and also as the owner name
for Owner traversals. The name is required for all tasks and
must be a single word (no whitespace).
-
TopURL
URL
- Specifies the URL of the top of the infostructure
to be traversed. If it is relative, the URL is resolved as a
file://localhost/
URL relative to the current working
directory at process start. The top URL is required for all tasks
and must be a single word (no whitespace). Any fragment identifier
will be ignored.
-
IndexURL
URL
- Specifies the URL of the HTML index file that will be
produced for this task. This directive is required and the URL
must be in absolute form.
-
IndexFile
pathname.html
- Specifies the pathname of the actual file for the HTML index.
This directive is required and must specify a valid pathname.
If the file already exists, it will be renamed pathname.old.html
and a link to it will be included in the new index.
-
IndexTitle
string
- Specifies the character string to use as the HTML index title and
also the subject line of any e-mail message. This directive is optional.
If not present, the title will be "MOMspider Index for Name"
where Name is the infostructure name.
-
ChangeWindow
N
- Specifies the window in N days (N being a natural number) prior to
the current date within which a tested URL's Last-modified date
is considered "interesting" and should be highlighted in the HTML index.
If N=0, no last-modification dates are considered interesting.
This directive is optional and defaults to seven (7) days.
-
ExpireWindow
N
- Specifies the window in N days (N being a natural number) after
the current date within which a traversed URL's Expires date
is considered "interesting" and should be highlighted in the HTML index.
If N=0, no expiration dates are considered interesting.
This directive is optional and defaults to zero (0). Since expires
dates are rarely used in the WWW, this directive is rarely useful.
-
EmailAddress
email_addresses
- Specifies the e-mail addresses to which an automatically generated
message should be sent if one or more of the other Email directives
below applies to any of the URLs tested during this task. This
directive is optional only if no other Email directives are given.
The format should be exactly the same as that given to the "To:"
header when sending normal e-mail messages.
-
EmailBroken
- Specifies that an e-mail message should be generated if any of the
tested links in this task are found to be broken. This directive
is optional and, if present, requires that EmailAddress also
be given.
-
EmailRedirected
- Specifies that an e-mail message should be generated if any of the
tested links in this task are found to be redirected. This directive
is optional and, if present, requires that EmailAddress also
be given.
-
EmailChanged
N
- Specifies that an e-mail message should be generated if any of the
tested links in this task are found to have been changed within the
past N days, where N is a natural number. Note that this directive
is similar to, but independent of, the ChangeWindow directive.
This directive is optional and, if present (with N > 0), requires that
EmailAddress also be given.
-
EmailExpired
N
- Specifies that an e-mail message should be generated if any of the
traversed documents in this task will expire within the
next N days, where N is a natural number. Note that this directive
is similar to, but independent of, the ExpireWindow directive.
This directive is optional and, if present (with N > 0), requires that
EmailAddress also be given.
-
Exclude
URLprefix
- Specifies that the given URLprefix should be added to the
Leaf Table such that all URLs encountered
during this task's traversal which contain the given prefix will only
be tested and not traversed. Multiple Exclude directives can
be specified for any task. The IndexURL is automatically
excluded at the beginning of every task.
-
>
- This directive, on a line by itself, signals the end of the current
task instruction. Each task must be terminated before the next begins.
Roy Fielding <fielding@ics.uci.edu>
Department of Information and Computer Science,
University of California, Irvine, CA 92717-3425
Last modified: Wed Aug 10 01:15:17 1994