|
|
"Big jobs usually go to the men who prove their ability to outgrow small ones."
|
|
--
John Wooden
|
|
The StructureTooBig SiteSpider is a utility for crawling websites and
evaluating the site content for broken links, slow pages, and external links.
To use the SiteSpider, you simply enter a URL, specify a crawl depth, and click
the Crawl button. The specified URL is requested and parsed for other links,
which are then requested and parsed. This continues until all links have been
requested, or the maximum crawl depth has been reached. (A crawl depth of zero
is essentially unlimited -- the crawl will not stop until all links have been
requested.)
The image below illustrates what a simple crawl may look like:
The main tree makes it very easy to see all links visually. Errors will stand
out in red with the actual response code (e.g. "404") in text beside the URL.
Links that are not requested (because they are duplicates or beyond the
specified crawl depth) are gray, and successfully requested links are green.
During the crawl, the status bar in the bottom window will display the current
status, including the number of links examined, the number of links pending
(indicates the number of links that have been parsed but not yet requested) and
the current crawl depth. A link will only be requested once: duplicate links
will not be requested or parsed.
While the tree is useful for seeing the structure of the links on the website,
the other tabs break down the information into reports by status code (for
example, all 400-level response codes, all 500-level response codes, etc.). The
external links tab will display all links that lead to different websites.
The Slowest Pages tab is useful for isolating problem areas of a site, as it
sorts all of the links by response time. The image below illustrates an example
from my site; the maximum timeout is configurable in the application settings.
Finally, there's a number of settings to configure the way SiteSpider operates:
Here's an explanation of each setting:
Follow Internal Redirects: If checked, any redirects on the
site being crawled will be followed automatically. If unchecked, redirects will
be flagged with a 302 response and not followed.
Follow External Redirects: Same as above, but applies only to
links that lead to pages off the site being crawled -- in other words, links to
other websites.
Keep Original Uri When Redirecting: If checked, the original
Uri of the link is the one that is used in the reports. If unchecked, the
destination Uri (after the all redirects) is used in the reports.
Request Delay: Time, in milliseconds, to pause between each
request.
Request Timeout: Time, in seconds, to wait before timing out
on a request (status code of 408).
Crawl Internal Links Only: If checked, only pages that
internal to the current website are parsed for links. While external links are
still requested, the content will not be parsed and crawled unless this option
is checked.
User Agent: The user agent sent by the crawler to identify
itself to the web server.
Content Types: Specifies the content types that will be parsed
for links to other pages. If you want to parse and crawl XML files, for
example, text/xml can be added here.
Authentication: If a website requires authentication, those
settings can be set here. The "Use My Credentials" checkbox specifies that
SiteSpider will use the credentials of the logged in user when crawling the
site.
The link data can be saved and loaded using the file menu, stored in a standard
XML format that you can use in other applications.
Depending on the size, settings, and hardware, crawling a site may take quite a
bit of time and a great deal of resources. Use the Crawl Depth setting to scale
this appropriately and keep the Crawl Internal Links Only setting checked
whenever possible. Without constraining these settings, a site may reach
hundreds of thousands of links within two or three levels.
|
You can use this software for your own personal use, but do
so at your own risk. You are responsible for any use or misuse
of this software. This software may not be redistributed
or copied in any way without expressed permission.
|
|
|