"Big jobs usually go to the men who prove their ability to outgrow small ones."
-- John Wooden


introducing the sitespider
The StructureTooBig SiteSpider is a utility for crawling websites and evaluating the site content for broken links, slow pages, and external links. To use the SiteSpider, you simply enter a URL, specify a crawl depth, and click the Crawl button. The specified URL is requested and parsed for other links, which are then requested and parsed. This continues until all links have been requested, or the maximum crawl depth has been reached. (A crawl depth of zero is essentially unlimited -- the crawl will not stop until all links have been requested.)

The image below illustrates what a simple crawl may look like:



The main tree makes it very easy to see all links visually. Errors will stand out in red with the actual response code (e.g. "404") in text beside the URL. Links that are not requested (because they are duplicates or beyond the specified crawl depth) are gray, and successfully requested links are green.

During the crawl, the status bar in the bottom window will display the current status, including the number of links examined, the number of links pending (indicates the number of links that have been parsed but not yet requested) and the current crawl depth. A link will only be requested once: duplicate links will not be requested or parsed.

analyzing the data
While the tree is useful for seeing the structure of the links on the website, the other tabs break down the information into reports by status code (for example, all 400-level response codes, all 500-level response codes, etc.). The external links tab will display all links that lead to different websites.

The Slowest Pages tab is useful for isolating problem areas of a site, as it sorts all of the links by response time. The image below illustrates an example from my site; the maximum timeout is configurable in the application settings.



settings
Finally, there's a number of settings to configure the way SiteSpider operates:



Here's an explanation of each setting:

Follow Internal Redirects: If checked, any redirects on the site being crawled will be followed automatically. If unchecked, redirects will be flagged with a 302 response and not followed.
Follow External Redirects: Same as above, but applies only to links that lead to pages off the site being crawled -- in other words, links to other websites.
Keep Original Uri When Redirecting: If checked, the original Uri of the link is the one that is used in the reports. If unchecked, the destination Uri (after the all redirects) is used in the reports.

Request Delay: Time, in milliseconds, to pause between each request.
Request Timeout: Time, in seconds, to wait before timing out on a request (status code of 408).

Crawl Internal Links Only: If checked, only pages that internal to the current website are parsed for links. While external links are still requested, the content will not be parsed and crawled unless this option is checked.

User Agent: The user agent sent by the crawler to identify itself to the web server.

Content Types: Specifies the content types that will be parsed for links to other pages. If you want to parse and crawl XML files, for example, text/xml can be added here.

Authentication: If a website requires authentication, those settings can be set here. The "Use My Credentials" checkbox specifies that SiteSpider will use the credentials of the logged in user when crawling the site.

The link data can be saved and loaded using the file menu, stored in a standard XML format that you can use in other applications.

a note on resources
Depending on the size, settings, and hardware, crawling a site may take quite a bit of time and a great deal of resources. Use the Crawl Depth setting to scale this appropriately and keep the Crawl Internal Links Only setting checked whenever possible. Without constraining these settings, a site may reach hundreds of thousands of links within two or three levels.

terms and downloads
You can use this software for your own personal use, but do so at your own risk. You are responsible for any use or misuse of this software. This software may not be redistributed or copied in any way without expressed permission.

Downloads (.NET Framework 2.0 required)
Download Application (50k)