This is the public wiki for the Heritrix archival crawler project. (Some material is still at the old wiki
.)
To contribute to this wiki, register for the linked JIRA issue tracking
. The same login may then used for issue reporting there and wiki editing and comments here.
Latest Releases
The most recent official releases are 1.14.1 (August 2008) and 2.0.1 (August 2008).
Latest development builds on the 1.x line for the adventurous are always available on our CruiseControl continuous build box
.
Latest development builds on the 2.x line for the adventurous are always available via our Continuum continuous build box
(look in the 'working copy' dist/target directory for latest build artifacts) and Maven2 repository
(look in the '2.0.1-SNAPSHOT' directory for the latest in-progress work as Mavenized libraries).
Documentation
The Heritrix User Manual
covers getting started with Heritrix and many advanced topics. The User Manual is generally focused on Heritrix 1.X versions, not fully updated for 1.12/1.14 or the larger changes in 2.0, but even for 2.0 it provides a strong basis for understanding Heritrix features and configuration.
For 2.0.0, the Heritrix 2 Tutorial wiki pages and Heritrix 2.0.0 Release Notes
provide a starting point. All settings also have help text in the web UI under 'details'.
For developers, the Heritrix Developer Manual
provides a guide to extending and customizing Heritrix code for your own purposes, though of course the source code itself, which is fairly well-commented, is the best guide.
For future documentation improvements, we have a Documentation Wishlist.
Upcoming Releases / Roadmap
Design and initial implementation is proceeding on releases 2.2.0 and beyond, as part of the third phase of sponsored 'smart crawler' work. The major theme of this work is enabling adaptive & continuous revisit crawling at large scale. Three major areas of work for this project, in the rough order they will be addressed, will include:
- standardizing the configuration/settings file format for ease of composition and archiving; improved rapid checkpointing for stable long-running crawls
- refactoring and possibly combining the internal uri-history and already-included data structures; improving flexibility of frontier queues (offering the possibility of long-lived timing queues and multiple queues per host/exclusion-grouping)
- enabling revisit of discovered URIs according to a swappable policy, which may take into account desired revisit intervals and detected URI change rates on previous visits
This work is expected to be extensively tested in mid- to late- 2008 and officially released before the end of the year.
Other areas of upcoming attention, though not yet scheduled for specific target releases, include:
- improving the usability and documentation of recently-added features (duplication-suppression; tunable prioritization) in typical operator workflows
- improving the automated test coverage with simulated crawling, especially for non-default feature configurations and longer/performance-intensive test runs
- better crawling of web video content with default configurations
- a web-services interface as an alternative to JMX for remote-control of the crawler
- new heuristics and knowledge-sharing for trap/spam reduction
- improved during-crawl queue-oriented reporting, including better predictions of completion times
- improved options for crawling access-controlled (password/other-auth) sites
Work in these areas may appear in 2.2.0 or later releases.
Developer Notes
Latest (for 2.2 and beyond):
General:
Regarding 2.0:
Regarding 1.12.X (duplicate reduction functionality):
Crawl Operator Notes
[ignore]