Dashboard > Heritrix > Home
Heritrix Log In View a printable version of the current page.
Home
Added by Paul Jack, last edited by Gordon Mohr on Aug 07, 2008  (view change)
Labels: 
(None)

This is the public wiki for the Heritrix archival crawler project. (Some material is still at the old wiki.)

To contribute to this wiki, register for the linked JIRA issue tracking. The same login may then used for issue reporting there and wiki editing and comments here.

Latest Releases

The most recent official releases are 1.14.1 (August 2008) and 2.0.1 (August 2008).

Latest development builds on the 1.x line for the adventurous are always available on our CruiseControl continuous build box.

Latest development builds on the 2.x line for the adventurous are always available via our Continuum continuous build box (look in the 'working copy' dist/target directory for latest build artifacts) and Maven2 repository (look in the '2.0.1-SNAPSHOT' directory for the latest in-progress work as Mavenized libraries).

Documentation

The Heritrix User Manual covers getting started with Heritrix and many advanced topics. The User Manual is generally focused on Heritrix 1.X versions, not fully updated for 1.12/1.14 or the larger changes in 2.0, but even for 2.0 it provides a strong basis for understanding Heritrix features and configuration.

For 2.0.0, the Heritrix 2 Tutorial wiki pages and Heritrix 2.0.0 Release Notes provide a starting point. All settings also have help text in the web UI under 'details'.

For developers, the Heritrix Developer Manual provides a guide to extending and customizing Heritrix code for your own purposes, though of course the source code itself, which is fairly well-commented, is the best guide.

For future documentation improvements, we have a Documentation Wishlist.

Upcoming Releases / Roadmap

Design and initial implementation is proceeding on releases 2.2.0 and beyond, as part of the third phase of sponsored 'smart crawler' work. The major theme of this work is enabling adaptive & continuous revisit crawling at large scale. Three major areas of work for this project, in the rough order they will be addressed, will include:

  • standardizing the configuration/settings file format for ease of composition and archiving; improved rapid checkpointing for stable long-running crawls
  • refactoring and possibly combining the internal uri-history and already-included data structures; improving flexibility of frontier queues (offering the possibility of long-lived timing queues and multiple queues per host/exclusion-grouping)
  • enabling revisit of discovered URIs according to a swappable policy, which may take into account desired revisit intervals and detected URI change rates on previous visits

This work is expected to be extensively tested in mid- to late- 2008 and officially released before the end of the year.

Other areas of upcoming attention, though not yet scheduled for specific target releases, include:

  • improving the usability and documentation of recently-added features (duplication-suppression; tunable prioritization) in typical operator workflows
  • improving the automated test coverage with simulated crawling, especially for non-default feature configurations and longer/performance-intensive test runs
  • better crawling of web video content with default configurations
  • a web-services interface as an alternative to JMX for remote-control of the crawler
  • new heuristics and knowledge-sharing for trap/spam reduction
  • improved during-crawl queue-oriented reporting, including better predictions of completion times
  • improved options for crawling access-controlled (password/other-auth) sites

Work in these areas may appear in 2.2.0 or later releases.

Developer Notes

Latest (for 2.2 and beyond):

General:

Regarding 2.0:

Regarding 1.12.X (duplicate reduction functionality):

Crawl Operator Notes


[ignore]

1.12.0 (Heritrix)
1.12.1 (Heritrix)
2 (Heritrix)
2.2.0 (Heritrix)
2.x WebUI Documentation (Heritrix)
Background Reading (Heritrix)
BeanShell User Notes (Heritrix)
Community Calls (Heritrix)
Development (Heritrix)
Documentation Wishlist (Heritrix)
FAQ (Heritrix)
Future Directions Brainstorming (Heritrix)
HowTos (Heritrix)
ModuleDoc (Heritrix)
New Settings Web UI (Heritrix)
Potential Cleanup-Refactorings (Heritrix)
Release Notes - 1.14.1 (Heritrix)
Settings Refactoring (Heritrix)
Style Guide (Heritrix)
WARC (Heritrix)

Site powered by a free Open Source Project / Non-profit License (more) of Confluence - the Enterprise wiki.
Learn more or evaluate Confluence for your organisation.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.2.10 Build:#528 Nov 29, 2006) - Bug/feature request - Contact Administrators