Dashboard > Heritrix > ... > 2 > 2.0.0 Release Notes
Heritrix Log In View a printable version of the current page.
2.0.0 Release Notes
Added by Gordon Mohr, last edited by Gordon Mohr on May 28, 2008  (view change)
Labels: 
(None)

Heritrix 2.0.0 Release Notes

Heritrix 2.0.0 is now available.

Heritrix 2.0.0 offers an updated user interface, a new settings framework and configuration file formats, new options for controlling the ordering of collection of URIs and sites, more options for remote-controlling a crawl engine, and many other big fixes and improvements.

What's New

For Crawl Operators

The Web UI for administering a crawl is now a separate application from the 'crawl engine' that actually collects/analyzes/stores information. As a result, while you can launch the engine and web UI inside the same Java VM, as previously, you can also launch the Web UI and engine in different JVMs, even on different machines. Also, one Web UI can administer multiple local or remote crawl engines. Finally, because the Web UI communicates with the engine strictly using "JMX" (Java Management Extensions), anything it can do can also be done by other JMX-speaking software, such as command-line utilities or alternate crawl management applications.

The model for configuring the crawler has changed significantly. The same settings exist, but now are collected on 'sheets', of a different on-disk format than in Heritrix 1.x. An implicit/virtual sheet ('default') is never actually stored but contains settings from source code default values. A 'global' sheet collects whole-crawler and all-URL settings.

Other settings that only apply to some URLs may be collected in any number of other custom-named sheets, and then mapped to URLs not just by domain (as in 1.x) but any SURT prefix. A single sheet can be mapped to many URLs; an URL may have more than one sheet mapped to it. (For example, a sheet 'slowly' with extra-long politeness delays can be mapped to sites a.com and b.com, while another sheet 'shallowly' with small max-hops limits can be mapped to b.com and c.com. Then, b.com will be crawled both 'slowly' and 'shallowly'.)

Both individual URIs and the queues of URIs within a crawler frontier may now have an integer 'precedence' value assigned to them, via separate URI and queue precedence policies configured via settings sheets. A URI's precedence affects the order in which it is collected relative to other URIs in the same queue. A queue's precedence affects which queues actively provide URIs for crawling when many queues are available and waiting. Lower numbers mean earlier (higher ranked) consideration - 1 is intended as the highest precedence. Policies already included with Heritrix, combined with sheet override configuration, can achieve a variety of desired prioritizations of crawl work. As with other crawler components, custom policies can also be implemented in Java.

For Developers

Heritrix has been split into subprojects of related (and reusable) functionality.

The innermost subproject, 'commons', includes web archiving core utility classes usable by a crawler and other related applications, such as analysis and access tools.

The next subproject, 'modules', includes crawl-focused modules with some potential applicability outside a crawler. It requires 'commons'.

The next subproject, 'engine', is a crawler with no user-interface beyond the JMX remote-control functionality. It requires both 'commons' and 'modules'.

The next subproject, 'webui', is the web-browser-based user-interface. Its output is a WAR file that can be run in the same JVM as the crawl engine, or elsewhere. It requires 'engine', 'modules', and 'archive'.

Many classes have moved or been renamed.

Heritrix 2 is now built with Maven 2, and our continuous build system has migrated to Continuum.

This somewhat encumbers usage in Eclipse. We have a guide to setting up Heritrix inside Eclipse at Setting up the new Heritrix in Eclipse. Key points to remember are: (1) an initial Maven build, either from the command-line or the M2Eclipse extension, is required to populate your local Maven repository with necessary 3rd-party-classes; (2) you must set the M2_REPO build-path variable for the Eclipse project to find that local repository; (3) the ctrl-shift-t shortcut for finding a class by typing its name is helpful for locating classes in whatever subproejct or package they now reside.

Known Limitations

Converting 1.x crawl configurations to 2.0 remains a manual process. For now, we recommend consulting a 1.x order.xml (or various override settings.xml files) while using the web UI to construct a analogous crawl configuration. (While a few classes or settings have moved, names and capabilities remain substantially alike.) We are still collecting feedback on Heritrix 2 configuration formats and options and expect to make changes in future releases to ease crawl design, transition from previous configurations, and archival or the entire crawl configuration.

Heritrix 2 has received very little testing so far on platforms other than Linux; functionality and support scripts which worked in 1.x will likely need updating, even if currently included in the distribution. An issue to note under Windows is that viewing crawl job files/directories in other programs (including a command-shell) may prevent a runnable job from moving between states (ready to active, active to completed), as in Heritrix 2.0 this requires a directory rename. (See HER-1477: http://webteam.archive.org/jira/browse/HER-1477 .)

Documentation is very limited compared to the 1.x user manual and developer manual.

The focus for Heritrix 2.0 has been to deploy the new prioritization features, refactored internals, and settings options. In some cases lesser-used old functionality has been removed or is temporarily broken. A few notable areas of removed functionality:

The old specialized Scope classes have been eliminated; there is a 'scope' setting on relevant modules, which always expects a module of type DecideRule (which includes sequence compositions of DecideRules).

The contributed AdaptiveRevisitFrontier and has not yet been updated for operation under 2.0.0.

The 2.x settings system has no current analogue to the 'refinements' feature of Heritrix 1.x.

You can view a complete list of tracked issues reported in Heritrix 2.0.0 as released.

If you discover other gaps relative to 1.x, please report them via our public issue tracker..

Getting the 2.0.0 Release

This release may be obtained from our Sourceforge files area:

Heritrix2 Releases

Getting Started

For operators:

2.0 Tutorial

For developers:

Setting up the new Heritrix in Eclipse

Documentation for 2.0 is limited compared to 1.x, but a notes on using new features will be available on the project wiki. See Precedence Feature Notes for information about the URI and queue prioritization features.

Resolved Issues

The following tracked issues are recorded as addressed in this 2.0.0 release:

http://webteam.archive.org/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10033

IA Webteam JIRA (55 issues)
T Key Summary Status
Task HER-1476 release: create wiki release notes, force/tag official build, upload to sourceforge ResolvedResolved
Bug HER-1475 Some processors (ex. ExtractorUniversal, ExtractorXML) don't appear in Add New Element -> Pick an object type to create dropdown ResolvedResolved
Bug HER-1471 CrawlURI.setStateProvider doing contending redundant work ResolvedResolved
Bug HER-1465 "No CrawlURI from ready non-empty queue" SEVERE errors ResolvedResolved
Improvement HER-1461 sort order of completed jobs: lex unhelpful, reverse chronological? ResolvedResolved
Bug HER-1456 can't remove list item in override sheet ResolvedResolved
Bug HER-1455 surt 'dump' files not landing where expected (are they anywhere?) ResolvedResolved
Bug HER-1454 job 'copy' misses important files ResolvedResolved
Bug HER-1452 NPE in managerThread on crawl terminate ResolvedResolved
Improvement HER-1451 Easier embedding for 2.0 ResolvedResolved
Bug HER-1449 OOME in Deflater.init() after short busy crawl ResolvedResolved
Bug HER-1445 Version number not placed in heritrix.properties file. ClosedClosed
Bug HER-1443 crawl slows/stalls when many queues exhausted in a row (as when their URIs are preselected-out or mapped-elsewhere) ResolvedResolved
Improvement HER-1439 rename 'OVERLY_EAGER_LINK_DETECTION' ResolvedResolved
Bug HER-1437 lists, maps that aren't yet 'add' overriden in local sheet, details page misleads as to edittability ResolvedResolved
Bug HER-1434 TrapSuppressExtractor missing from heritrix2 ResolvedResolved
Bug HER-1433 HopsPathMatchesRegExpDecideRule has no configuration options ResolvedResolved
Improvement HER-1432 MatchesListRegExpDecideRule list-logic boolean should be renamed ResolvedResolved
Bug HER-1431 'details' page 'add new element' drop-downs don't stay up ResolvedResolved
Improvement HER-1430 in edit sheet details of lists and maps, a 'delete' option makes sense ResolvedResolved
Improvement HER-1429 when editting sheet, navigation to 'details' loses unsubmitted edits ResolvedResolved
Improvement HER-1428 no place for descriptive names of scope rules ResolvedResolved
Improvement HER-1427 'submit' on settings sheet 'add new element' forms sends back to sheet, which is often wrong ResolvedResolved
Improvement HER-1426 eliminate operator-from as requirement for valid configuration ResolvedResolved
Bug HER-1423 JMXSheetManager get() and resolve() issues ResolvedResolved
Bug HER-1422 web ui can't reorder (move up, move down) scope decide-rules ResolvedResolved
Bug HER-1414 login page needs submit button ResolvedResolved
Bug HER-1401 SURT add/remove pages indistinguishable ResolvedResolved
Bug HER-1400 Confusing descriptive text on SURT add/remove screen ResolvedResolved
Bug HER-1399 Error page after re-authentication due to session time-out. ResolvedResolved
Bug HER-1396 README.txt is missing a CR/LF at the end of the file. ResolvedResolved
Bug HER-1395 Unconventional command-line behavior -- -h/--help ResolvedResolved
Bug HER-1392 Unconventional tarball/package name ResolvedResolved
Bug HER-1390 not all ignored seed lines are logged ResolvedResolved
Bug HER-1386 Bashisms in shell scripts result in ".kill: 195: No such process" error on Ubuntu 7.10 ResolvedResolved
Bug HER-1384 crawl status errors in crawl report ResolvedResolved
New Feature HER-1381 Ability to rotate log files via JMX/Web UI ResolvedResolved
Bug HER-1375 Add a remote crawl engine via web UI only allows 2 engines ResolvedResolved
Bug HER-1373 Reloading a just launched job causes an exception ResolvedResolved
Bug HER-1372 filter for operator-from does not accept emails with dots in the address ResolvedResolved
Bug HER-1371 When calling bin/heritrix with no options, script errors ResolvedResolved
Bug HER-1369 heritrix -r option: conflicting docs, help text, behavior ResolvedResolved
Improvement HER-1368 Reference list of dependencies and 3rd-party-code licenses ResolvedResolved
Improvement HER-1358 include managerThread in threads report ResolvedResolved
Bug HER-1348 Tables in web UI should be prettier ResolvedResolved
New Feature HER-1347 Offer dump of entire associations database ResolvedResolved
Bug HER-1346 Redirect state directory in config.txt ResolvedResolved
Bug HER-1338 heritrix.properties should be named logging.properties ResolvedResolved
Bug HER-1313 Web UI should only display actual job directories ResolvedResolved
Bug HER-1308 CrawlJobManagerImpl.writeLines is inefficient, has mysterious ranged-replace capability ResolvedResolved
Bug HER-1279 JerichoExtractorHTML fails to extract links when multiple attributes per element exist ResolvedResolved
Bug HER-1278 JBDB override class obsoleted by recent update of JE.jar ResolvedResolved
Bug HER-1275 Can't add list elements in an override sheet ResolvedResolved
Bug HER-1265 PersistLogProcessor, PersistProcessor log robustness on write, close, read ResolvedResolved
Bug HER-1236 Allow variables in file paths ResolvedResolved

For issues fixed in previous test releases, see:

Reporting Issues

Bugs and other issues or suggested improvements/features may be submitted through our public issue tracker.

The project discussion list is hosted at Yahoo Groups.

Site powered by a free Open Source Project / Non-profit License (more) of Confluence - the Enterprise wiki.
Learn more or evaluate Confluence for your organisation.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.2.10 Build:#528 Nov 29, 2006) - Bug/feature request - Contact Administrators