Heritrix 2.0.0 Release Notes
Heritrix 2.0.0 is now available
.
Heritrix 2.0.0 offers an updated user interface, a new settings framework and configuration file formats, new options for controlling the ordering of collection of URIs and sites, more options for remote-controlling a crawl engine, and many other big fixes and improvements.
What's New
For Crawl Operators
The Web UI for administering a crawl is now a separate application from the 'crawl engine' that actually collects/analyzes/stores information. As a result, while you can launch the engine and web UI inside the same Java VM, as previously, you can also launch the Web UI and engine in different JVMs, even on different machines. Also, one Web UI can administer multiple local or remote crawl engines. Finally, because the Web UI communicates with the engine strictly using "JMX" (Java Management Extensions), anything it can do can also be done by other JMX-speaking software, such as command-line utilities or alternate crawl management applications.
The model for configuring the crawler has changed significantly. The same settings exist, but now are collected on 'sheets', of a different on-disk format than in Heritrix 1.x. An implicit/virtual sheet ('default') is never actually stored but contains settings from source code default values. A 'global' sheet collects whole-crawler and all-URL settings.
Other settings that only apply to some URLs may be collected in any number of other custom-named sheets, and then mapped to URLs not just by domain (as in 1.x) but any SURT prefix. A single sheet can be mapped to many URLs; an URL may have more than one sheet mapped to it. (For example, a sheet 'slowly' with extra-long politeness delays can be mapped to sites a.com and b.com, while another sheet 'shallowly' with small max-hops limits can be mapped to b.com and c.com. Then, b.com will be crawled both 'slowly' and 'shallowly'.)
Both individual URIs and the queues of URIs within a crawler frontier may now have an integer 'precedence' value assigned to them, via separate URI and queue precedence policies configured via settings sheets. A URI's precedence affects the order in which it is collected relative to other URIs in the same queue. A queue's precedence affects which queues actively provide URIs for crawling when many queues are available and waiting. Lower numbers mean earlier (higher ranked) consideration - 1 is intended as the highest precedence. Policies already included with Heritrix, combined with sheet override configuration, can achieve a variety of desired prioritizations of crawl work. As with other crawler components, custom policies can also be implemented in Java.
For Developers
Heritrix has been split into subprojects of related (and reusable) functionality.
The innermost subproject, 'commons', includes web archiving core utility classes usable by a crawler and other related applications, such as analysis and access tools.
The next subproject, 'modules', includes crawl-focused modules with some potential applicability outside a crawler. It requires 'commons'.
The next subproject, 'engine', is a crawler with no user-interface beyond the JMX remote-control functionality. It requires both 'commons' and 'modules'.
The next subproject, 'webui', is the web-browser-based user-interface. Its output is a WAR file that can be run in the same JVM as the crawl engine, or elsewhere. It requires 'engine', 'modules', and 'archive'.
Many classes have moved or been renamed.
Heritrix 2 is now built with Maven 2
, and our continuous build system
has migrated to Continuum
.
This somewhat encumbers usage in Eclipse. We have a guide to setting up Heritrix inside Eclipse at Setting up the new Heritrix in Eclipse. Key points to remember are: (1) an initial Maven build, either from the command-line or the M2Eclipse extension, is required to populate your local Maven repository with necessary 3rd-party-classes; (2) you must set the M2_REPO build-path variable for the Eclipse project to find that local repository; (3) the ctrl-shift-t shortcut for finding a class by typing its name is helpful for locating classes in whatever subproejct or package they now reside.
Known Limitations
Converting 1.x crawl configurations to 2.0 remains a manual process. For now, we recommend consulting a 1.x order.xml (or various override settings.xml files) while using the web UI to construct a analogous crawl configuration. (While a few classes or settings have moved, names and capabilities remain substantially alike.) We are still collecting feedback on Heritrix 2 configuration formats and options and expect to make changes in future releases to ease crawl design, transition from previous configurations, and archival or the entire crawl configuration.
Heritrix 2 has received very little testing so far on platforms other than Linux; functionality and support scripts which worked in 1.x will likely need updating, even if currently included in the distribution. An issue to note under Windows is that viewing crawl job files/directories in other programs (including a command-shell) may prevent a runnable job from moving between states (ready to active, active to completed), as in Heritrix 2.0 this requires a directory rename. (See HER-1477: http://webteam.archive.org/jira/browse/HER-1477
.)
Documentation is very limited compared to the 1.x user manual and developer manual.
The focus for Heritrix 2.0 has been to deploy the new prioritization features, refactored internals, and settings options. In some cases lesser-used old functionality has been removed or is temporarily broken. A few notable areas of removed functionality:
The old specialized Scope classes have been eliminated; there is a 'scope' setting on relevant modules, which always expects a module of type DecideRule (which includes sequence compositions of DecideRules).
The contributed AdaptiveRevisitFrontier and has not yet been updated for operation under 2.0.0.
The 2.x settings system has no current analogue to the 'refinements' feature of Heritrix 1.x.
You can view a complete list of tracked issues reported in Heritrix 2.0.0 as released
.
If you discover other gaps relative to 1.x, please report them via our public issue tracker
..
Getting the 2.0.0 Release
This release may be obtained from our Sourceforge files area:
Heritrix2 Releases
Getting Started
For operators:
2.0 Tutorial
For developers:
Setting up the new Heritrix in Eclipse
Documentation for 2.0 is limited compared to 1.x, but a notes on using new features will be available on the project wiki. See Precedence Feature Notes for information about the URI and queue prioritization features.
Resolved Issues
The following tracked issues are recorded as addressed in this 2.0.0 release:
http://webteam.archive.org/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10033
IA Webteam JIRA
(55 issues)
|
|
T |
Key |
Summary |
Status |
|
HER-1476
|
release: create wiki release notes, force/tag official build, upload to sourceforge
|
Resolved
|
|
HER-1475
|
Some processors (ex. ExtractorUniversal, ExtractorXML) don't appear in Add New Element -> Pick an object type to create dropdown
|
Resolved
|
|
HER-1471
|
CrawlURI.setStateProvider doing contending redundant work
|
Resolved
|
|
HER-1465
|
"No CrawlURI from ready non-empty queue" SEVERE errors
|
Resolved
|
|
HER-1461
|
sort order of completed jobs: lex unhelpful, reverse chronological?
|
Resolved
|
|
HER-1456
|
can't remove list item in override sheet
|
Resolved
|
|
HER-1455
|
surt 'dump' files not landing where expected (are they anywhere?)
|
Resolved
|
|
HER-1454
|
job 'copy' misses important files
|
Resolved
|
|
HER-1452
|
NPE in managerThread on crawl terminate
|
Resolved
|
|
HER-1451
|
Easier embedding for 2.0
|
Resolved
|
|
HER-1449
|
OOME in Deflater.init() after short busy crawl
|
Resolved
|
|
HER-1445
|
Version number not placed in heritrix.properties file.
|
Closed
|
|
HER-1443
|
crawl slows/stalls when many queues exhausted in a row (as when their URIs are preselected-out or mapped-elsewhere)
|
Resolved
|
|
HER-1439
|
rename 'OVERLY_EAGER_LINK_DETECTION'
|
Resolved
|
|
HER-1437
|
lists, maps that aren't yet 'add' overriden in local sheet, details page misleads as to edittability
|
Resolved
|
|
HER-1434
|
TrapSuppressExtractor missing from heritrix2
|
Resolved
|
|
HER-1433
|
HopsPathMatchesRegExpDecideRule has no configuration options
|
Resolved
|
|
HER-1432
|
MatchesListRegExpDecideRule list-logic boolean should be renamed
|
Resolved
|
|
HER-1431
|
'details' page 'add new element' drop-downs don't stay up
|
Resolved
|
|
HER-1430
|
in edit sheet details of lists and maps, a 'delete' option makes sense
|
Resolved
|
|
HER-1429
|
when editting sheet, navigation to 'details' loses unsubmitted edits
|
Resolved
|
|
HER-1428
|
no place for descriptive names of scope rules
|
Resolved
|
|
HER-1427
|
'submit' on settings sheet 'add new element' forms sends back to sheet, which is often wrong
|
Resolved
|
|
HER-1426
|
eliminate operator-from as requirement for valid configuration
|
Resolved
|
|
HER-1423
|
JMXSheetManager get() and resolve() issues
|
Resolved
|
|
HER-1422
|
web ui can't reorder (move up, move down) scope decide-rules
|
Resolved
|
|
HER-1414
|
login page needs submit button
|
Resolved
|
|
HER-1401
|
SURT add/remove pages indistinguishable
|
Resolved
|
|
HER-1400
|
Confusing descriptive text on SURT add/remove screen
|
Resolved
|
|
HER-1399
|
Error page after re-authentication due to session time-out.
|
Resolved
|
|
HER-1396
|
README.txt is missing a CR/LF at the end of the file.
|
Resolved
|
|
HER-1395
|
Unconventional command-line behavior -- -h/--help
|
Resolved
|
|
HER-1392
|
Unconventional tarball/package name
|
Resolved
|
|
HER-1390
|
not all ignored seed lines are logged
|
Resolved
|
|
HER-1386
|
Bashisms in shell scripts result in ".kill: 195: No such process" error on Ubuntu 7.10
|
Resolved
|
|
HER-1384
|
crawl status errors in crawl report
|
Resolved
|
|
HER-1381
|
Ability to rotate log files via JMX/Web UI
|
Resolved
|
|
HER-1375
|
Add a remote crawl engine via web UI only allows 2 engines
|
Resolved
|
|
HER-1373
|
Reloading a just launched job causes an exception
|
Resolved
|
|
HER-1372
|
filter for operator-from does not accept emails with dots in the address
|
Resolved
|
|
HER-1371
|
When calling bin/heritrix with no options, script errors
|
Resolved
|
|
HER-1369
|
heritrix -r option: conflicting docs, help text, behavior
|
Resolved
|
|
HER-1368
|
Reference list of dependencies and 3rd-party-code licenses
|
Resolved
|
|
HER-1358
|
include managerThread in threads report
|
Resolved
|
|
HER-1348
|
Tables in web UI should be prettier
|
Resolved
|
|
HER-1347
|
Offer dump of entire associations database
|
Resolved
|
|
HER-1346
|
Redirect state directory in config.txt
|
Resolved
|
|
HER-1338
|
heritrix.properties should be named logging.properties
|
Resolved
|
|
HER-1313
|
Web UI should only display actual job directories
|
Resolved
|
|
HER-1308
|
CrawlJobManagerImpl.writeLines is inefficient, has mysterious ranged-replace capability
|
Resolved
|
|
HER-1279
|
JerichoExtractorHTML fails to extract links when multiple attributes per element exist
|
Resolved
|
|
HER-1278
|
JBDB override class obsoleted by recent update of JE.jar
|
Resolved
|
|
HER-1275
|
Can't add list elements in an override sheet
|
Resolved
|
|
HER-1265
|
PersistLogProcessor, PersistProcessor log robustness on write, close, read
|
Resolved
|
|
HER-1236
|
Allow variables in file paths
|
Resolved
|
For issues fixed in previous test releases, see:
Reporting Issues
Bugs and other issues or suggested improvements/features may be submitted through our public issue tracker
.
The project discussion list
is hosted at Yahoo Groups.