Dashboard > Heritrix > Release Notes - 1.14.0
Heritrix Log In View a printable version of the current page.
Release Notes - 1.14.0
Added by Gordon Mohr, last edited by Gordon Mohr on May 01, 2008  (view change)
Labels: 
(None)

Release Notes - 1.14.0 (April 2008)

These are the project wiki Release Notes for the 1.14.0 release.

Release 1.14.0 adds a number of small features to the Heritrix 1.x line, most notably upgrading support for the WARC archived-web-content format to version 0.17 (ISO Committee Draft). This release also includes 41 bug fixes or other incremental improvements, including several based on community contributions or requests.

The 1.14.0 release is now available at the archive-crawler Sourceforge project.

Notable Changes

WARC/0.17 support (HER-1180)

The WARC support now matches the 0.17 specification version (ISO Committee Draft). The prefix 'Experimental' has been removed from WARC support class names.

'Public suffix'-based queue policy (HER-1175, HER-1177)

A new TopmostAssignedSurtQueueAssignmentPolicy assigns URIs to queues based on the information from publicsuffix.org. Specifically, the queue name will be based on the SURT form of the topmost domain that may be assigned from a name registry. This tends to group related subdomains in the same queue.

Hosts Report (HER-1254)

The hosts report automatically dumped at the end of a crawl has two additional fields per listed host: number of URIs discovered but not fetched due to robots.txt rules, and number of URIs still pending/queued when the crawl ended.

BdbFrontier "dump-pending-at-close" Option (HER-1255)

BdbFrontier has a new 'expert' setting, "dump-pending-at-close". If true, during crawl termination, all URIs still pending/queued will be logged to the crawl.log with status '0' (untried).

JMX 'dumpUris' operation (HER-1154)

CrawlJob offers a new 'dumpUris' JMX operation, which offers options similar to the view URIs option in the web admin UI, but dumps URIs to a local file.

Fixes for OutOfMemoryError (OOME) Risks (HER-1449, HER-1171)

Two distinct risks for triggering an OutOfMemoryError have been removed, one concerning heap memory exhaustion in large crawls requiring many queues, and the other non-heap memory exhaustion when garbage-collection-triggered finalization may lag in a fast crawl needing little heap memory.

Renamed settings: 'overly-eager-link-detection' (HER-1439) and 'bind-address' (HER-1045)

Two settings with potentially-confusing names have been renamed. 'overly-eager-link-detection' in ExtractorHTML and JerichoExtractorHTML, with a default value of 'true', has been renamed 'extract-value-attributes' to more accurately reflect its effect. 'bind-address' in FetchHTTP, with a default value of the empty string, has been renamed 'http-bind-address', for consistency with 'http-proxy-host' and 'http-proxy-host' and to avoid confusion with the admin web UI bind address.

If you use previous version order.xml configuration files with the old setting names in Heritrix, you will receive non-fatal logged/alert warnings about "Unknown attribute". To avoid these warnings, either rename the old settings in the order.xml or, if you are happy with the default values, you may simply delete the old settings.

Additional contributors

In addition to the usual suspects, this release includes contributed fixes or functionality from:

  • Matt Sanford
  • Eric C. Jensen
  • Kohei TAKEDA

All Tracked Changes

The following tracked issues are recorded as addressed in this 1.14.0 release:

http://webteam.archive.org/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10020

IA Webteam JIRA (48 issues)
T Key Summary Status
Improvement HER-1490 rename 'experimental' WARC classes ResolvedResolved
Bug HER-1488 fix sourceforge website update from build ResolvedResolved
Bug HER-1460 ClassicScope forceAcceptFilter inert, confusing ResolvedResolved
Bug HER-1449 OOME in Deflater.init() after short busy crawl ResolvedResolved
Bug HER-1448 
 not treated as whitespace in href, affecting link extraction ClosedClosed
Improvement HER-1439 rename 'OVERLY_EAGER_LINK_DETECTION' ResolvedResolved
Bug HER-1390 not all ignored seed lines are logged ResolvedResolved
Improvement HER-1387 Upgrade JE to 3.2.74 [heritrix1] ResolvedResolved
Bug HER-1322 Update relative URI derelativization, esp. of naked ?query-strings, to match RFC3986 ResolvedResolved
Bug HER-1295 broken-fetch retries triggering unwanted 304 responses ResolvedResolved
Bug HER-1292 ARCWriter writes incorrect length in ARC files for 304 responses. ResolvedResolved
Bug HER-1289 RuntimeException "CrawlServer must deserialize in a ToeThread" when doing recovery from log ResolvedResolved
Bug HER-1281 crawl stalls with plenty of inactive queues; StoredQueue sync/corruption? ResolvedResolved
Bug HER-1280 do not by default GET form action URLs declared as POST, because it can cause problems/complaints ResolvedResolved
Bug HER-1279 JerichoExtractorHTML fails to extract links when multiple attributes per element exist ResolvedResolved
Bug HER-1278 JBDB override class obsoleted by recent update of JE.jar ResolvedResolved
Bug HER-1265 PersistLogProcessor, PersistProcessor log robustness on write, close, read ResolvedResolved
Bug HER-1261 AdaptiveRevisitFrontier ignores JMX importUri(s) ResolvedResolved
Improvement HER-1259 ARFrontier improvements (QueueAssignmentPolicy configuration, selective rescheduling) ResolvedResolved
Improvement HER-1256 upgrade BDB-JE to 3.2.44 ResolvedResolved
New Feature HER-1255 Add option to dump queues to crawl.log at end of crawl ResolvedResolved
New Feature HER-1254 Per-host tallies of robots exclusions, uncrawled URIs at end of early-terminated crawl ResolvedResolved
Bug HER-1248 (sub)domains of infinite breadth kill crawler throughput ResolvedResolved
Bug HER-1239 FetchHTTP throws NullPointerExcetpion on checkpoint if cookies are disabled ResolvedResolved
Bug HER-1192 Heritrix extracts and schedules invalid https: URI ResolvedResolved
Bug HER-1191 https: (no slashes) in recovery from recovery log causes NPE in UURIFactory.checkHttpSchemeSpecificPartSlashPrefix ResolvedResolved
Bug HER-1188 Deadlock risk in WorkQueueFrontier.receive and WorkQueueFrontier.deepestUri and WorkQueueFrontier.next ClosedClosed
Improvement HER-1181 apply scope to recovery replays ResolvedResolved
New Feature HER-1180 support iso-draft WARC (0.17) in writer/reader ResolvedResolved
Bug HER-1178 ARCReader cannot handle ARC-files with records larger than 2 GB ResolvedResolved
New Feature HER-1177 publicsuffix-based HashCrawlMapper reduce-pattern ResolvedResolved
Improvement HER-1176 make queue-assignment-policy overridable, changeable ResolvedResolved
New Feature HER-1175 publicsuffix-based queue assignment policy (TopmostAssignedSurtQueueAssignmentPolicy) ClosedClosed
Bug HER-1173 forms with large values get "ArrayIndexOutOfBoundsException: 200000" (jetty limit) ResolvedResolved
Bug HER-1172 (patch) ExtractorUniversal file handle leak ClosedClosed
Bug HER-1171 Queues-of-queues (inactive,retired,etc.) grow without bound; OutOfMemoryError ResolvedResolved
Bug HER-1169 CrawlMapper 'check-uri' doesn't work in late position (if fetch status != 0) ResolvedResolved
Bug HER-1168 CrawlMapper broken by outlinks/outcandidates split ResolvedResolved
Bug HER-1166 BloomFilter32bitSplit artificially limits effective size ResolvedResolved
Bug HER-1163 "java.lang.ArithmeticException: / by zero" in org.archive.io.RecordingOutputStream.checkLimits(RecordingOutputStream.java:271) ResolvedResolved
New Feature HER-1154 [contrib] JMX dumpUris method supporting the same behavior as the frontier search on the wui ResolvedResolved
Bug HER-1128 ExtractorHTML fails to extract FRAME SRC link without whitespace before SRC ResolvedResolved
Bug HER-1123 ContentTypeNotMatchesRegExpDecideRule not usable due to typo in src/conf/modules/DecideRule.options ResolvedResolved
Improvement HER-1045 [contrib] make it easier to set FetchHTTP local bind address, use rotation of alternate local addresses ResolvedResolved
Improvement HER-1003 Interruptible regex - TimeoutCharSequence? ClosedClosed
Improvement HER-907 Log jmxclient start/pause/resume/stop/etc. events ResolvedResolved
Bug HER-661 "failed get" "user-mapped section open" exceptions ResolvedResolved
Bug HER-616 The UURI class may throw NullPointerException in getReferencedHost() ResolvedResolved

Site powered by a free Open Source Project / Non-profit License (more) of Confluence - the Enterprise wiki.
Learn more or evaluate Confluence for your organisation.
Powered by Atlassian Confluence, the Enterprise Wiki. (Version: 2.2.10 Build:#528 Nov 29, 2006) - Bug/feature request - Contact Administrators