Release Notes - 1.14.0 (April 2008)
These are the project wiki Release Notes for the 1.14.0 release.
Release 1.14.0 adds a number of small features to the Heritrix 1.x line, most notably upgrading support for the WARC archived-web-content format to version 0.17 (ISO Committee Draft). This release also includes 41 bug fixes or other incremental improvements, including several based on community contributions or requests.
The 1.14.0 release is now available at the archive-crawler Sourceforge project
.
Notable Changes
WARC/0.17 support (HER-1180)
The WARC support now matches the 0.17 specification version (ISO Committee Draft). The prefix 'Experimental' has been removed from WARC support class names.
'Public suffix'-based queue policy (HER-1175, HER-1177)
A new TopmostAssignedSurtQueueAssignmentPolicy assigns URIs to queues based on the information from publicsuffix.org. Specifically, the queue name will be based on the SURT form of the topmost domain that may be assigned from a name registry. This tends to group related subdomains in the same queue.
Hosts Report (HER-1254)
The hosts report automatically dumped at the end of a crawl has two additional fields per listed host: number of URIs discovered but not fetched due to robots.txt rules, and number of URIs still pending/queued when the crawl ended.
BdbFrontier "dump-pending-at-close" Option (HER-1255)
BdbFrontier has a new 'expert' setting, "dump-pending-at-close". If true, during crawl termination, all URIs still pending/queued will be logged to the crawl.log with status '0' (untried).
JMX 'dumpUris' operation (HER-1154)
CrawlJob offers a new 'dumpUris' JMX operation, which offers options similar to the view URIs option in the web admin UI, but dumps URIs to a local file.
Fixes for OutOfMemoryError (OOME) Risks (HER-1449, HER-1171)
Two distinct risks for triggering an OutOfMemoryError have been removed, one concerning heap memory exhaustion in large crawls requiring many queues, and the other non-heap memory exhaustion when garbage-collection-triggered finalization may lag in a fast crawl needing little heap memory.
Renamed settings: 'overly-eager-link-detection' (HER-1439) and 'bind-address' (HER-1045)
Two settings with potentially-confusing names have been renamed. 'overly-eager-link-detection' in ExtractorHTML and JerichoExtractorHTML, with a default value of 'true', has been renamed 'extract-value-attributes' to more accurately reflect its effect. 'bind-address' in FetchHTTP, with a default value of the empty string, has been renamed 'http-bind-address', for consistency with 'http-proxy-host' and 'http-proxy-host' and to avoid confusion with the admin web UI bind address.
If you use previous version order.xml configuration files with the old setting names in Heritrix, you will receive non-fatal logged/alert warnings about "Unknown attribute". To avoid these warnings, either rename the old settings in the order.xml or, if you are happy with the default values, you may simply delete the old settings.
Additional contributors
In addition to the usual suspects
, this release includes contributed fixes or functionality from:
- Matt Sanford
- Eric C. Jensen
- Kohei TAKEDA
All Tracked Changes
The following tracked issues are recorded as addressed in this 1.14.0 release:
http://webteam.archive.org/jira/secure/ReleaseNote.jspa?projectId=10021&styleName=Html&version=10020
IA Webteam JIRA
(48 issues)
|
|
T |
Key |
Summary |
Status |
|
HER-1490
|
rename 'experimental' WARC classes
|
Resolved
|
|
HER-1488
|
fix sourceforge website update from build
|
Resolved
|
|
HER-1460
|
ClassicScope forceAcceptFilter inert, confusing
|
Resolved
|
|
HER-1449
|
OOME in Deflater.init() after short busy crawl
|
Resolved
|
|
HER-1448
|

 not treated as whitespace in href, affecting link extraction
|
Closed
|
|
HER-1439
|
rename 'OVERLY_EAGER_LINK_DETECTION'
|
Resolved
|
|
HER-1390
|
not all ignored seed lines are logged
|
Resolved
|
|
HER-1387
|
Upgrade JE to 3.2.74 [heritrix1]
|
Resolved
|
|
HER-1322
|
Update relative URI derelativization, esp. of naked ?query-strings, to match RFC3986
|
Resolved
|
|
HER-1295
|
broken-fetch retries triggering unwanted 304 responses
|
Resolved
|
|
HER-1292
|
ARCWriter writes incorrect length in ARC files for 304 responses.
|
Resolved
|
|
HER-1289
|
RuntimeException "CrawlServer must deserialize in a ToeThread" when doing recovery from log
|
Resolved
|
|
HER-1281
|
crawl stalls with plenty of inactive queues; StoredQueue sync/corruption?
|
Resolved
|
|
HER-1280
|
do not by default GET form action URLs declared as POST, because it can cause problems/complaints
|
Resolved
|
|
HER-1279
|
JerichoExtractorHTML fails to extract links when multiple attributes per element exist
|
Resolved
|
|
HER-1278
|
JBDB override class obsoleted by recent update of JE.jar
|
Resolved
|
|
HER-1265
|
PersistLogProcessor, PersistProcessor log robustness on write, close, read
|
Resolved
|
|
HER-1261
|
AdaptiveRevisitFrontier ignores JMX importUri(s)
|
Resolved
|
|
HER-1259
|
ARFrontier improvements (QueueAssignmentPolicy configuration, selective rescheduling)
|
Resolved
|
|
HER-1256
|
upgrade BDB-JE to 3.2.44
|
Resolved
|
|
HER-1255
|
Add option to dump queues to crawl.log at end of crawl
|
Resolved
|
|
HER-1254
|
Per-host tallies of robots exclusions, uncrawled URIs at end of early-terminated crawl
|
Resolved
|
|
HER-1248
|
(sub)domains of infinite breadth kill crawler throughput
|
Resolved
|
|
HER-1239
|
FetchHTTP throws NullPointerExcetpion on checkpoint if cookies are disabled
|
Resolved
|
|
HER-1192
|
Heritrix extracts and schedules invalid https: URI
|
Resolved
|
|
HER-1191
|
https: (no slashes) in recovery from recovery log causes NPE in UURIFactory.checkHttpSchemeSpecificPartSlashPrefix
|
Resolved
|
|
HER-1188
|
Deadlock risk in WorkQueueFrontier.receive and WorkQueueFrontier.deepestUri and WorkQueueFrontier.next
|
Closed
|
|
HER-1181
|
apply scope to recovery replays
|
Resolved
|
|
HER-1180
|
support iso-draft WARC (0.17) in writer/reader
|
Resolved
|
|
HER-1178
|
ARCReader cannot handle ARC-files with records larger than 2 GB
|
Resolved
|
|
HER-1177
|
publicsuffix-based HashCrawlMapper reduce-pattern
|
Resolved
|
|
HER-1176
|
make queue-assignment-policy overridable, changeable
|
Resolved
|
|
HER-1175
|
publicsuffix-based queue assignment policy (TopmostAssignedSurtQueueAssignmentPolicy)
|
Closed
|
|
HER-1173
|
forms with large values get "ArrayIndexOutOfBoundsException: 200000" (jetty limit)
|
Resolved
|
|
HER-1172
|
(patch) ExtractorUniversal file handle leak
|
Closed
|
|
HER-1171
|
Queues-of-queues (inactive,retired,etc.) grow without bound; OutOfMemoryError
|
Resolved
|
|
HER-1169
|
CrawlMapper 'check-uri' doesn't work in late position (if fetch status != 0)
|
Resolved
|
|
HER-1168
|
CrawlMapper broken by outlinks/outcandidates split
|
Resolved
|
|
HER-1166
|
BloomFilter32bitSplit artificially limits effective size
|
Resolved
|
|
HER-1163
|
"java.lang.ArithmeticException: / by zero" in org.archive.io.RecordingOutputStream.checkLimits(RecordingOutputStream.java:271)
|
Resolved
|
|
HER-1154
|
[contrib] JMX dumpUris method supporting the same behavior as the frontier search on the wui
|
Resolved
|
|
HER-1128
|
ExtractorHTML fails to extract FRAME SRC link without whitespace before SRC
|
Resolved
|
|
HER-1123
|
ContentTypeNotMatchesRegExpDecideRule not usable due to typo in src/conf/modules/DecideRule.options
|
Resolved
|
|
HER-1045
|
[contrib] make it easier to set FetchHTTP local bind address, use rotation of alternate local addresses
|
Resolved
|
|
HER-1003
|
Interruptible regex - TimeoutCharSequence?
|
Closed
|
|
HER-907
|
Log jmxclient start/pause/resume/stop/etc. events
|
Resolved
|
|
HER-661
|
"failed get" "user-mapped section open" exceptions
|
Resolved
|
|
HER-616
|
The UURI class may throw NullPointerException in getReferencedHost()
|
Resolved
|