Welcome to NutchWAX!
NutchWAX is a set of add-ons to Nutch in order to index and search archived web data.
These add-ons are developed and maintained by the Internet Archive Web Team in conjunction with a broad community of contributors, partners and end-users.
The name "NutchWAX" stands for "Nutch (W)eb (A)rchive e(X)tensions".
Since NutchWAX is a set of add-ons to Nutch, you should already be familiar with Nutch before using NutchWAX.
NutchWAX 0.12.x
The latest and greatest version of NutchWAX is 0.12.x. This release is a significant re-write of NutchWAX compared to the 0.10 release. The impetus for the re-write of NutchWAX was two-fold:
- Catch-up to the latest versions of Nutch and Hadoop. The 0.10 release was tied to older versions of Nutch and Hadoop.
- Re-factor NutchWAX add-ons to leverage the Nutch plugin framework and public APIs; rather than copy/paste/edit code from Nutch internal classes.
NutchWAX 0.12.x is bundled as a Nutch contrib package; with the goal that eventually all of the NutchWAX plugins and extensions will be integrated into mainline Nutch.
Website changes
Thus far, the focus has been on completing the 0.12 release, in particular the source code and bundled documentation. In the coming weeks the NutchWAX-related web pages will be overhauled. We will likely add a blog to this wiki page with regular updates and notes regarding ongoing NutchWAX development.
So, bookmark this as your NutchWAX homepage.
Download, build and install
At this time, NutchWAX 0.12 is only available via the SourceForge-hosted subversion server. For details on how to download, build and install NutchWAX 0.12, see:
http://archive-access.svn.sourceforge.net/svnroot/archive-access/trunk/archive-access/projects/nutchwax/archive/INSTALL.txt
To get the source for the 0.12.1 release, checkout
http://archive-access.svn.sourceforge.net/svnroot/archive-access/tags/nutchwax-0_12_1
Discussion & Mailing List
Participate in the NutchWAX discussion by subscribing to the archive-access-discuss mailing list at:
http://lists.sourceforge.net/lists/listinfo/archive-access-discuss