Help on WARC tools
This document provides instructions for creating and using archived web documents stored in ISO-standard WARC ("web archive") format, using the tools and utilities available on this site.
WARC (Web ARChives) files are sets of documents and metadata that have been downloaded from a website or group of sites for archival storage. The WARC format is an international standard (ISO 28500:2009; see http://archive-access.sourceforge.net/warc/ for draft specificiations) that is used by the Internet Archive, the U.S. Library of Congress, and other organizations. WARC files can be searched and navigated using open-source utilities provided by the Internet Archive (see http://archive-access.sourceforge.net/.)
The tools used here are:
Heritrix web crawler (http://crawler.archive.org/)
A slightly extended version of the Hanzo Warc-tools (http://code.hanzoarchives.com/warc-tools/overview). See versions of file "warc-tools-mandal-x.x.zip" in MoinMoinCustomizations:
- lynx (for formatting text-only dumps of archived document contents)
Rationale
The system employed here allows us to search and navigate the text in WARC files. Design features:
- No need for a web server, Java container, etc., to navigate archived web pages. Static html files work on any platform.
- Regex-based searching (highly customizable).
Pages can be indexed alongside other content (e.g., by a desktop search engine or within a CMS).
- Search query results are stored as static files. They can be revisited, redistributed, etc. Visited links are highlighted according to browser settings.
- Lynx output provides a list of URLs and simplified (streamlined) text-only layout, which make textual analysis efficient (i.e., the presentation is optimized for rapid scanning of many documents rather than layout with advertising, flash, and other junk).
WARC and HTTP headers are presented at the top of the rendered page.
Creating a web archive
See the instructions that accompany Heritrix.
Full-text search index
Using the modified warc-tools we can create a searchable index of text content.
python warc-tools/warcfilter.py -T response filename.warc.gz > filtered.warc
- Filters the archive to only response records (i.e., filters out the records that describe requests made by the web crawler).
python warc-tools/warchtmlindex.py filtered.warc > index.html
- Creates an index.html document summarizing the archive contents.
mkdir html
python warc-tools/filesdump.py filtered.warc
- Creates:
- fulltext.html, which includes the text of each record, one per line;
- a directory "html/" containing text-only content;
- a zip archive containing all of the above, including the "index.html" file created in the previous step.
- Creates:
Other useful functions:
python warc-tools/warcindex.py filename.warc.gz > index.csv
- Creates a spreadsheet summarizing the archive contents.
python warc-tools/warcextract.py filename.warc.gz 171305 > index.txt
- extracts the record at offset 171305
python warc-tools/warcextract.py filename.warc.gz 171305 | lynx -stdin -dump > lynx.txt
- generates a text-only version of the record
Navigating
The file "index.html" is a useful place to begin; the table lists the URI, content type, and content length of each resource in the archive file. Clicking on a URI will show the text content (if available) for that file. Hyperlinks within the document are numbered; clicking on one of these will take you to the corresponding URL in a list of links at the bottom of the page. Clicking on one of the URLs in the list will take you back to the index page; if the URL is in the index you will jump to that URL in the index, otherwise you will be taken to the top of the index. Athough this setup requires three clicks to get from a hyperlink to the archived target file, it has the advantage of always showing what the hyperlinks reference, even if we don't have the file itself archived.
The text-only versions of these archived documents (formatted by lynx, which can accommodate just about any type of text/* document) are intended to serve in full-text search and content scanning. These files are small, have no dependencies (stylesheets, images, scripts, etc.), and--unlike other WARC viewing utilities--can be viewed locally without a web server. For graphical access reproducing the original web sites and associated visual media, please install the Wayback Machine (http://archive-access.sourceforge.net/projects/wayback), which looks nicer but is less efficient for textual data mining. Unfortunately installing and configuring the Wayback Machine for an offline wiki is somewhat of a challenge.
Searching
All the text content for the collected pages from a given site are included in a single index file (fulltext.html), which can be searched and read without the need for dedicated applications such as Wayback and NutchWAX. This should be sufficient for content-based research purposes.
MoinMoinCustomizations/warc-search.sh - Call this in the top-level directory of an archive dump (i.e., the directory containing "fulltext.html"). The argument is the search term. For example, if warc-search.sh is located in the "WARC" directory and the archive dump in "dump1":
$ cd ~/WARC/dump1/
$ ../warc-search.sh "Alexandria"
The above command will produce a page called "query_Alexandria.html" that looks like the following (the links in the search results have been rendered inactive here; in the actual search results they should link to corresponding files in the "html/" directory):
Search results for "Alexandria"
- http://www.archive.org/web/web.php
HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:48:46 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g X-Powered-By: PHP/5.0.5-2ubuntu1.4 Connection: close Content-Type: text/html; charset=UTF-8 (logo) Web | Moving Images | Texts | Audio | Software | Education | Pais not currently supported. http://archive.bibalex.org, the Internet archive at the New Library of Alexandria, Egypt, mirrors the Wayback Machine. Try your search there when you have trouble connecting to the
- http://www.archive.org/about/faqs.php
HTTP/1.1 200 OK Date: Wed, 30 Apr 2008 20:50:17 GMT Server: Apache/2.0.54 (Ubuntu) PHP/5.0.5-2ubuntu1.4 mod_ssl/2.0.54 OpenSSL/0.9.7g X-Powered-By: PHP/5.0.5-2ubuntu1.4 Connection: close Content-Type: text/html; charset=UTF-8 (logo) Web | Moving Images | Texts | Audio | Software | Education |much early 20th-century media -- television and radio, for example -- was not saved. The Library of Alexandria -- an ancient center of learning containing a copy of every book in the world -- disappeared when i
rg/search.php?query=collection%3Aetree&sort;=-%2Fmetadata%2Fndba 115. http://www.unesco.org/webworld/alexandria_new/ 116. http://www.archive.org/donate 117. http://www.videolan.org/ 118. http://www.elecard.com/
- (etc.)
This script provides the first 400 characters from each matching record in the fulltext index (providing URL and headers), and 100+100 characters of context for 1-3 matches of the search term.
Use a pipe ("|") as "OR" operator for multiple terms. Try "library.{4,10}Alexandria" to match "Library of Alexandria", for example.
There is no default "AND" that accommodates all possible permutations; this approach is suitable in pulling out all occurrences of phrases for exhaustive listing and analysis of phrases in a corpora, rather than keyword frequency/density ranking for efficient retrieval of "best-matching" documents as in a question-answer search pattern (like Google).

![[?]](/web.cgi/moin_static193/mandal/img/moin-help.png)