Web Archives and Large-Scale Data: Perliminary Techniques for Facilitating Research

Date

2012-05-25

Authors

Woodward, Nicholas
Norsworthy, Kent

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

The Latin American Government Documents Archive (LAGDA) is a collaborative project of the University of Texas Libraries, The Nettie Lee Benson Latin American Collection, and the Latin American Network Information Center (LANIC) at The University of Texas at Austin that seeks to preserve and facilitate access to a wide range of ministerial and presidential documents from 18 Latin American and Caribbean countries. Web crawling is conducted quarterly using the Internet Archive’s Archive-It application. The resulting Archive contains copies of the Websites of approximately 300 government ministries and presidencies between 2005 and the present.

Currently, LAGDA is comprised of approximately 66.6 million documents archived from the Internet, totaling 5.6 terabytes of data. The collection increases in size by an additional 250 gigabytes with each quarterly crawl. Content in the Archive includes not only the full-text versions of official documents, but also original video and audio recordings of key regional leaders, all archived in the ARC file format produced by the Heritrix web crawler. Archive contents include thousands of annual and "state of the nation" reports, plans and programs, and speeches by presidents and government ministers. The data include HTML-formatted pages, Microsoft Word documents, Adobe PDF files and RTF documents, as well as various audio and video formats. The collection includes only sparsely populated metadata.

Promoting research of the collection is a central component of the LAGDA project, and towards those ends staff has collaborated with researchers at the Texas Advanced Computing Center (TACC) using the LAGDA data to develop text-mining methods for document representation and classification. This includes implementing several strategies to mechanically classify and categorize information contained in the Archive in order to facilitate search and browse capabilities. Additionally, LANIC and TACC have worked together to create methods for research on sub collections in the Archive, e.g. presidential speeches or ministerial documents. Preliminary results of these efforts have been encouraging, and they are the initial steps on the path towards solutions that will make large-scale data more accessible to researchers.

The challenges presented in LAGDA are similar to those faced by academic libraries across the country as they are increasingly faced with “big data” collections that necessitate new strategies for data analysis tools. Nascent projects such as LAGDA provide some initial insights into how academic libraries can work collaboratively to facilitate research on the types of large-scale collections that are increasingly prevalent in today’s digital world.

The presentation will focus on the following components: Challenges presented by Web archived data “Big data” and data-driven research The role of libraries in data analysis The future of “big data” and libraries

Description

Presentation slides for the 2012 Texas Conference on Digital Libraries (TCDL).

Citation