Website Archive Glossary

Website Archive Glossary

Archive – A repository containing records, documents, or other materials of enduring, evidential, legal, or historical value that are preserved so as to provide continual access in accordance with user access policies.

Blocked Site Error – Error message indicating that an authorized person has requested that the information be excluded from our collection.

Capture – The process of copying digital information from the live web to an archive.

Collection – A group of records, documents, or other materials of enduring value related by common ownership or common subject matter within an archive.

Crawl – The information captured by a crawler (see definition below) on a single visit to each of the specified urls.

Crawler – A software agent that captures information from the web. Our crawler starts with a list of urls to visit. As it visits these urls it captures the documents on these web pages.

Document – A resource on the web that has a distinct web address. It could be an embedded image, whole web page, pdf file or any other component of a web page.

Domain – A series of alphanumeric strings separated by periods, such as, that is an address of a computer network connection and that identifies the owner of the address.

Failed Connection – Error message indicating that the server that contains the information you requested is down.

Full-text Search – A search that compares every word in a document, as opposed to searching an abstract or a set of keywords associated with the document.

Harvesting – Gathering and organizing the contents of the web pages captured by the crawler.

HTML – A markup language used to structure text and multimedia documents and to set up hypertext links between documents, used extensively on the web. It can be created and processed with a wide range of tools from simple text editors to sophisticated authoring software.

Hyperlink – A reference in an online document to another online resource. Hyperlinks are typically activated by clicking on a highlighted word or icon at a particular location on the screen, which will result in the display of the referenced resource.

JavaScript – A popular scripting language that is widely supported in web browsers and other web tools. It adds interactive functions to HTML pages, which are otherwise static.

Not In Archive – Error message indicating that the site archived has a redirect on it and the site you are redirected to is not in the archive or cannot be found on the live web.

Path Index Error – Error message indicating a problem in our database wherein the information requested is not available (generally because of a machine or software issue).

Redirect – A technique used on the web to forward the user from one url to another. This may be due to a file’s location change on the same server or a file’s relocation to a new server.

Robots.txt – A text file placed in the root directory of a web site that prohibits crawlers from indexing all or specific pages of the site. The Robots Exclusion protocol provides a format for designating which directories and files are off limits to the crawler.

Robots.txt Query Exclusion – Error message indicating that the requested information was not captured by the crawler because the site owner asked that the information be excluded from capture using a robots.txt file.

Server Side Image Maps – Image maps where the image is stored on the server and accessed each time a portion of the image is selected by a user. Image maps are images where various portions of the image are assigned a specific action (i.e., link to another page, run a program, etc.).

Streaming Media – A one-way transmission (audio, video, etc.) over a data network that is played as it is received and is not stored permanently on the requesting computer.

URL (Uniform Resource Locator) – A web address (for example, ), usually consisting of the access protocol (http), the domain name ( ), and optionally the path to a file or resource residing on that server (/records/archives/).

Web – A decentralized, global network connecting millions of computers. It allows computer users to communicate information to each other.

Web Page - A resource on the web, usually in HTML/XML format and with hypertext links to enable navigation from one page or section to another, displayed with a web browser. A web page can contain any of the following:
Graphics (.gif, .jpeg or .png)
Audio (.mid or .wav)
Interactive multimedia content that requires a plug-in such as Flash, Shockwave or VML
Applets (subprograms that run inside the page) which often provide motion graphics, interaction, and sound

Web Site – A set of interconnected web pages, usually including a homepage, generally located on the same server, and prepared and maintained as a collection of information by a person, group, or organization.