GLAM/Model projects/Improving the quality of OCR

Beginning in the summer of 2011, the US National Archives and Records Administration began a major content contribution of tens of thousands of digital files, many of which were textual (or spoken) documents. As part of these efforts, we also sought to use Wikisource in order to crowdsource the transcription of the contributed documents, which will make them more useful and accessible to the institution's researchers. In addition to the transcriptions themselves, the project led to several exciting developments, like the inclusion of these transcriptions in the NARA online catalog, the creation of new introduction material for Wikisource, and the development of a new "Transcribe!" button for documents that streamlines the process for new users.

Why Wikisource

Wikisource, like Wikipedia, is another member of the Wikimedia family and runs on the same wiki platform. The wiki platform is an incredible platform for transcribing, and has an already developed a procedure for transcribing documents which produces well-formatted texts arranged for web viewing whose transcriptions have been vetted by multiple editors. By the time each document is completed, its transcription will have been proofread by one human and then validated by another. Transcriptions take place on a per-page basis, and each page is transcribed or viewed side-by-side with the corresponding image of that page. Pages’ current status are indicated by a color code at the top, and they are arranged on index pages.

What we did

While there have been other projects on Wikipedia and Wikimedia Commons, this was the first large-scale partnership between a cultural institution and the English-language Wikisource.

Creation of a project portal
We began with the WS:NARA project page in order to coordinate contributions by editors. WikiProjects are places where editors self-organize around particular subjects (like military history, mathematics, or your institution's holdings).
Upload of documents
Documents are uploaded to Wikimedia Commons. Wikimedians desire the highest resolution possible, but it is particularly important to ensure the quality is at least good enough for the text to be easily legible. NARA's uploads (textual and graphic) are listed at commons:Category:Media contributed by the National Archives and Records Administration.
Transcription of documents
The complete transcription process consists of roughly three steps:
1. An index page is created to contain the bibliographic information and page list
2. Each page is transcribed and proofread side-by-side with page scans.
3. When completely transcribed, the entire document is displayed in viewer-friendly format.
Development of new resources
"Transcribe!" button.

Editing aid.
Online catalog linking

Results

This project is ongoing and will last as long as there are documents in the National Archives' holdings that are not transcribed(!) At the time of writing:

Over 20,000 pages of documents have been uploaded.
45 documents have been fully transcribed, validated, and added to the NARA online catalog.