Creating Email Archives from PDFs: The Covid-19 Corpus

Awardees

Matthew J. Connelly
Professor of History and Co-Director of ISERP

$98,630

Columbia University will contribute email archiving solutions on both ends of the email stewardship cycle — acquisition and preservation, on one end, and research access, on the other. The focus will be on government responses to the Covid-19 pandemic that are being released through FOIA requests made available online by journalists. Consequently, researchers are facing a number of challenges accessing these records and cannot easily determine the scope of arrangement of the collections, or find descriptions of the contents of the main components. To combat these challenges Columbia will build an open-source tool and associated library that takes email embedded in PDFs as input and generates an MBOX file as output, thereby making these records compatible with existing email archiving solutions. In addition, the project team will process a large corpus of FOIAed records on Covid-19 to enhance its value to researchers and to develop it into a new collection as part of the Freedom of Information Archive (FOIArchive), an aggregated database of government records