Project Tycho, Data for Health: Open Access to Newly Digitized United States Weekly Nationally Notifiable Disease Surveillance Data from 1888 to the Present

Monday, June 23, 2014: 4:00 PM
109, Nashville Convention Center
Willem Gijsbert Van Panhuis , University of Pittsburgh, Pittsburgh, PA
John Grefenstette , University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA
Anne Cross , University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA
Donald S Burke , University of Pittsburgh Graduate School of Public Health, Pittsburgh, PA

Brief Summary
BACKGROUND: Public health agencies in the United States such as the Public Health Service before 1950 and the Centers for Disease Control after 1950 have published nationally notifiable disease reports for cities and states every week since 1888 in journals such as the Public Health Reports and the Morbidity and Mortality Weekly Report. Because most of these reports have been publicly available in PDF or paper format only, opportunities to use this wealth of information for statistical and computational analysis have been greatly restricted. METHODS: We identified and digitized PDF or paper files of all 6500 weekly nationally notifiable disease surveillance reports published since 1888 into Excel spreadsheets using independent double data entry. All numeric disease reports (defined as counts) and contextual information such as the reporting locations, dates, and disease names were extracted from these spreadsheets using semi-automatic computational algorithms. All extracted information was standardized and made publicly available without restrictions through an online user interface. RESULTS: The online database named after Tycho Brahe (1546-1601) provides tools for the exploration and retrieval of datasets selected by users. Available data have been classified into three levels, each with different content. Level 1 includes data that have been standardized into a common format for specific studies. Level 2 includes data that have been reported in a common and consistent format, e.g. diseases reported for a one week period and without disease subcategories that changed over time. Level 3 includes all data available in raw format. Although level 3 is the most complete level of data, the large heterogeneity in types and formats of reports included requires extensive standardization before use in any analysis. All levels of data can be freely accessed and used for any purpose on www.tycho.pitt.eduafter registration and agreement to a creative commons attribution license. CONCLUSIONS: The Project Tycho database of newly digitized 125 years of weekly US notifiable disease data creates a new paradigm for the availability and use of large scale public health data. We aim to expand this resource into a resource for integrated publicly available disease surveillance data from around the world. This will accelerate new multi-disciplinary translational approaches that integrate public health data with large scale data from other domains such as electronic medical records, genomic data, climate data, and social media data, maximizing opportunities to use available data for better health.