BACKGROUND: Linking records from multiple disease registries requires that consistent matching elements be available in all sources. Variables that are important for examining health disparities but typically not included in matching algorithms include patient race, ethnicity, sex, birth country, and behavioral risk factors. However, the completeness, quality and categorization of these data often vary.
METHODS: We matched data on persons reported to the New York City (NYC) Department of Health and Mental Hygiene’s (DOHMH) Bureaus of HIV Prevention and Control (BHIV), Sexually Transmitted Disease Prevention and Control (BSTDC), Tuberculosis Control (BTBC), Communicable Disease (BCD) and Office of Vital Statistics (OVS) from 2000–2013 and Primary Care Information Project (PCIP) from 2006–2013. We summarized discordance in race/ethnicity data among one-to-one matches with non-missing values and developed a hierarchical algorithm to resolve discordance, based on data completeness and collection methods. Because granularity and coding of race and ethnicity data varied between registries, data were cleaned before collapsing into a single race/ethnicity variable to align with 2010 U.S. Census categories.
RESULTS: Level of completeness in the race/ethnicity variable varied greatly between data sources, ranging from <1% of data missing or unknown in registries that conduct routine investigation and provide case management (BTBC) or medical chart extraction (BHIV) to >50% in registries that primarily receive data through ELR (BSTDC, PCIP). Discordance was greatest when comparing programs in which events were not routinely investigated (BSTDC) to those in which they are routinely investigated (BTBC, BHIV); when matched to STD data, 11.8% of TB and 20.5% of HIV records were discordant. According to the hierarchical algorithm, variable values were selected from one source in the following order: BTBC (routine investigation and case management), BHIV (medical chart extraction), OVS (data collected by funeral directors), BCD (routinely investigated cases), BSTDC (subset of routinely investigated cases), BCD (non-routinely investigated cases), then PCIP (ELR only).
CONCLUSIONS: Demographic data are often more complete when persons are interviewed as part of routine case investigation or other follow-up is performed compared to electronic laboratory reporting (ELR) only. Initiatives to improve the completeness and quality of demographic data are needed to strengthen our ability to identify at-risk populations and address health disparities. Record linkage and data integration offer an opportunity to fill in missing data with values from other sources. The lessons learned and methods may be useful to other jurisdictions interested in data matching and harmonization.