BACKGROUND: Foodborne illness remains a major public health challenge in the United States, causing an estimated 48 million illnesses, 128,000 hospitalizations, and 3,000 deaths annually (Scallan et al, 2011). Outbreak investigations provide critical information on the pathogens, foods, and food-pathogen pairs causing illness. However, approximately 60% of reported foodborne illness outbreaks identify a confirmed or suspect etiology, and of those, only half implicate a food vehicle (Gould et al., 2013). Analytical tools used during hypothesis generation have the potential to more rapidly develop or confirm existing hypotheses.
METHODS: National data on reported foodborne STEC outbreaks from 1998-2014 were available from the Centers for Disease Control and Prevention’s (CDC) Foodborne Outbreak Surveillance System. Previous work identified demographic and outbreak predictors associated with food sources in STEC outbreaks (White et al, 2016), including age (percentage of cases in an outbreak <5, 5-10, 10-19, 20-50, and ≥50 years), percentage female (percentage of female cases involved in the outbreak), multistate (binary variable indicating if the outbreak occurred in multiple states), number of cases, duration, and season of the outbreak. A conditional inference tree method was used to develop a decision tree. A multi-class response was used, allowing for multi-category outcomes. Model accuracy was tested using cross-validation techniques. To evaluate the performance of the decision tree model, we compared it to previous methods using logistic regression to evaluate and compare the performance of both modeling techniques in predicting outbreak food sources.
RESULTS: From 1998-2014, 470 outbreaks were reported to CDC. Outbreaks without an identified food source (153), a complex food source (80), and uncommon food source (31) were excluded. The most common food sources were beef (125 outbreaks), vegetable (51), and dairy (30). The decision tree had 5 levels, with a total of 19 nodes and 10 terminal nodes. Variables selected to define branching and the probability of the three food sources included: percentage female, season, and age categories. The primary predictor was the percentage female, and age categories (percentage < 5 years and percentage > 50 years) were secondary predictors. The decision tree correctly predicted 98% of beef outbreaks. The model correctly predicted 50% of dairy and vegetable outbreaks.
CONCLUSIONS: Decision trees are not only a useful prediction tool, but are also visually appealing to demonstrate relationships between predictors and outcomes. This study demonstrates the feasibility of using decision tree modeling techniques to develop a tool for hypothesis-generation in foodborne outbreak investigation.