BACKGROUND: The computer learning algorithm used in this pilot is based on the Bayes modeling of predictions which are commonly used for classifying narrative text. This algorithm has been demonstrated to auto-code occupational injury narratives into discreet event (mechanism of injury) groups with high accuracy. While manual coding of narrative text may never be completely replaced by automated methods, previous studies have shown that this algorithm can identify the narratives that would most benefit from manual review.
METHODS: The algorithm for Naïve Bayes was applied (using Textminer software) to the narratives of 50,000 records randomly sampled from the 2011 Bureau of Labor Statistics Survey of Occupational Injuries and Illnesses (BLS SOII) micro-data for the U.S. A sub-sample of these cases (40,000) was ‘trained’ on the BLS-assigned Occupational Injury and Illness Coding System (OIICS) codes for 2-digit Event categories. The Bayesian probabilities calculated from the training dataset were used to predict 2-digit Event from the narratives of the remaining 10,000 cases comprising the prediction dataset. The algorithm was run on all individual (single) words in the narrative of the prediction set (SW Model), followed by a second run which treated each 2-word sequence as a single word (SEQ Model) in the prediction set. Results of the SW Model and the SEQ Model were combined to identify a set of cases for which both models were in agreement on the algorithm-assigned Event category (Combined Model). To evaluate algorithm performance at the 2-digit OIICS level for the three models, sensitivity and positive predictive value (PPV) were calculated using BLS manually assigned categories as the gold standard.
RESULTS: PPV increased overall from 69.1% for the SEQ Model to 79.5% for the Combined Model. Measures of sensitivity for the majority of the high frequency 2-digit Event categories in the Combined Model were 85% or higher. Measures of PPV were relatively lower. One exception was Category 71 (Overexertion from Outside Sources) which had a PPV of 90% and accounted for 29% of cases in the combined dataset.
CONCLUSIONS: Use of the Naïve Bayesian algorithm to auto-code Event information in the BLS SOII shows promise. Operationally, if a user restricts to the ‘Agree’ cases in the Combined Model and accepts algorithm-assigned Event categories for the high frequency categories where PPV is 85% or greater (Event categories 26 and 71), one could auto-code 34% (2,504) of the 'Agree' cases with relatively high confidence in the accuracy. This translates into 25% (2,504/10,000) of all cases.