American Journal of Epidemiology
26 July 2024
Abstract
Fall-related injuries (FRIs) are a major cause of hospitalizations among older patients, but identifying them in unstructured clinical notes poses challenges for large-scale research. In this study, we developed and evaluated Natural Language Processing (NLP) models to address this issue. We utilized all available clinical notes from the Mass General Brigham for 2,100 older adults, identifying 154,949 paragraphs of interest through automatic scanning for FRI-related keywords. Two clinical experts directly labeled 5,000 paragraphs to generate benchmark-standard labels, while 3,689 validated patterns were annotated, indirectly labeling 93,157 paragraphs as validated-standard labels. Five NLP models, including vanilla BERT, RoBERTa, Clinical-BERT, Distil-BERT, and SVM, were trained using 2,000 benchmark paragraphs and all validated paragraphs. BERT-based models were trained in three stages: Masked Language Modeling, General Boolean Question Answering (QA), and QA for FRI. For validation, 500 benchmark paragraphs were used, and the remaining 2,500 for testing. Performance metrics (precision, recall, F1 scores, Area Under ROC [AUROC] or Precision-Recall [AUPR] curves) were employed by comparison, with RoBERTa showing the best performance. Precision was 0.90 [0.88-0.91], recall [0.90-0.93], F1 score 0.90 [0.89-0.92], AUROC and AUPR curves of 0.96 [0.95-0.97]. These NLP models accurately identify FRIs from unstructured clinical notes, potentially enhancing clinical notes-based research efficiency.
Citation
Ge, Wendong, Lilian M. G. Coelho, Maria A. Donahue, Hunter J. Rice, Deborah Blacker, John Hsu, Joseph P. Newhouse, Sonia Hernandez-Diaz, Sebastien Haneuse, M. Brandon Westover, and Lidia M. V. R. Moura. "Automated Identification of Fall-related Injuries in Unstructured Clinical Notes." American Journal of Epidemiology (26 July 2024).