In healthcare, we now have more accessible electronic information than ever before, and the excitement is palpable around the potential for both big data and machine learning (ML). For example, computer scientists at Stanford recently developed deep neural networks to diagnose skin cancer from images of skin lesions.1

ML based on structured data typically uses existing data (e.g., predicting future outcomes with different treatment options based on past outcomes) without the need for new and expensive human annotation. Healthcare, however, is complicated by the approximately 80% of unstructured data that make up a typical electronic health record (EHR). An interesting comment from a northeastern health system Chief Medical Information Officer illustrates the problem. He said, “My team had plenty of questions to ask of the data, but most of them couldn’t be answered because the data was trapped in unstructured text.” In the end, they ended up answering less interesting questions where the data was more easily at hand.

Language in EHRs is complex, and key features are determined by subtle clues. “Trying to quit smoking” indicates a current smoker, whereas “quit smoking five years ago” indicates an ex-smoker. This prevents direct ML of outcomes from unstructured text. Instead, Natural Language Processing (NLP) needs to be employed to turn the text into a set of features for ML to use.

NLP is a technology that has been commercially available for many years, and is used widely in the biomedical and healthcare space. NLP components often rely on ML. For example, determining the structure of the language (e.g., part-of-speech tagging) is typically done using ML since there is plenty of representative data available where words have been annotated as nouns, verbs, etc.

Turning Unstructured Data into Valuable Information

Some NLP systems, especially academic systems, also incorporate ML for extracting features. This works well in competition settings where there has been effort to create a well-annotated corpus. These techniques, however, have had less success in commercial settings in which good quality, large-scale, and representative annotated data is rarely available. Annotating a gold standard is expensive, requiring development of detailed annotation guidelines and use of multiple annotators to judge reproducibility of results (measured by inter-annotator agreement). This can be particularly challenging when the subject matter experts are busy healthcare professionals.

How can we improve this? Rule-based NLP techniques that extract features from text using linguistic patterns and terminologies avoid the need for annotated training data, and typically provide good precision (correct results). However, this has sometimes been at the expense of good recall (found results).

Data-driven, rule-based NLP (agile text mining) is a more recent alternative. This can achieve high recall and precision for any particular feature in hours without the need for annotated training sets. While production rules may need additional development, a swift “first pass” will often be enough to establish whether a feature will be a useful contributor for an ML model.

The following are three examples of successful NLP-ML initiatives in which NLP is used to rapidly and effectively extract and structure the key features from unstructured text to feed into ML algorithms for predictive modeling:

1. NLP for Product Vigilance in Medical Affairs

A top-10 pharma company uses NLP to annotate and categorize “voice of the customer” (VoC) call feeds for pharmacovigilance. VoC call transcripts are a rich seam of potential patient reported outcomes, side effects, drug interactions, and more.

Researchers in the predictive analytics group built a workflow to process the call transcripts, using agile text mining to make sense of the unstructured feeds. The calls are categorized and tagged for key metadata such as caller demographics and reason for calling (e.g., complaint, formulation information, side effect, drug-drug interactions).

The extracted features are used as the structured substrate for ML algorithms to assist in the categorization of call feeds, and to build predictive models around the different products.

Adaptation to healthcare—by collecting similar information through patient portal messages and nurse notes—could similarly provide valuable healthcare insights: For example, when there is 1) a formulary addition of a new medication or 2) as part of the vigilance workflow for patient safety and risk management from an alert.

2. NLP Aiding Pneumonia Prediction

Kaiser Permanente developed a system that categorizes potential pneumonia patients2 based on interpretation of their chest X-rays in radiology reports. The approach blends leading-edge commercial off-the-shelf healthcare NLP capabilities and rule-based ML algorithms to provide a classifier method.

This collaboratively developed system was assessed against a 300-patient gold standard and achieved excellent selectivity and specificity scores of over 90%. It has been used to classify hundreds of thousands of patient reports.

3. NLP with ML to Predict Failure or Success of Cancer Drugs in Clinical Trials

In a publication3 last year, researchers from Roche and Humboldt University of Berlin describe how they used NLP to systematically identify all MEDLINE abstracts containing both the protein target and the specific disease indication of a known set of successfully approved or failed cancer therapeutics (for example, abstracts containing both HER2 and breast cancer, or c-Kit and gastrointestinal stromal tumor).

The researchers applied ML classifiers and found that the NLP-extracted data features could be used to predict success or failure of target-indication pairs, and hence, approved or failed drugs. They conclude: “These patterns allow predicting success of drugs in Phase II or III with remarkably high accuracy.”

The authors proposed that these methods (combining text mining with ML) could easily be applied to other disease areas and provide an indication of the direction of successful development.


NLP can provide access to the approximately 80% of data otherwise trapped in unstructured text, turning it at large scale into features that can drive predictive models. The combination of NLP and ML provides a powerful paradigm for developing predictive models in healthcare and life science applications.


1. “Dermatologist-level Classification of Skin Cancer with Deep Neural Networks.” Nature. 2017 Feb 2; 542(7639):115-118. doi: 10.1038/nature21056.

2. “Automated Identification of Pneumonia in Chest Radiograph Reports in Critically Ill Patients.” BMC Med Inform Decis Mak. 2013; 13: 90. doi: 10.1186/1472-6947-13-90.

3. “Reflection of Successful Anticancer Drug Development Processes in the Literature.” Drug Discov Today. 2016 Nov;21(11):1740-1744. doi: 10.1016/j.drudis.2016.07.008.

  • David Milward
    David Milward

    Chief Technology Officer at Linguamatics

    David Milward is Chief Technology Officer at Linguamatics. He is a pioneer of interactive text mining, and a founder of Linguamatics. He has over 20 years of experience in natural language processing (NLP) product development, consultancy, and research.


    You May Also Like

    Movers & Shakers

    AbelsonTaylor AbelsonTaylor, the largest independent advertising agency in the healthcare field, has hired Ernie ...

    Personalizing the Customer Experience

    Brand teams that wish to create a personalized experience for the customer via multi-channel ...

    The Impact of Artificial Intelligence on Data Analytics

    Artificial intelligence (AI) and its ability to impact data analytics is on everyone’s mind. ...