Record matching on mortality data

I’m looking forward to teaming up with my HHS Entrepreneur-in-Residence cohorts Paula Braun and Adam Culbertson.  We have a “perfect storm” coming up, where all three of our projects are intersecting.  Paula is working on modernizing the nation’s mortality reporting capabilities.  Adam has been working with the HIMSS (Heath Information Management Society and Systems) organization to improve algorithms and methods for matching patient records.  And I, for the DDOD project, have been working on a use case to leverage NDI (National Death Index) for outcomes research.  So the goals of mortality system modernization, patient matching and outcomes research are converging.Patient Matching Exercise

To that end, Adam organized a hackathon at the HIMSS Innovation Center in Cleveland for August 2015.  This event throws in one more twist: the FHIR (Fast Healthcare Interoperability Resources) specification.  FHIR is a flexible standard for exchanging healthcare information electronically using RESTful APIs.  The hackathon intends to demonstrate what can be accomplished when experts from different domains combine their insights on patient matching and add FHIR as a catalyst.  The event is broken into two sections:

Section 1:  Test Your Matching Algorithms
Connect matching algorithms to a FHIR resource server containing synthetic patient resources.  The matching algorithms will be updated to take in FHIR patient resources and then perform a de-duplication of the records.  A final list of patient resources should be produced.  Basic performance metrics can then be calculated to determine the success of the matching exercise.  Use the provided tools, or bring your own and connect them up.Section 2:  Development Exercise
Develop applications that allow EHRs to easily update the status of patients who are deceased. A synthetic centralized mortality database, such as the National Death Index or a state’s vital statistics registry, will be made available through a FHIR interface.  External data sources, such as EHRs, will be matched against this repository to flag decedents. The applications should be tailored to deliver data to decision makers. This scenario will focus on how different use cases drive different requirements for matching.

Matching algorithms for patient recordsPatient matching and de-duplication is an important topic in EHRs (Electronic Health Records) and HIEs (Health Information Exchanges), where identifying a patient uniquely impacts clinical care quality, patient safety, and research results.  It becomes increasingly important as organizations exchange records electronically and patients seek treatment across multiple healthcare providers.   (See related assessment titled “Patient Identification and Matching Report” that was delivered to HHS’s ONC in 2014.)

We’re looking forward to reporting on progress on all three initiatives and the common goal.

This topic is covered on the HHS IDEA Lab blog:

Appendix: Background on patient matching

Additional challenges occur because real-world data often has errors, variations and missing attributes.  Common errors could include misspellings and transpositions.  Many first names in particular could be written in multiple ways, including variations in spelling, formality, abbreviations and initials.  In large geographies, it’s also common for there to be multiple patients with identical first and last names.

Data set Name Date of birth City of residence
Data set 1 William J. Smith 1/2/73 Berkeley, California
Data set 2 Smith, W. J. 1973.1.2 Berkeley, CA
Data set 3 Bill Smith Jan 2, 1973 Berkeley, Calif.

Although there’s a broad range of matching algorithms, they can be divided into two main categories:

  • Deterministic algorithms search for an exact match between attributes
  • Probabilistic algorithms score an approximate match between records

These are often supplemented with exception-driven manual review.  From a broader, mathematical perspective, the concept we’re dealing with is entity resolution (ER).  There’s a good introductory ER tutorial that summarizes the work in Entity Resolution for Big Data, presented at KDD 2013.  Although it looks at the discipline more generically, it’s still quite applicable to patient records.  It delves into the areas of Data Preparation, Pairwise Matching, Algorithms in Record Linkage, De-duplication, and Canonicalization.  To enabling scalability, it suggest use of Blocking techniques and Canopy Clustering    These capabilities are needed so often, that they may be built into commercial enterprise software.  IBM’s InfoSphere MDM (Master Data Management) is an example.

Metrics for patient matchingWhen comparing multiple algorithms for effectiveness, we have a couple good metrics: precision and recall.  Precision identifies how many of the matches were relevant, while recall identifies how many of the relevant items were matched.  F-Measure combines the two.  It should be noted that the accuracy metric, which is the ratio of items accurately identified to the total number of items, should be avoided.  It suffers from the “accuracy paradox”, where lower measures of accuracy may actually be more predictive


  • Precision:     p = TP/(TP+FP)
  • Recall:    r = TP/(TP+FN)
  • F-Measure =  2 p r / (p + r)
  • Accuracy:   a = TP+TN/(TP+TN+FP+FN)

In the long run, the challenge can also be approached from the other side.  In other words, how can the quality of data entry and storage within an organization be improved.  This approach could reap benefits in downstream matching, reducing the need for complex algorithms and improving accuracy.  AHIMA published a primer on Patient Matching in HIEs, in which they go as far as calling for a nationwide standard that would facilitate more accurate matching.  They suggest standardizing on commonly defined demographic elements, eliminating use of free text entry except for proper names, and ensuring multiple values aren’t combined in single fields.

Share Button