Category Archives: blog

Data.world Platform

Recently I’ve been playing around with the relatively new Data.world platform, which is intended to be the social network for data people.  I tried some of the query and visualization features, as well as uploading datasets.  Seems it has a lot of promise and I really hope it takes off.  I think at the moment, the biggest challenge for me is how to distinguish which datasets are trustworthy and clean.  I also think they have an opportunity to greatly improve the search capability for datasets if they could provide advanced search directives, such as explicitly specifying fields, geographic locations, time frames or semantic tags.

Today, I received an email from Data.world listing 10 favorite features of their users.  They’re listed below and I’m sure exploring them could provide for countless hours of fun.

10) Add metadata with data dictionaries, file tags, and column descriptions
9) Instantly query and join .csv and .json files using SQL
8) Join your local data with other datasets in data.world
7) Showcase your code, data, and documentation in Python and R notebooks
6) See inside files before downloading
5) Use integrations for R, Python, or JDBC to pull data into your tools of choice
4) Enrich your analyses with U.S. Census ACS summary files
3) Export datasets as Tabular Data Packages, a standard machine-readable format
2) Pull data directly into Tableau to create your visualizations
1) Add and sync files from GitHub, Google Drive, or S3

 

Open Data Panel at All Things Open Conference

Open Data Panel to be Featured at All Things Open

Open Data will be a featured panel discussion at the All Things Open conference this year, .  With a new administration set to transition into place in January and multiple new initiatives starting at both the state and federal levels, the topic has never been more important.  The session, which will take place Wednesday, October 26 at 1:30 pm ET, will feature some of the foremost experts in the world.

Topics to be discussed will include:

  • The New Open Data Transition Report
  • Future opportunities for Open Data at the local and federal levels with the DATA Act
  • How the Open Data landscape is evolving, particularly through Demand-Driven Open Data (DDOD)
  • Future opportunities in open data at the Federal and local levels
  • How the panel’s insights can help local governments create demand driven open data programs

The world-class lineup of panel members will include:

  • Joel Gurin-  (President and Founder, Center for Open Data Enterprise)
  • Hudson Hollister –  (Founder and Executive Director, Data Coalition)
  • David Portnoy –  (Founder, Demand-Driven Open Data)
  • Tony Fung –  (Deputy Secretary of Technology, State of Virginia)
  • Andreas Addison –  (President, Civic Innovator LLC)
  • Sam McClenney –  (Economist, RTI International)
  • Caroline Sullivan – (Wake County Commissioner)

The panel is open to attendees of All Things Open, the largest “open” technology event on the east coast of The Unites States

 

Rheumatoid Arthritis Data Challenge

Looking forward to seeing the evolution of the Rheumatoid Arthritis Data Challenge.  Here are the parameters…

  • Title: Rheumatoid Arthritis Data Challenge
  • Announcement date: March 8, 2016
  • Award date: May 10, 2016
  • Summary:
The Rheumatoid Arthritis Data Challenge is a code-a-thon, described as:

“Striking at the heart of a key issue in health outcomes research, participants will be provided access to a secured development environment in a staged competition over three weeks to create the best competitive algorithms to gauge clinical response in Rheumatoid Arthritis management.”
The challenge is hosted by Health Datapalooza in May 2016. It’s sponsored by Optum, Academy Health, and the US Department of Health and Human Services (HHS). This challenge uses non-governmental de-identified administrative claims data and electronic record clinical (EHR) data with the goal of establishing algorithms to predict clinical response to rheumatoid arthritis management. Applications are open to any team of health data enthusiasts, but only 15 of these will be selected for participation. (Register at: https://hdpalooza.wufoo.com/forms/rheumatoid-arthritis-data-challenge-reg-form/). Winners announced at the Health Datapalooza on May 10, 2016, with $40,000 in prizes to be awarded.

Public Access Repositories for Federally Funded Research

According to OSTP, there has been growth in the use of public access repositories intended to store results of federally funded research.  That’s good news.  Despite a mandate from February 2013 that such results be made available, the adoption by the research community has been slow.  Challenges include the competitive nature of research, mixing of multiple sources of funding, licensing conflicts with private peer reviewed publications, privacy concerns for study subjects, and many others.  Actually, even the raw data and source code for the calculations needs to be mHHS Public access reposade available.  For a research study, the clearest measure for meeting this mandate is complete reproducibility.

So while we’re quite far away from the ultimate goal, there have been incremental gains.  The HHS statistical agencies (including NIH, AHRQ, CDC, C FDA and ASPR) in particular have been using two systems: PubMed Central and CDC Stacks.  According to the latest figuresGrowth in PubMed from OSTP, on a typical weekday PubMed has than 1.2 million unique users who are downloading 2 million articles.  While that’s impressive, the actual growth in the number of articles in the two years since the mandate is approximately 30% (from about 2.7 million to 3.5 million).  So much more work remains.

 

Open Access repositories at a glance

Plans for Demand-Driven Open Data 2.0

Demand-Driven Open Data (DDOD) is a component HHS’s Health Data Initiative (HDI) represented publicly by  HealthData.gov.  DDOD is a framework of tools and methods to provide a systematic, ongoing and transparent mechanism for industry and academia to tell HHS more about their data needs.  The DDOD project description has recently been updated on the HHS IDEA Lab website: http://www.hhs.gov/idealab/projects-item/demand-driven-open-data/.   The writeup includes the problem description, background and history, the DDOD solution and process, and future plans.

In November 2015, the project has undergone an extensive evaluation of the activities and accomplishments from the prior year.  Based on the observations, plans are in place to deploy DDOD 2.0 in 2016.  On the process side, the new version will have clearly defined SOPs (standard operating procedures), better instructions for data requesters and data program owners, and up-front validation of use cases.  On the technology side, DDOD will integrate with the current HealthData.gov platform, with the goals of optimizing data discoverability and usability.  It will also include dashboards, data quality analytics, and automated validation of use case content.  These features help guide the operations of DODD and HealthData.gov workflow.

Invisible Illness Codaton

Identifying Datasets for Invisible Illness Codathon

Several datasets were identified for use on a recent White House codathon on mental illness and suicide prevention.  (See related press release.)  Many of them were from HHS (U.S. Department of Health and Human Services) agencies: CDCSAMHSA and AHRQ.  Datasets throughout government were tagged with “Suicide” for easy retrieval.  These tags were then ingested and aggregated up to Data.gov, specifically http://catalog.data.gov/dataset?tags=suicide.

Source: White House – Suicide Prevention/Mental Health & Data for Invisible Illnesses

Data sourcesCDC Suicide data sources

  • WHO Statistical Information System (WHOSIS)WHOSIS, the WHO Statistical Information System, is an interactive database bringing together core health statistics for the 193 WHO Member States. It comprises more than 70 indicators, which can be accessed by way of a quick search, by major categories, or through user-defined tables. The data can be further filtered, tabulated, charted and downloaded.
  • International Crime Victims Surveys
  • National Inpatient Sample (NIS)The NIS is a database of hospital inpatient stays used to identify, track, and analyze national trends in health care utilization, access, charges, quality, and outcomes. The NIS is the largest all-payer inpatient care database that is publicly available in the United States, containing data from approximately 8 million hospital stays from about 1,000 hospitals sampled to approximate a 20-percent stratified sample of U.S. community hospitals
  • National Survey on Drug Use and Health (NSDUH)Beginning in 2008 the National Survey on Drug Use and Health Report starting asking suicidal thoughts and behaviors of all adults aged 18 or older. Along with responses for the suicide-related questions, the survey collects nationally- and state-representative information on socio-demographic items such as age group, sex, ethnicity, employment, and income.
  • Pan American Health Association, Regional Core Health Data InitiativeIn 1995, the Regional Core Health Data and Country Profile Initiative was launched by the Pan American Health Organization to monitor the attainment of health goals of the Member States. The initiative includes a database with 117 health-related indicators, country health profiles, and reference documents.
  • The American Association of SuicidologyThe goal of the American Association of Suicidology (AAS) is to understand and prevent suicide. The Research Division of AAS is dedicated to advancing knowledge about suicidal behavior through science.
  • Suicide Attack Database – current CPOST-SAD (release contains the universe of suicide attacks from 1982 through June 2015, a total of 4,620 attacks in over 40 countries.
  • Behavioral Risk Factor Surveillance System (BRFSS) —Collects data on a variety of behavioral health issues through a national telephone survey developed by the US Centers for Disease Control and Prevention (CDC), and administered to a sample of households in the US. Some states include questions on suicidal behavior.
  • Department of Defense Suicide Event Report (DoDSER) Data – The Department of Defense Suicide Event Report (DoDSER) is the system of record for health surveillance related to suicide ideations, attempts, and deaths.

 

Overview for using these data sources

 

 

Record matching on mortality data

I’m looking forward to teaming up with my HHS Entrepreneur-in-Residence cohorts Paula Braun and Adam Culbertson.  We have a “perfect storm” coming up, where all three of our projects are intersecting.  Paula is working on modernizing the nation’s mortality reporting capabilities.  Adam has been working with the HIMSS (Heath Information Management Society and Systems) organization to improve algorithms and methods for matching patient records.  And I, for the DDOD project, have been working on a use case to leverage NDI (National Death Index) for outcomes research.  So the goals of mortality system modernization, patient matching and outcomes research are converging.Patient Matching Exercise

To that end, Adam organized a hackathon at the HIMSS Innovation Center in Cleveland for August 2015.  This event throws in one more twist: the FHIR (Fast Healthcare Interoperability Resources) specification.  FHIR is a flexible standard for exchanging healthcare information electronically using RESTful APIs.  The hackathon intends to demonstrate what can be accomplished when experts from different domains combine their insights on patient matching and add FHIR as a catalyst.  The event is broken into two sections:

Section 1:  Test Your Matching Algorithms
Connect matching algorithms to a FHIR resource server containing synthetic patient resources.  The matching algorithms will be updated to take in FHIR patient resources and then perform a de-duplication of the records.  A final list of patient resources should be produced.  Basic performance metrics can then be calculated to determine the success of the matching exercise.  Use the provided tools, or bring your own and connect them up.Section 2:  Development Exercise
Develop applications that allow EHRs to easily update the status of patients who are deceased. A synthetic centralized mortality database, such as the National Death Index or a state’s vital statistics registry, will be made available through a FHIR interface.  External data sources, such as EHRs, will be matched against this repository to flag decedents. The applications should be tailored to deliver data to decision makers. This scenario will focus on how different use cases drive different requirements for matching.

Matching algorithms for patient recordsPatient matching and de-duplication is an important topic in EHRs (Electronic Health Records) and HIEs (Health Information Exchanges), where identifying a patient uniquely impacts clinical care quality, patient safety, and research results.  It becomes increasingly important as organizations exchange records electronically and patients seek treatment across multiple healthcare providers.   (See related assessment titled “Patient Identification and Matching Report” that was delivered to HHS’s ONC in 2014.)

We’re looking forward to reporting on progress on all three initiatives and the common goal.

This topic is covered on the HHS IDEA Lab blog:  http://www.hhs.gov/idealab/2015/08/10/teaming-advance-patient-matching-hackathon/

Appendix: Background on patient matching

Additional challenges occur because real-world data often has errors, variations and missing attributes.  Common errors could include misspellings and transpositions.  Many first names in particular could be written in multiple ways, including variations in spelling, formality, abbreviations and initials.  In large geographies, it’s also common for there to be multiple patients with identical first and last names.

Data set Name Date of birth City of residence
Data set 1 William J. Smith 1/2/73 Berkeley, California
Data set 2 Smith, W. J. 1973.1.2 Berkeley, CA
Data set 3 Bill Smith Jan 2, 1973 Berkeley, Calif.

Although there’s a broad range of matching algorithms, they can be divided into two main categories:

  • Deterministic algorithms search for an exact match between attributes
  • Probabilistic algorithms score an approximate match between records

These are often supplemented with exception-driven manual review.  From a broader, mathematical perspective, the concept we’re dealing with is entity resolution (ER).  There’s a good introductory ER tutorial that summarizes the work in Entity Resolution for Big Data, presented at KDD 2013.  Although it looks at the discipline more generically, it’s still quite applicable to patient records.  It delves into the areas of Data Preparation, Pairwise Matching, Algorithms in Record Linkage, De-duplication, and Canonicalization.  To enabling scalability, it suggest use of Blocking techniques and Canopy Clustering    These capabilities are needed so often, that they may be built into commercial enterprise software.  IBM’s InfoSphere MDM (Master Data Management) is an example.

Metrics for patient matchingWhen comparing multiple algorithms for effectiveness, we have a couple good metrics: precision and recall.  Precision identifies how many of the matches were relevant, while recall identifies how many of the relevant items were matched.  F-Measure combines the two.  It should be noted that the accuracy metric, which is the ratio of items accurately identified to the total number of items, should be avoided.  It suffers from the “accuracy paradox”, where lower measures of accuracy may actually be more predictive

 

  • Precision:     p = TP/(TP+FP)
  • Recall:    r = TP/(TP+FN)
  • F-Measure =  2 p r / (p + r)
  • Accuracy:   a = TP+TN/(TP+TN+FP+FN)

In the long run, the challenge can also be approached from the other side.  In other words, how can the quality of data entry and storage within an organization be improved.  This approach could reap benefits in downstream matching, reducing the need for complex algorithms and improving accuracy.  AHIMA published a primer on Patient Matching in HIEs, in which they go as far as calling for a nationwide standard that would facilitate more accurate matching.  They suggest standardizing on commonly defined demographic elements, eliminating use of free text entry except for proper names, and ensuring multiple values aren’t combined in single fields.

Using DDOD to identify and index data assets

Part of implementing the Federal Government’s M-13-13 “Open Data Policy – Managing Information as an Asset” is to create and maintain an Enterprise Data Inventory (EDI).   EDI is supposed to catalog government-wide SRDAs (Strategically Relevant Data Assets).  The challenge is that the definition of an SRDA is subjective within the context of an internal IT system, there’s not enough budget to catalog the huge number of legacy systems, and it’s hard to know when you’re done documenting the complete set.

Enter DDOD (Demand-Driven Open Data).  While it doesn’t solve these challenges directly, its practical approach to managing open data initiatives certainly can improve the situation.  Every time an internal “system of record” is identified for a DDOD Use Case, we’re presented with a new opportunity to make sure that an internal system is included in the EDI.  Already, DDOD has been able to identify missing assets.

DDOD helps with EDI and field-level data dictionary

But DDOD can do even better.  By focusing on working one Use Case at a time, we provide the opportunity to catalog the data asset to a much more granular level.  The data assets on HealthData.gov and Data.gov are catalog at the dataset level, using the W3C DCAT (Data Catalog) Vocabulary.  The goal is to catalog datasets associated with DDOD Use Cases at the field-level data dictionary level.  Ultimately, we’d want to get attain a level of sophistication at which we’re semantically tagging fields using controlled vocabularies.

Performing field-level cataloging all this has a couple important advantages.  First, in enables better indexing and more sophisticated data discovery on HealthData.gov and other HHS portals.  Second, it identifies opportunities to link across datasets from different organizations and even across different domains.  The mechanics of DDOD in relation to EDI, HealthData.gov, data discoverability and linking is further explained at the Data Owners section of the DDOD website.

Note: HHS EDI is not currently available as a stand-alone data catalog.  But it’s incorporated into http://www.healthdata.gov/data.json, because this catalog includes all 3 types of access levels: public, restricted public, and non-public datasets.

DDOD Love from Health Datapalooza 2015

Health Datapalooza

Demand-Driven Open Data (DDOD) has gotten a lot of coverage throughout Health Datapalooza 2015.  I participated in 4 panels throughout the week and had the opportunity to explain DDOD to many constituents.

  • Developer HealthCa.mp
    Health DevCamp logo
    Developer HealthCa.mp is a collaborative event for learning about existing and emerging APIs that can be used to develop applications that will help consumers, patients and/or beneficiaries achieve better care through access to health data, especially their own!Areas of focus include:
    • Prototype BlueButton on FHIR API from CMS
    • Project Argonaut
    • Privacy on FHIR initiative
    • Sources of population data from CMS and elsewhere around HHS
  • Health Datapalooza DataLab
    EVENT DETAILS HHS has so much data! Medicare, substance abuse and mental health, social services and disease prevention are only some of the MANY topical domains where HHS provides huge amounts of free data for public consumption. It’s all there on HealthData.gov! Don’t know how the data might be useful for you? In the DataLab you’ll meet the people who collect and curate this trove of data assets as they serve up their data for your use. But if you still want inspiration, many of the data owners will co-present with creative, insightful, innovative users of their data to truly demonstrate its alternative value for positive disruptions in health, health care, and social services.

    Moderator: Damon Davis, U.S. Department of Health & Human Services

    Panelists: Natasha Alexeeva, Caretalia; Christina Bethell, PhD, MBA, MPH, Johns Hopkins; Lily Chen, PhD, National Center for Health Statistics; Steve Cohen, Agency for Healthcare Research & Quality; Manuel Figallo, Sas; Reem Ghandour, DrPH, MPA, Maternal and Child Health Bureau; Jennifer King, U.S. Department of Health & Human Services; Jennie Larkin, PhD, National Institutes of Health; Brooklyn Lupari, Substance Abuse & Mental Health Services Administration; Rick Moser, PhD, National Cancer Institute; David Portnoy, MBA, U.S. Department of Health & Human Services; Chris Powers, PharmD, Centers for Medicare and Medicaid Services; Elizabeth Young, RowdMap

  • No, You Can’t Always Get What You Want: Getting What You Need from HHS
    EVENT DETAILSWhile more data is better than less, pushing out any ol’ data isn’t good enough.  As the Data Liberation movement matures, the folks releasing the data face a major challenge in determining what’s the most valuable stuff to put out.  How do they move from smorgasbord to intentionally curated data releases prioritizing the highest-value data?  Folks at HHS are wrestling with this, going out of their way to make sure they understand what you want and ensure you get the yummy data goodies you’re craving.  Learn how HHS is using your requests and feedback to share data differently.  This session explores the HHS new initiative, the Demand-Driven Open Data (DDOD): the lean startup approach to public-private collaboration.  A new initiative out of HHS IDEA Lab, DDOD is bold and ambitious, intending to change the fundamental data sharing mindset throughout HHS agencies — from quantity of datasets published to actual value delivered.

    Moderator: Damon Davis, U.S. Department of Health & Human Services

    Panelists: Phil Bourne, National Institute of Health (NIH); Niall Brennan, Centers for Medicare & Medicaid Services; Jim Craver, MMA, Centers for Disease Control & Prevention; Chris Dymek, EdD, U.S. Department of Health & Human Services; Taha Kass-Hout, Food & Drug Administration; Brian Lee, MPH, Centers for Disease Control & Prevention; David Portnoy, MBA, U.S. Department of Health & Human Services

  • Healthcare Entrepreneurs Boot Camp: Matching Public Health Data with Real-World Business Models
    EVENT DETAILSIf you’ve ever considered starting something using health data, whether a product, service, or offering in an existing business, or a start-up company to take over the world this is something you won’t want to miss.  In this highly-interactive, games-based brew-ha, we pack the room full of flat-out gurus to get an understanding of what it takes to be a healthcare entrepreneur.  Your guides will come from finance and investment; clinical research and medical management; sales and marketing; technology and information services; operations and strategy; analytics and data science; government and policy; business, product, and line owners from payers and providers; and some successful entrepreneurs who have been there and done it for good measure.  We’ll take your idea from the back of a napkin and give you the know-how to make it a reality!

    Orchestrators: Sujata Bhatia, MD, PhD, Harvard University; Niall Brennan, Centers for Medicare & Medicaid Services; Joshua Rosenthal, PhD, RowdMap; Marshall Votta, Leverage Health Solutions

    Panelists: Michael Abate, JD, Dinsmore & Shohl LLP; Stephen Agular, Zaffre Investments; Chris Boone, PhD, Health Data Consortium; Craig Brammer, The Health Collaborative; John Burich, Passport Health Plan; Jim Chase, MHA, Minnesota Community Measurement; Arnaub Chatterjee, Merck; Henriette Coetzer, MD, RowdMap; Jim Craver, MAA, Center for Disease Control; Michelle De Mooy, Center for Democracy and Technology; Gregory Downing, PhD, U.S. Department of Health & Human Services; Chris Dugan, Evolent Health; Margo Edmunds,PhD, AcademyHealth; Douglas Fridsma, MD, PhD, American Medical Informatics Association; Tina Grande, MHS, Healthcare Leadership Council; Mina Hsiang, US Digital Services; Jessica Kahn, Center for Medicare & Medicaid Services; Brian Lee, MPH, Center for Disease Control; David Portnoy, MBA, U.S. Department of Health & Human Services; Aaron Seib, National Association for Trusted Exchange; Maksim Tsvetovat, OpenHealth; David Wennberg, MD, The Dartmouth Institute; Niam Yaraghi, PhD, Brookings Institute; Jean-Ezra Yeung, Ayasdi

 

There were follow-up publications as well.  Among them, was HHS on a mission to liberate health data from GCN.

GCN article on DDOD
HHS found that its data owners were releasing datasets that were easy to generate and least risky to release, without much regard to what data consumers could really use. The DDOD framework lets HHS prioritize data releases based on the data’s value because, as every request is considered a use case.It lets users — be they researchers, nonprofits or local governments — request data in a systematic, ongoing and transparent way and ensures there will be data consumers for information that’s released, providing immediate, quantifiable value to both the consumer and HHS.

My list of speaking engagements at Palooza is here.

Investment Model for Pharma

I had the opportunity to attend a presentation on “Entry and Investment Decisions in the
Pharmaceutical Industry”
by Anita Rao, PhD at Booth School of Business, University of Chicago. Transition between FDA approval years The concepts examined are applicable for any product that has lengthy periods of pre-launch R&D investment with a presence of competing products.  But there’s an aspect to this particular research that’s unique to pharmaceuticals: the uncertainty factor introduced by FDA’s drug approval process.  With that in mind, the paper analyzes historical data from FDA to infer how firms working on potentially competing products may respond each other’s actions prior to approval.

Quick side note…  I love what you can do by analyzing readily available historical data in a new way.  I think there’s an opportunity to improve on this model by leveraging valuable data that’s still buried deep within the FDA.  That’s exactly the kind of opportunity Demand-Driven Open Data (DDOD) was designed to address.

Investment model faster FDA approvalThere was one key question that it aimed to answer: What net effect would accelerating the drug approval process have on investment decisions and NPV (net present value) of each product.  The conclusion reached was that the increased incentive due to accelerated return on investment was significantly stronger than the disincentive due to risk of intensified competition.

Most immediately this model has potential to assist investors in making better decisions in regulated industries with substitute products and long investment time periods.  Investors in the areas of medical devices, agriculture and alternative energy might also be able to use this model.

But I’d love for this model to go beyond the use by investors and help inform public policy.  For that to happen, it needs to take into account a bigger picture, including the cost to the regulating body and by implication typically the taxpayer.  So in this particular case, we would need to assess the cost FDA bears in its current approval process, as well as estimating the likely increases due for accelerated approvals.

To take the concept further, there are certain questions that policymakers could address in order to maximize the total economic value to all participants.   For example, are there opportunities for the firms to fund or offset some of the additional cost from accelerating the approval process?  Is there an efficient way for the FDA to prioritize approvals more dynamically based on economic or public health value?  And is there a way to do so without significant conflicts of interest and minimal additional risk to consumers?