Category Archives: semantic web

Schema.org publishes health plan and provider network schemas

Some good news on healthcare standards

I have been working with the Google semantic web group for many months to design several schemas that represent healthcare provider networks and health insurance plan coverage.  The good news is that these schemas have been officially published for use with Schema.org.  This is the first step towards a wider adoption for a more consistent designation for this type of information.  The schemas are:

Health Insurance Plan: List of health plans and their corresponding network of providers and drug formularies http://pending.webschemas.org/HealthInsurancePlan
Health Plan Network: Defines a network of providers within an health plan. http://pending.webschemas.org/HealthPlanNetwork
Health Plan Cost Sharing Specification: List of costs to be paid by the covered beneficiary. http://pending.webschemas.org/HealthPlanCostSharingSpecification
Health Plan Formulary: Lists of drugs covered by health plan. http://pending.webschemas.org/HealthPlanFormulary

Now for the background…

In November 2015, the US health agency Centers for Medicare & Medicaid Services (CMS) enacted a new regulatory requirement for health insurers who list plans on insurance marketplaces. They must now publish a machine-readable version of their provider network directory and health plan coverage, publish it to a specified JSON standard, and update it at least monthly. Many major health insurance companies across the US have already started to publish their health plan coverage, provider directories and drug formularies to this standard.

The official schema is kept in a GitHub Repository: https://github.com/CMSgov/QHP-provider-formulary-APIs.  This format makes it possible to see how which changes were made and when.  It also has an issues section to facilitate ongoing discussion about the optimal adoption of the standard.  There’s a website that goes into a more detailed explanation on the background of this effort: https://www.cms.gov/CCIIO/Resources/Data-Resources/marketplace-puf.html.

This website also includes the “Machine-readable URL PUF” seed file” to the actual data that have been published by insurance company.  This file contains URLs that can be crawled to aggregate the latest plan and provider data.

In terms of adoption, U.S. health plans that participate in insurance markeplaces have published: *

  • 39 states
  • 398 health plans
  • ~26,000 URLs describing insurance coverage, provider networks, drug formularies

* Updated November 2016


A group of companies representing the provider, payer and consumer segments of healthcare convened to discuss the standard throughout 2015.  The considerations that went into formation of the standard can be found at: http://ddod.healthdata.gov/wiki/Interoperability:_Provider_network_directories

Open Data Discoverability

I’m adding a working document to cover the topic of open data discoverability and usability.  It appears as though this is an area that is in desperate need for attention.  I have come across it tangentially throughout much of my work.  It deserves to be aggregated and curated.  There are also some lingering opportunities to make practical use of semantic web concepts.  There are vast repositories of data assets throughout government, academia and industry that could be better leveraged.  So lets make it happen.

 

DDOD featured on Digital Gov

DDOD logoThe Demand-Driven Open Data (DDOD) program has recently been featured on DigitalGov.  (See DigitalGov article.)

It should be added, that a major project in the works is the merging of DDOD tools and methodologies into the larger HealthData.gov program.  The effort seeks to maximize the value of existing data assets from across HHS agencies (CMS, FDA, CDC, NIH, etc.).  Already planned are new features to enhance data discoverability and usability.

We’re also looking into how to improve the growing knowledge base of DDOD use cases by leveraging semantic web and linked open data (LOD) concepts.  A couple years ago, HHS organized the Health Data Platform Metadata Challenge – Health 2.0.  The findings from this exercise could be leveraged for both DDOD and HealthData.gov.

DDOD featured on DigitalGov

Using DDOD to identify and index data assets

Part of implementing the Federal Government’s M-13-13 “Open Data Policy – Managing Information as an Asset” is to create and maintain an Enterprise Data Inventory (EDI).   EDI is supposed to catalog government-wide SRDAs (Strategically Relevant Data Assets).  The challenge is that the definition of an SRDA is subjective within the context of an internal IT system, there’s not enough budget to catalog the huge number of legacy systems, and it’s hard to know when you’re done documenting the complete set.

Enter DDOD (Demand-Driven Open Data).  While it doesn’t solve these challenges directly, its practical approach to managing open data initiatives certainly can improve the situation.  Every time an internal “system of record” is identified for a DDOD Use Case, we’re presented with a new opportunity to make sure that an internal system is included in the EDI.  Already, DDOD has been able to identify missing assets.

DDOD helps with EDI and field-level data dictionary

But DDOD can do even better.  By focusing on working one Use Case at a time, we provide the opportunity to catalog the data asset to a much more granular level.  The data assets on HealthData.gov and Data.gov are catalog at the dataset level, using the W3C DCAT (Data Catalog) Vocabulary.  The goal is to catalog datasets associated with DDOD Use Cases at the field-level data dictionary level.  Ultimately, we’d want to get attain a level of sophistication at which we’re semantically tagging fields using controlled vocabularies.

Performing field-level cataloging all this has a couple important advantages.  First, in enables better indexing and more sophisticated data discovery on HealthData.gov and other HHS portals.  Second, it identifies opportunities to link across datasets from different organizations and even across different domains.  The mechanics of DDOD in relation to EDI, HealthData.gov, data discoverability and linking is further explained at the Data Owners section of the DDOD website.

Note: HHS EDI is not currently available as a stand-alone data catalog.  But it’s incorporated into http://www.healthdata.gov/data.json, because this catalog includes all 3 types of access levels: public, restricted public, and non-public datasets.

Field-level data dictionaries for open data

Typically, publicly available open data repositories — especially being hosted or indexed via CKAN — have been described only at the dataset level.  Meaning, typically datasets are described in a DCAT-compatible schema.  This includes the metadata schema required by Project Open Data for Data.gov and all agency-specific data hosting websites.

But ideally, the cataloging of these datasets should move to a more granular level of detail: field-level.  Doing so, makes it possible for search capabilities to go well beyond the typical tags and predefined categories.  With fields defined, we can quickly find all datasets that have common fields.  That in turn makes it easier to find opportunities for linking across datasets and allows for a related dataset recommendation engine.  The solution becomes even more powerful if the fields are labeled with a predefined semantic vocabulary — that is globally uniquely defined.  (See approach described in Health2.0 Metadata Challenge.)

One challenge to this goal is that CKAN has not historically been good at defining a standard, machine readable data dictionary.  We’ve examined a range of standards and suggestions for defining data dictionaries.  These include common SQL DDL, XML, JSON, and YAML formats.

* ANSI SQL Standard 
   - DDL (Data Definition Language): "CREATE TABLE"
   - SQL/Schemata
 
        ```
        testdb-# \d company
                    Table "public.company"
          Column   |     Type      | Modifiers
        -----------+---------------+-----------
         id        | integer       | not null
         name      | text          | not null
         address   | character(50) |
         join_date | date          |
        Indexes:
            "company_pkey" PRIMARY KEY, btree (id)
        ```

* JSON Table Schema: http://dataprotocols.org/json-table-schema/

    ```
    "schema": {
      "fields": [
        {
          "name": "name of field (e.g. column name)",
          "title": "A nicer human readable label or title for the field",
          "type": "A string specifying the type",
        },
        ... more field descriptors
      ],
      "primaryKey": ...
      "foreignKeys": ...
    }
    ```

* YAML schema files used for Doctrine ORM: http://doctrine.readthedocs.org/en/latest/en/manual/yaml-schema-files.html

* XML schema syntax for Google's DSPL (Dataset Publishing Language): https://developers.google.com/public-data/docs/schema/dspl9

* W3 XML Schema: http://www.w3.org/TR/xmlschema-2/#built-in-primitive-datatypes


## CSV storage formats
* Open Knowledge Data Packager - CKAN Extension http://ckan.org/2014/06/09/the-open-knowledge-data-packager/

* Tabular Data Package Spec: http://dataprotocols.org/tabular-data-package/

* The above two are also part of a W3C standards track:
http://www.w3.org/TR/tabular-data-model/

 

Enter the all powerful CSV

CSV format often a desired format for it’s high interoperability.  However, it suffers from the fact that we need to keep its metadata separately defined.  This in turn causes challenges in version control, broken links and correctly identifying the column order.  There’s also the all-too-common and annoying test that has to be performed to determine if the first row is data or column header.

So is there an elegant, machine-readable, standard-ish way to embed the metadata within the data file itself?  OKFN suggests that the solution could be accomplished via Tabular Data Packages.  Basically, you have the option to provide the data “inline” directly in the datapackage.json file.  The data would be in addition to specifying the full schema (as per JSON Table Schema) and CSV dialect (as per CSVDDF Dialect specification) in the same file.  We just need to have simple scripts that eventually extract these components into separate CSV files and JSON Table Schema.  Open Knowledge Data Packager is a CKAN extension that makes use of JSON Table Schema and Tabular Data Package for hosting datasets on CKAN FileStore.

Finally, there’s a helpful article on Implementing CSV on the Web and W3C’s CSV working group is seeking feedback on model and vocabulary for tabular data.

 

Is “SchemaStore” CKAN’s mystical unicorn?

As mentioned previously, CKAN hasn’t been strong in storing and managing standard, machine readable data dictionaries.  So a special shout out goes to Greg Lawrence, who has figured out how to solve this limitation.  He’s built a CKAN “SchemaStore” and a custom Java app to index content into CKAN’s DataStore object.  It grabs the needed information by running SQL exports on Oracle tables.  The code that enables SchemaStore is incorporated into the BC Data Catalogue CKAN extension on GitHub.  The field tags are defined in the edc_datasets.py file of this repository.

An example of the SchemaStore implementation can be found in this sample dataset under the “Object Description” section.  Here you’re able to see all of the relevant elements from the Oracle table object: Column Name, Short Name, Data Type, Data Precision, and Comments.  The data dictionary for this dataset is in machine readable JSON format.  For example, the first 3 fields of the data dictionary are:

details: [
  {
    data_precision: "0",
    column_comments: "The date and time the information was entered.",
    data_type: "DATE",
    short_name: "TIMESTAMP",
    column_name: "ENTRY_TIMESTAMP"
  },
  {
    data_precision: "0",
    column_comments: "The identification of the user that created the initial record.",
    data_type: "VARCHAR2",
    short_name: "ENT_USR_ID",
    column_name: "ENTRY_USERID"
  },
  {
    data_precision: "0",
    column_comments: "A feature code is most importantly a means of linking a features to its name and definition.",
    data_type: "VARCHAR2",
    short_name: "FEAT_CODE",
    column_name: "FEATURE_CODE"
  }, ...
],
See related issue for Data.gov (“Make field level metadata searchable and link common fields across the catalog”):  https://github.com/GSA/data.gov/issues/640

What Happened to the Semantic Web?

It looks bleak

Over the past few years, there have been questions asked about the viability of the Semantic Web (aka, SemWeb) envisioned by Tim Berners-Lee.  In the strictest sense, the original standards set out by the W3C have not proliferated at any great pace and have not been widely adopted commercially. There are also no multi-billion dollar acquisitions or IPOs in the SemWeb space.  Even in government and academia, the vast majority of “open data” is in traditional relational form (rather than RDF linked datasets) and don’t reference widely adopted ontologies.

Evidence of decline?

 

But it’s a matter of framing

The outlook changes drastically if we look at the question a bit differently. Rather than defining the SemWeb as the original set of standards or narrow vision, what if we look at related technologies that it may have spawned or influenced.  Now a number of success stories emerge.  We have the tremendous growth of Schema.org and adoption of Microdata among the 3 big search engines: Google, Yahoo, and Bing.  We also have SemWeb concepts applied in Google’s Knowledge Graph, Google Rich Data Snippets, and Facebook Social Graph.  Even IBM’s Watson is no longer just an IBM Research project.  It’s being commercialized into IBM’s verticals, including healthcare, insurance and finance. So SemWeb technologies are alive — in a sense.  For the purpose of clarity, let’s refer to the original W3C vision discussed since 2001 as the “old SemWeb” and the recent commercial successes as the “new SemWeb”.  Of course, these are fuzzy definitions, since the new SemWeb is not formally defined.

 

What’s wrong with the original vision?

The W3C breaks the elements of the old SemWeb into: (1) Linked Data, (2) Vocabularies, (3) Inference, and (4) Query.  Each of which are widely in use today, but in a way that’s different from original specs.  For example, linked data implemented as Microdata or JSON-LD has gained popularity over the heavier and more verbose RDF/XML.  Most websites forgo formally defined OWL ontologies for vocabularies found on databases like Schema.org or Freebase.  Rule engines and reasoners are already built into products we use.  It’s what happens in the “brains” of Google’s page rank and ad optimization algorithms.  And instead of the SPARQL query language, humans interact often interact with the new SemWeb through natural language searches, while machines through RESTful APIs.  With IBM’s Watson translates questions into sophisticated queries involving federation and inference against its knowledge base.

There are a couple other difficulties with the old SemWeb worthy of noting.  It’s been said that it’s too rigid to effectively keep up with today’s rate of data creation and structural evolution.  The overhead of frequent updates to ontologies, tagging and linkages is just too high.  Another problem is around the anemic adoption of the SPARQL language.  The high level of both technical and domain proficiency required to leverage SPARQL directly — especially when it comes to federated queries or those involving inference — is simply impractical in most commercial situations.  However, it might be feasible to have such skills in a highly specialized domain, such as the human genome project.  (See post on a case study of such a SemWeb implementation.)

But even in highly specialized domains, you run into another problem: ontological realism.  This problem is one of ontological “silos” that naturally occur as a result of optimizing for a specific domain and the need to integrate with ontologies built for neighboring domains.  Such silos reduce the effectiveness of SemWeb efforts, because they impair the ability to run queries and inference across multiple data sources.  There needs to be a widely adopted base ontology and corresponding design methodology that works across multiple domains, yet wouldn’t interfere with your specific domain.  The fact that ontologies need to evolve over time means that consistent effort is needed to adhere to such methodologies to avoid eventual silos.

Why has adoption of the old SemWeb lagged that of simpler implementations, like Schema.org?  One could draw an analogy to adoption of API integration standards.  Adoption of REST/JSON has overtaken SOAP/XML.  (See chart below.)  To understand why, we need to look at the domains in which these technologies are applied.  The compelling use case of loose coupling between unrelated companies or independent teams favored the simplicity of REST.  That said, within the confines of large corporate environments, the rigor of SOAP implementations still make sense. Analogy of rest vs soap to semantic web

 

When does it make sense?

One of the biggest challenges to the adoption of the old SemWeb has been the lack of clear commercial benefits.  To many corporate CIOs and CTOs, any potential benefit was overshadowed by the TCO (total cost of ownership, including migration overhead and ongoing maintenance).  No doubt the technology and concepts proposed for the old SemWeb are exhilarating.  But rather than falling in love with the technology, the key to adoption has been the existence and realization of a clear business case.  That’s exactly what’s been happening for the successful implementations of the new SemWeb.  For example, Google sees tremendous ROI in implementing its Knowledge Graph, because it greatly improves ad revenue.  Webmasters and Google’s advertisers, in turn, are eager to organize and tag their content per Schema.org for the purpose of SEO/SEM.

Sure, that’s fine for deep-pocketed visionaries like Google.  But how about for the risk averse?  How would they know when there’s likely a sufficient ROI to adopting SemWeb technologies?  CEOs and CTOs looking to incorporate such technologies into their product lines might watch for a trend of increasing acquisitions or VC funding for SemWeb related services.  CIOs looking to support their business operations might wait to hear about success stories from similar corporate implementations.  Researchers and universities may ask whether there been any discoveries substantially aided by SemWeb initiatives.

Additionally, there may be some hope even for the aspects of the old SemWeb vision that haven’t gained adoption yet.  The LOD2 Technology Stack is being funded by the European Commission within the Seventh Framework Programme. It is a set of standards and integrated semantic web tools being developed in conjunction with the EU Open Data Portal. It’s too early to see any obvious success stories. But it’s quite possible that such government support will lead to unexpected new developments from SemWeb efforts. After all, the US Department of Defense’s funding of ARPANET led to the development of the Internet.

There are many paths to adopting the new SemWeb.  Go find yours.

Case study in Linked Data and Semantic Web: The Human Genome Project

The National Human Genome Research Institute’s “GWAS Catalog” (Genome-Wide Association Studies) project is a successful implementation of Linked Data (http://linkeddata.org/) and Semantic Web (http://www.w3.org/standards/semanticweb/) concepts.  This article discusses how the project has been implemented, challenges faced and possible paths for the future.