Data.world Platform

Recently I’ve been playing around with the relatively new Data.world platform, which is intended to be the social network for data people.  I tried some of the query and visualization features, as well as uploading datasets.  Seems it has a lot of promise and I really hope it takes off.  I think at the moment, the biggest challenge for me is how to distinguish which datasets are trustworthy and clean.  I also think they have an opportunity to greatly improve the search capability for datasets if they could provide advanced search directives, such as explicitly specifying fields, geographic locations, time frames or semantic tags.

Today, I received an email from Data.world listing 10 favorite features of their users.  They’re listed below and I’m sure exploring them could provide for countless hours of fun.

10) Add metadata with data dictionaries, file tags, and column descriptions
9) Instantly query and join .csv and .json files using SQL
8) Join your local data with other datasets in data.world
7) Showcase your code, data, and documentation in Python and R notebooks
6) See inside files before downloading
5) Use integrations for R, Python, or JDBC to pull data into your tools of choice
4) Enrich your analyses with U.S. Census ACS summary files
3) Export datasets as Tabular Data Packages, a standard machine-readable format
2) Pull data directly into Tableau to create your visualizations
1) Add and sync files from GitHub, Google Drive, or S3

 

Share Button

Open Data Panel at All Things Open Conference

Open Data Panel to be Featured at All Things Open

Open Data will be a featured panel discussion at the All Things Open conference this year, .  With a new administration set to transition into place in January and multiple new initiatives starting at both the state and federal levels, the topic has never been more important.  The session, which will take place Wednesday, October 26 at 1:30 pm ET, will feature some of the foremost experts in the world.

Topics to be discussed will include:

  • The New Open Data Transition Report
  • Future opportunities for Open Data at the local and federal levels with the DATA Act
  • How the Open Data landscape is evolving, particularly through Demand-Driven Open Data (DDOD)
  • Future opportunities in open data at the Federal and local levels
  • How the panel’s insights can help local governments create demand driven open data programs

The world-class lineup of panel members will include:

  • Joel Gurin-  (President and Founder, Center for Open Data Enterprise)
  • Hudson Hollister –  (Founder and Executive Director, Data Coalition)
  • David Portnoy –  (Founder, Demand-Driven Open Data)
  • Tony Fung –  (Deputy Secretary of Technology, State of Virginia)
  • Andreas Addison –  (President, Civic Innovator LLC)
  • Sam McClenney –  (Economist, RTI International)
  • Caroline Sullivan – (Wake County Commissioner)

The panel is open to attendees of All Things Open, the largest “open” technology event on the east coast of The Unites States

 

Share Button

Python Serverless Microframework

AWS has introduced a Python serverless microframework.  It’s a beautiful concept, making it super simple to create and deploy an APIs with infinite scalability.  It does so by leveraging the Amazon API Gateway and AWS Lambda without the learning curve.  That said, the ideal use case for the frameworke is for rapid prototyping or highly scalable deployment of a very simple rest API.  It’s the right choice as long as the API you need is a good candidate for development using Python with Flask-like view decorators.

The framework is accessible via the Chalice GitHub repository, providing a CLI (command line tool) for creating, deploying, and managing your app.  All  you need to deploy a new API is to put your app in a Python file.  How little effort is needed to get going?  How about this…

pip install chalice
chalice new-project
chalice deploy

Now you’re ready to hit the endpoint you configured!

(Note that as a tradeoff for simplicity, not all API Gateway and Lambda services are exposed to Chalice.  There is a simple way to consume AWS Lambda’s built-in logging via Amazon CloudWatch Logs.)

Share Button

Predictive Analytics: Why IoT is different

The challenge: Moving from batch retrospective to real-time predictive analytics

First, let’s get the lay of the land.  Batch processed analytics has a solid track record with well-defined use cases and best practices.  And most analytics today are still limited to getting retrospective answers.  Many organizations are turning to the more forward-looking predictive analytics to broaden the actions they could take beyond the confines of historical decisions and add ability to ask what-if questions.

The Internet of Things (IoT) on the other hand is still quite new, with methodologies for solving unique problems, such as those around real-time processing and connectivity limitations, still being hashed out.  Now add the desire to move from retrospective to predictive and you have a world of new challenges.  In the past few years, there have been great strides made in both technologies and methods in processing of data in real-time.  But the move to predictive often takes an even bigger leap, due to the challenge that to get more incremental insight requires an exponential increase in data ingestion and processing capacity.  As a result, predictive analytics still accounts for only a small fraction of a typical organization’s analytical capabilities.

 

IoT adds another twist: Network and processing bottlenecks

What makes IoT different when it comes to predictive analytics?   In some applications, the majority of collected data loses value within milliseconds.  Historically, data collection has been the hard part of a predictive analytics system.  However, that’s shifting, especially with advances in industrial IoT technologies.  IoT brings capabilities to scale the volume of data collection while reducing latency.  As a result, now collection is becoming the easy part.  The bottlenecks start at sanitization, modeling and integration.  This in turn makes the downstream components of analytics and taking action more challenging.

When looking at optimizing an IoT implementation, it’s important to balance the roles and capabilities of “edge” vs. “cloud”.  Edge refers to specialized infrastructure that can improve performance through physical proximity.  It enables analytics and knowledge generation to occur closer to the source of the data.  Edge gives you responsiveness, but not scale.  Cloud, on the other hand, gives you scale but not responsiveness.  

 

Making it work: Configuring the 3 components

There are three core components in any IoT implementation.  First, collection, which includes sensing, network, storage and query capability.  Second, learning, to analyze the data and generate predictions.  And, third, taking action, typically using automated methods, on the analytics from prior stages.  In traditional cloud-centric IoT architecture, while the actual sensors are outside of the cloud, the collect, learn and act components of the system often run into responsiveness challenges.  In situations where data collection volumes are particularly large, the overhead of network communications has a significant impact on cost, sometimes up to 50% of the entire system.  

Moving any portion of these three components to the edge results in performance gains, because less data needs to be moved between each.  More buffering and storage can occur at the edge as the cost of memory and disk continues to drop.  (It should also be noted however that the storage and query functions of the system may become less prevalent as data is processed and acted on in real-time.)  Then various aspects of data filtering, computation and predictive analytics can be executed at the edge, so that only data required for centralized processing needs to be moved.  These gains need to be balanced against increasing system complexity resulting from the fact that edge resources may not be continuously connected to the network.  So clearly there are many ways to implement and fine tune predictive analytics for IoT.  Doing this will only get easier as the field matures.

Predix IoT cloud platform
For context…  The subject matter for this post came from a talk by Venu Vasudevan, Professor of Electrical & Computer Engineering at Rice University, where we discussed what makes IoT more challenging when it comes to specifically predictive analytics.  This topic was presented at an IoT meetup for the Predix platform, a cloud-based IoT platform-as-a-service (PaaS) created by GE.  It’s open source and built on CouldFoundry’s stack.  Predix is available on AWS and Azure cloud services.  

Share Button

Leveraging healthcare data for consumer solutions

On April 23, 2016, over 300 developers from around the country descended on San Francisco for the weekend to tackle some of the hardest challenges facing the nation.  The event was called BayesHack, sponsored by the nonprofit Bayes Impact.  There were representatives from 7 cabinet-level federal agencies present to set up the 11 “prompts”, mentor the teams and judge the entries.  The prompts for the U.S. Department of Health and Human Services and the Department of Veterans Affairs asked challenging questions on how to leverage existing datasets…

  • How can data connect individuals with the health providers they need?
  • How can data get help to sufferers of opioid addiction?
  • How can data predict and prevent veteran suicide?
  • How can data tackle End Stage Renal Disease (ESRD) and Chronic Kidney Disease (CKD)?

 

As part of the judging process, the teams had to pitch their solutions to both agencies and private sector judges, such as partners at Andreessen Horowitz.  All teams submitted their code to the event’s github account, so that it could be used for judging, as well as ensuring that it will be available in the public domain.    For hackathons such as this one, it’s important to recognize that even if there are already similar commercial products, getting solutions into the public domain makes it possible for others build on later.  (Incidentally, this focus on actual working prototypes via GitHub is surprisingly lacking from many hackathons.  Bayes did a great job focusing on potential implementation beyond just the weekend.)

Of particular focus was the “How can data connect individuals with the health providers they need?” prompt, since this data has only recently become available, due to a regulatory requirement.  This data consisted of commercial healthcare provider networks for plans on ACA insurance marketplaces, including plan coverage, practice locations, specialties, copays and drug formularies.  There were 7 team submissions, most of which produced solutions focusing on usability for consumers and advanced analytics to policy makers.  Some teams expanded the scope to include not just insurance selection, but access to care in general.

To summarize some of the novel ideas in the solutions…

  • Simplified mobile-first user experience, resembling TurboTax for health selection
  • Visualizations and what-if analysis for policy makers
  • Voice recognition and NLP, as in Google freeform search instead of menus and buttons
  • Ranking algorithms and recommendation engines
  • Ingesting additional 3rd party information (such as Vitals, Yelp, and Part D claims) for consumers who need additional information before they can make an informed choice
  • Providing an API for other apps to leverage
  • Enabling self-reporting of network accuracy, like GasBuddy for health plan coverage

Here are some notable entries for this prompt:

The Hhs-marketplace team created an app that leverages chart visualizations to let a consumer compare plan attributes against benchmarks, such as state averages.  The example below shows a user entering a zip code and the specialists they’re interested in seeing.  The app finds the plans that meet those criteria, displays cost comparisons for them and a graphical comparison of the options.

 

The Fhir salamander team created mobile-first responsive web front end that takes the user through a series of simple menu choices to get them to recommended plans.  Along the way, for convenience and efficiency, it enables the user to click a button to place a telephone call to the plans (to ensure that the doctor they want is taking new patients from that plan) or to view the summary plan description files.

In working on the challenge the team transcribed the JSON provider network schema into a relational model.  They reported identifying data quality issues and therefore needing to clean up the raw data in order to use it for analytics.  They also generated state-level statistics to assist in comparison.  The app is written in Javascript, while the analytics are in Python.  They feel that the relational model, code to load it and the code to clean up the data could be reused elsewhere.  While the AWS website (http://tiny.cc/bayeshhs_fsdemo) is no longer live, the deck is available (http://tiny.cc/bayeshhs_fs).

The Hhs insights team produced an interactive provider density map.  Their approach was to target policy makers, rather than consumers.  For that purpose, they built aggregate analytics and related visualizations.  For example, their code uses DOL MSA (Metro Statistical Area) for GeoJSON calculations and visualizations.  In order to enable the needed analytics, they had to take on the challenge of normalizing the JSON schema of provider networks into a tabular format, as well as pre-calculating several aggregate metrics.

The Hhs marketplace finder team created an app that displays the pros-cons of the top 5 plan option for the
user, along with visualizations for making quantitative comparisons easy to understand.  Bad choices are suppressed to avoid screen clutter.  It starts with less than 10 simple questions.  Then adds a prediction of the user’s healthcare needs, which was determined based on statistics by age, gender, preexisting conditions and location.  Finally, it would eventually make it possible for a user to estimate their total cost based on different events, such as hospitalization or illness.
A data science team from Berkeley calling themselves Semantic Search, submitted an extremely ambitious

project.  Basically, creation of a Google Pagerank for healthcare decisions.  Instead of the menus and buttons of a traditional app UI, this solution used a freeform field for a user to indicate what they were looking for.  The goal is to let a consumer who is not tech saavy explain their situation in a natural way, without the interface and technology getting in the way.  Under the covers it uses natural language processing, ranking algorithms and a recommendation engine.  The user is ultimately presented with the top couple plans, along with explanations of why they were recommended.  To make the solution possible, this app has to collect behavioral data logs, use logistic regression to predict the probability that a certain plan would work, and leverage the LETOR ranking mechanism to provide answers.
As an interesting side note, a Schema.org standard for U.S. health insurance networks has recently been adopted.  Eventually, medical groups and insurance companies can publish semantically tagged information directly to the web, bypassing the current single point of collection at CMS.  This would allow for a growth of relevant data that could be used by applications like this one.

 

 

Disclaimer: The challenge prompt used for HHS does not constitute the department’s official stance or endorsement of this activity.  It was used in an unofficial capacity only and intended to take advantage of data newly available from industry due to changes in regulations of the health insurance marketplace.

 

Share Button

Schema.org publishes health plan and provider network schemas

Some good news on healthcare standards

I have been working with the Google semantic web group for many months to design several schemas that represent healthcare provider networks and health insurance plan coverage.  The good news is that these schemas have been officially published for use with Schema.org.  This is the first step towards a wider adoption for a more consistent designation for this type of information.  The schemas are:

Health Insurance Plan: List of health plans and their corresponding network of providers and drug formularies http://pending.webschemas.org/HealthInsurancePlan
Health Plan Network: Defines a network of providers within an health plan. http://pending.webschemas.org/HealthPlanNetwork
Health Plan Cost Sharing Specification: List of costs to be paid by the covered beneficiary. http://pending.webschemas.org/HealthPlanCostSharingSpecification
Health Plan Formulary: Lists of drugs covered by health plan. http://pending.webschemas.org/HealthPlanFormulary

Now for the background…

In November 2015, the US health agency Centers for Medicare & Medicaid Services (CMS) enacted a new regulatory requirement for health insurers who list plans on insurance marketplaces. They must now publish a machine-readable version of their provider network directory and health plan coverage, publish it to a specified JSON standard, and update it at least monthly. Many major health insurance companies across the US have already started to publish their health plan coverage, provider directories and drug formularies to this standard.

The official schema is kept in a GitHub Repository: https://github.com/CMSgov/QHP-provider-formulary-APIs.  This format makes it possible to see how which changes were made and when.  It also has an issues section to facilitate ongoing discussion about the optimal adoption of the standard.  There’s a website that goes into a more detailed explanation on the background of this effort: https://www.cms.gov/CCIIO/Resources/Data-Resources/marketplace-puf.html.

This website also includes the “Machine-readable URL PUF” seed file” to the actual data that have been published by insurance company.  This file contains URLs that can be crawled to aggregate the latest plan and provider data.

In terms of adoption, U.S. health plans that participate in insurance markeplaces have published: *

  • 39 states
  • 398 health plans
  • ~26,000 URLs describing insurance coverage, provider networks, drug formularies

* Updated November 2016


A group of companies representing the provider, payer and consumer segments of healthcare convened to discuss the standard throughout 2015.  The considerations that went into formation of the standard can be found at: http://ddod.healthdata.gov/wiki/Interoperability:_Provider_network_directories

Share Button

Open Referral standard

Open_ReferralThe DDOD program is currently assisting the proponents of a new open standard for publishing human services, called Open Referral.  In order for us to be able to justify the promotion of this standard and publication of data to it, we’re first looking to develop clear and concise use cases.

The Background

Open Referral is a standard that originally came out of a Code for America initiative a couple years ago, with the goal of automating the updating of human services offered across many programs.  Doing so would not only make offered services more discoverable, but also lower the cost of administration for the service providers and referring organizations.

The Problem: A landscape of siloed directories

It’s hard to see the safety net. Which agencies provide what services to whom? Where and how can people access them? These details are always in flux. Nonprofit and government agencies are often under-resourced and overwhelmed, and it may not be a priority for them to push information out to attract more customers.

So there are many ‘referral services’ — such as call centers, resource directories, and web applications — that collect directory information about health, human, and social services. However, these directories are all locked in fragmented and redundant silos. As a result of this costly and ineffective status quo:

  • People in need have difficulty discovering and accessing services that can help them live better lives.
  • Service providers struggle to connect clients with other services that can help meet complex needs.
  • Decision-makers are unable to gauge the effectiveness of programs at improving community health.
  • Innovators are stymied by lack of access to data that could power valuable tools for any of the above.  

– Source: Open Referral project description

For potential use cases, there have been a small handful of government programs identified as potential pilots.  These include:

 

The Competition

Open Referral is not without competing standards.  In fact, the AIRS/211 Taxonomy is already widely used among certified providers of information and referral services, such as iCarol.  However, AIRS/211 has two drawbacks in comparison with Open Referral.  

First, it’s not a free and open standard.  While there are sample PDFs available for parts of the taxonomy, a full spec requires a subscription.

“If you wish to evaluate the Taxonomy prior to subscribing, you can register for evaluation purposes and have access to the full Taxonomy for a limited period of time through the search function. ”  – Source: UAIRS/211 Download page and Subscription page

The taxonomy also requires an annual license fee, which could be a challenge to continue funding in perpetuity for government and nonprofit organizations.

“Organizations need a license to engage in any use of the Taxonomy.”
— Source: AIRS/211 Subscription page

Second, the AIRS/221 taxonomy if highly structured and extensive.  While that has advantages for consistency and interoperability, it raises other challenges.  It leads to a high learning curve and therefore sets potentials barriers for organizations without technical expertise.  Open Referral states that it is a more lightweight option.

It should also be noted that there’s a CivicServices schema defined for use with  Schema.org.  Its approach is to embed machine-readable “Microdata” throughout human-readable HTML web pages.  Schema.org standards are intended to be interpreted by web engines like Google, Bing and Yahoo when indexing a website.  That said, the degree of adoption for CivicServices in particular – from either search engines or information publishers – is unclear at this point.

 

Onward!

In concept, the Open Referral standard would lower the cost and lag time for organizations to update relevant services for their constituents.  The standard is being evangelized by Greg Bloom, who has started with Code for America and has been reaching out to organizations who would be consuming this data (such as Crisis Text Line, Purple Binder and iCarol) for the purpose of defining a compelling use case.

There’s a DDOD writeup on this topic at “Interoperability: Directories of health, human and social services”, intended to facilitate creation of practical use cases.

 

 

Further reading…

Additional information on Open Referral can be found at:

Share Button

Rheumatoid Arthritis Data Challenge

Looking forward to seeing the evolution of the Rheumatoid Arthritis Data Challenge.  Here are the parameters…

  • Title: Rheumatoid Arthritis Data Challenge
  • Announcement date: March 8, 2016
  • Award date: May 10, 2016
  • Summary:
The Rheumatoid Arthritis Data Challenge is a code-a-thon, described as:

“Striking at the heart of a key issue in health outcomes research, participants will be provided access to a secured development environment in a staged competition over three weeks to create the best competitive algorithms to gauge clinical response in Rheumatoid Arthritis management.”
The challenge is hosted by Health Datapalooza in May 2016. It’s sponsored by Optum, Academy Health, and the US Department of Health and Human Services (HHS). This challenge uses non-governmental de-identified administrative claims data and electronic record clinical (EHR) data with the goal of establishing algorithms to predict clinical response to rheumatoid arthritis management. Applications are open to any team of health data enthusiasts, but only 15 of these will be selected for participation. (Register at: https://hdpalooza.wufoo.com/forms/rheumatoid-arthritis-data-challenge-reg-form/). Winners announced at the Health Datapalooza on May 10, 2016, with $40,000 in prizes to be awarded.
Share Button

Open Data Discoverability

I’m adding a working document to cover the topic of open data discoverability and usability.  It appears as though this is an area that is in desperate need for attention.  I have come across it tangentially throughout much of my work.  It deserves to be aggregated and curated.  There are also some lingering opportunities to make practical use of semantic web concepts.  There are vast repositories of data assets throughout government, academia and industry that could be better leveraged.  So lets make it happen.

 

Share Button

DDOD featured on Digital Gov

DDOD logoThe Demand-Driven Open Data (DDOD) program has recently been featured on DigitalGov.  (See DigitalGov article.)

It should be added, that a major project in the works is the merging of DDOD tools and methodologies into the larger HealthData.gov program.  The effort seeks to maximize the value of existing data assets from across HHS agencies (CMS, FDA, CDC, NIH, etc.).  Already planned are new features to enhance data discoverability and usability.

We’re also looking into how to improve the growing knowledge base of DDOD use cases by leveraging semantic web and linked open data (LOD) concepts.  A couple years ago, HHS organized the Health Data Platform Metadata Challenge – Health 2.0.  The findings from this exercise could be leveraged for both DDOD and HealthData.gov.

DDOD featured on DigitalGov

Share Button