Tag Archives: big data

The Birth of Demand-Driven Open Data

And so it begins

My project as an Entrepreneur-in-Residence with the HHS IDEA Lab is called “Innovative Design, Development and Linkages of Databases”.  Think of it as Web 3.0 (the next generation of machine readable and programmable internet applications) applied to open government and focused on healthcare and social service applications.  The underlying hypothesis was that by investigating how HHS could better leverage its vast data repositories as a strategic asset, we would discover innovative ways to create value by linking across datasets from different agencies.

So to sum up…  I was to find opportunities across a trillion dollar organization, where the experts already working with the data have a lifetime of domain-specific experience and several acronyms after their name.  And I was to accomplish this without any dedicated resources within one year.  Pretty easy, right?

My hope was that my big data experience in industry — both for startups and large scale enterprises — was a sufficient catalyst to make progress.  And I had one other significant asset to make it all come together…  I was fortunate that the project was championed by a phenomenal group of internal backers: Keith Tucker and Cynthia Colton, who lead the Enterprise Data Inventory (EDI) in the Office of the Chief Information Officer (OCIO), and Damon Davis, who heads up the Health Data Initiative and HealthData.gov.

Tell me your data fantasies

The first step was to set out on a journey of discovery.  With guidance and clout from the internal sponsors, I was able to secure meetings with leaders and innovators for big data and analytics efforts across HHS.  I had the privilege of engaging in stimulating discussions at CMS, FDA, NIH, CDC, NCHS, ONC, ASPE and several other organizations.

Upon attempting to synthesize the information gathered into something actionable, I noticed that past open data projects fell into two camps.  In the first camp, were those with ample examples of how external organizations were doing fantastic and often unexpected things with the data.  In the second, while the projects may have been successfully implemented from a technical perspective, it wasn’t clear whether or how the data was being used.

The “aha” moment

That’s when it hit me — we’re trying to solve the wrong problem.  It seemed that the greatest value that has been created with existing HHS data — and thereby the most innovative linkages — has been done by industry, researchers and citizen activists.  That meant we can accomplish the main goals of the project if we look at the problem a bit differently.  Instead of outright building the linkages that we think have value, we can accelerate the rate at which external organizations to do what they do best.

It seemed so obvious now. In fact, I had personally experienced this phenomenon myself.  Prior to my HHS fellowship, I built an online marketplace for medical services called Symbiosis Health.  I made use of three datasets across different HHS organizations.  But I did so with great difficulty.  Each had deficiencies which I thought should be easy to fix.  It might be providing more frequent refreshes, adding a field that enables joins to another dataset, providing a data dictionary or consolidating data sources.  If only I could have told someone at HHS what we needed!

Let’s pivot this thing

Thus, the “pivot” was made.  While pivoting is a well known concept for rapid course correction in Lean Startup circles, it’s not something typically associated with government.  Entrepreneurs are supposed to allow themselves to make mistakes and make fast course corrections.  Government is supposed to plan ahead and stay the course.  Except in this case we have the best of both worlds — IDEA Lab.  It gives access to all the resources and deep domain expertise of HHS, but with the ability to pivot and continue to iterate without being weighed down by original assumptions!  I feel fortunate for an opportunity to work in such an environment.

Pivoting into Demand-Driven Open Data


So what exactly is this thing?

The project born from this pivot is called Demand-Driven Open Data (DDOD).  It’s a framework of tools and methods to provide a systematic, ongoing and transparent mechanism for industry and academia to tell HHS what data they need.  With DDOD, all open data efforts are managed in terms of “use cases” which enables allocation of limited resources based on value.  It’s the Lean Startup approach to open data.  The concept is to minimize up front development, acquiring customers before you build the product.

As the use cases are completed, several things happen.  Outside of the actual work done on adding and improving datasets, both the specifications and the solution associated with the use cases are documented and made publicly available on the DDOD website.  Additionally, for the datasets involved and linkages enabled, we add or enhance relevant tagging, dataset-level metadata, data dictionary, cross-dataset relationships and long form dataset descriptions.  This approach, in turn, accelerates future discoveries of datasets.  And best of all, it stimulates the linking we wanted in the first place, through coded relationships and field-level matching. 

How does it fit into the big picture?

It’s beautiful how the pieces come together.  DDOD incorporates quite well with HHS’s existing Health Data Initiative (HDI) and HealthData.gov.  While DDOD is demand-driven from outside of HHS, you can think of HDI as its supply-driven counterpart.  That’s the one guided by brilliant subject matter experts throughout HHS.  Finally, HealthData.gov is the data indexing and discovery platform that serves as a home for enabling both these components.  As a matter of fact, we’re looking for DDOD to serve as the community section of HealthData.gov.

Let’s roll!

So now the fun begins.  Next up…  More adventure as we work through actual pilot use cases.  We’ll also cover some cool potential components of DDOD that would put more emphasis on the “linkages” aspect of the project.  These include usage analytics, data maturity reporting, and semantic tagging of the dataset catalog and fields in the data dictionary.  Stay tuned.

 In the mean time, you can get involved in two ways…  Get the word out to your network about the opportunities provided by DDOD.  Or, if you have actual use cases to add, go to http://demand-driven-open-data.github.io/ and get them entered.

 

What Happened to the Semantic Web?

It looks bleak

Over the past few years, there have been questions asked about the viability of the Semantic Web (aka, SemWeb) envisioned by Tim Berners-Lee.  In the strictest sense, the original standards set out by the W3C have not proliferated at any great pace and have not been widely adopted commercially. There are also no multi-billion dollar acquisitions or IPOs in the SemWeb space.  Even in government and academia, the vast majority of “open data” is in traditional relational form (rather than RDF linked datasets) and don’t reference widely adopted ontologies.

Evidence of decline?

 

But it’s a matter of framing

The outlook changes drastically if we look at the question a bit differently. Rather than defining the SemWeb as the original set of standards or narrow vision, what if we look at related technologies that it may have spawned or influenced.  Now a number of success stories emerge.  We have the tremendous growth of Schema.org and adoption of Microdata among the 3 big search engines: Google, Yahoo, and Bing.  We also have SemWeb concepts applied in Google’s Knowledge Graph, Google Rich Data Snippets, and Facebook Social Graph.  Even IBM’s Watson is no longer just an IBM Research project.  It’s being commercialized into IBM’s verticals, including healthcare, insurance and finance. So SemWeb technologies are alive — in a sense.  For the purpose of clarity, let’s refer to the original W3C vision discussed since 2001 as the “old SemWeb” and the recent commercial successes as the “new SemWeb”.  Of course, these are fuzzy definitions, since the new SemWeb is not formally defined.

 

What’s wrong with the original vision?

The W3C breaks the elements of the old SemWeb into: (1) Linked Data, (2) Vocabularies, (3) Inference, and (4) Query.  Each of which are widely in use today, but in a way that’s different from original specs.  For example, linked data implemented as Microdata or JSON-LD has gained popularity over the heavier and more verbose RDF/XML.  Most websites forgo formally defined OWL ontologies for vocabularies found on databases like Schema.org or Freebase.  Rule engines and reasoners are already built into products we use.  It’s what happens in the “brains” of Google’s page rank and ad optimization algorithms.  And instead of the SPARQL query language, humans interact often interact with the new SemWeb through natural language searches, while machines through RESTful APIs.  With IBM’s Watson translates questions into sophisticated queries involving federation and inference against its knowledge base.

There are a couple other difficulties with the old SemWeb worthy of noting.  It’s been said that it’s too rigid to effectively keep up with today’s rate of data creation and structural evolution.  The overhead of frequent updates to ontologies, tagging and linkages is just too high.  Another problem is around the anemic adoption of the SPARQL language.  The high level of both technical and domain proficiency required to leverage SPARQL directly — especially when it comes to federated queries or those involving inference — is simply impractical in most commercial situations.  However, it might be feasible to have such skills in a highly specialized domain, such as the human genome project.  (See post on a case study of such a SemWeb implementation.)

But even in highly specialized domains, you run into another problem: ontological realism.  This problem is one of ontological “silos” that naturally occur as a result of optimizing for a specific domain and the need to integrate with ontologies built for neighboring domains.  Such silos reduce the effectiveness of SemWeb efforts, because they impair the ability to run queries and inference across multiple data sources.  There needs to be a widely adopted base ontology and corresponding design methodology that works across multiple domains, yet wouldn’t interfere with your specific domain.  The fact that ontologies need to evolve over time means that consistent effort is needed to adhere to such methodologies to avoid eventual silos.

Why has adoption of the old SemWeb lagged that of simpler implementations, like Schema.org?  One could draw an analogy to adoption of API integration standards.  Adoption of REST/JSON has overtaken SOAP/XML.  (See chart below.)  To understand why, we need to look at the domains in which these technologies are applied.  The compelling use case of loose coupling between unrelated companies or independent teams favored the simplicity of REST.  That said, within the confines of large corporate environments, the rigor of SOAP implementations still make sense. Analogy of rest vs soap to semantic web

 

When does it make sense?

One of the biggest challenges to the adoption of the old SemWeb has been the lack of clear commercial benefits.  To many corporate CIOs and CTOs, any potential benefit was overshadowed by the TCO (total cost of ownership, including migration overhead and ongoing maintenance).  No doubt the technology and concepts proposed for the old SemWeb are exhilarating.  But rather than falling in love with the technology, the key to adoption has been the existence and realization of a clear business case.  That’s exactly what’s been happening for the successful implementations of the new SemWeb.  For example, Google sees tremendous ROI in implementing its Knowledge Graph, because it greatly improves ad revenue.  Webmasters and Google’s advertisers, in turn, are eager to organize and tag their content per Schema.org for the purpose of SEO/SEM.

Sure, that’s fine for deep-pocketed visionaries like Google.  But how about for the risk averse?  How would they know when there’s likely a sufficient ROI to adopting SemWeb technologies?  CEOs and CTOs looking to incorporate such technologies into their product lines might watch for a trend of increasing acquisitions or VC funding for SemWeb related services.  CIOs looking to support their business operations might wait to hear about success stories from similar corporate implementations.  Researchers and universities may ask whether there been any discoveries substantially aided by SemWeb initiatives.

Additionally, there may be some hope even for the aspects of the old SemWeb vision that haven’t gained adoption yet.  The LOD2 Technology Stack is being funded by the European Commission within the Seventh Framework Programme. It is a set of standards and integrated semantic web tools being developed in conjunction with the EU Open Data Portal. It’s too early to see any obvious success stories. But it’s quite possible that such government support will lead to unexpected new developments from SemWeb efforts. After all, the US Department of Defense’s funding of ARPANET led to the development of the Internet.

There are many paths to adopting the new SemWeb.  Go find yours.

Comparison of MPP Data Warehouse Platforms

Revisiting a favorite topic.  Comparison of MPP (massively parallel processing) data warehouse platforms.  It includes key differences, architectures, trends, costs, maturity and marketshare.  This discussion is actually an update of emerging trends from the Hadoop ecosystem