Monthly Archives: April 2015

Field-level data dictionaries for open data

Typically, publicly available open data repositories — especially being hosted or indexed via CKAN — have been described only at the dataset level.  Meaning, typically datasets are described in a DCAT-compatible schema.  This includes the metadata schema required by Project Open Data for Data.gov and all agency-specific data hosting websites.

But ideally, the cataloging of these datasets should move to a more granular level of detail: field-level.  Doing so, makes it possible for search capabilities to go well beyond the typical tags and predefined categories.  With fields defined, we can quickly find all datasets that have common fields.  That in turn makes it easier to find opportunities for linking across datasets and allows for a related dataset recommendation engine.  The solution becomes even more powerful if the fields are labeled with a predefined semantic vocabulary — that is globally uniquely defined.  (See approach described in Health2.0 Metadata Challenge.)

One challenge to this goal is that CKAN has not historically been good at defining a standard, machine readable data dictionary.  We’ve examined a range of standards and suggestions for defining data dictionaries.  These include common SQL DDL, XML, JSON, and YAML formats.

* ANSI SQL Standard 
   - DDL (Data Definition Language): "CREATE TABLE"
   - SQL/Schemata
 
        ```
        testdb-# \d company
                    Table "public.company"
          Column   |     Type      | Modifiers
        -----------+---------------+-----------
         id        | integer       | not null
         name      | text          | not null
         address   | character(50) |
         join_date | date          |
        Indexes:
            "company_pkey" PRIMARY KEY, btree (id)
        ```

* JSON Table Schema: http://dataprotocols.org/json-table-schema/

    ```
    "schema": {
      "fields": [
        {
          "name": "name of field (e.g. column name)",
          "title": "A nicer human readable label or title for the field",
          "type": "A string specifying the type",
        },
        ... more field descriptors
      ],
      "primaryKey": ...
      "foreignKeys": ...
    }
    ```

* YAML schema files used for Doctrine ORM: http://doctrine.readthedocs.org/en/latest/en/manual/yaml-schema-files.html

* XML schema syntax for Google's DSPL (Dataset Publishing Language): https://developers.google.com/public-data/docs/schema/dspl9

* W3 XML Schema: http://www.w3.org/TR/xmlschema-2/#built-in-primitive-datatypes


## CSV storage formats
* Open Knowledge Data Packager - CKAN Extension http://ckan.org/2014/06/09/the-open-knowledge-data-packager/

* Tabular Data Package Spec: http://dataprotocols.org/tabular-data-package/

* The above two are also part of a W3C standards track:
http://www.w3.org/TR/tabular-data-model/

 

Enter the all powerful CSV

CSV format often a desired format for it’s high interoperability.  However, it suffers from the fact that we need to keep its metadata separately defined.  This in turn causes challenges in version control, broken links and correctly identifying the column order.  There’s also the all-too-common and annoying test that has to be performed to determine if the first row is data or column header.

So is there an elegant, machine-readable, standard-ish way to embed the metadata within the data file itself?  OKFN suggests that the solution could be accomplished via Tabular Data Packages.  Basically, you have the option to provide the data “inline” directly in the datapackage.json file.  The data would be in addition to specifying the full schema (as per JSON Table Schema) and CSV dialect (as per CSVDDF Dialect specification) in the same file.  We just need to have simple scripts that eventually extract these components into separate CSV files and JSON Table Schema.  Open Knowledge Data Packager is a CKAN extension that makes use of JSON Table Schema and Tabular Data Package for hosting datasets on CKAN FileStore.

Finally, there’s a helpful article on Implementing CSV on the Web and W3C’s CSV working group is seeking feedback on model and vocabulary for tabular data.

 

Is “SchemaStore” CKAN’s mystical unicorn?

As mentioned previously, CKAN hasn’t been strong in storing and managing standard, machine readable data dictionaries.  So a special shout out goes to Greg Lawrence, who has figured out how to solve this limitation.  He’s built a CKAN “SchemaStore” and a custom Java app to index content into CKAN’s DataStore object.  It grabs the needed information by running SQL exports on Oracle tables.  The code that enables SchemaStore is incorporated into the BC Data Catalogue CKAN extension on GitHub.  The field tags are defined in the edc_datasets.py file of this repository.

An example of the SchemaStore implementation can be found in this sample dataset under the “Object Description” section.  Here you’re able to see all of the relevant elements from the Oracle table object: Column Name, Short Name, Data Type, Data Precision, and Comments.  The data dictionary for this dataset is in machine readable JSON format.  For example, the first 3 fields of the data dictionary are:

details: [
  {
    data_precision: "0",
    column_comments: "The date and time the information was entered.",
    data_type: "DATE",
    short_name: "TIMESTAMP",
    column_name: "ENTRY_TIMESTAMP"
  },
  {
    data_precision: "0",
    column_comments: "The identification of the user that created the initial record.",
    data_type: "VARCHAR2",
    short_name: "ENT_USR_ID",
    column_name: "ENTRY_USERID"
  },
  {
    data_precision: "0",
    column_comments: "A feature code is most importantly a means of linking a features to its name and definition.",
    data_type: "VARCHAR2",
    short_name: "FEAT_CODE",
    column_name: "FEATURE_CODE"
  }, ...
],
See related issue for Data.gov (“Make field level metadata searchable and link common fields across the catalog”):  https://github.com/GSA/data.gov/issues/640