ENH: Linked Datasets (RDF) #3402

westurner · 2013-04-19T16:01:53Z

ENH: Linked Datasets (RDF)

This is very much a meta ticket.
There are a number of bare links here.
They are for documentation

Use Case

So I:

retrieved some data
- from somewhere
- about a certain #topic
perfomed analysis
- with certain transformations and aggregations
- with certain versions of certain tools
- confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

Series (1D)
- index
- data
  - NumPy datatypes
DataFrame (2D)
- index
- column(s)
  - NumPy datatypes
Panel (3D)
Panel4D (4D)

Read or parse a data format into a DataSet:

pandas.read_*
- read_clipboard
- read_csv
- read_excel
- read_fwf
- read_gbq
- read_hdf
- read_html
- read_json
- read_msgpack
- read_pickle
- read_sql
- read_stata
- read_table
pandas.HDFStore
- https://pandas.pydata.org/docs/dev/io.html#hdf5-pytables

Add metadata:

Add RDF metadata (RDFa, JSONLD)

Save or serialize a DataSet into a data format:

pandas.DataFrame.
- to_csv
- to_dict
- to_excel
- to_gbq
- to_html
- to_latex
- to_panel
- to_period
- to_records
- to_sparse
- to_sql
- to_stata
- to_string
- to_timestamp
- to_wide
to_ RDF
to_ CSVW
to_ HTML + RDFa
to_ JSONLD
- create a JSONLD @context

Share or publish a serialized DataSet with the internet:

Email Attachment (Table in a PDF)
- opendatahandbook.org
- project-open-data.github.io
FTP, SFTP, RSYNC, NFS
HTML web upload form with metadata form fields
CLI tool
Version Control: Git, Hg, Svn
- challenge: 'large' files ("binary blobs") in VCS systems
HTTP API: Object Storage (~LDP)
- GET/POST /container/filename.csv # [.json|.xml|.xls|.rdf|.html]
- challenge: indexing metadata from a separate document / named graph
  - GET/POST to/container/filename.csv`
Push to CKAN
Host DataSet metadata
- python -m SimpleHTTPServer 8088
- e.g. http://datasets.schema-labs.appspot.com/about Indexes http://schema.org/Dataset s

Implementation

What changes would be needed for Pandas core to support this workflow?

.meta schema
to_rdf for Series, DataFrames, Panels, and Panel4Ds
read_rdf for Series, DataFrames, Panels, and Panel 4Ds
~@datastep process decorators
~DataSet
~DataCatalog of precomputed aggregations/views/slices.
Units support (.meta?)

`.meta` schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources

CSV2RDF (`csvw`)

W3C PROV (`prov:`)

schema.org (`schema:`)

http://schema.org
http://www.w3.org/wiki/WebSchemas
http://schema.rdfs.org/
https://schema.org/docs/full.html :
- schema:Dataset -- A body of structured information describing some topic(s) of interest.
  - [schema:Thing, schema:CreativeWork]
  - distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
  - spatial, temporal
  - catalog -- A data catalog which contains a dataset (DataCatalog)
- schema:DataCatalog -- collection of Datasets
  - [schema:Thing, schema:CreativeWork]
  - dataset -- A dataset contained in a catalog. (Dataset)
- schema:DataDownload -- A dataset in downloadable form.
  - [schema:Thing, schema:CreativeWork]
  - contentSize
  - contentURL
  - uploadDate

W3C RDF Data Cube (`qb:`)

http://www.w3.org/TR/vocab-data-cube/
http://www.w3.org/2011/gld/wiki/Data_Cube_Vocabulary#The_history_of_Data_Cube.2C_SDMX-RDF_and_SCOVO
http://www.w3.org/TR/vocab-data-cube/#vocab-reference :
- qb:DataSet -- a collection of observations, possibly organized into various slices, conforming to some common dimensional structure
  - qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values.
- qb:Observation -- a single observation in the cube, may have one or more associated measured values.
  - qb:dataset -- data set of which this observation is a part.
- qb:ObservationGroup -- a, possibly arbitrary, group of observations.
  - qb:observation -- an observation contained within this slice of the data set.
- qb:Slice -- a subset of a DataSet defined by fixing a subset of the dimensional values, component properties on the Slice.
- [Components, Properties, Dimensions, Attributes, Measures]

`to_rdf`

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

output fmt
JSON-LD: compaction

.

`read_rdf`

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

Series.read_rdf()
DataFrame.read_rdf()
Panel.read_rdf()
Panel4D.read_rdf()

Arguments to read_rdf would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.

@datastep / PROV

Objective: Additive journal of transformations
Link to source script(s) URIs
Decorator for annotating data transformations with metadata.
Generate PROV metadata for data transformations

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
- 'this is an aggregation of that'
  - 'this' has a URI
  - 'that' has a URI
What if there is no metadata for df2?

Units support

Series.meta
DataFrame.column.meta
NumPy [, PyTables]
http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html
https://pint.readthedocs.org/en/latest/
http://pythonhosted.org/quantities/

RDF Datatypes

http://en.wikipedia.org/wiki/ISO_8601
http://www.w3.org/TR/xmlschema-2/#decimal
http://schema.org/Date
http://schema.org/DateTime
http://schema.org/Float
http://schema.org/Quantity
https://github.com/RDFLib/rdflib
- from rdflib.namespace import XSD, RDF, RDFS
- from rdflib import URIRef, Literal
- https://github.com/RDFLib/rdflib-sqlalchemy (SQLAlchemy)

JSON-LD (RDF in JSON)

https://github.com/digitalbazaar/pyld (JSON-LD)
https://github.com/RDFLib/rdflib-jsonld (JSON-LD)

Linked Data Primer

Linked Data Abstractions

Graphs are represented as triples of (s,p,o)
Subject, Predicate, Object
Queries are patterns with ?references
- graph.triples((None, None, None))
- SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
subjects are linked to objects by predicates
- subjects and predicate are identified by URI 'keys'

URIs and URLs

a URI is like a URL
usually, we expect URLs to be 'dereferencable` HTTP URIs
- HTTP GET http://en.wikipedia.org/
a URI may start with a different URI prefix
- urn:
- uuid:

SQL and Linked Data

there exist standard mappings for whole SQL tablesets
- rdb2rdf
- similar to application scaffolding
- ACL support adds complexity
virtuoso supports SQL and RDF and SPARQL
- standard mappings
- virtuoso powers http://dbpedia.org/
  - dbpedia.org has a high degree of centrality
    - http://lod-cloud.net/
rdflib-sqlalchemy maps RDF onto SQL tables
- fairly inefficiently, when compared to native triplestores

Named Graphs

Quads: (g, s, p, o)
g: sometimes called the 'context' of a triple
Metadata about GRAPH ?g
Multiple named graphs in one file: TriX, TriG

Linked Data Formats

Choosing Schema

XSD, RDF, RDFS, DCTERMS
Which schema is most popular?
Which schema is a best fit for the data?
Which schema will search engines index for us?
What do the queries look like?
Years Later... What is OWL?
Why would we start with RDFS now?

Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

where and how was it downloaded? (digital sense)
how was it collected? (process control sense)

Datasets have structure:

Tabular, Hierarchical
1D, 2D, 3D, 4D
Graph-based
- Chains
- Flows
Schema

5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

The text was updated successfully, but these errors were encountered:

ghost · 2013-04-20T09:04:54Z

Hi,

Thanks for the thoroughly-researched idea-issue.
Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge)
taken over the world. Controversially probably, I'm including RDF in that statement, which
has certainly gotten a lot of attention and there are real services built on top of it (freebase,
opencalais, semantic search engines and so on).

DataFrame metadata has come up again and again, please read through the (long) metadata
discussion in #2495 to catch up on some the issues already discussed.
#3297 is planned for 0.12, but has nothing to do with RDF and has very limited scope, since
it's intended to answer a different use-case. However, users would be free to
embed their own JSON schemas under .meta, so it's somewhat open-ended.

The next step after that, embedding metadata in axis labels is interesting, but right now isn't
planned for a specific release. Although I'm sure, the `quantities' users would find that useful.

IMO, it's premature to bake these specs into pandas at this point in the life of
the semantic web.
Is there a fundamental reason why all this can't be done in an auxiliary package,
on top of pandas?

That's my opinion, other devs may feel differently.

ghost · 2013-04-27T15:43:05Z

Bringing over comments made by @westurner in GH3297 :

https://www.google.com/search?q=sdmx+json
http://json-stat.org

westurner · 2013-04-29T21:26:46Z

Thx.

westurner · 2013-05-05T23:25:03Z

From https://news.ycombinator.com/item?id=5657935 :

In terms of http://en.wikipedia.org/wiki/Linked_data , there are a number of standard (overlapping) URI-based schema for describing data with structured attributes:

http://schema.org/docs/full.html
http://schema.rdfs.org/all.json
http://schema.rdfs.org/all.ttl (Turtle RDF Triples)
http://rdfs.org/sioc/spec/
http://json-ld.org/
http://json-ld.org/spec/latest/json-ld/
http://json-ld.org/spec/latest/json-ld-api/
http://www.w3.org/TR/ldp/ Linked Data Platform TR defines a RESTful API standard
http://wiki.apache.org/incubator/MarmottaProposal implements LDP 1.0 Draft and SPARQL 1.1

westurner · 2013-05-05T23:26:27Z

westurner · 2013-05-22T11:38:20Z

@y-p

Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge)
taken over the world. Controversially probably, I'm including RDF in that statement, which
has certainly gotten a lot of attention and there are real services built on top of it (freebase,
opencalais, semantic search engines and so on).

dr-leo · 2013-07-25T09:31:35Z

I stumbled upon this proposal while looking for SDMX tools that might help read economic data from Eurostat, the OECD, IMF, BIS and their likes. So a DataFrame.to_rdf method would need to be complemented by a read_sdmx function. Well, the mentioned data providers offer CSV files as well. But the benefits of working with XML and EDIFact-based formats such as described on http://sdmx.org/ are obvious.

I don't know what level of generality would be appropriate to IO just SDMX. But it might be interesting to look at Eurostat's SDMX Reference Implementation and the other material available at.

https://webgate.ec.europa.eu/fpfis/mwikis/sdmx/index.php.

Starting "small" with SCMX might be appropriate to do within pandas. A more general semantic web focused approach can be studied at http://www.cubicweb.org.

benjello · 2013-07-25T09:47:25Z

It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !

jreback · 2013-07-25T11:26:25Z

if someone is interested, could follow the paradigm of pandas.io.wb.py (The world bank dataset)
basically wrap functions to get the data and return a frame

westurner · 2013-07-25T20:18:26Z

read_sdmx would be great.

write_rdf would also be great. (to_triples)

TODO: re-topical-cluster globs of links in this thread. Here are three more:

http://www.reddit.com/r/semanticweb/comments/1dvakc/schemaorgdataset_standard_schema_for_linked_data/ (schema.org: schema:, XSD, semantic web background)
http://project-open-data.github.io/metadata-resources/ (Data.gov requirements)
http://docs.ckan.org/en/latest/linked-data-and-rdf.html#schema-mapping (CKAN RDF: Dublin Core, DCAT, VoID, SCOVO)

westurner · 2013-07-25T20:23:52Z

It may well be easy enough to transform .meta to RDF.

The more challenging part is, IMHO, storing the procedural metadata while/in applying transforms to the Series, DataFrames, and Panels.

From a provenance and reproducibility standpoint: how do downstream users who are not reading the Python source which produced the calculations compare/review the applied data analysis methods (and findings) with RDF metadata?

[EDIT]

There should be a link to the revision id and/or version of the code in the .meta information.

westurner · 2013-07-25T20:27:58Z

General Ontology Resources:

dr-leo · 2013-07-26T04:43:27Z

All this looks very interesting.

Again, I recommend a deeper dive into CubicWeb, a web framework
supporting RDF and other semantic web standards. It also implements a
SparQL-like query language called RQL. Apart from reusing some of its
core components it seems worth exploring whether in the long term
CubicWeb could be used as a web front end for admin and representation
tasks relating to datasets.

There is no doubt a lot of speculation in these statements. But we
should avoid reinventing wheels.

the pandas.io.wb.py module is a children's game compared to teaching
pandas RDF. The latter goal should probably be pursued in a separate
project such as pandas-rdf or pandas-sdmx, as has been suggested before.
That said, I know nothing about the relationship between RDF and SDMX.

Writing pandas.io.eurostat.py, oecd.py bis.py modules along the lines of
wb.py should not be too difficult, especially if one focuses on CSV
formatted data. Still, using SDMX could make the user's life much easier
and richer.

To add both complexity and to the links collection: Some elements of the
SDMX standard build on EDIFACT. Here,

https://pypi.python.org/pypi/bots-open-source-edi-translator/version%203.0.0

could come in handy.

Am 25.07.2013 22:28, schrieb Wes Turner:

General Ontology Resources:

http://lov.okfn.org

http://prefix.cc

—
Reply to this email directly or view it on GitHub
#3402 (comment).

westurner · 2013-10-18T15:13:34Z

westurner · 2013-10-18T15:15:23Z

westurner · 2013-10-18T15:15:34Z

TODO: re-work description toward something more actionable (research -> development)

westurner · 2013-10-18T16:12:18Z

@dr-leo
The docs for cubicweb look outstanding.

http://redd.it/1o4v5k#ccqw3ju (LDP for REST spec)
http://docs.ckan.org/en/latest/api.html

westurner · 2013-10-18T16:15:59Z

@benjello

It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !

Do you have a few links to API specs and or Python implementations? AFAIK there is not yet an extension API for pandas IO and remote_data providers.

[[re: try/except imports / setuptools]]

benjello · 2013-10-20T10:01:42Z

@westurner : I am not an expert but i think that what is needed is a python to access SDMX content on the OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is a plenty of doc on SDMX.

dr-leo · 2013-10-20T18:28:20Z

The reference implementation of the SDMX framework as freely available
on the Eurostat website is written in Java. I am unaware of any other
implementation.

The SDMX specification at sdmx.org is not rocket science, but covers
several hundred pages.

You may want to set up a separate project, say, PySDMX, and spend some
time to understand the reference implementation, divide it into
tractable chunks and port these to Python. PySDMX could then use pandas
as a storage backend. It could also be designed so as to easily
interface with CubicWeb and friends.

Maybe there are mailing lists on SDMX and its implementations where one
could ask related questions and reach out for porential contributors.

Leo

Am 20.10.2013 12:01, schrieb Mahdi Ben Jelloul:

@westurner https://github.com/westurner : I am not an expert but i
think that what is needed is a python to access SDMX content on the
OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is
a plenty of doc on SDMX.

—
Reply to this email directly or view it on GitHub
#3402 (comment).

jtratner · 2013-10-20T18:39:28Z

2 comments here:

I'd encourage anyone interested in connecting SDMX to pandas to either submit a PR to pandas or work on a Python reader and then submit a PR to hook the package into pandas. That's the best way to get support support for SDMX into pandas, especially if the spec is 100s of pages and there's an option to load from csv. (and you'd end up with the same thing from loading SDMX into pandas vs. loading csv to pandas).
Special support for RDF is not within scope for pandas right now both because of what @y-p said and because it's not clear how users would want to use it (particularly if this is complex enough to require query languages). I'd imagine that you could keep the RDF descriptors in a column (or whatever you need to use for comparison) and then use those descriptors to traverse after you're finished transforming the data.

If you need help hooking in an already-written SDMX reader into pandas, feel free to ask.

westurner · 2013-10-21T02:25:37Z

SDMX

RDF Data Cube Vocabulary

[EDIT] http://publishing-statistical-data.googlecode.com/svn/trunk/specs/src/main/vocab/

jtratner · 2013-10-21T02:36:14Z

@westurner why do you post many links with no summary of them? Not really
helpful for narrowing things down.

westurner · 2013-10-22T02:52:34Z

@jtratner
Sorry about the noise: I find it easier to get the research together (without Markdown || ReStructuredText formatting).

I am working on a more implementation-focused description for this issue. This appears to be a strong candidate for a meta-ticket, which I do understand are not usually specifically helpful. This may very well belong out of core (import rdflib, most likely), but this seems to be a good place to coordinate efforts. To be clear, I have no working generalized implementation of this: I have one-offs for specific datasets and it seems wasteful. A read_rdf and a to_html5_rdfa could be so helpful.

Storing columnar RDF dataset metadata out-of-band from Series.meta and DataFrame.meta is the easiest thing to do right now.

For the meantime, for reference, above are links to SDMX and (newer, more comprehensive) RDF Data Cube Vocabulary standards.

jtratner · 2013-10-22T03:13:00Z

@westurner okay - that's helpful :) [and it's much more understandable if that's your process for working towards something] - btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata e.g., _metadata = ['rdf'], it will generally be moved over to new objects.

What I don't understand from everything you've laid out is what you're looking for with read_rdf (to_html5_rdfa actually seems pretty straightforward once you know where data is stored). Are you looking to get data + the associated RDF triple with it? Or keep all of the RDF data from the file you've read in? Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.

westurner · 2013-10-22T08:03:01Z

btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata e.g., _metadata = ['rdf'], it will generally be moved over to new objects.

https://github.com/pydata/pandas/blob/master/pandas/tests/test_generic.py#L235 (Generic.test_metadata_propagation)
https://github.com/pydata/pandas/blob/master/pandas/tests/test_generic.py#L227 (Generic.check_metadata)

What I don't understand from everything you've laid out is what you're looking for with read_rdf

A read_rdf may have to be a bit more schema & query opinionated (ie read_sdmx_rdf, read_which_datacube_rdf).

(to_html5_rdfa actually seems pretty straightforward once you know where data is stored).

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L557 (HTMLFormatter)

Are you looking to get data + the associated RDF triple with it?

Like more granular than to_triples? I can't think of a specific use case ATM, but that might also be helpful.

Or keep all of the RDF data from the file you've read in?

Moreso this, I think. ETL [+ documentation] -> Publish

Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.

👍

westurner · 2013-12-12T15:04:13Z

I have:

no progress to report
completely rewritten the description for this issue

westurner · 2013-12-12T15:27:02Z

https://github.com/mhausenblas/web.instata

Turn your plain old tabular data (POTD) into Web data with web.instata: it takes CSV as input and generates a HTML document with the data items marked up with Schema.org terms.

westurner · 2013-12-12T15:28:43Z

CSV on the Web Working Group Charter
http://www.w3.org/2013/05/lcsv-charter.html

Data on the Web Best Practices Working Group Charter
http://www.w3.org/2013/05/odbp-charter.html

5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

ghost · 2014-02-04T12:56:49Z

@westurner , you haven't posted any new links in a while. is everything ok?

westurner · 2014-02-04T21:44:22Z

Stayin' alive. I'll close this for now?

mmalter · 2014-02-27T13:32:25Z

I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.

We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.

dr-leo · 2014-03-01T19:34:11Z

Hi,

I am very pleased to read this and will certainly test it asap.

I am afraid I don't understand the background of your question on
exposing Timeseries keys.

Leo

Am 27.02.2014 14:32, schrieb Michaël Malter:

I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.

We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.

Reply to this email directly or view it on GitHub:
#3402 (comment)

westurner · 2014-04-01T04:07:23Z

column.name
column.meta.unit
column.meta.precision

I am now thinking that the easiest approach here -- for columnar metadata in pandas (this is an open problem with CSV and most tabular/spreadsheet formats) -- would be dataframe.meta['columns'][column_id].

As mentioned earlier, this is probably not a job for pandas; but for an external "pandas-rdf".

westurner · 2014-10-23T10:37:36Z

See: @dr-leo

westurner · 2016-06-20T10:54:47Z

Added:

to_ CSVW

westurner mentioned this issue Jun 28, 2013

Day to day variation in next_full_moon calculations brandon-rhodes/pyephem#20

Closed

westurner closed this as completed Feb 4, 2014

westurner mentioned this issue Oct 22, 2014

RLS: Roadmap westurner/pandasrdf#1

Open

34 tasks

westurner mentioned this issue Jan 7, 2015

ENH: RDFa, JSON-LD, Turtle, n3.js westurner/healthref#8

Open

westurner mentioned this issue Jul 29, 2015

ENH: plotting methods can unpack labeled data [MOVED TO #4829] matplotlib/matplotlib#4787

Closed

westurner mentioned this issue Dec 14, 2016

Planned functionality rhiever/datacleaner#1

Open

westurner mentioned this issue Feb 9, 2017

provide path for to_json/from_json of units for front-end astrofrog/numtraits#12

Open

westurner mentioned this issue Apr 6, 2018

Linked Data (RDFa, JSON-LD) quiltdata/quilt#542

Closed

2 tasks

westurner mentioned this issue Mar 20, 2023

ENH: pyarrow and optionally pydantic astropenguin/pandas-dataclasses#166

Open

ENH: Linked Datasets (RDF) #3402

ENH: Linked Datasets (RDF) #3402

Comments

westurner commented Apr 19, 2013 • edited Loading