Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Linked Datasets (RDF) #3402

Closed
34 tasks
westurner opened this issue Apr 19, 2013 · 36 comments
Closed
34 tasks

ENH: Linked Datasets (RDF) #3402

westurner opened this issue Apr 19, 2013 · 36 comments
Labels
Ideas Long-Term Enhancement Discussions

Comments

@westurner
Copy link
Contributor

westurner commented Apr 19, 2013

ENH: Linked Datasets (RDF)

  • This is very much a meta ticket.
  • There are a number of bare links here.
  • They are for documentation

(UPDATE: see westurner/pandasrdf#1)

Use Case

So I:

  • retrieved some data
    • from somewhere
    • about a certain #topic
  • perfomed analysis
    • with certain transformations and aggregations
    • with certain versions of certain tools
    • confirmed/rejected a [null] hypothesis

and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.

User Story

As a data analyst, I would like to share or publish Series, DataFrames, Panels, and Panel4Ds as structured, hierarchical, RDF linked data ("DataSet").

Status Quo: Pandas IO

How do I go from a [CSV] to a DataFrame to something shareable with a URL?

http://pandas.pydata.org/pandas-docs/dev/io.html

.

  • Series (1D)
    • index
    • data
      • NumPy datatypes
  • DataFrame (2D)
    • index
    • column(s)
      • NumPy datatypes
  • Panel (3D)
  • Panel4D (4D)

Read or parse a data format into a DataSet:

Add metadata:

  • Add RDF metadata (RDFa, JSONLD)

Save or serialize a DataSet into a data format:

  • pandas.DataFrame.
    • to_csv
    • to_dict
    • to_excel
    • to_gbq
    • to_html
    • to_latex
    • to_panel
    • to_period
    • to_records
    • to_sparse
    • to_sql
    • to_stata
    • to_string
    • to_timestamp
    • to_wide
  • to_ RDF
  • to_ CSVW
  • to_ HTML + RDFa
  • to_ JSONLD

Share or publish a serialized DataSet with the internet:

Implementation

What changes would be needed for Pandas core to support this workflow?

  • .meta schema
  • to_rdf for Series, DataFrames, Panels, and Panel4Ds
  • read_rdf for Series, DataFrames, Panels, and Panel 4Ds
  • ~@datastep process decorators
  • ~DataSet
  • ~DataCatalog of precomputed aggregations/views/slices.
  • Units support (.meta?)

.meta schema

It's easy enough to serialize a dict and a table to naieve RDF.

For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.

There is currently no standard method for storing columnar metadata
within Pandas (e.g. in .meta['columns'][colname]['schema'], or as a JSON-LD @context).

Ontology Resources
CSV2RDF (csvw)
W3C PROV (prov:)
schema.org (schema:)
  • http://schema.org
  • http://www.w3.org/wiki/WebSchemas
  • http://schema.rdfs.org/
  • https://schema.org/docs/full.html :
    • schema:Dataset -- A body of structured information describing some topic(s) of interest.
      • [schema:Thing, schema:CreativeWork]
      • distribution -- A downloadable form of this dataset, at a specific location, in a specific format (DataDownload)
      • spatial, temporal
      • catalog -- A data catalog which contains a dataset (DataCatalog)
    • schema:DataCatalog -- collection of Datasets
      • [schema:Thing, schema:CreativeWork]
      • dataset -- A dataset contained in a catalog. (Dataset)
    • schema:DataDownload -- A dataset in downloadable form.
      • [schema:Thing, schema:CreativeWork]
      • contentSize
      • contentURL
      • uploadDate
W3C RDF Data Cube (qb:)

to_rdf

http://pandas.pydata.org/pandas-docs/dev/io.html

Arguments:

  • output fmt
  • JSON-LD: compaction

.

  • Series.meta
  • Series.to_rdf()
  • DataFrame.meta
  • DataFrame.to_rdf()
  • Panel.meta
  • Panel.to_rdf()
  • Panel4D.meta
  • Panel4D.to_rdf()

read_rdf

http://pandas.pydata.org/pandas-docs/dev/remote_data.html

  • Series.read_rdf()
  • DataFrame.read_rdf()
  • Panel.read_rdf()
  • Panel4D.read_rdf()

Arguments to read_rdf would need to describe which dimensions of data to
read into 1D/2D/3D/4D form.

@datastep / PROV

  • Objective: Additive journal of transformations
  • Link to source script(s) URIs
  • Decorator for annotating data transformations with metadata.
  • Generate PROV metadata for data transformations

Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)

DataCatalog

A collection of Datasets.

  • DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
    • 'this is an aggregation of that'
      • 'this' has a URI
      • 'that' has a URI
  • What if there is no metadata for df2?

Units support

RDF Datatypes

JSON-LD (RDF in JSON)

Linked Data Primer

Linked Data Abstractions

  • Graphs are represented as triples of (s,p,o)
  • Subject, Predicate, Object
  • Queries are patterns with ?references
    • graph.triples((None, None, None))
    • SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
  • subjects are linked to objects by predicates
    • subjects and predicate are identified by URI 'keys'

URIs and URLs

  • a URI is like a URL
  • usually, we expect URLs to be 'dereferencable` HTTP URIs
  • a URI may start with a different URI prefix
    • urn:
    • uuid:

SQL and Linked Data

  • there exist standard mappings for whole SQL tablesets
    • rdb2rdf
    • similar to application scaffolding
    • ACL support adds complexity
  • virtuoso supports SQL and RDF and SPARQL
  • rdflib-sqlalchemy maps RDF onto SQL tables
    • fairly inefficiently, when compared to native triplestores

Named Graphs

  • Quads: (g, s, p, o)
  • g: sometimes called the 'context' of a triple
  • Metadata about GRAPH ?g
  • Multiple named graphs in one file: TriX, TriG

Linked Data Formats

  • NTriples
  • RDF/XML
    • TriX
  • Turtle, N3
    • TriG
  • JSON-LD

Choosing Schema

  • XSD, RDF, RDFS, DCTERMS
  • Which schema is most popular?
  • Which schema is a best fit for the data?
  • Which schema will search engines index for us?
  • What do the queries look like?
  • Years Later... What is OWL?
  • Why would we start with RDFS now?

Linked Data Process, Provenance, and Schema

DataSets have [implicit] URIs:

http://example.com/datasets/#<key>

Shared or published DataSets have URLs:

http://ckan.example.org/datasets/<key>

DataSets are about certain things:

e.g. URIs for #Tags, Categories, Taxonomy, Ontology

DataSets are derived from somewhere, somehow:

  • where and how was it downloaded? (digital sense)
  • how was it collected? (process control sense)

Datasets have structure:

  • Tabular, Hierarchical
  • 1D, 2D, 3D, 4D
  • Graph-based
    • Chains
    • Flows
  • Schema

5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

☆ Publish data on the Web in any format (e.g., PDF, JPEG) accompanied by an explicit Open License (expression of rights).
☆☆ Publish structured data on the Web in a machine-readable format (e.g., XML).
☆☆☆ Publish structured data on the Web in a documented, non-proprietary data format (e.g., CSV, KML).
☆☆☆☆ Publish structured data on the Web as RDF (eg Turtle, RDFa, JSON-LD, SPARQL)
☆☆☆☆☆ In your RDF, have the identifiers be links (URLs) to useful data sources.

https://en.wikipedia.org/wiki/Linked_Data

@ghost
Copy link

ghost commented Apr 20, 2013

Hi,

Thanks for the thoroughly-researched idea-issue.
Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge)
taken over the world. Controversially probably, I'm including RDF in that statement, which
has certainly gotten a lot of attention and there are real services built on top of it (freebase,
opencalais, semantic search engines and so on).

DataFrame metadata has come up again and again, please read through the (long) metadata
discussion in #2495 to catch up on some the issues already discussed.
#3297 is planned for 0.12, but has nothing to do with RDF and has very limited scope, since
it's intended to answer a different use-case. However, users would be free to
embed their own JSON schemas under .meta, so it's somewhat open-ended.

The next step after that, embedding metadata in axis labels is interesting, but right now isn't
planned for a specific release. Although I'm sure, the `quantities' users would find that useful.

IMO, it's premature to bake these specs into pandas at this point in the life of
the semantic web.
Is there a fundamental reason why all this can't be done in an auxiliary package,
on top of pandas?

That's my opinion, other devs may feel differently.

@ghost
Copy link

ghost commented Apr 27, 2013

Bringing over comments made by @westurner in GH3297 :

https://www.google.com/search?q=sdmx+json
http://json-stat.org

@westurner
Copy link
Contributor Author

Thx.

@westurner
Copy link
Contributor Author

From https://news.ycombinator.com/item?id=5657935 :

In terms of http://en.wikipedia.org/wiki/Linked_data , there are a number of standard (overlapping) URI-based schema for describing data with structured attributes:

@westurner
Copy link
Contributor Author

@westurner
Copy link
Contributor Author

@y-p

Most of the spec links you provided are exploratory in practice, and have yet (to my knowledge)
taken over the world. Controversially probably, I'm including RDF in that statement, which
has certainly gotten a lot of attention and there are real services built on top of it (freebase,
opencalais, semantic search engines and so on).

@dr-leo
Copy link
Contributor

dr-leo commented Jul 25, 2013

I stumbled upon this proposal while looking for SDMX tools that might help read economic data from Eurostat, the OECD, IMF, BIS and their likes. So a DataFrame.to_rdf method would need to be complemented by a read_sdmx function. Well, the mentioned data providers offer CSV files as well. But the benefits of working with XML and EDIFact-based formats such as described on http://sdmx.org/ are obvious.

I don't know what level of generality would be appropriate to IO just SDMX. But it might be interesting to look at Eurostat's SDMX Reference Implementation and the other material available at.

https://webgate.ec.europa.eu/fpfis/mwikis/sdmx/index.php.

Starting "small" with SCMX might be appropriate to do within pandas. A more general semantic web focused approach can be studied at http://www.cubicweb.org.

@benjello
Copy link
Contributor

It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !

@jreback
Copy link
Contributor

jreback commented Jul 25, 2013

if someone is interested, could follow the paradigm of pandas.io.wb.py (The world bank dataset)
basically wrap functions to get the data and return a frame

@westurner
Copy link
Contributor Author

read_sdmx would be great.

write_rdf would also be great. (to_triples)

TODO: re-topical-cluster globs of links in this thread. Here are three more:

@westurner
Copy link
Contributor Author

It may well be easy enough to transform .meta to RDF.

The more challenging part is, IMHO, storing the procedural metadata while/in applying transforms to the Series, DataFrames, and Panels.

From a provenance and reproducibility standpoint: how do downstream users who are not reading the Python source which produced the calculations compare/review the applied data analysis methods (and findings) with RDF metadata?

[EDIT]

There should be a link to the revision id and/or version of the code in the .meta information.

@westurner
Copy link
Contributor Author

General Ontology Resources:

@dr-leo
Copy link
Contributor

dr-leo commented Jul 26, 2013

All this looks very interesting.

Again, I recommend a deeper dive into CubicWeb, a web framework
supporting RDF and other semantic web standards. It also implements a
SparQL-like query language called RQL. Apart from reusing some of its
core components it seems worth exploring whether in the long term
CubicWeb could be used as a web front end for admin and representation
tasks relating to datasets.

There is no doubt a lot of speculation in these statements. But we
should avoid reinventing wheels.

the pandas.io.wb.py module is a children's game compared to teaching
pandas RDF. The latter goal should probably be pursued in a separate
project such as pandas-rdf or pandas-sdmx, as has been suggested before.
That said, I know nothing about the relationship between RDF and SDMX.

Writing pandas.io.eurostat.py, oecd.py bis.py modules along the lines of
wb.py should not be too difficult, especially if one focuses on CSV
formatted data. Still, using SDMX could make the user's life much easier
and richer.

To add both complexity and to the links collection: Some elements of the
SDMX standard build on EDIFACT. Here,

https://pypi.python.org/pypi/bots-open-source-edi-translator/version%203.0.0

could come in handy.

Am 25.07.2013 22:28, schrieb Wes Turner:

General Ontology Resources:


Reply to this email directly or view it on GitHub
#3402 (comment).

@westurner
Copy link
Contributor Author

TODO: re-work description toward something more actionable (research -> development)

@westurner
Copy link
Contributor Author

@westurner
Copy link
Contributor Author

@benjello

It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas !

Do you have a few links to API specs and or Python implementations? AFAIK there is not yet an extension API for pandas IO and remote_data providers.

[[re: try/except imports / setuptools]]

@benjello
Copy link
Contributor

@westurner : I am not an expert but i think that what is needed is a python to access SDMX content on the OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is a plenty of doc on SDMX.

@dr-leo
Copy link
Contributor

dr-leo commented Oct 20, 2013

The reference implementation of the SDMX framework as freely available
on the Eurostat website is written in Java. I am unaware of any other
implementation.

The SDMX specification at sdmx.org is not rocket science, but covers
several hundred pages.

You may want to set up a separate project, say, PySDMX, and spend some
time to understand the reference implementation, divide it into
tractable chunks and port these to Python. PySDMX could then use pandas
as a storage backend. It could also be designed so as to easily
interface with CubicWeb and friends.

Maybe there are mailing lists on SDMX and its implementations where one
could ask related questions and reach out for porential contributors.

Leo

Am 20.10.2013 12:01, schrieb Mahdi Ben Jelloul:

@westurner https://github.com/westurner : I am not an expert but i
think that what is needed is a python to access SDMX content on the
OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is
a plenty of doc on SDMX.


Reply to this email directly or view it on GitHub
#3402 (comment).

@jtratner
Copy link
Contributor

2 comments here:

  1. I'd encourage anyone interested in connecting SDMX to pandas to either submit a PR to pandas or work on a Python reader and then submit a PR to hook the package into pandas. That's the best way to get support support for SDMX into pandas, especially if the spec is 100s of pages and there's an option to load from csv. (and you'd end up with the same thing from loading SDMX into pandas vs. loading csv to pandas).
  2. Special support for RDF is not within scope for pandas right now both because of what @y-p said and because it's not clear how users would want to use it (particularly if this is complex enough to require query languages). I'd imagine that you could keep the RDF descriptors in a column (or whatever you need to use for comparison) and then use those descriptors to traverse after you're finished transforming the data.

If you need help hooking in an already-written SDMX reader into pandas, feel free to ask.

@jtratner
Copy link
Contributor

@westurner why do you post many links with no summary of them? Not really
helpful for narrowing things down.

@westurner
Copy link
Contributor Author

@jtratner
Sorry about the noise: I find it easier to get the research together (without Markdown || ReStructuredText formatting).

I am working on a more implementation-focused description for this issue. This appears to be a strong candidate for a meta-ticket, which I do understand are not usually specifically helpful. This may very well belong out of core (import rdflib, most likely), but this seems to be a good place to coordinate efforts. To be clear, I have no working generalized implementation of this: I have one-offs for specific datasets and it seems wasteful. A read_rdf and a to_html5_rdfa could be so helpful.

Storing columnar RDF dataset metadata out-of-band from Series.meta and DataFrame.meta is the easiest thing to do right now.

For the meantime, for reference, above are links to SDMX and (newer, more comprehensive) RDF Data Cube Vocabulary standards.

@jtratner
Copy link
Contributor

@westurner okay - that's helpful :) [and it's much more understandable if that's your process for working towards something] - btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata e.g., _metadata = ['rdf'], it will generally be moved over to new objects.

What I don't understand from everything you've laid out is what you're looking for with read_rdf (to_html5_rdfa actually seems pretty straightforward once you know where data is stored). Are you looking to get data + the associated RDF triple with it? Or keep all of the RDF data from the file you've read in? Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.

@westurner
Copy link
Contributor Author

btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as _metadata e.g., _metadata = ['rdf'], it will generally be moved over to new objects.

What I don't understand from everything you've laid out is what you're looking for with read_rdf

A read_rdf may have to be a bit more schema & query opinionated (ie read_sdmx_rdf, read_which_datacube_rdf).

(to_html5_rdfa actually seems pretty straightforward once you know where data is stored).

https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L557 (HTMLFormatter)

Are you looking to get data + the associated RDF triple with it?

Like more granular than to_triples? I can't think of a specific use case ATM, but that might also be helpful.

Or keep all of the RDF data from the file you've read in?

Moreso this, I think. ETL [+ documentation] -> Publish

Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame.

👍

@westurner
Copy link
Contributor Author

I have:

  • no progress to report
  • completely rewritten the description for this issue

@westurner
Copy link
Contributor Author

https://github.com/mhausenblas/web.instata

Turn your plain old tabular data (POTD) into Web data with web.instata: it takes CSV as input and generates a HTML document with the data items marked up with Schema.org terms.

@westurner
Copy link
Contributor Author

CSV on the Web Working Group Charter
http://www.w3.org/2013/05/lcsv-charter.html

Data on the Web Best Practices Working Group Charter
http://www.w3.org/2013/05/odbp-charter.html

5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data

@ghost
Copy link

ghost commented Feb 4, 2014

@westurner , you haven't posted any new links in a while. is everything ok?

@westurner
Copy link
Contributor Author

Stayin' alive. I'll close this for now?

@mmalter
Copy link

mmalter commented Feb 27, 2014

I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.

We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.

@dr-leo
Copy link
Contributor

dr-leo commented Mar 1, 2014

Hi,

I am very pleased to read this and will certainly test it asap.

I am afraid I don't understand the background of your question on
exposing Timeseries keys.

Leo

Am 27.02.2014 14:32, schrieb Michaël Malter:

I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days.

We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new.


Reply to this email directly or view it on GitHub:
#3402 (comment)

@westurner
Copy link
Contributor Author

  • column.name
  • column.meta.unit
  • column.meta.precision

I am now thinking that the easiest approach here -- for columnar metadata in pandas (this is an open problem with CSV and most tabular/spreadsheet formats) -- would be dataframe.meta['columns'][column_id].

As mentioned earlier, this is probably not a job for pandas; but for an external "pandas-rdf".

@westurner
Copy link
Contributor Author

@westurner
Copy link
Contributor Author

Added:

  • to_ CSVW

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ideas Long-Term Enhancement Discussions
Projects
None yet
Development

No branches or pull requests

6 participants