-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Linked Datasets (RDF) #3402
Comments
Hi, Thanks for the thoroughly-researched idea-issue. DataFrame metadata has come up again and again, please read through the (long) metadata The next step after that, embedding metadata in axis labels is interesting, but right now isn't IMO, it's premature to bake these specs into pandas at this point in the life of That's my opinion, other devs may feel differently. |
Bringing over comments made by @westurner in GH3297 : https://www.google.com/search?q=sdmx+json |
Thx. |
From https://news.ycombinator.com/item?id=5657935 : In terms of http://en.wikipedia.org/wiki/Linked_data , there are a number of standard (overlapping) URI-based schema for describing data with structured attributes:
|
|
I stumbled upon this proposal while looking for SDMX tools that might help read economic data from Eurostat, the OECD, IMF, BIS and their likes. So a DataFrame.to_rdf method would need to be complemented by a read_sdmx function. Well, the mentioned data providers offer CSV files as well. But the benefits of working with XML and EDIFact-based formats such as described on http://sdmx.org/ are obvious. I don't know what level of generality would be appropriate to IO just SDMX. But it might be interesting to look at Eurostat's SDMX Reference Implementation and the other material available at. https://webgate.ec.europa.eu/fpfis/mwikis/sdmx/index.php. Starting "small" with SCMX might be appropriate to do within pandas. A more general semantic web focused approach can be studied at http://www.cubicweb.org. |
It would definitely be great to be able to read data from Eurostat, the OECD, IMF, BIS using pandas ! |
if someone is interested, could follow the paradigm of pandas.io.wb.py (The world bank dataset) |
TODO: re-topical-cluster globs of links in this thread. Here are three more:
|
It may well be easy enough to transform The more challenging part is, IMHO, storing the procedural metadata while/in applying transforms to the From a provenance and reproducibility standpoint: how do downstream users who are not reading the Python source which produced the calculations compare/review the applied data analysis methods (and findings) with RDF metadata? [EDIT] There should be a link to the revision id and/or version of the code in the |
General Ontology Resources: |
All this looks very interesting. Again, I recommend a deeper dive into CubicWeb, a web framework There is no doubt a lot of speculation in these statements. But we the pandas.io.wb.py module is a children's game compared to teaching Writing pandas.io.eurostat.py, oecd.py bis.py modules along the lines of To add both complexity and to the links collection: Some elements of the https://pypi.python.org/pypi/bots-open-source-edi-translator/version%203.0.0 could come in handy. Am 25.07.2013 22:28, schrieb Wes Turner:
|
TODO: re-work description toward something more actionable (research -> development) |
@dr-leo
|
Do you have a few links to API specs and or Python implementations? AFAIK there is not yet an extension API for pandas IO and remote_data providers. |
@westurner : I am not an expert but i think that what is needed is a python to access SDMX content on the OCED, ECB, IMF etc. data servers. I didn't find any yet. But the re is a plenty of doc on SDMX. |
The reference implementation of the SDMX framework as freely available The SDMX specification at sdmx.org is not rocket science, but covers You may want to set up a separate project, say, PySDMX, and spend some Maybe there are mailing lists on SDMX and its implementations where one Leo Am 20.10.2013 12:01, schrieb Mahdi Ben Jelloul:
|
2 comments here:
If you need help hooking in an already-written SDMX reader into pandas, feel free to ask. |
@westurner why do you post many links with no summary of them? Not really |
@jtratner I am working on a more implementation-focused description for this issue. This appears to be a strong candidate for a meta-ticket, which I do understand are not usually specifically helpful. This may very well belong out of core ( Storing columnar RDF dataset metadata out-of-band from For the meantime, for reference, above are links to SDMX and (newer, more comprehensive) RDF Data Cube Vocabulary standards. |
@westurner okay - that's helpful :) [and it's much more understandable if that's your process for working towards something] - btw, metadata does (kind of) have some support now (i.e., if you add a property to a subclass and add the property name as What I don't understand from everything you've laid out is what you're looking for with read_rdf (to_html5_rdfa actually seems pretty straightforward once you know where data is stored). Are you looking to get data + the associated RDF triple with it? Or keep all of the RDF data from the file you've read in? Seems like you could get a really naive implementation just by storing all the RDF as a big list of strings in a separate column of a DataFrame. |
A
https://github.com/pydata/pandas/blob/master/pandas/core/format.py#L557 (
Like more granular than
Moreso this, I think. ETL [+ documentation] -> Publish
👍 |
I have:
|
https://github.com/mhausenblas/web.instata
|
CSV on the Web Working Group Charter Data on the Web Best Practices Working Group Charter 5 ★ Open Data |
@westurner , you haven't posted any new links in a while. is everything ok? |
Stayin' alive. I'll close this for now? |
I am exactly working on drleo proposal for a pysdmx module I will release something on github in a few days. We need to figure out a way to expose the keys of a time series. Should I subclass DataFrame to provide addititonal properties? I know pandas make heavy use of new. |
Hi, I am very pleased to read this and will certainly test it asap. I am afraid I don't understand the background of your question on Leo Am 27.02.2014 14:32, schrieb Michaël Malter:
|
I am now thinking that the easiest approach here -- for columnar metadata in pandas (this is an open problem with CSV and most tabular/spreadsheet formats) -- would be As mentioned earlier, this is probably not a job for pandas; but for an external "pandas-rdf". |
Added:
|
ENH: Linked Datasets (RDF)
(UPDATE: see westurner/pandasrdf#1)
Use Case
So I:
and I want to share my findings so that others can find, review, repeat, reproduce, and verify (confirm/reject) a given conclusion.
User Story
As a data analyst, I would like to share or publish
Series
,DataFrame
s,Panel
s, andPanel4D
s as structured, hierarchical, RDF linked data ("DataSet").Status Quo: Pandas IO
http://pandas.pydata.org/pandas-docs/dev/io.html
.
Read or parse a data format into a DataSet:
pandas.read_*
read_clipboard
read_csv
read_excel
read_fwf
read_gbq
read_hdf
read_html
read_json
read_msgpack
read_pickle
read_sql
read_stata
read_table
pandas.HDFStore
Add metadata:
Save or serialize a DataSet into a data format:
pandas.DataFrame.
to_csv
to_dict
to_excel
to_gbq
to_html
to_latex
to_panel
to_period
to_records
to_sparse
to_sql
to_stata
to_string
to_timestamp
to_wide
Share or publish a serialized DataSet with the internet:
GET/POST /container/filename.csv
# [.json|.xml|.xls|.rdf|.html]GET/POST to
/container/filename.csv`python -m SimpleHTTPServer 8088
Implementation
What changes would be needed for Pandas core to support this workflow?
.meta
schemato_rdf
for Series, DataFrames, Panels, and Panel4Dsread_rdf
for Series, DataFrames, Panels, and Panel 4Ds@datastep
process decoratorsDataSet
DataCatalog
of precomputed aggregations/views/slices..meta
?).meta
schemaIt's easy enough to serialize a dict and a table to naieve RDF.
For interoperability, it would be helpful to standardize with a common
set of terms/symbols/structures/schema for describing
the tabular, hierarchical data which pandas is designed to handle.
There is currently no standard method for storing columnar metadata
within Pandas (e.g. in
.meta['columns'][colname]['schema']
, or as a JSON-LD@context
).Ontology Resources
rdfs:
)owl:
)CSV2RDF (
csvw
)W3C PROV (
prov:
)schema.org (
schema:
)W3C RDF Data Cube (
qb:
)to_rdf
http://pandas.pydata.org/pandas-docs/dev/io.html
Arguments:
fmt
.
Series.meta
Series.to_rdf()
DataFrame.meta
DataFrame.to_rdf()
Panel.meta
Panel.to_rdf()
Panel4D.meta
Panel4D.to_rdf()
read_rdf
http://pandas.pydata.org/pandas-docs/dev/remote_data.html
Series.read_rdf()
DataFrame.read_rdf()
Panel.read_rdf()
Panel4D.read_rdf()
Arguments to
read_rdf
would need to describe which dimensions of data toread into 1D/2D/3D/4D form.
@datastep / PROV
Ten Simple Rules for Reproducible Computational Research (3, 4, 5, 7, 8, 10)
DataCatalog
A collection of Datasets.
DataCatalog = {that=df1, this=df1.group().apply(), also_this=df2]
Units support
RDF Datatypes
from rdflib.namespace import XSD, RDF, RDFS
from rdflib import URIRef, Literal
JSON-LD (RDF in JSON)
Linked Data Primer
Linked Data Abstractions
graph.triples((None, None, None))
SELECT ?s, ?p, ?o WHERE { ?s ?p ?o };
URIs and URLs
urn:
uuid:
SQL and Linked Data
Named Graphs
GRAPH ?g
Linked Data Formats
Choosing Schema
Linked Data Process, Provenance, and Schema
DataSets have [implicit] URIs:
Shared or published DataSets have URLs:
DataSets are about certain things:
DataSets are derived from somewhere, somehow:
Datasets have structure:
5 ★ Open Data
http://5stardata.info/
http://www.w3.org/TR/ld-glossary/#x5-star-linked-open-data
https://en.wikipedia.org/wiki/Linked_Data
The text was updated successfully, but these errors were encountered: