Conversions for various frictionless datapackage properties #8

ghost · 2018-06-28T15:29:49Z

First PR, tracked from project put together by @Stephen-Gates and ODIQueensland.

Following latest frictionless datapackage spec:
Fixes for:

licenses (including extras)
sources (using url and organization as primary, and extras)
profile
contributors (maintainer and author, and extras)

Used references from:
#2
frictionlessdata/ckanext-datapackager#59
frictionlessdata/ckanext-datapackager#62

@amercader
I figured that once I work on datapackage upload, source organization can be checked and if a match will pull in the url.
Otherwise have tried to follow code-style and conversations with @Stephen-Gates
I've updated tests, but please let me know if more needed.
Other PRs to follow, so feedback welcome.

…d filtered for valid properties.

ghost · 2018-07-04T00:05:35Z

Hi @amercader
We're looking to make regular pull requests at least every week to fortnight for the integration work with ODIQueensland (introduced via @Stephen-Gates ) for the datapackage to dataset work (here) - please let me know anything you require from me to ensure we can maintain this momentum. Thanks in advance.

amercader · 2018-07-04T10:35:48Z

Hi @mattredbox thank you very much for this. I'm traveling this week but should be able to start reviewing this on Friday or early next week. Weekly PR sounds great

ghost · 2018-07-04T10:42:25Z

awesome thanks @amercader

ghost · 2018-07-11T01:35:14Z

Hi @amercader
FYI : There's a mapping in the PR that I will comment-out for now as it is undecided ie:
ckan source url and how to map to datapackage (I had coupled it to organisation just to have it available, but it needs more discussion I think). next PR update on its way.

amercader

@mattRedBox see initial comments. In the future please split each field / feature in a separate commit as it makes reviewing much easier. Cheers

amercader · 2018-07-13T14:19:56Z

ckan_datapackage_tools/converter.py

+        extras_dict = dict(extras)
+        result.update(_parse_profile(extras_dict))
+        _parse_extra_sources(extras_dict, sources)
+        _parse_extra_licenses(extras_dict, licenses)


AFAICT this PR does not include changes to support ingesting data packages so I don't know how are you planning to implement it, but I would assume that if a DP had two licenses in the licenses property they would be both kept in the licenses extra, regardless off whether the first is used to populate license_id and co. If that is the case (which would be my preference), this will create a duplicated license when exporting a DP.

{ "name": "my-dp", "licenses": [ {"name": "cc-zero"}, {"name": "cc-by-sa"} ] }

would generate the following CKAN dataset:

{ "name": "my-dp", "license_id": "cc-zero", "extras": [ {"key": "licenses", "value": [ {"name": "cc-zero"}, {"name": "cc-by-sa"} ] } ] }

With this code the output dp would be

{ "name": "my-dp", "licenses": [ {"name": "cc-zero"}, {"name": "cc-zero"}, {"name": "cc-by-sa"} ] }

Basically we need a check on _parse_extra_licenses() to see if a license already exists in the list (and the same approach should be followed by any property of which CKAN stores one item and DP allows many).

Does this make sense?

Hi @amercader
Yes no problem - will make more effort to align commits with field/feature.

Yes the ingest will be following shortly (sorry on leave atm but will have available soon)...

Yes I can duplicate the licenses in extras, if that follows the overall pattern you had in mind, but I'm not sure I follow why? From comments I've seen in other parts of the code I thought the idea was that, for ingest, extras would/should (eventually) simply contain what remains after relevant properties are populated? So in the ingest that follows, the 'primary license' pops the first license from the license array?

I know it makes sense purely from a ckanext-datapackager point of view to not duplicate the first license, but from a general integration point of view if other third party tools want to interact with the licenses extra they won't necessarily know that this only contains n -1 licenses and they need to go find the other one on license_id

Hi @amercader
Yep makes sense

Hi @amercader
Sorry for closing this and other issue - there's been delay/change in project requirements on our side - I'll pick this up again with you once this is established. I have some other PRs for this and ckanext-datapackager, that I will contribute regardless, but there is just a delay on our side that I want to see resolved first.

No problem @mattRedBox thanks for letting me know

amercader · 2018-07-13T14:32:02Z

ckan_datapackage_tools/converter.py


+def _parse_extra_source(extra_source):


I would expect any incoming sources property to be stored as-is, ie in the DP format, so it should be just a matter of dumping whatever is stored in the sources extra into the DP's sources property. eg sources in extras won't have name and url, they'll have whatever is on the DP spec

Yes, I've since just done the same (sorry haven't pushed up next changes yet, but will asap). Ok I'll also remove the enforcing of datapackage mandatory properties here too and leave for the validation library (so setting a datapackage source even if title is missing)?

Oh, maybe I misunderstood the issue. My point is that why the stored sources in the CKAN extra should have a name and url property if these are not part of the spec but maybe what you are trying to do is support old versions of the spec? If that't the case it makes perfect sense, apologies.

Hi @amercader
Yes that's correct - sorry I'll make this more clear next time

amercader · 2018-07-13T14:42:28Z

ckan_datapackage_tools/converter.py

+    all_sources = []
+    source = {}
+    organization = dataset_dict.get('organization')
+    if organization.get('is_organization', False):


I'm reluctant to use info from organizations on the DP output unless it can also be set on the way, when ingesting a DP. It's going to be difficult to ensure that imported DP are assigned to CKAN organizations reliably. The only thing we have to match both are the sources title property. But maybe it's worth having sources always populated even if not supporting it on the way in.

yep same as previous comment.
The only thing I'm not sure of, though, is what to do about dataset's source url.
Because the latest frictionless states that 'title' is mandatory, but if there is an existing dataset source url, surely downloaded datapackage should be capturing this somehow (as it also has 'sources'). Or is this a bigger question for discussion with @pwalsh or @rufuspollock ?
In another, related project, Data Curator, in which we capture datapackage properties, we are looking to store properties 'silently' from incoming datasets to ensure that the properties still exist on upload again to ckan, but because 'source'(s) exists in both dataset and datapackage, I thought perhaps this might need to be handled some other way? Or should the dataset source['url'] be mapped to the datapackage contributor author path? (and then vice-versa on ingest?)
For that matter, as it seems important for a dataset to have an organisation, I'm not sure there's an easy way to capture that in a datapackage or whether datapackage should capture this at all (as it has major impact on how datasets are loaded into ckan). We could:

ignore it in datapackage

map it to a single datapackage source (with only title):
-> ignoring it on ingest or
-> checking the ingest organisation matches?
Apologise if these are questions that @Stephen-Gates may have already squared away with you, but due to many conflicting commitments, unfortunately for us, Stephen has had to step back from his role with ODIQueensland, so doing what I can to fill in gaps here :P

These questions were not answered because they are tricky to answer :)

I think my preference would be to only add organizations to sources in the way out for now as this is the most robust approach. Trying to match sources to organizations on the way in just using the title seems too brittle.

Hi @amercader
Yes I agree - perhaps we could add organisation (just the title) on the way out - but I might just leave this out for now as then there's a problem with not repeating it in 'extras' on the way in with repeated imports.
I've realised since that probably the mistake I made was thinking a CKAN dataset source is/should be the same as a datapackage source. I realise now that with the new contributors, and that author_email, author_name and source_url were all being collected together, then source_url from ckan should probably also go the contributor, with role 'author' as contributor['path'] in the datapackage.

amercader · 2018-07-13T14:45:15Z

ckan_datapackage_tools/converter.py

-def _parse_extras(dataset_dict):
-    result = {}
+def _parse_extra_license(extra_license):
+    license = {}


Similarly to my comment regarding sources above, I'd expect whatever is stored in the licenses extra to be the same that the DP so it should not be necessary to parse or map it.

OK, so again just to be clear (for me ;) ), the ingest of a datapackage licenses should actually create the same in extras, but in addition, also push the first license to dataset's license, so that this first license is captured in both dataset license and extras?

Yes, that's correct. See my other comment above

amercader · 2018-07-13T14:46:11Z

ckan_datapackage_tools/converter.py

+    all_sources = []
+    source = {}
+    organization = dataset_dict.get('organization')
+    if organization.get('is_organization', False):


is_organization should always be True for dataset_dict['organization'] objects

Yep OK, thanks. wasn't sure about this as I saw the reference thought perhaps there might be cases where this is false.

Matthew Mulholland added 5 commits June 29, 2018 01:12

linted and corrected test result.

929d15c

corrected lint for error message.

1d619fb

Added tests for extras. Corrected logic so mandatory values upheld an…

47cc98c

…d filtered for valid properties.

removed debug log.

30b71ff

linted.

31fc56d

amercader reviewed Jul 13, 2018

View reviewed changes

ghost closed this Jul 18, 2018

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversions for various frictionless datapackage properties #8

Conversions for various frictionless datapackage properties #8

ghost commented Jun 28, 2018 •

edited by ghost

Loading

ghost commented Jul 4, 2018

amercader commented Jul 4, 2018

ghost commented Jul 4, 2018

ghost commented Jul 11, 2018

amercader left a comment

amercader Jul 13, 2018

ghost Jul 13, 2018 •

edited by ghost

Loading

amercader Jul 18, 2018

ghost Jul 18, 2018

ghost Jul 18, 2018 •

edited by ghost

Loading

amercader Jul 19, 2018

amercader Jul 13, 2018

ghost Jul 13, 2018 •

edited by ghost

Loading

amercader Jul 18, 2018

ghost Jul 18, 2018

amercader Jul 13, 2018

ghost Jul 14, 2018 •

edited by ghost

Loading

amercader Jul 18, 2018

ghost Jul 18, 2018

amercader Jul 13, 2018

ghost Jul 14, 2018 •

edited by ghost

Loading

amercader Jul 18, 2018

amercader Jul 13, 2018

ghost Jul 14, 2018


		def _parse_extra_source(extra_source):

Conversions for various frictionless datapackage properties #8

Conversions for various frictionless datapackage properties #8

Conversation

ghost commented Jun 28, 2018 • edited by ghost Loading

ghost commented Jul 4, 2018

amercader commented Jul 4, 2018

ghost commented Jul 4, 2018

ghost commented Jul 11, 2018

amercader left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jul 13, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jul 18, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jul 13, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jul 14, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Jul 14, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost commented Jun 28, 2018 •

edited by ghost

Loading

ghost Jul 13, 2018 •

edited by ghost

Loading

ghost Jul 18, 2018 •

edited by ghost

Loading

ghost Jul 13, 2018 •

edited by ghost

Loading

ghost Jul 14, 2018 •

edited by ghost

Loading

ghost Jul 14, 2018 •

edited by ghost

Loading