Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversions for various frictionless datapackage properties #8

Closed
wants to merge 5 commits into from
Closed

Conversation

ghost
Copy link

@ghost ghost commented Jun 28, 2018

Hi @amercader,

First PR, tracked from project put together by @Stephen-Gates and ODIQueensland.

Following latest frictionless datapackage spec:
Fixes for:

  • licenses (including extras)
  • sources (using url and organization as primary, and extras)
  • profile
  • contributors (maintainer and author, and extras)

Used references from:
#2
frictionlessdata/ckanext-datapackager#59
frictionlessdata/ckanext-datapackager#62

@amercader
I figured that once I work on datapackage upload, source organization can be checked and if a match will pull in the url.
Otherwise have tried to follow code-style and conversations with @Stephen-Gates
I've updated tests, but please let me know if more needed.
Other PRs to follow, so feedback welcome.

@ghost
Copy link
Author

ghost commented Jul 4, 2018

Hi @amercader
We're looking to make regular pull requests at least every week to fortnight for the integration work with ODIQueensland (introduced via @Stephen-Gates ) for the datapackage to dataset work (here) - please let me know anything you require from me to ensure we can maintain this momentum. Thanks in advance.

@amercader
Copy link
Member

Hi @mattredbox thank you very much for this. I'm traveling this week but should be able to start reviewing this on Friday or early next week. Weekly PR sounds great

@ghost
Copy link
Author

ghost commented Jul 4, 2018

awesome thanks @amercader

@ghost
Copy link
Author

ghost commented Jul 11, 2018

Hi @amercader
FYI : There's a mapping in the PR that I will comment-out for now as it is undecided ie:
ckan source url and how to map to datapackage (I had coupled it to organisation just to have it available, but it needs more discussion I think). next PR update on its way.

Copy link
Member

@amercader amercader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattRedBox see initial comments. In the future please split each field / feature in a separate commit as it makes reviewing much easier. Cheers

extras_dict = dict(extras)
result.update(_parse_profile(extras_dict))
_parse_extra_sources(extras_dict, sources)
_parse_extra_licenses(extras_dict, licenses)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT this PR does not include changes to support ingesting data packages so I don't know how are you planning to implement it, but I would assume that if a DP had two licenses in the licenses property they would be both kept in the licenses extra, regardless off whether the first is used to populate license_id and co. If that is the case (which would be my preference), this will create a duplicated license when exporting a DP.

{
  "name": "my-dp",
  "licenses": [
    {"name": "cc-zero"},
    {"name": "cc-by-sa"}
]
}

would generate the following CKAN dataset:

{
  "name": "my-dp",
  "license_id": "cc-zero",
  "extras": [
    {"key": "licenses", "value": [
      {"name": "cc-zero"},
      {"name": "cc-by-sa"}
   ]
   }
  ]
}

With this code the output dp would be

{
  "name": "my-dp",
  "licenses": [
    {"name": "cc-zero"},
    {"name": "cc-zero"},
    {"name": "cc-by-sa"}
]
}

Basically we need a check on _parse_extra_licenses() to see if a license already exists in the list (and the same approach should be followed by any property of which CKAN stores one item and DP allows many).

Does this make sense?

Copy link
Author

@ghost ghost Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amercader
Yes no problem - will make more effort to align commits with field/feature.

Yes the ingest will be following shortly (sorry on leave atm but will have available soon)...

Yes I can duplicate the licenses in extras, if that follows the overall pattern you had in mind, but I'm not sure I follow why? From comments I've seen in other parts of the code I thought the idea was that, for ingest, extras would/should (eventually) simply contain what remains after relevant properties are populated? So in the ingest that follows, the 'primary license' pops the first license from the license array?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it makes sense purely from a ckanext-datapackager point of view to not duplicate the first license, but from a general integration point of view if other third party tools want to interact with the licenses extra they won't necessarily know that this only contains n -1 licenses and they need to go find the other one on license_id

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amercader
Yep makes sense

Copy link
Author

@ghost ghost Jul 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amercader
Sorry for closing this and other issue - there's been delay/change in project requirements on our side - I'll pick this up again with you once this is established. I have some other PRs for this and ckanext-datapackager, that I will contribute regardless, but there is just a delay on our side that I want to see resolved first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem @mattRedBox thanks for letting me know


def _parse_extra_source(extra_source):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect any incoming sources property to be stored as-is, ie in the DP format, so it should be just a matter of dumping whatever is stored in the sources extra into the DP's sources property. eg sources in extras won't have name and url, they'll have whatever is on the DP spec

Copy link
Author

@ghost ghost Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've since just done the same (sorry haven't pushed up next changes yet, but will asap). Ok I'll also remove the enforcing of datapackage mandatory properties here too and leave for the validation library (so setting a datapackage source even if title is missing)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe I misunderstood the issue. My point is that why the stored sources in the CKAN extra should have a name and url property if these are not part of the spec but maybe what you are trying to do is support old versions of the spec? If that't the case it makes perfect sense, apologies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amercader
Yes that's correct - sorry I'll make this more clear next time

all_sources = []
source = {}
organization = dataset_dict.get('organization')
if organization.get('is_organization', False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to use info from organizations on the DP output unless it can also be set on the way, when ingesting a DP. It's going to be difficult to ensure that imported DP are assigned to CKAN organizations reliably. The only thing we have to match both are the sources title property. But maybe it's worth having sources always populated even if not supporting it on the way in.

Copy link
Author

@ghost ghost Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep same as previous comment.
The only thing I'm not sure of, though, is what to do about dataset's source url.
Because the latest frictionless states that 'title' is mandatory, but if there is an existing dataset source url, surely downloaded datapackage should be capturing this somehow (as it also has 'sources'). Or is this a bigger question for discussion with @pwalsh or @rufuspollock ?
In another, related project, Data Curator, in which we capture datapackage properties, we are looking to store properties 'silently' from incoming datasets to ensure that the properties still exist on upload again to ckan, but because 'source'(s) exists in both dataset and datapackage, I thought perhaps this might need to be handled some other way? Or should the dataset source['url'] be mapped to the datapackage contributor author path? (and then vice-versa on ingest?)
For that matter, as it seems important for a dataset to have an organisation, I'm not sure there's an easy way to capture that in a datapackage or whether datapackage should capture this at all (as it has major impact on how datasets are loaded into ckan). We could:

  • ignore it in datapackage
  • map it to a single datapackage source (with only title):
    -> ignoring it on ingest or
    -> checking the ingest organisation matches?
    Apologise if these are questions that @Stephen-Gates may have already squared away with you, but due to many conflicting commitments, unfortunately for us, Stephen has had to step back from his role with ODIQueensland, so doing what I can to fill in gaps here :P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These questions were not answered because they are tricky to answer :)

I think my preference would be to only add organizations to sources in the way out for now as this is the most robust approach. Trying to match sources to organizations on the way in just using the title seems too brittle.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @amercader
Yes I agree - perhaps we could add organisation (just the title) on the way out - but I might just leave this out for now as then there's a problem with not repeating it in 'extras' on the way in with repeated imports.
I've realised since that probably the mistake I made was thinking a CKAN dataset source is/should be the same as a datapackage source. I realise now that with the new contributors, and that author_email, author_name and source_url were all being collected together, then source_url from ckan should probably also go the contributor, with role 'author' as contributor['path'] in the datapackage.

def _parse_extras(dataset_dict):
result = {}
def _parse_extra_license(extra_license):
license = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to my comment regarding sources above, I'd expect whatever is stored in the licenses extra to be the same that the DP so it should not be necessary to parse or map it.

Copy link
Author

@ghost ghost Jul 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so again just to be clear (for me ;) ), the ingest of a datapackage licenses should actually create the same in extras, but in addition, also push the first license to dataset's license, so that this first license is captured in both dataset license and extras?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's correct. See my other comment above

all_sources = []
source = {}
organization = dataset_dict.get('organization')
if organization.get('is_organization', False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_organization should always be True for dataset_dict['organization'] objects

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep OK, thanks. wasn't sure about this as I saw the reference thought perhaps there might be cases where this is false.

@ghost ghost closed this Jul 18, 2018
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant