-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversions for various frictionless datapackage properties #8
Conversation
Hi @amercader |
Hi @mattredbox thank you very much for this. I'm traveling this week but should be able to start reviewing this on Friday or early next week. Weekly PR sounds great |
awesome thanks @amercader |
Hi @amercader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattRedBox see initial comments. In the future please split each field / feature in a separate commit as it makes reviewing much easier. Cheers
extras_dict = dict(extras) | ||
result.update(_parse_profile(extras_dict)) | ||
_parse_extra_sources(extras_dict, sources) | ||
_parse_extra_licenses(extras_dict, licenses) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT this PR does not include changes to support ingesting data packages so I don't know how are you planning to implement it, but I would assume that if a DP had two licenses in the licenses
property they would be both kept in the licenses
extra, regardless off whether the first is used to populate license_id
and co. If that is the case (which would be my preference), this will create a duplicated license when exporting a DP.
{
"name": "my-dp",
"licenses": [
{"name": "cc-zero"},
{"name": "cc-by-sa"}
]
}
would generate the following CKAN dataset:
{
"name": "my-dp",
"license_id": "cc-zero",
"extras": [
{"key": "licenses", "value": [
{"name": "cc-zero"},
{"name": "cc-by-sa"}
]
}
]
}
With this code the output dp would be
{
"name": "my-dp",
"licenses": [
{"name": "cc-zero"},
{"name": "cc-zero"},
{"name": "cc-by-sa"}
]
}
Basically we need a check on _parse_extra_licenses()
to see if a license already exists in the list (and the same approach should be followed by any property of which CKAN stores one item and DP allows many).
Does this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amercader
Yes no problem - will make more effort to align commits with field/feature.
Yes the ingest will be following shortly (sorry on leave atm but will have available soon)...
Yes I can duplicate the licenses in extras, if that follows the overall pattern you had in mind, but I'm not sure I follow why? From comments I've seen in other parts of the code I thought the idea was that, for ingest, extras would/should (eventually) simply contain what remains after relevant properties are populated? So in the ingest that follows, the 'primary license' pops the first license from the license array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it makes sense purely from a ckanext-datapackager point of view to not duplicate the first license, but from a general integration point of view if other third party tools want to interact with the licenses
extra they won't necessarily know that this only contains n -1 licenses and they need to go find the other one on license_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amercader
Yep makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amercader
Sorry for closing this and other issue - there's been delay/change in project requirements on our side - I'll pick this up again with you once this is established. I have some other PRs for this and ckanext-datapackager, that I will contribute regardless, but there is just a delay on our side that I want to see resolved first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem @mattRedBox thanks for letting me know
|
||
def _parse_extra_source(extra_source): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect any incoming sources
property to be stored as-is, ie in the DP format, so it should be just a matter of dumping whatever is stored in the sources
extra into the DP's sources
property. eg sources in extras won't have name
and url
, they'll have whatever is on the DP spec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I've since just done the same (sorry haven't pushed up next changes yet, but will asap). Ok I'll also remove the enforcing of datapackage mandatory properties here too and leave for the validation library (so setting a datapackage source even if title is missing)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, maybe I misunderstood the issue. My point is that why the stored sources
in the CKAN extra should have a name
and url
property if these are not part of the spec but maybe what you are trying to do is support old versions of the spec? If that't the case it makes perfect sense, apologies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amercader
Yes that's correct - sorry I'll make this more clear next time
all_sources = [] | ||
source = {} | ||
organization = dataset_dict.get('organization') | ||
if organization.get('is_organization', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to use info from organizations on the DP output unless it can also be set on the way, when ingesting a DP. It's going to be difficult to ensure that imported DP are assigned to CKAN organizations reliably. The only thing we have to match both are the sources
title
property. But maybe it's worth having sources always populated even if not supporting it on the way in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep same as previous comment.
The only thing I'm not sure of, though, is what to do about dataset's source url.
Because the latest frictionless states that 'title' is mandatory, but if there is an existing dataset source url, surely downloaded datapackage should be capturing this somehow (as it also has 'sources'). Or is this a bigger question for discussion with @pwalsh or @rufuspollock ?
In another, related project, Data Curator, in which we capture datapackage properties, we are looking to store properties 'silently' from incoming datasets to ensure that the properties still exist on upload again to ckan, but because 'source'(s) exists in both dataset and datapackage, I thought perhaps this might need to be handled some other way? Or should the dataset source['url'] be mapped to the datapackage contributor author path? (and then vice-versa on ingest?)
For that matter, as it seems important for a dataset to have an organisation, I'm not sure there's an easy way to capture that in a datapackage or whether datapackage should capture this at all (as it has major impact on how datasets are loaded into ckan). We could:
- ignore it in datapackage
- map it to a single datapackage source (with only title):
-> ignoring it on ingest or
-> checking the ingest organisation matches?
Apologise if these are questions that @Stephen-Gates may have already squared away with you, but due to many conflicting commitments, unfortunately for us, Stephen has had to step back from his role with ODIQueensland, so doing what I can to fill in gaps here :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These questions were not answered because they are tricky to answer :)
I think my preference would be to only add organizations to sources in the way out for now as this is the most robust approach. Trying to match sources to organizations on the way in just using the title seems too brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @amercader
Yes I agree - perhaps we could add organisation (just the title) on the way out - but I might just leave this out for now as then there's a problem with not repeating it in 'extras' on the way in with repeated imports.
I've realised since that probably the mistake I made was thinking a CKAN dataset source is/should be the same as a datapackage source. I realise now that with the new contributors, and that author_email, author_name and source_url were all being collected together, then source_url from ckan should probably also go the contributor, with role 'author' as contributor['path'] in the datapackage.
def _parse_extras(dataset_dict): | ||
result = {} | ||
def _parse_extra_license(extra_license): | ||
license = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to my comment regarding sources
above, I'd expect whatever is stored in the licenses
extra to be the same that the DP so it should not be necessary to parse or map it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so again just to be clear (for me ;) ), the ingest of a datapackage licenses should actually create the same in extras, but in addition, also push the first license to dataset's license, so that this first license is captured in both dataset license and extras?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. See my other comment above
all_sources = [] | ||
source = {} | ||
organization = dataset_dict.get('organization') | ||
if organization.get('is_organization', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_organization
should always be True for dataset_dict['organization']
objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep OK, thanks. wasn't sure about this as I saw the reference thought perhaps there might be cases where this is false.
Hi @amercader,
First PR, tracked from project put together by @Stephen-Gates and ODIQueensland.
Following latest frictionless datapackage spec:
Fixes for:
Used references from:
#2
frictionlessdata/ckanext-datapackager#59
frictionlessdata/ckanext-datapackager#62
@amercader
I figured that once I work on datapackage upload, source organization can be checked and if a match will pull in the url.
Otherwise have tried to follow code-style and conversations with @Stephen-Gates
I've updated tests, but please let me know if more needed.
Other PRs to follow, so feedback welcome.