-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
provide automated DOI registration #699
Conversation
35a20e6
to
e5786f4
Compare
django/library/models.py
Outdated
some fields DataCite do not want or have. | ||
""" | ||
metadata = {} | ||
codemeta = release.codemeta.metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to go through codemeta at all for any of this? Seems like it may be easier to follow if we just transform straight from the release, and use shared methods for any instances where fields may have the same transformation as codemeta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
originally we thought that all the info from codemeta will go straight into the datacite doi but found out that's not the case. i can refactor to use the release info as much as possible instead the codemeta. @alee , ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgfost - can you elaborate on the "shared methods" and how this is done in python syntax? for example CodeMeta convert_authors method, how to share that with the DataCiteMetadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asuworks and I had a conversation about this yesterday - there's two ways we could think about this, one is that codemeta represents our "Intermediate Representation (IR)" metadata that we use to translate to all other metadata. That would be fine if there's a non-lossy transformation between our object model metadata -> codemeta -> datacite
However the issue is if there's additional / richer metadata fields in datacite than codemeta provides that we could use but aren't available so we need to directly use our object model metadata to transform into datacite, in that case it might be better to have a direct transformation from our object model metadata -> datacite
I think we did some analysis on this in the requirements spec in the metadata crosswalk table but don't have it clear in mind at the moment, we should look at that and respond definitively here if this is the case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to answer your question though we could think of "shared methods" as functions that take our object model and convert it into a fully materialized in-memory intermediate representation (Python dictionary based or otherwise) that can be:
- used to more easily transform into codemeta or datacite
- easily tested without having to deal with making additional database queries, etc
- maybe other benefits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that, I think the Transformer
naming makes sense. Any reason to not keep the build()
/transform()
as a @classmethod
for the child classes? It makes the interface slightly simpler, not sure if there is any downside
Also, it may make more sense to directly call methods that the base transformer class provides, rather than using its transform()
which is sorta just an arbitrary collection of fields that wouldn't mean much if we, say, needed a 3rd metadata crosswalk for something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason to use a base class transform() is based on the assumption that it would generate an object which would cover (mostly) all the required fields in an acceptable format.
In a child class we would just need to "fix" the wrong or missing attributes.
@sgfost can you provide a quick code sample of your idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since i'm new to our code and python, i'll defer the decision to Allen and Scott. should i hold off on my refactor and work on the fabrica api until a decision is made or should i refactor using the CodebaseRelease info?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something to this effect @asuworks, similar strategy just calling the methods from the base directly rather than building everything and picking out what we want from a dict
class SharedMetadataTransformer:
def __init__(self, release):
self.release = release
def convert_contrib_to_authors(self):
# something
pass
class CodemetaTransformer(SharedMetadataTransformer):
INITIAL_DATA = {
"@context": "http://schema.org",
# ...
}
def transform(self):
self.metadata = self.INITIAL_DATA.copy()
self.metadata.update(
# ...
codemeta_specific="something"
authors=self.convert_contrib_to_authors()
# ...
)
return self.metadata
@monaw We'll still need to identify any shared transforms and remove the datacite class's dependency on codemeta regardless of how exactly it ends up looking. Focusing on the api client in the meantime isn't a bad idea though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ideas! I could see the use of having transform() generate a Python dictionary and return it every time, but we could also still do something like DataCiteMetadata.build()
or CodeMeta.build()
and use the transformer within it. One thing that might be good to do is to cache the Python dictionary that gets built up by the transformer, to have a more stateful object (though I certainly appreciate the functional programmingness of this design).
I tend to prefer composition over inheritance so might suggest that DataCiteMetadata and CodeMeta have a MetadataTransformer instance that can convert a CodebaseRelease into a pure in-memory data structure that can then be used as faithful source material for further CodeMeta or DataCite transformations. The MetadataTransformer can then be mocked and tested without db dependencies etc.
Other thoughts?
a2b559f
to
f403211
Compare
…done, author and contributors remaining to do as well as testing; team decided to cache metadata dictionary so will hold off on refactor for now (comses.net/comses#699)
a859111
to
ed5f198
Compare
…done, author and contributors remaining to do as well as testing; team decided to cache metadata dictionary so will hold off on refactor for now (comses.net/comses#699)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, very thorough. Some thoughts from a first pass over everything.
also, I haven't fully combed through the code in all of the tasks in doi.py but I wonder if there any duplication between them that can be cleaned up?
django/library/models.py
Outdated
# FIXME: what is the difference between | ||
# CodebaseRelease.objects.filter(codebase=r.codebase).order_by("-version_number").all() | ||
# and | ||
# ordered_codebase_releases: List[CodebaseRelease] = codebase.ordered_releases() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codebase.ordered_releases(has_change_perm=True)
should be equivalent
django/library/doi.py
Outdated
http_status = 204 | ||
message = str(dc_nce) | ||
|
||
except DataCiteBadRequestError as cd_bre: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a typo
django/library/models.py
Outdated
return ReleaseContributor.objects.authors(self.codebase_release) | ||
|
||
|
||
class CodeMetaMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These metadata transformers still feel a bit unwieldy or difficult to read through, especially the huge build..()
methods. One idea is dataclasses instead of the metadata = {}
and consistently using property methods instead of a mix of classmethod
getters, property
/cached_property
, and direct assignment in build()
.
Though this might be a refactor that can happen after this merge. I'll be implementing something like a GithubMetadata
rather soon so that may be a good time to revisit
django/library/doi.py
Outdated
|
||
""" | ||
RECURRENT TASKS | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these should be available to be invoked from the management command. I also think the input() waits should be made optional so that we can schedule them as cron jobs for example
…, default publication year to this year if none, updated identifier and creators fields (comses/comses.net/comses#699)
…, default publication year to this year if none, updated identifier and creators fields (comses/comses.net/comses#699)
8a81b3e
to
ca941ec
Compare
…ode notes (issue comses/comses.net/comses#699)
the issue that i was working on was the datacite python package schema43.validate() test was failing. one thing i noticed was that the creators metadata was empty. the test code does publish() the release but yet the creators were empty. oddly during debugging, sometimes the creators will have 2 test_user entries but i didn't figure out why sometimes it was empty and sometimes it had 2...perhaps there is something about the compute_contributors() caching that i don't understand. that's as far as i got before my time ran out. i'm really sorry to leave this issue unsolved! since DataCite requires the creators metadata, i can see why schema43.validate() failed but there may be other additional reasons why the validation is failing. for more info, see DOI feature documentation |
0e29fcb
to
b35f00c
Compare
…ode notes (issue comses/comses.net/comses#699)
393075c
to
63877cb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core doi creation logic seems alright to me, the only real improvement I can think of other than minor cleanup is doing metadata updates on post_save signals rather than the complicated-seeming comparison. This should probably be done with celery though..
Otherwise, the only thing I'm confused by is the remove_dois_from_not_peer_reviewed_releases
command. If the DOI exists and points to the release, is there harm in keeping it stored, even though its sorta 'legacy' data?
@@ -2438,14 +2510,90 @@ def __str__(self): | |||
return f"[peer review] {invitation.candidate_reviewer} submitted? {self.reviewer_submitted}, recommendation: {self.get_recommendation_display()}" | |||
|
|||
|
|||
class CodeMeta: | |||
class CommonMetadata: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this should be a collection of helper functions/static methods instead of a built dictionary, is this a worthwhile change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the initial goals was to also provide an easier way to mock out the metadata tests and improve testing speed by not needing a DB but the design didn't end up that way. That's part of the reason that the CodeMeta and DataCite metadata classes were initialized with a bare dictionary but without validation it's kind of a mess. Let's discuss at or after tomorrow's dev meeting?
|
||
@classmethod | ||
def convert_platforms(cls, codebase_release: CodebaseRelease): | ||
return [tag.name for tag in codebase_release.platform_tags.all()] | ||
def from_codebase(cls, codebase: Codebase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be better if ReleaseDataciteMetadata
and CodebaseDataciteMetadata
were 2 separate sibling classes that inherit from the same schema or at least the base fields that are currently duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds like a good way to separate the logic, I'll work on that while continuing to clean up the way their dictionaries are built up
Good call, though I think celery is probably overkill for this, a cron job should be fine since these aren't long-running or frequent processes. I don't think I would do it on
I think the goal is to re-mint old handle.net and legacy PIDs to be uniform though @asuworks would probably be better to clarify here... |
We wanted to get rid of legacy DOIs for consistency. There were not so many legacy DOIs in prod, as far as I remember... |
d9f7c5d
to
e73617a
Compare
remove codemeta tests entirely, will add them back in comses#699
e45893c
to
401cdc6
Compare
297ff46
to
543030f
Compare
- use datacite python client to mint DOIs - add basic scaffolding to settings / config.ini and a get_datacite_client() method to doi module - add custom setup and cleanup for property testing Co-authored-by: Anton Suharev <asuworks@users.noreply.github.com> Co-authored-by: Scott Foster <sgfost@users.noreply.github.com> Co-authored-by: Mona Wong <monaw@users.noreply.github.com>
- add DATACITE_DRY_RUN (default=true) to django settings and .env.template - check with `hasattr` before deleting cached values - make keywords case insensitive - add __init__ method to CodeMetaValidationTest to setup instance variables properly - default publication year to this year if empty for codebase release (comses/planning#146) - start to move DOI tasks to management commands Recurring Tasks: ./manage.py mint_dois --interactive --dry-run ./manage.py sync_doi_metadata --interactive --dry-run - skip parameters (interactive, dry-run), to run in production - push datacite sandbox into default settings - DataCite creator givenName, familyName, and name all must be set explicitly, the DataCite fabrica form performs an ORCID lookup to populate those fields - rename to CodeMetaSchema/DataCiteSchema, replace "Metadata" with "Schema" for something slightly shorter and better reading than CodeMetaMetadata - add two subtype classes to handle creating a DataCiteSchema from a Codebase or a CodebaseRelease. Should consider a pydantic data model in the future - quiet down exceptions for degenerate codebases w/o Licenses and return partially consistent proxy objects if the codebase is not yet published - remove ContributorAffiliation tags prefetch - remove spuriously additional build_aip from archive creation - start to refactor various Factory test mock classes - add a flag defer_fs currently only used by tests but in the future could support creating the published archive asynchronously as a scheduled task (huey / temporal / etc) - hypothesis tests were generating inconsistent results due to issues with our state generation for Codebases, CodebaseRelease, Users, etc. switching to get_or_create for now, this may have downstream effects but probably shouldn't - minor logger tuning to remove unnecessary messaging
ad9aec9
to
83a0bed
Compare
83a0bed
to
9754634
Compare
- prefix all one-off destructive DOI commands with `doi_` - add reset_staging to mint new DOIs on staging using the datacite sandbox, doi_reset_staging -> step 3, doi_mint_parent_codebase_dois - bump deps for datacite schema 4.5 and django cve
9754634
to
6c2725b
Compare
add support for automated metadata translation from our object model into the DataCite schema with eventual DOI publication via the datacite Python library
refs https://github.com/comses/planning/issues/146