Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

provide automated DOI registration #699

Merged
merged 5 commits into from
Dec 5, 2024
Merged

Conversation

alee
Copy link
Member

@alee alee commented Mar 14, 2024

add support for automated metadata translation from our object model into the DataCite schema with eventual DOI publication via the datacite Python library

refs https://github.com/comses/planning/issues/146

@alee alee force-pushed the datacite_doi_registration branch from 35a20e6 to e5786f4 Compare March 14, 2024 23:00
some fields DataCite do not want or have.
"""
metadata = {}
codemeta = release.codemeta.metadata
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to go through codemeta at all for any of this? Seems like it may be easier to follow if we just transform straight from the release, and use shared methods for any instances where fields may have the same transformation as codemeta

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

originally we thought that all the info from codemeta will go straight into the datacite doi but found out that's not the case. i can refactor to use the release info as much as possible instead the codemeta. @alee , ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sgfost - can you elaborate on the "shared methods" and how this is done in python syntax? for example CodeMeta convert_authors method, how to share that with the DataCiteMetadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asuworks and I had a conversation about this yesterday - there's two ways we could think about this, one is that codemeta represents our "Intermediate Representation (IR)" metadata that we use to translate to all other metadata. That would be fine if there's a non-lossy transformation between our object model metadata -> codemeta -> datacite

However the issue is if there's additional / richer metadata fields in datacite than codemeta provides that we could use but aren't available so we need to directly use our object model metadata to transform into datacite, in that case it might be better to have a direct transformation from our object model metadata -> datacite

I think we did some analysis on this in the requirements spec in the metadata crosswalk table but don't have it clear in mind at the moment, we should look at that and respond definitively here if this is the case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to answer your question though we could think of "shared methods" as functions that take our object model and convert it into a fully materialized in-memory intermediate representation (Python dictionary based or otherwise) that can be:

  1. used to more easily transform into codemeta or datacite
  2. easily tested without having to deal with making additional database queries, etc
  3. maybe other benefits

Copy link
Contributor

@sgfost sgfost Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that, I think the Transformer naming makes sense. Any reason to not keep the build()/transform() as a @classmethod for the child classes? It makes the interface slightly simpler, not sure if there is any downside

Also, it may make more sense to directly call methods that the base transformer class provides, rather than using its transform() which is sorta just an arbitrary collection of fields that wouldn't mean much if we, say, needed a 3rd metadata crosswalk for something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason to use a base class transform() is based on the assumption that it would generate an object which would cover (mostly) all the required fields in an acceptable format.
In a child class we would just need to "fix" the wrong or missing attributes.
@sgfost can you provide a quick code sample of your idea?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since i'm new to our code and python, i'll defer the decision to Allen and Scott. should i hold off on my refactor and work on the fabrica api until a decision is made or should i refactor using the CodebaseRelease info?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something to this effect @asuworks, similar strategy just calling the methods from the base directly rather than building everything and picking out what we want from a dict

class SharedMetadataTransformer:
    def __init__(self, release):
        self.release = release

    def convert_contrib_to_authors(self):
        # something
        pass

class CodemetaTransformer(SharedMetadataTransformer):
    INITIAL_DATA = {
            "@context": "http://schema.org",
            # ...
    }

    def transform(self):
        self.metadata = self.INITIAL_DATA.copy()
        self.metadata.update(
                # ...
                codemeta_specific="something"
                authors=self.convert_contrib_to_authors()
                # ...
        )
        return self.metadata

@monaw We'll still need to identify any shared transforms and remove the datacite class's dependency on codemeta regardless of how exactly it ends up looking. Focusing on the api client in the meantime isn't a bad idea though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice ideas! I could see the use of having transform() generate a Python dictionary and return it every time, but we could also still do something like DataCiteMetadata.build() or CodeMeta.build() and use the transformer within it. One thing that might be good to do is to cache the Python dictionary that gets built up by the transformer, to have a more stateful object (though I certainly appreciate the functional programmingness of this design).

I tend to prefer composition over inheritance so might suggest that DataCiteMetadata and CodeMeta have a MetadataTransformer instance that can convert a CodebaseRelease into a pure in-memory data structure that can then be used as faithful source material for further CodeMeta or DataCite transformations. The MetadataTransformer can then be mocked and tested without db dependencies etc.

Other thoughts?

@asuworks asuworks self-requested a review March 27, 2024 01:01
@asuworks asuworks marked this pull request as ready for review March 27, 2024 03:48
@alee alee force-pushed the datacite_doi_registration branch 5 times, most recently from a2b559f to f403211 Compare April 3, 2024 17:45
monaw added a commit to monaw/comses.net that referenced this pull request Apr 8, 2024
…done, author and contributors remaining to do as well as testing; team decided to cache metadata dictionary so will hold off on refactor for now (comses.net/comses#699)
@alee alee force-pushed the datacite_doi_registration branch from a859111 to ed5f198 Compare April 9, 2024 22:16
monaw added a commit to monaw/comses.net that referenced this pull request Apr 9, 2024
…done, author and contributors remaining to do as well as testing; team decided to cache metadata dictionary so will hold off on refactor for now (comses.net/comses#699)
Copy link
Contributor

@sgfost sgfost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, very thorough. Some thoughts from a first pass over everything.

also, I haven't fully combed through the code in all of the tasks in doi.py but I wonder if there any duplication between them that can be cleaned up?

# FIXME: what is the difference between
# CodebaseRelease.objects.filter(codebase=r.codebase).order_by("-version_number").all()
# and
# ordered_codebase_releases: List[CodebaseRelease] = codebase.ordered_releases()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codebase.ordered_releases(has_change_perm=True) should be equivalent

http_status = 204
message = str(dc_nce)

except DataCiteBadRequestError as cd_bre:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a typo

return ReleaseContributor.objects.authors(self.codebase_release)


class CodeMetaMetadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These metadata transformers still feel a bit unwieldy or difficult to read through, especially the huge build..() methods. One idea is dataclasses instead of the metadata = {} and consistently using property methods instead of a mix of classmethod getters, property/cached_property, and direct assignment in build().

Though this might be a refactor that can happen after this merge. I'll be implementing something like a GithubMetadata rather soon so that may be a good time to revisit


"""
RECURRENT TASKS
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these should be available to be invoked from the management command. I also think the input() waits should be made optional so that we can schedule them as cron jobs for example

monaw added a commit to alee/comses.net that referenced this pull request May 25, 2024
…, default publication year to this year if none, updated identifier and creators fields (comses/comses.net/comses#699)
alee pushed a commit to alee/comses.net that referenced this pull request Jun 19, 2024
…, default publication year to this year if none, updated identifier and creators fields (comses/comses.net/comses#699)
@alee alee force-pushed the datacite_doi_registration branch 2 times, most recently from 8a81b3e to ca941ec Compare June 19, 2024 17:57
monaw added a commit to alee/comses.net that referenced this pull request Jul 18, 2024
@monaw
Copy link
Contributor

monaw commented Jul 18, 2024

the issue that i was working on was the datacite python package schema43.validate() test was failing. one thing i noticed was that the creators metadata was empty. the test code does publish() the release but yet the creators were empty. oddly during debugging, sometimes the creators will have 2 test_user entries but i didn't figure out why sometimes it was empty and sometimes it had 2...perhaps there is something about the compute_contributors() caching that i don't understand. that's as far as i got before my time ran out. i'm really sorry to leave this issue unsolved! since DataCite requires the creators metadata, i can see why schema43.validate() failed but there may be other additional reasons why the validation is failing. for more info, see DOI feature documentation

@alee alee force-pushed the datacite_doi_registration branch from 0e29fcb to b35f00c Compare August 12, 2024 21:12
alee pushed a commit to alee/comses.net that referenced this pull request Aug 12, 2024
@alee alee force-pushed the datacite_doi_registration branch from 393075c to 63877cb Compare August 13, 2024 03:04
Copy link
Contributor

@sgfost sgfost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core doi creation logic seems alright to me, the only real improvement I can think of other than minor cleanup is doing metadata updates on post_save signals rather than the complicated-seeming comparison. This should probably be done with celery though..

Otherwise, the only thing I'm confused by is the remove_dois_from_not_peer_reviewed_releases command. If the DOI exists and points to the release, is there harm in keeping it stored, even though its sorta 'legacy' data?

@@ -2438,14 +2510,90 @@ def __str__(self):
return f"[peer review] {invitation.candidate_reviewer} submitted? {self.reviewer_submitted}, recommendation: {self.get_recommendation_display()}"


class CodeMeta:
class CommonMetadata:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like this should be a collection of helper functions/static methods instead of a built dictionary, is this a worthwhile change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the initial goals was to also provide an easier way to mock out the metadata tests and improve testing speed by not needing a DB but the design didn't end up that way. That's part of the reason that the CodeMeta and DataCite metadata classes were initialized with a bare dictionary but without validation it's kind of a mess. Let's discuss at or after tomorrow's dev meeting?


@classmethod
def convert_platforms(cls, codebase_release: CodebaseRelease):
return [tag.name for tag in codebase_release.platform_tags.all()]
def from_codebase(cls, codebase: Codebase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be better if ReleaseDataciteMetadata and CodebaseDataciteMetadata were 2 separate sibling classes that inherit from the same schema or at least the base fields that are currently duplicated?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a good way to separate the logic, I'll work on that while continuing to clean up the way their dictionaries are built up

@alee
Copy link
Member Author

alee commented Aug 14, 2024

The core doi creation logic seems alright to me, the only real improvement I can think of other than minor cleanup is doing metadata updates on post_save signals rather than the complicated-seeming comparison. This should probably be done with celery though..

Good call, though I think celery is probably overkill for this, a cron job should be fine since these aren't long-running or frequent processes. I don't think I would do it on post_save, and instead batch it nightly.

Otherwise, the only thing I'm confused by is the remove_dois_from_not_peer_reviewed_releases command. If the DOI exists and points to the release, is there harm in keeping it stored, even though its sorta 'legacy' data?

I think the goal is to re-mint old handle.net and legacy PIDs to be uniform though @asuworks would probably be better to clarify here...

@asuworks
Copy link
Contributor

asuworks commented Aug 14, 2024

We wanted to get rid of legacy DOIs for consistency. There were not so many legacy DOIs in prod, as far as I remember...
We would need to handle them separately. Do we want this?

@alee alee force-pushed the datacite_doi_registration branch 2 times, most recently from d9f7c5d to e73617a Compare August 16, 2024 06:09
alee added a commit to sgfost/comses.net that referenced this pull request Aug 30, 2024
remove codemeta tests entirely, will add them back in comses#699
@alee alee force-pushed the datacite_doi_registration branch 3 times, most recently from e45893c to 401cdc6 Compare November 19, 2024 00:38
@alee alee force-pushed the datacite_doi_registration branch 4 times, most recently from 297ff46 to 543030f Compare November 27, 2024 09:27
alee and others added 4 commits November 27, 2024 02:29
- use datacite python client to mint DOIs
- add basic scaffolding to settings / config.ini and a get_datacite_client() method to doi module
- add custom setup and cleanup for property testing

Co-authored-by: Anton Suharev <asuworks@users.noreply.github.com>
Co-authored-by: Scott Foster <sgfost@users.noreply.github.com>
Co-authored-by: Mona Wong <monaw@users.noreply.github.com>
- add DATACITE_DRY_RUN (default=true) to django settings and .env.template
- check with `hasattr` before deleting cached values
- make keywords case insensitive
- add __init__ method to CodeMetaValidationTest to setup instance
  variables properly
- default publication year to this year if empty for codebase release (comses/planning#146)
- start to move DOI tasks to management commands

Recurring Tasks:
./manage.py mint_dois --interactive --dry-run
./manage.py sync_doi_metadata --interactive --dry-run

- skip parameters (interactive, dry-run), to run in production
- push datacite sandbox into default settings
- DataCite creator givenName, familyName, and name all must be set
  explicitly, the DataCite fabrica form performs an ORCID lookup to
  populate those fields
- rename to CodeMetaSchema/DataCiteSchema, replace "Metadata" with
  "Schema" for something slightly shorter and better reading than
  CodeMetaMetadata
- add two subtype classes to handle creating a DataCiteSchema from a
  Codebase or a CodebaseRelease. Should consider a pydantic data model
  in the future
- quiet down exceptions for degenerate codebases w/o Licenses and return
  partially consistent proxy objects if the codebase is not yet published
- remove ContributorAffiliation tags prefetch
- remove spuriously additional build_aip from archive creation
- start to refactor various Factory test mock classes
- add a flag defer_fs currently only used by tests but in the future
  could support creating the published archive asynchronously as a
  scheduled task (huey / temporal / etc)
- hypothesis tests were generating inconsistent results due to issues
  with our state generation for Codebases, CodebaseRelease, Users, etc.
  switching to get_or_create for now, this may have downstream effects
  but probably shouldn't
- minor logger tuning to remove unnecessary messaging
@alee alee force-pushed the datacite_doi_registration branch 14 times, most recently from ad9aec9 to 83a0bed Compare December 4, 2024 07:16
@alee alee force-pushed the datacite_doi_registration branch from 83a0bed to 9754634 Compare December 5, 2024 03:46
- prefix all one-off destructive DOI commands with `doi_`
- add reset_staging to mint new DOIs on staging using the datacite
  sandbox, doi_reset_staging -> step 3, doi_mint_parent_codebase_dois
- bump deps for datacite schema 4.5 and django cve
@alee alee force-pushed the datacite_doi_registration branch from 9754634 to 6c2725b Compare December 5, 2024 04:46
@alee alee merged commit 9d96cfc into comses:main Dec 5, 2024
6 of 7 checks passed
@alee alee deleted the datacite_doi_registration branch December 12, 2024 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants