Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codebase git mirroring #718

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

Codebase git mirroring #718

wants to merge 25 commits into from

Conversation

sgfost
Copy link
Contributor

@sgfost sgfost commented May 7, 2024

part 1 (1-way mirror):

adds a button that allows model submitters to create an auto-updating, read-only git repository archive which is hosted on a central organization

additions

  • include CITATION.cff and LICENSE files in archive packages (resolves comses/planning#234)
    • CITATION.cff is translated from codemeta
    • LICENSE file is built from license text templates in the License model
  • library.fs.CodebaseGitRepositoryApi: functionality for building/updating a git repository from a Codebase
  • library.github.GithubApi: provides an interface over PyGithub for interacting with repositories on github
  • library.github.GithubRepoNameValidator: provides validate() to make sure a repo name is valid and unused
  • huey for async task processing which runs on the server container (resolves comses/planning#231)
    • mirror_codebase() and update_mirrored_codebase() huey tasks which call the CodebaseGitRepositoryApi to build the git repo on the file system and then GithubApi to create/push to the remote
  • feature overview page at /github/
  • button + form on the release detail page for mirroring a codebase

configuration steps

  1. create an app on the comses-model-library organization with the following permissions:
    • Administration: read and write
    • Contents: read and write
    • Metadata: read only
  2. generate client secret and private key
  3. install the app on the organization and add the installation id to .env
  4. add the app id, client id, and organization name to .env
  5. add the private key and client secret to secrets/

image

django/library/views.py Fixed Show fixed Hide fixed
@sgfost sgfost force-pushed the git-mirror branch 2 times, most recently from ca4ade6 to 4be9693 Compare September 11, 2024 23:08
django/library/views.py Fixed Show fixed Hide fixed
@sgfost sgfost force-pushed the git-mirror branch 2 times, most recently from 6b4e25f to 5a356fb Compare October 4, 2024 19:34
django/library/views.py Dismissed Show dismissed Hide dismissed
@sgfost sgfost marked this pull request as ready for review October 18, 2024 22:47
@sgfost sgfost force-pushed the git-mirror branch 2 times, most recently from f297ac9 to 2d48eab Compare December 5, 2024 18:44
Copy link
Member

@alee alee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, thanks for pushing this forward @sgfost!

some minor notes on the overall structure included as inline comments. the circular dependency from tasks <-> models is tricky, not sure how to get around it without some other layer of indirection like a service layer but it's probably fine to just keep the hyperlocal import, and I tend to agree with the no-service-layer django argument https://www.b-list.org/weblog/2020/mar/16/no-service/

@@ -173,5 +173,13 @@ def initialize_test_shared_folders():
)


def clear_test_shared_folder(dir=settings.REPOSITORY_ROOT):
for fs in os.scandir(dir):
Copy link
Member

@alee alee Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this cleaner if we move to pathlib like the other file where you used iterdir()? e.g.,

folder = Path(dir)
for item in folder.iterdir():
    if item.is_file():
        item.unlink()
    elif item.is_dir():
        shutil.rmtree(item, ignore_errors=True)

django/library/fs.py Outdated Show resolved Hide resolved
django/library/github.py Outdated Show resolved Hide resolved
django/library/fs.py Outdated Show resolved Hide resolved
r=True,
)
# copy over files from the sip storage and add to the index
for file in sip_storage.list(absolute=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would CodebaseReleaseStorage benefit from a copy_all convenience method that takes an optional callback method to be invoked on each file that's copied, or returns a list of all copied files' paths? That might simplify this to:

files = sip_storage.copy_all(dest_path)
self.repo.index.add(files)

django/library/models.py Show resolved Hide resolved
}


def release_to_codemeta(release: CodebaseRelease):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked closely at the codemeticulous pydantic data classes yet but it might be cleaner to still have wrapper DataCiteSchema and CodeMetaSchema classes that also encapsulate the schema-specific transformations needed to properly populate these data classes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, there will probably end up being more than just codemeta generation in this module so it would at least scope the functions

django/requirements.txt Outdated Show resolved Hide resolved
frontend/src/components/GithubMirrorModal.vue Outdated Show resolved Hide resolved
frontend/vite.config.ts Show resolved Hide resolved
@alee
Copy link
Member

alee commented Dec 13, 2024

looks great, thanks for pushing this forward @sgfost!

some minor notes on the overall structure included as inline comments. the circular dependency from tasks <-> models is tricky, not sure how to get around it without some other layer of indirection like a service layer but it's probably fine to just keep the hyperlocal import, and I tend to agree with the no-service-layer django argument https://www.b-list.org/weblog/2020/mar/16/no-service/

For the updated metadata situation, we could consider pushing the metadata changes as a single commit on the release branch. If the release branch is also head then it also advances the main branch, otherwise it only lives on that release branch... could get messy though

@sgfost
Copy link
Contributor Author

sgfost commented Dec 13, 2024

For the updated metadata situation, we could consider pushing the metadata changes as a single commit on the release branch. If the release branch is also head then it also advances the main branch, otherwise it only lives on that release branch... could get messy though

I think that is by far the best way to handle it. Question is whether adding release branches at all introduces too much complexity. For this (mirroring), probably not, but my thinking up until now was that it would for the fully synced mode

I may have been totally wrong though, its definitely a bit more complex but may solve more problems than it creates. I don't think we'd have to expose authors to the complexity, i.e. require that they make release branches, as the CML side could just generate one from the release tag

Sample workflow:

  • start sync from a codebase with releases 1.0.0 and 1.1.0
  • generate 1.0.0 1.1.0 and main branches
  • push to remote
  • author commits any number of times to main
  • author cuts a release on github, 2.0.0
  • CML pulls in changes, makes a release branch for 2.0.0 from the 2.0.0 tag
  • author decides to make a release on the CML now, 2.1.0 so we make a branch 2.1.0 and point main there
  • push to remote, which adds branches 2.0.0, 2.1.0, and updates main

The one major problem I can think of is handling updates to the release branches incoming from github. Metadata changes might be able to be resolved but source code, etc. should be frozen like they are in the CML, but there's no way to enforce this Solution: ignore anything that happens to release branches on the github side, only pull in new releases. Don't sync metadata from a CML release when the release branch has been added to remotely (compare hash)

* fix release ordering to sort by semantic version number rather than by
  string
this API is responsible for managing a local git repository mirror for a
comses codebase. PUBLIC release archives are commits/tags in the history

`build()` and `append_releases()` are the two main API methods which
construct (or rebuild) a git repo and add new releases to the repo,
respectively
* indicate that ordered releases method on codebase now returns a list and
  add public_releases() which returns a queryset
currently these are not retroactively inserted into archive packages
since that would require rebuilding everything. Generating git repos,
however, will add them if they are missing

** includes an experimental refactor of metadata transformations which
is used to implement the citation file format generation

resolves comses/planning#234
currently a synchronous process with 0 error handling

* add file size checking to the git repo fs api
sgfost and others added 17 commits December 20, 2024 13:52
this allows for retries regardless of where the failure occurred

the main points are:
- only build the git repo if it doesn't exist
- only create the github repo if it doesn't exist
- pushing changes that exist remotely is already handled gracefully
https://huey.readthedocs.io/

The huey service is essentially a mirror of the server with a connection
to the same db and redis service that runs the huey consumer process

**setup is currently only for development**
* now only attempt to create github releases if they don't already exist
* fixed bug where local_releases were being overwritten with only the
  last updated releases instead of adding to the set
this is considerably easier to manage than creating a clone service, not
sure if there is any potential downside to not isolating the two
processes
checks that repo names are valid and available

TODO: add a /github page for more in depth information about the feature
and a summary in the action modal
will need to really clean up metadata coversions, especially before
adding syncing but the one-way transformers idea likely won't hold up

ideas for a better approach:
- pydantic for validation/structuring
- codemeta as an intermediate format
adding these manually was an easily forgotten step that wouldn't be
noticed in dev but would fail to build in prod
mirroring strategy is not currently planned to be used with user
repositories

* use absolute url in github description
the main benefit of this, for now, is being able to update release
metadata without re-writing history, which is implemented in
`CodebaseGitRepositoryApi.update_release_branch()` and processed
asynchronously by tasks that are called by post_save signals for
Codebase and CodebaseRelease

+ renaming and other minor refactoring

[no ci]
introduces a simple scheme for caching properties of a codebase or
release that get invalidated every time the codebase or release are
updated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants