-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Codebase git mirroring #718
base: main
Are you sure you want to change the base?
Conversation
35de762
to
a12a8c1
Compare
ca4ade6
to
4be9693
Compare
6b4e25f
to
5a356fb
Compare
f297ac9
to
2d48eab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great, thanks for pushing this forward @sgfost!
some minor notes on the overall structure included as inline comments. the circular dependency from tasks <-> models is tricky, not sure how to get around it without some other layer of indirection like a service layer but it's probably fine to just keep the hyperlocal import, and I tend to agree with the no-service-layer django argument https://www.b-list.org/weblog/2020/mar/16/no-service/
@@ -173,5 +173,13 @@ def initialize_test_shared_folders(): | |||
) | |||
|
|||
|
|||
def clear_test_shared_folder(dir=settings.REPOSITORY_ROOT): | |||
for fs in os.scandir(dir): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this cleaner if we move to pathlib like the other file where you used iterdir()? e.g.,
folder = Path(dir)
for item in folder.iterdir():
if item.is_file():
item.unlink()
elif item.is_dir():
shutil.rmtree(item, ignore_errors=True)
r=True, | ||
) | ||
# copy over files from the sip storage and add to the index | ||
for file in sip_storage.list(absolute=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would CodebaseReleaseStorage benefit from a copy_all
convenience method that takes an optional callback method to be invoked on each file that's copied, or returns a list of all copied files' paths? That might simplify this to:
files = sip_storage.copy_all(dest_path)
self.repo.index.add(files)
django/library/metadata.py
Outdated
} | ||
|
||
|
||
def release_to_codemeta(release: CodebaseRelease): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked closely at the codemeticulous pydantic data classes yet but it might be cleaner to still have wrapper DataCiteSchema
and CodeMetaSchema
classes that also encapsulate the schema-specific transformations needed to properly populate these data classes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, there will probably end up being more than just codemeta generation in this module so it would at least scope the functions
For the updated metadata situation, we could consider pushing the metadata changes as a single commit on the release branch. If the release branch is also head then it also advances the main branch, otherwise it only lives on that release branch... could get messy though |
I think that is by far the best way to handle it. Question is whether adding release branches at all introduces too much complexity. For this (mirroring), probably not, but my thinking up until now was that it would for the fully synced mode I may have been totally wrong though, its definitely a bit more complex but may solve more problems than it creates. I don't think we'd have to expose authors to the complexity, i.e. require that they make release branches, as the CML side could just generate one from the release tag Sample workflow:
|
* fix release ordering to sort by semantic version number rather than by string
this API is responsible for managing a local git repository mirror for a comses codebase. PUBLIC release archives are commits/tags in the history `build()` and `append_releases()` are the two main API methods which construct (or rebuild) a git repo and add new releases to the repo, respectively
* indicate that ordered releases method on codebase now returns a list and add public_releases() which returns a queryset
currently these are not retroactively inserted into archive packages since that would require rebuilding everything. Generating git repos, however, will add them if they are missing ** includes an experimental refactor of metadata transformations which is used to implement the citation file format generation resolves comses/planning#234
currently a synchronous process with 0 error handling * add file size checking to the git repo fs api
this allows for retries regardless of where the failure occurred the main points are: - only build the git repo if it doesn't exist - only create the github repo if it doesn't exist - pushing changes that exist remotely is already handled gracefully
https://huey.readthedocs.io/ The huey service is essentially a mirror of the server with a connection to the same db and redis service that runs the huey consumer process **setup is currently only for development**
* now only attempt to create github releases if they don't already exist * fixed bug where local_releases were being overwritten with only the last updated releases instead of adding to the set
this is considerably easier to manage than creating a clone service, not sure if there is any potential downside to not isolating the two processes
checks that repo names are valid and available TODO: add a /github page for more in depth information about the feature and a summary in the action modal
will need to really clean up metadata coversions, especially before adding syncing but the one-way transformers idea likely won't hold up ideas for a better approach: - pydantic for validation/structuring - codemeta as an intermediate format
adding these manually was an easily forgotten step that wouldn't be noticed in dev but would fail to build in prod
mirroring strategy is not currently planned to be used with user repositories * use absolute url in github description
the main benefit of this, for now, is being able to update release metadata without re-writing history, which is implemented in `CodebaseGitRepositoryApi.update_release_branch()` and processed asynchronously by tasks that are called by post_save signals for Codebase and CodebaseRelease + renaming and other minor refactoring [no ci]
introduces a simple scheme for caching properties of a codebase or release that get invalidated every time the codebase or release are updated
part 1 (1-way mirror):
adds a button that allows model submitters to create an auto-updating, read-only git repository archive which is hosted on a central organization
additions
CITATION.cff
andLICENSE
files in archive packages (resolves comses/planning#234)CITATION.cff
is translated from codemetaLICENSE
file is built from license text templates in theLicense
modellibrary.fs.CodebaseGitRepositoryApi
: functionality for building/updating a git repository from aCodebase
library.github.GithubApi
: provides an interface overPyGithub
for interacting with repositories on githublibrary.github.GithubRepoNameValidator
: providesvalidate()
to make sure a repo name is valid and unusedhuey
for async task processing which runs on the server container (resolves comses/planning#231)mirror_codebase()
andupdate_mirrored_codebase()
huey tasks which call theCodebaseGitRepositoryApi
to build the git repo on the file system and thenGithubApi
to create/push to the remote/github/
configuration steps
comses-model-library
organization with the following permissions:.env
.env
secrets/