Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating metadata on all draft assets #1450

Open
satra opened this issue Jan 25, 2023 · 10 comments · May be fixed by #1545
Open

updating metadata on all draft assets #1450

satra opened this issue Jan 25, 2023 · 10 comments · May be fixed by #1545
Assignees
Labels
maintenance Action to maintain the system (neither a bugfix nor an enhancement)

Comments

@satra
Copy link
Member

satra commented Jan 25, 2023

i'm starting this issue so that we can do this within the next two weeks.

requires re-extracting metadata from all draft assets and uploading it to the draft version of the dataset. we won't touch published versions here. this is to ensure that all draft datasets are compliant with current schema.

i think this is best done at an admin process level. let's start with all nwb datasets. it would involve running a script on assets from all draft versions to update it.

@jwodder - could you please provide a dandi api-based function to re-extract metadata from a remote nwb file (i.e. a dandi asset url)? this is not getting the asset metadata but to reextract it from the nwb file itself.

cc/ @yarikoptic

@yarikoptic
Copy link
Member

Are you expecting some new metadata being extracted from nwb what can't be achieved through metadata record version upgrade?

@satra
Copy link
Member Author

satra commented Jan 30, 2023

Are you expecting some new metadata being extracted from nwb

yes. our metadata extraction routines have changed several times, so the older assets are definitely out of sync with our current metadata extraction routines. to start with we can limit updates to assets with versions lower than current.

@yarikoptic
Copy link
Member

support for that was added already in dandi-cli . I do not think it is an issue to keep in dandi-archive since I don't think code should be developed here now - just a matter of running that command/script. but I can be wrong ;)

@waxlamp
Copy link
Member

waxlamp commented Mar 13, 2023

I believe this issue was filed as a task reminder, not as something to change in the archive codebase. If this can be carried out with dandi-cli, that's fantastic.

I think we still need some clarity on exactly what to do, how to do it, and who will do it. Can we get those things decided so we can keep this moving forward?

@satra
Copy link
Member Author

satra commented Mar 13, 2023

we may need to do this via the management console:

for each draft dandiset/asset:

  • extract metadata of asset
  • lock dandiset
  • check asset still exists
  • update asset
  • unlock dandiset

since this is a update operation, i don't want the process to conflict with any upload or other actions (e.g. checksum calculation) that may be going on. this may not be something the CLI can do. also the CLI won't have admin privileges unless we use such a token.

@yarikoptic
Copy link
Member

or just finally expose dandiset locking as admin-only operation in API, and adjust already coded in implementation since it seems that we would need similar mechanism nearly for any archive wide operations which might interfere with original owner's actions.

@jjnesbitt jjnesbitt assigned jjnesbitt and unassigned yarikoptic Mar 16, 2023
@jjnesbitt
Copy link
Member

Running the numbers against the current draft assets in prod (ignoring zarrs), I believe this will involve the processing of ~700GB in order to accomplish. We might be able to bring this down if we can define which assets are considered "old" and which are okay.

@satra
Copy link
Member Author

satra commented Mar 17, 2023

how did you get that number? it shouldn't actually download all the data per asset to do this. the updated code works on cached/remote streaming to do this.

@jjnesbitt
Copy link
Member

how did you get that number?

Summing the size of all assets in draft versions, excluding zarr assets.

it shouldn't actually download all the data per asset to do this. the updated code works on cached/remote streaming to do this.

So it only needs to access a portion of the stored s3 file? I was using the nwb2asset function as shown in this example, so if that uses partial access we're good.

@satra
Copy link
Member Author

satra commented Mar 18, 2023

as long as the asset uses the as_readable mode that would work. dandi cli added a helper service script, and this is the relevant section: https://github.com/dandi/dandi-cli/blob/dc68e02eb6dfcb3b2ce82cab22bdadf1b1f31999/dandi/cli/cmd_service_scripts.py#L89

@jjnesbitt jjnesbitt linked a pull request Mar 20, 2023 that will close this issue
@waxlamp waxlamp added the maintenance Action to maintain the system (neither a bugfix nor an enhancement) label Feb 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Action to maintain the system (neither a bugfix nor an enhancement)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants