Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add yellow-fever to manifest #1083

Merged
merged 1 commit into from
Dec 16, 2024
Merged

Add yellow-fever to manifest #1083

merged 1 commit into from
Dec 16, 2024

Conversation

genehack
Copy link
Contributor

@genehack genehack commented Dec 5, 2024

Also bumps resource index revision number in testing and production configs.

Note: needs to merge semi-concurrently with the work for nextstrain/yellow-fever#25.

@genehack genehack requested a review from joverlee521 December 5, 2024 19:25
@nextstrain-bot nextstrain-bot temporarily deployed to nextstrain-s-add-yellow-zarguy December 5, 2024 19:25 Inactive
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, but I was extremely confused why yellow-fever was not showing up at https://nextstrain-s-add-yellow-zarguy.herokuapp.com/pathogens

Turns out the current yellow-fever dataset on S3 does not have a version id

$ aws s3api list-object-versions --bucket nextstrain-data --prefix yellow-fever_meta.json
{
    "Versions": [
        {
            "ETag": "\"1390b7046efbd1f0150bbf2656206c0c\"",
            "Size": 758,
            "StorageClass": "STANDARD",
            "Key": "yellow-fever_meta.json",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2017-05-03T21:30:48.000Z"
        }
    ],
    "RequestCharged": null
}

So it gets filtered out during the resource index generation. Is there a way to add a version id to an S3 object or do we have to download and re-upload it?

@jameshadfield
Copy link
Member

Turns out the current yellow-fever dataset on S3 does not have a version id

(From memory) that's because it predates us using versioning in the bucket. I'd suggest uploading a current YF build (that one's from 2017) to the bucket.

@joverlee521
Copy link
Contributor

Turns out the current yellow-fever dataset on S3 does not have a version id

(From memory) that's because it predates us using versioning in the bucket. I'd suggest uploading a current YF build (that one's from 2017) to the bucket.

Ah that makes sense! If we want to retain the version history on nextstrain.org/pathogens that links back to the 2017 build, it would still need the version id though.

data/manifest_core.json Show resolved Hide resolved
data/manifest_core.json Show resolved Hide resolved
@genehack
Copy link
Contributor Author

genehack commented Dec 6, 2024

Turns out the current yellow-fever dataset on S3 does not have a version id

(From memory) that's because it predates us using versioning in the bucket. I'd suggest uploading a current YF build (that one's from 2017) to the bucket.

Ah that makes sense! If we want to retain the version history on nextstrain.org/pathogens that links back to the 2017 build, it would still need the version id though.

As this Stack Overflow post points out, the object does have a version id: it's the literal string "null" (and a pox upon the AWS engineer who decided to do this) — so, depending on how our code expects that version id to work, things might work out okay?

Also bumps resource index revision number in testing and production
configs, and updates an old comment that was not accurate now that
yellow-fever is in the manifest.
@genehack genehack force-pushed the add-yellow-fever-to-manifest branch from 26b9565 to 86a6850 Compare December 6, 2024 19:13
@genehack genehack temporarily deployed to nextstrain-s-add-yellow-zarguy December 6, 2024 19:13 Inactive
@jameshadfield
Copy link
Member

Ah that makes sense! If we want to retain the version history on nextstrain.org/pathogens that links back to the 2017 build, it would still need the version id though

Only Amazon S3 generates version IDs, and they cannot be edited, and similarly for last modification date. So we can re-upload, and it'll get a (new) version ID, but then it'll look like a 2024 dataset to the resource indexer. If you really want to show this as from 2017 you could extend the indexer to allow a "special cases" mapping of versionId to date.

@joverlee521
Copy link
Contributor

As this Stack Overflow post points out, the object does have a version id: it's the literal string "null" (and a pox upon the AWS engineer who decided to do this) — so, depending on how our code expects that version id to work, things might work out okay?

Ah yes, the distinction between null and "null" 🤦‍♀️

I forgot the resource indexer doesn't check the actual S3 objects. It pulls metadata from the S3 inventory, which reports the "null" version id as an empty string "" in the third column of the CSV:

"nextstrain-data","yellow-fever_meta.json","","true","false","2017-05-03T21:30:48.000Z","1390b7046efbd1f0150bbf2656206c0c"

The empty string matches the falsy check and gets filtered out of the resource index.

So we can re-upload, and it'll get a (new) version ID, but then it'll look like a 2024 dataset to the resource indexer. If you really want to show this as from 2017 you could extend the indexer to allow a "special cases" mapping of versionId to date.

@genehack how important is it to show the dataset is from 2017 in the snapshots chart?

@genehack
Copy link
Contributor Author

genehack commented Dec 6, 2024

Ah yes, the distinction between null and "null" 🤦‍♀️

You love hate to see it, right?

So we can re-upload, and it'll get a (new) version ID, but then it'll look like a 2024 dataset to the resource indexer. If you really want to show this as from 2017 you could extend the indexer to allow a "special cases" mapping of versionId to date.

@genehack how important is it to show the dataset is from 2017 in the snapshots chart?

I mean, it would certainly be nice to do so — I wonder if the user metadata thing mentioned in that SO post would be a way to get a version ID without changing the last modified date…

I may experiment with that over the weekend…

@genehack
Copy link
Contributor Author

genehack commented Dec 9, 2024

@genehack how important is it to show the dataset is from 2017 in the snapshots chart?

I mean, it would certainly be nice to do so — I wonder if the user metadata thing mentioned in that SO post would be a way to get a version ID without changing the last modified date…

I may experiment with that over the weekend…

So, I have mucked around with this in a test bucket, and done some more web searching, and TL;DR I don't see any way to get a version ID on that that S3 object that doesn't also change the last-modified date. I will chat with @trvrb to confirm this, but I suspect if the two options are "display the old dataset but with a modern date" and "don't display the old dataset at all", the latter is better.

I wonder if the resource indexer could be updated so that it didn't turn the "null" string into the empty string, and instead passed on it on through into the CSV? @joverlee521, thoughts?

@joverlee521
Copy link
Contributor

I wonder if the resource indexer could be updated so that it didn't turn the "null" string into the empty string, and instead passed on it on through into the CSV?

The CSV is generated by AWS S3 inventory so we don't have control over the empty string version id. I found some docs on converting the empty strings to null strings, but it seems like a manual operation not a config setting.

@genehack
Copy link
Contributor Author

The CSV is generated by AWS S3 inventory so we don't have control over the empty string version id. I found some docs on converting the empty strings to null strings, but it seems like a manual operation not a config setting.

womp womp

@tsibley
Copy link
Member

tsibley commented Dec 11, 2024

If we want to preserve that original yellow-fever version, we can make our indexer not filter out objects with VersionId == "" in the S3 inventory.

@genehack
Copy link
Contributor Author

If we want to preserve that original yellow-fever version, we can make our indexer not filter out objects with VersionId == "" in the S3 inventory.

…which will also surface all the other things that are currently dropped due to being old/unversioned. That feels like the pull tab on a can of worms?

@genehack
Copy link
Contributor Author

After reflection and much discussion, the consensus (which I agree with) was that surfacing that old data was not going to be worth the trouble. Going ahead and merging this; will enable yellow-fever automation and fire off a build next.

@genehack genehack merged commit a0cde63 into master Dec 16, 2024
8 checks passed
@genehack genehack deleted the add-yellow-fever-to-manifest branch December 16, 2024 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants