Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotfix to ensure sample-names are always strings #49

Merged

Conversation

charles-cowart
Copy link
Contributor

No description provided.

@charles-cowart
Copy link
Contributor Author

Confirmed that Qiita dependency is failing to install due to codecov dependency having been removed from Pipy/pip:
home-assistant/core#91283. Looking into a solution.

@charles-cowart
Copy link
Contributor Author

Codecov updated to use their new native binary. The stdout from the codecov step seems to imply that codecov isn't fully set up. Looking at the earlier version using the v1 shell script, it appears this was also the case so this PR won't be a step backwards. I will make it a low-priority issue in GitHub to get codecov running for this project.

@charles-cowart
Copy link
Contributor Author

Hi @wasade! If you don't mind reviewing this since Antonio's still out, I'd greatly appreciate it, thanks! No rush.

qp_klp/process_amplicon_job.py Outdated Show resolved Hide resolved
delimiter='\t',
index_col='sample_name').to_dict(
'index')
index_col='sample_name')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index_col does not interact as expected with dtype. This will still incur a type conversion. Converting it to str after the fact is lossy. Please set remove, and set an explicit index with set_index after parse

>>> df = pd.read_csv('clinical_metadata_formatted.tsv', sep='\t', dtype=str, index_col='sample-id')
>>> df.index[:5]
Float64Index([10317.000106498, 10317.000107092, 10317.000106499,
              10317.000107093, 10317.000106684],
             dtype='float64', name='sample-id')

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wasade I can see it - it didn't happen in the case that spawned this hot fix but in general if the index column gets interpreted as a numeric directly instead of getting casted to string first with the other columns before then being converted when it becomes the index, you would get potential data loss.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're in agreement. Please do not allow the index column to be cast to numeric.

qp_klp/process_metagenomics_job.py Outdated Show resolved Hide resolved
qp_klp/process_metagenomics_job.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a regression test that asserts that data from numeric-like index values is not lost. Specifically, construct a file to parse, which at a minimum contains "123.000" and "1e-3" as sample IDs, and assert any relevant parsing logic does not alter these IDs

@charles-cowart
Copy link
Contributor Author

@wasade all tests, including the new one are passing. Ready for review!

Copy link
Contributor

@wasade wasade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay thanks!

@charles-cowart charles-cowart merged commit 6a2cbcb into qiita-spots:main Apr 19, 2023
@charles-cowart charles-cowart deleted the sample_names_are_strings_hotfix branch April 19, 2023 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants