-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hotfix to ensure sample-names are always strings #49
hotfix to ensure sample-names are always strings #49
Conversation
Confirmed that Qiita dependency is failing to install due to codecov dependency having been removed from Pipy/pip: |
Codecov updated to use their new native binary. The stdout from the codecov step seems to imply that codecov isn't fully set up. Looking at the earlier version using the v1 shell script, it appears this was also the case so this PR won't be a step backwards. I will make it a low-priority issue in GitHub to get codecov running for this project. |
Hi @wasade! If you don't mind reviewing this since Antonio's still out, I'd greatly appreciate it, thanks! No rush. |
qp_klp/process_amplicon_job.py
Outdated
delimiter='\t', | ||
index_col='sample_name').to_dict( | ||
'index') | ||
index_col='sample_name') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index_col
does not interact as expected with dtype
. This will still incur a type conversion. Converting it to str
after the fact is lossy. Please set remove, and set an explicit index with set_index
after parse
>>> df = pd.read_csv('clinical_metadata_formatted.tsv', sep='\t', dtype=str, index_col='sample-id')
>>> df.index[:5]
Float64Index([10317.000106498, 10317.000107092, 10317.000106499,
10317.000107093, 10317.000106684],
dtype='float64', name='sample-id')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wasade I can see it - it didn't happen in the case that spawned this hot fix but in general if the index column gets interpreted as a numeric directly instead of getting casted to string first with the other columns before then being converted when it becomes the index, you would get potential data loss.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're in agreement. Please do not allow the index column to be cast to numeric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a regression test that asserts that data from numeric-like index values is not lost. Specifically, construct a file to parse, which at a minimum contains "123.000" and "1e-3" as sample IDs, and assert any relevant parsing logic does not alter these IDs
@wasade all tests, including the new one are passing. Ready for review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay thanks!
No description provided.