Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix split_fasta.py bug and improve runtime #175

Merged
merged 3 commits into from
Mar 24, 2021

Conversation

skrakau
Copy link
Member

@skrakau skrakau commented Mar 24, 2021

There were two problems with the split_fasta.py script:

Bug:

  • The sort function sort_values() was applied to df, but the result was nowhere stored. So I added an inplace=True.
  • In any case, the index still refers to the position before sorting. Currently, after max_sequences it stops writing sequences to individual files, although the first max_sequences where not the longest and not necessarily have a length >= length_threshold. Thus sequences are missed. To address this I reset the index.

Runtime (see #166):

  • The sort_values() function is applied to all sequences (O(n log n)). This is not necessary, since one can first separate all sequences below the length_threshold. Thus one can sort only the sequences >= length_threshold, take the longest max_sequences and add the remaining to the pooled list.
  • For the file with the unbinned sequences, which required > 300h this run through in a couple of minutes

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@skrakau skrakau requested a review from d4straub March 24, 2021 10:02
@github-actions
Copy link

github-actions bot commented Mar 24, 2021

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit e43c974

+| ✅ 117 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗   4 tests had warnings |!
### ❗ Test warnings:

❔ Tests ignored:

  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_dev.yml
  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_release.yml
  • conda_env_yaml - No environment.yml file found - skipping conda_env_yaml test
  • conda_dockerfile - No environment.yml / Dockerfile file found - skipping conda_dockerfile test

✅ Tests passed:

Run details

  • nf-core/tools version 1.13.2
  • Run at 2021-03-24 10:17:05

Copy link
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@skrakau skrakau merged commit 7cfc745 into nf-core:dev Mar 24, 2021
@skrakau skrakau deleted the improve_splitting_unbinned branch May 31, 2021 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants