Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement spectra extraction workflow using pyTDFSDK with parallelization #26

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

alex-l-kong
Copy link
Contributor

@alex-l-kong alex-l-kong commented Oct 15, 2024

What is the purpose of this PR?

Using SCiLS to extract spectra information is a long and cumbersome process, and we have no control over the development process. To expedite this, we implement our own functionality using Python.

How did you implement your changes

Initially, timsconvert was looked into as a solution. There were a few problem with this approach:

  1. Poor support for multiple runs
  2. Lack of support for normalization techniques (ex. TIC)
  3. Inconsistent .imzml file definitions
  4. Sub-optimal pyimzml memory usage

To address the issues with SCiLS and timsconvert, the concurrent.futures ProcessPoolExecutor library is leveraged. This offers us the primary benefit of parallel computation, significantly reducing the time required to extract each spectra, which would normally happen on a per-run, per-spot basis. With a 16-core machine, we could parallelize the extraction of 5-10 runs at a time.

One challenge with SCiLS is the bins used to define the m/z peaks. This is done to reduce the number of m/z datapoints that are used for downstream analysis. However, because only raw spectra is extracted, we need to do the following (NOTE: all of these have been addressed):

  1. Implement TIC normalization
  2. Reverse engineer a binning function that approximates the peaks.

The binning function is run using raw_mz / 200 / 1000` to define the endpoints. This closely matches what SCiLS does, since the conversion happens to mDa.

The pyTDFSDK library provides a simple connection/cursor workflow that allows us to easily query each spot for their corresponding spectra value.

Remaining issues

pyTDFSDK is a Windows-only package, meaning it will not be possible to test on Mac OS.

Aggregating spots across different runs is still a challenge because de-identification is needed. This will be a WIP.

@alex-l-kong alex-l-kong self-assigned this Oct 15, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@alex-l-kong alex-l-kong changed the title Implement workflow using pyTDFSDK and dask Implement spectra extraction workflow using pyTDFSDK and dask Oct 15, 2024
@coveralls
Copy link

coveralls commented Oct 15, 2024

Pull Request Test Coverage Report for Build 13041160211

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 75 (0.0%) changed or added relevant lines in 1 file are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-23.0%) to 73.883%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/maldi_tools/load_maldi_data.py 0 75 0.0%
Files with Coverage Reduction New Missed Lines %
src/maldi_tools/extraction.py 1 99.38%
Totals Coverage Status
Change from base Build 6777233943: -23.0%
Covered Lines: 215
Relevant Lines: 291

💛 - Coveralls

@alex-l-kong alex-l-kong changed the title Implement spectra extraction workflow using pyTDFSDK and dask Implement spectra extraction workflow using pyTDFSDK with parallelization Nov 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants