Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

alex-l-kong · 2024-10-15T20:41:58Z

What is the purpose of this PR?

Using SCiLS to extract spectra information is a long and cumbersome process, and we have no control over the development process. To expedite this, we implement our own functionality using Python.

How did you implement your changes

Initially, timsconvert was looked into as a solution. There were a few problem with this approach:

Poor support for multiple runs
Lack of support for normalization techniques (ex. TIC)
Inconsistent .imzml file definitions
Sub-optimal pyimzml memory usage

To address the issues with SCiLS and timsconvert, the concurrent.futures ProcessPoolExecutor library is leveraged. This offers us the primary benefit of parallel computation, significantly reducing the time required to extract each spectra, which would normally happen on a per-run, per-spot basis. With a 16-core machine, we could parallelize the extraction of 5-10 runs at a time.

One challenge with SCiLS is the bins used to define the m/z peaks. This is done to reduce the number of m/z datapoints that are used for downstream analysis. However, because only raw spectra is extracted, we need to do the following (NOTE: all of these have been addressed):

Implement TIC normalization
Reverse engineer a binning function that approximates the peaks.

~~The binning function is run using raw_mz / 200 / 1000` to define the endpoints. This closely matches what SCiLS does, since the conversion happens to mDa.~~

The pyTDFSDK library provides a simple connection/cursor workflow that allows us to easily query each spot for their corresponding spectra value.

Remaining issues

pyTDFSDK is a Windows-only package, meaning it will not be possible to test on Mac OS.

Aggregating spots across different runs is still a challenge because de-identification is needed. This will be a WIP.

review-notebook-app · 2024-10-15T20:42:04Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

coveralls · 2024-10-15T20:49:45Z

Pull Request Test Coverage Report for Build 13041160211

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 75 (0.0%) changed or added relevant lines in 1 file are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-23.0%) to 73.883%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
src/maldi_tools/load_maldi_data.py	0	75	0.0%

Files with Coverage Reduction	New Missed Lines	%
src/maldi_tools/extraction.py	1	99.38%

Totals
Change from base Build 6777233943:	-23.0%
Covered Lines:	215
Relevant Lines:	291

💛 - Coveralls

…pts...

New dask-based workflow

05ad9ce

alex-l-kong self-assigned this Oct 15, 2024

alex-l-kong changed the title ~~Implement workflow using pyTDFSDK and dask~~ Implement spectra extraction workflow using pyTDFSDK and dask Oct 15, 2024

alex-l-kong added 15 commits October 24, 2024 11:00

Smorgasbord of changes, all of which are failed parallelization attem…

5e8622d

…pts...

Simplify workflow to single-threaded

63f8ccd

Fix binned_mz indexing scheme

6f89473

Reformatting

dcec39e

Full multiprocessing support added

87e5a68

OCD formatting

9cbd2e3

Clarify comment about the path to the binary file

1d72c30

Upload simplified copy of the maldi-load.ipynb notebook

a4592bd

Add TIC normalization

016f631

Allow option for TIC normalization for testing and flexibility

69d32e1

Use logarithmic binning method from SCiLS

ab42f60

Binning method fully finalized

3b14661

Add error check for num_bins

d02340d

Properly handle pixels with zero-sum intensity

e164243

Add SCiLS-based TIC normalization

789312c

alex-l-kong changed the title ~~Implement spectra extraction workflow using pyTDFSDK and dask~~ Implement spectra extraction workflow using pyTDFSDK with parallelization Nov 11, 2024

alex-l-kong added 2 commits January 29, 2025 13:52

Add automatic num_bins calculation to TDFSDK workflow

ac697f3

Documentation fix

b37dfc6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

alex-l-kong commented Oct 15, 2024 •

edited

Loading

review-notebook-app bot commented Oct 15, 2024

coveralls commented Oct 15, 2024 •

edited

Loading

Implement spectra extraction workflow using pyTDFSDK with parallelization #26

Are you sure you want to change the base?

Implement spectra extraction workflow using pyTDFSDK with parallelization #26

Conversation

alex-l-kong commented Oct 15, 2024 • edited Loading

review-notebook-app bot commented Oct 15, 2024

coveralls commented Oct 15, 2024 • edited Loading

Pull Request Test Coverage Report for Build 13041160211

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

Implement spectra extraction workflow using `pyTDFSDK` with parallelization #26

alex-l-kong commented Oct 15, 2024 •

edited

Loading

coveralls commented Oct 15, 2024 •

edited

Loading