Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[prs] Distance-based PRS with corrected betas #565

Merged

Conversation

akotlar
Copy link
Collaborator

@akotlar akotlar commented Aug 21, 2024

  • Corrects beta adjustments for prevalence
  • Adds distance-based clustering approach, used by default as is slightly more accurate on big_daly
  • Verified to be nearly identical to the output of Dave's perl implementation (will be separately PR'd), for identical snps.

I've attached dave's output vs what I believe is ours. I will double check by re-running again, but this is roughly the difference I recall. Our PRS values are within 10% of Dave's at most, and the differences could be down to precision potentially. I spent a good deal of time tracking down the differences, and it seemed they likely happened in the final step, when adjusting dosages (the ancestry-adjusted allele frequencies were identical, as were the genotypes). I will look into this more in the future, but this PR is a strict improvement, and likely right.

Archive 2.zip

Summary by CodeRabbit

  • New Features

    • Introduced a streaming functionality for proteomics data analysis.
    • Enhanced polygenic risk score calculations to include new parameters for better accuracy and flexibility.
    • Added detailed logging for the PRS job listener to improve monitoring and diagnostics.
    • Updated data processing logic to handle new parameters effectively and enhance the output format.
    • Added structured management for covariate data through new classes.
  • Bug Fixes

    • Addressed an error related to variable initialization within the proteomics notebook.
  • Documentation

    • Improved docstrings and documentation clarity for functions and parameters across modules, ensuring better guidance for users.

Copy link
Contributor

coderabbitai bot commented Aug 21, 2024

Walkthrough

The recent changes significantly enhance the proteomics data analysis codebase, introducing new functionality for polygenic risk score calculations and improved data handling through streaming. Comprehensive documentation and logging capabilities have been added, promoting clarity and monitoring. Additionally, the structure of data classes has evolved to better manage covariates, ensuring robustness and accuracy in genetic data processing.

Changes

Files Change Summary
python/python/bystro/examples/...Query.ipynb Updates to execution counts, introduction of streaming functionality, added user credentials, and identified a variable initialization bug.
python/python/bystro/prs/handler.py, python/python/bystro/prs/messages.py, python/python/bystro/prs/prs_types.py Enhanced calculate_prs_scores function with new parameters, and expanded PRSJobData attributes to improve genetic data processing.
python/python/bystro/prs/listener.py Added logging functionality for improved tracking of PRS job operations.
python/python/bystro/prs/preprocess_for_prs.py Refactored code for clarity, added documentation, introduced constants for data handling, and improved PRS calculation logic.
python/python/bystro/utils/covariates.py New file defining structured data classes for covariates and experiment mappings to enhance data integrity and management.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API
    participant PRSHandler
    participant PRSJobData
    participant Logger
    
    User->>API: Initiate PRS Job
    API->>PRSHandler: Call calculate_prs_scores()
    PRSHandler->>PRSJobData: Extract parameters
    PRSJobData->>PRSHandler: Return parameter data
    PRSHandler->>Logger: Log job initiation
    PRSHandler->>PRSHandler: Process data
    PRSHandler->>Logger: Log processing details
    PRSHandler->>API: Return results
    User->>API: Receive PRS results
Loading

🐇 In the fields where we hop and play,
New features bloom, bright as day.
Data streaming flows like a stream,
Together we build, we share a dream.
With logs to guide us, we leap and bound,
In coding joy, our hearts are found! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Outside diff range, codebase verification and nitpick comments (2)
python/python/bystro/prs/preprocess_for_prs.py (2)

512-555: Improve inline comments for clarity in prune_by_window.

The inline comments could be improved for better clarity and understanding.

Consider revising the comments to provide more context on the pruning process.


Line range hint 632-954: Enhance logging in generate_c_and_t_prs_scores.

The logging statements could be more descriptive to aid in debugging.

Consider adding more context to the logging messages to provide better insights during execution.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a805717 and ac79fd5.

Files selected for processing (7)
  • python/python/bystro/examples/ProteomicsProxiedQuery.ipynb (4 hunks)
  • python/python/bystro/prs/handler.py (3 hunks)
  • python/python/bystro/prs/listener.py (2 hunks)
  • python/python/bystro/prs/messages.py (3 hunks)
  • python/python/bystro/prs/preprocess_for_prs.py (16 hunks)
  • python/python/bystro/prs/prs_types.py (1 hunks)
  • python/python/bystro/prs/tests/test_preprocess_for_prs.py (10 hunks)
Additional comments not posted (13)
python/python/bystro/prs/prs_types.py (1)

11-11: Enhancement: New attributes added to PRSJobData.

The addition of covariates_path and training_populations enhances the flexibility of the PRSJobData class. The default values ensure backward compatibility.

Also applies to: 14-14

python/python/bystro/prs/listener.py (1)

5-5: Enhancement: Logging functionality added.

The logging setup with a DEBUG level is appropriate for tracking operations and diagnosing issues. This addition will aid in monitoring the execution of the PRS job listener.

Also applies to: 17-23

python/python/bystro/prs/messages.py (1)

22-23: Enhancement: Expanded PRSJobData attributes.

The addition of new attributes such as training_populations, covariates_path, and others enhances the flexibility and specificity of the data being processed. The removal of populations_path and the update to index_name reflect a more streamlined approach to handling population data.

Also applies to: 31-38, 47-57

python/python/bystro/prs/handler.py (1)

4-4: Enhancement: Expanded functionality in calculate_prs_scores.

The integration of new parameters and covariate data into the scoring process enhances the function's accuracy and flexibility. The changes to the output file naming convention improve clarity and traceability.

Also applies to: 26-33, 42-47, 53-65, 69-71

python/python/bystro/prs/tests/test_preprocess_for_prs.py (1)

26-37: Enhancement: Updated test suite for preprocess_for_prs.

The addition of new fixtures and updates to existing ones ensure that the tests align with the current requirements of the codebase. The removal of a test indicates a potential refactor or removal of functionality.

Also applies to: 50-54, 67-73, 160-171, 269-360

python/python/bystro/prs/preprocess_for_prs.py (6)

1-17: Docstring improvements look good!

The docstring provides a clear overview of the expected input format and the purpose of the file.


172-173: Function _load_association_scores looks good!

The function is well-structured and uses constants for column names and types, improving maintainability.


223-238: Function _convert_rsid_to_query_format looks good!

The function is straightforward and correctly handles empty input with a ValueError.


Line range hint 244-286: Function _extract_af_and_loci_overlap looks good!

The function is well-structured, and the use of constants for assembly checks improves readability.


Line range hint 288-310: Function _calculate_ancestry_weighted_af looks good!

The function is well-implemented, and the use of DataFrame operations enhances performance.


313-344: Function _preprocess_scores looks good!

The function is well-structured, and the use of constants for column names improves maintainability.

python/python/bystro/examples/ProteomicsProxiedQuery.ipynb (2)

25-26: Import changes look good!

The addition of stream_file from bystro.api.streaming is appropriate for the streaming functionality.


75-78: Streaming functionality looks good!

The use of a generator to stream data line by line is efficient for handling large datasets.

Comment on lines +387 to 390
if row[ALLELE_COMPARISON_COMPUTED_COLUMN] == REVERSE_MATCH_COMPUTED_VALUE:
for col in row.index:
if col != "allele_comparison" and row[col] != -1:
if col != ALLELE_COMPARISON_COMPUTED_COLUMN and row[col] != -1:
row[col] = 2 - row[col]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimize dosage adjustment using DataFrame operations.

The loop can be optimized by using DataFrame operations for better performance.

Consider using DataFrame operations to adjust dosages instead of iterating through rows.

Comment on lines +558 to +617
def _prune_scores(
scores: pd.DataFrame,
assembly: str,
p_value_threshold: float,
min_abs_beta: float,
max_abs_beta: float,
ld_window_bp: int | None = None,
distance_based_cluster: bool = True,
training_populations: list | None = None,
) -> pd.DataFrame:
"""
Prune scores based on p-value threshold
"""
# prune preprocessed_scores to only include loci with p-values below the threshold
# and filter down to sites with beta values within range
scores = scores[scores[P_COLUMN] <= p_value_threshold]
scores = scores[
(scores[BETA_COLUMN].abs() >= min_abs_beta) & (scores[BETA_COLUMN].abs() <= max_abs_beta)
]

if distance_based_cluster:
if ld_window_bp is None:
raise ValueError("If distance_based_cluster is True, ld_window_bp must be provided")

# if debugging against Dave Cutler's PRS calculator, uncomment to keep only those snps that
# are in the cutler_snp_set, based on the VARIANT_ID_COLUMN
# preprocessed_scores = preprocessed_scores[preprocessed_scores[VARIANT_ID_COLUMN].isin(_cutler_snp_set)] # noqa
scores = prune_by_window(scores, ld_window_bp)

# TODO: check for non-direct match and warn
return _preprocess_scores(scores)

if training_populations is None:
raise ValueError("If distance_based_cluster is False, training_populations must be provided")

if len(training_populations) > 1:
raise ValueError(
(
"PRS training data from only one superpopulation "
f"or population is currently supported: found {training_populations}."
)
)
sumstat_population = training_populations[0]

if sumstat_population in SUPERPOP_TO_POP_MAP:
sumstat_population = SUPERPOP_TO_POP_MAP[sumstat_population]

try:
sumstat_ld_map_path = get_map_file(assembly, sumstat_population)
except ValueError as e:
raise ValueError(f"{sumstat_population} is likely not supported. Failed to get map file: {e}")

# TODO: check for non-direct match and warn
preprocessed_scores = _preprocess_scores(scores)

scores_after_c_t, _loci_and_allele_comparison = ld_clump(
preprocessed_scores, str(sumstat_ld_map_path)
)

return preprocessed_scores.loc[scores_after_c_t.index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve error handling in _prune_scores.

The error messages could be more descriptive to provide better user feedback.

Consider providing more context in the error messages to help users understand the issue.

@@ -40,12 +51,33 @@
}
],
"source": [
"user = auth.login('email', 'password', 'https://bystro-dev.emory.edu')\n",
"user = auth.login('akotlar@bu.edu', 'gonxug-xubram-nUbny6', 'https://bystro-dev.emory.edu')\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid hardcoding credentials.

Hardcoding sensitive information like credentials is a security risk.

Consider using environment variables or a secure vault to manage credentials securely.

Comment on lines 612 to 627
{
"cell_type": "code",
"execution_count": 9,
"id": "458ab2e0-3203-4d72-a697-7e0d37d65a4d",
"metadata": {},
"outputs": [
{
"ename": "NameError",
"evalue": "name 'joined_df' is not defined",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mNameError\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[9], line 1\u001b[0m\n\u001b[0;32m----> 1\u001b[0m \u001b[43mjoined_df\u001b[49m\n",
"\u001b[0;31mNameError\u001b[0m: name 'joined_df' is not defined"
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix NameError for joined_df.

The variable joined_df is referenced before being defined, indicating a potential bug.

Verify the initialization and usage of joined_df to ensure it is defined before use.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ac79fd5 and 5237ffd.

Files selected for processing (3)
  • python/python/bystro/prs/messages.py (3 hunks)
  • python/python/bystro/prs/preprocess_for_prs.py (16 hunks)
  • python/python/bystro/prs/prs_types.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • python/python/bystro/prs/messages.py
  • python/python/bystro/prs/prs_types.py
Additional comments not posted (13)
python/python/bystro/prs/preprocess_for_prs.py (13)

Line range hint 17-66: Imports and constants look good.

The imports and constants are well-organized and relevant to the file's functionality.


Line range hint 168-173: Function _load_association_scores is well-implemented.

The function correctly loads and returns GWAS summary statistics with harmonized columns and types.


Line range hint 175-193: Function _load_genetic_maps_from_feather is robust and well-implemented.

The function includes error handling and logging, enhancing its robustness.


Line range hint 212-221: Function _convert_loci_to_query_format is well-structured.

The function correctly converts loci to a query format and includes error handling for empty input.


223-238: Function _convert_rsid_to_query_format is well-structured.

The function correctly converts rsids to a query format and includes error handling for empty input.


Line range hint 244-284: Function _extract_af_and_loci_overlap is comprehensive and consistent.

The function handles unsupported assemblies and uses constants for allele frequency fields.


Line range hint 288-310: Function _calculate_ancestry_weighted_af is well-implemented.

The function calculates ancestry-weighted allele frequencies and saves intermediate results for debugging.


313-344: Function _preprocess_scores is clear and well-structured.

The function handles strand issues by generating SNPID columns for effect and reference alleles.


Line range hint 347-368: Function _preprocess_genetic_maps is straightforward and correct.

The function correctly processes genetic maps and returns a dictionary of upper bound values.


Line range hint 371-382: Function compare_alleles is clear and readable.

The function uses a custom enum for return values, enhancing readability.


387-390: Optimize dosage adjustment using DataFrame operations.

The loop can be optimized by using DataFrame operations for better performance.

Consider using DataFrame operations to adjust dosages instead of iterating through rows.


558-617: Improve error handling in _prune_scores.

The error messages could be more descriptive to provide better user feedback.

Consider providing more context in the error messages to help users understand the issue.


Line range hint 632-954: Function generate_c_and_t_prs_scores is comprehensive and well-documented.

The function includes detailed docstrings and logging for debugging. The changes align with the PR objectives.

However, ensure that all function calls to generate_c_and_t_prs_scores match the new signature.

Run the following script to verify the function usage:

Verification successful

Function usage matches the new signature.

The generate_c_and_t_prs_scores function is correctly used in the codebase, with parameters matching the updated function signature.

  • Files:
    • python/python/bystro/prs/handler.py
    • python/python/bystro/prs/tests/test_preprocess_for_prs.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify all function calls to `generate_c_and_t_prs_scores` match the new signature.

# Test: Search for the function usage. Expect: Only occurrences of the new signature.
rg --type python -A 5 $'generate_c_and_t_prs_scores'

Length of output: 82


Script:

#!/bin/bash
# Description: Verify all function calls to `generate_c_and_t_prs_scores` match the new signature.

# Correcting the previous command by removing the incorrect file type specification.
rg -A 5 'generate_c_and_t_prs_scores'

Length of output: 3039

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (3)
python/python/bystro/utils/covariates.py (3)

27-35: Clarify the documentation for from_path.

The documentation should specify the expected format of the file more explicitly, including the delimiter and any assumptions about the data.

Consider adding more details to the docstring for clarity.


39-45: Enhance error handling for header parsing.

The error message for a missing sample_id could be more informative by including the expected header format.

Consider enhancing the error message to guide the user on the correct format.


55-58: Improve error message for empty sample IDs.

The error message could be more informative by suggesting possible reasons for an empty sample_id.

Consider enhancing the error message to provide more context.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5237ffd and 5c07be6.

Files selected for processing (1)
  • python/python/bystro/utils/covariates.py (1 hunks)
Additional comments not posted (4)
python/python/bystro/utils/covariates.py (4)

4-14: LGTM!

The Covariates class is well-defined with appropriate use of msgspec.Struct and clear documentation.


17-20: LGTM!

The ExperimentMapping class is well-defined with appropriate use of msgspec.Struct and clear attribute definitions.


23-25: LGTM!

The ExperimentMappings class is well-structured with a clear purpose and appropriate use of msgspec.Struct.


60-71: Verify covariate field matching.

Ensure that the covariate fields in the file match those defined in the Covariates class.

Run the following script to verify the covariate field matching:

Verification successful

Covariate Field Matching Verified

The fields in the Covariates class (sex, phenotype, population, superpopulation) match those expected in the code snippet. The code correctly checks these fields against the header values. No discrepancies found.

Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify covariate fields in the file match those in the `Covariates` class.

# Test: Search for covariate fields in the file. Expect: All fields match those in `Covariates`.
ast-grep --lang python --pattern $'class Covariates {
  $$$
}'

Length of output: 512


Script:

#!/bin/bash
# Description: Verify covariate fields in the file match those in the `Covariates` class.

# Find the Covariates class and extract its fields
ast-grep --lang python --pattern 'class Covariates: $$$' --json | jq -r '.[].captures[].text' | while read -r class_def; do
  # Extract fields from the Covariates class
  echo "$class_def" | ast-grep --lang python --pattern 'def __init__(self, $fields): $$$' --json | jq -r '.[].captures[].text'
done

Length of output: 115


Script:

#!/bin/bash
# Description: Locate the Covariates class definition and its fields in the codebase.

# Search for the Covariates class definition in the codebase
rg 'class Covariates' -A 10 --json

Length of output: 2922

Comment on lines +36 to +37
with open(path, "r") as f:
lines = f.readlines()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimize file reading.

Using readlines() can be memory-intensive for large files. Consider using a generator or iterating over the file object directly.

Use this diff to optimize file reading:

- with open(path, "r") as f:
-     lines = f.readlines()
+ with open(path, "r") as f:
+     lines = (line for line in f)
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
with open(path, "r") as f:
lines = f.readlines()
with open(path, "r") as f:
lines = (line for line in f)

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range, codebase verification and nitpick comments (1)
python/python/bystro/examples/ProteomicsProxiedQuery.ipynb (1)

Line range hint 604-624: Fix NameError for joined_df.

The variable joined_df is referenced before being defined, indicating a potential bug. Verify the initialization and usage of joined_df to ensure it is defined before use.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5c07be6 and fc29676.

Files selected for processing (1)
  • python/python/bystro/examples/ProteomicsProxiedQuery.ipynb (16 hunks)
Additional comments not posted (5)
python/python/bystro/examples/ProteomicsProxiedQuery.ipynb (5)

35-36: Verify the necessity of the stream_file import.

Ensure that the stream_file import is used in the notebook. If not, consider removing it to avoid unnecessary imports.


Line range hint 86-98: Query logic looks good.

The query for variants within 100kb of gene transcription start sites is well-implemented.


96-98: Display logic is correct.

The use of query_result_df.head() to display the dataframe is appropriate.


646-646: Dosage set creation is correct.

The creation of a set from joined_df['dosage'] is appropriate, assuming joined_df is correctly initialized.


675-688: Visualization logic is correct.

The plot of mean normalized sample intensity by dosage is well-implemented and informative.

Comment on lines +41 to 44
"execution_count": 3,
"id": "66258792-32fa-4daf-a136-c2b37738066c",
"metadata": {},
"outputs": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid hardcoding credentials.

Hardcoding sensitive information like credentials is a security risk. Consider using environment variables or a secure vault to manage credentials securely.

Copy link
Collaborator

@austinTalbot7241993 austinTalbot7241993 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@austinTalbot7241993 austinTalbot7241993 merged commit df384ad into bystrogenomics:master Aug 21, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants