Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Integrate peakpredictor #44

Merged
merged 32 commits into from
Apr 23, 2021

Conversation

simonvh
Copy link
Member

@simonvh simonvh commented Feb 22, 2021

Still a work in progress, just want to see where the conflicts are for now.

This is a replacement for ananse binding which uses a more complex model to predict binding, based on ATAC and/or H3K27ac in a transcription factor-specific manner.

@simonvh
Copy link
Member Author

simonvh commented Feb 25, 2021

This PR does the following things:

  • It replaces the "old" model with a new one, which is flexible and can use ATAC-seq and/or H3K27ac. It is trained on ~100 TFs in different cell types. Depending on which data is available the appropriate model is used.
  • You can use either the pre-defined regions for human, or a set of custom regions. The latter works according to the code from @siebrenf.
  • Quantile normalization still uses the "old" sampling approach. This could be improved, but I'm currently unclear on what the best approach would be.
  • It should be possible to use the pfm + motif2factors.txt for any species, and should not change the TF names. There is currently still a check on valid TFs for the default option (human) as the current motif2factors.txt is messy.
  • You can specify which TFs you want to predict. There is one thing still to change here, as motif scanning is done for all the motifs, which is then not very efficient.

@simonvh simonvh requested review from bioqxu, siebrenf and Maarten-vd-Sande and removed request for bioqxu February 25, 2021 08:12
Copy link
Member

@Maarten-vd-Sande Maarten-vd-Sande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some general remarks, again not everything related to this PR.

Looks good

Comment on lines +42 to +44
parser.add_argument(
"-v", "--version", action="version", version=f"%(prog)s v{__version__}"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a cookie 🍪 I can add tab completion 🙃

help="one or more BED format files with putative enhancer regions (e.g. narrowPeak, broadPeak)",
metavar="",
nargs='+', # >= 1 files
p = subparsers.add_parser(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Easier to name each subparser to what it actually represents.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah but that is not related to your changes 🤐

ananse/utils.py Outdated
return factors

factors = [line.strip() for line in open(fname)]
return factors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no newline 😱

Comment on lines +217 to +231
def check_input_factors(factors):
"""Check factors.

Factors can eiher be a list of transcription factors, or a filename of a
file that containts TFs. Returns a list of factors.
If factors is None, it will return the default transcription factors.

Returns
-------
list
List of TF names.
"""
# Load factors
if factors is None:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factors is not a kwarg, so unlikely it is None? Also I don't see how it returns the default TFs

Comment on lines 93 to 96
if line[1] == "":
realFC = 0
else:
realFC = float(line[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personal preference, but I am linking pep to make it look like it isn't (https://www.python.org/dev/peps/pep-0008/#programming-recommendations)

For sequences, (strings, lists, tuples), use the fact that empty sequences are false:

# Correct:
if not seq:
if seq:

# Wrong:
if len(seq):
if not len(seq):

Anyways, I think line[1] should have a descriptive name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole function could be pandas, but again.. Not part of the PR 🙃

Comment on lines 188 to 193
def filter_TF(scores_df, network=None, tpmfile=None, tpm=20, overlap=0.98):
"""Filter TFs:
1) it have high expression in origin cell type;
2) 98% of its target genes are also regulated by previous TFs.
1) it have high expression in origin cell type;
2) 98% of its target genes are also regulated by previous TFs.
"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to PR. But the amount of (unexpected to me) filtering is strange, this is one example. I think this is something we should take a look at and think about.

E.g. removing not-validated TFs by default will of course make the AUPRC better, since we do not have true data of those. However, if we are confident in the model I am not sure if this is the way to go

raise ValueError("Need either ATAC-seq or H3K27ac BAM file(s).")

if genome is None:
logger.info("Assuming genome is hg38")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps logger.warning? I feel this should really be seen by people, and more explicit is more better.

valid_factors = valid_factors.loc[
valid_factors["Pseudogene"].isnull(), "HGNC approved gene symbol"
].values
valid_factors = [f for f in valid_factors if f not in ["EP300"]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this contain duplicates? Otherwise list(set([...]))

if self.is_human_genome():
factor = factor.upper()

if self.is_human_genome() and factor not in valid_factors:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we sure that all valid_factors are actually uppercase in the case of human?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be!

Copy link
Member

@siebrenf siebrenf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, I only noticed the review request when Maarten mentioned it 🙀

I have two serious questions in _load_motifs, the rest looks good (and therefore I assume it works well :p )

Comment on lines +211 to +212
if len(self.f2m) == 1:
logger.debug("using motifs for 1 factor")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lovely touch ❤️

Comment on lines +219 to +220
if self.pfmfile is not None:
logger.debug("Reading default file")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default file is loaded when self.pfmfile is None right?

Comment on lines +218 to +238
tmp_f2m = {}
if self.pfmfile is not None:
logger.debug("Reading default file")
tmp_f2m = self._load_factor2motifs(indirect=True)

for k, v in self.f2m.items():
if k in tmp_f2m:
tmp_f2m[k] += v
else:
tmp_f2m[k] = v

self.motif_graph = nx.Graph()
d = []
for f1 in tmp_f2m:
for f2 in tmp_f2m:
jaccard = len(set(tmp_f2m[f1]).intersection(set(tmp_f2m[f2]))) / len(
set(tmp_f2m[f1]).union(set(tmp_f2m[f2]))
)
d.append([f1, f2, jaccard])
if jaccard > 0:
self.motif_graph.add_edge(f1, f2, weight=1 - jaccard)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of this depends on tmp_f2m. Is it supposed to run only if if self.pfmfile is not None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this can only be done with the default motif file I think.

Comment on lines 108 to 110
* Motif scores.
* The average peak coverage.
* The distance from the peak to nearest TSS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        * Motif scores.
        * The average peak coverage (and their regions).
        * The distance from the peak to nearest TSS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for (my) understanding, I'd like a note that _avg and _dist are used later on, with reference data only.

Comment on lines +260 to +271
fname = f"{self.data_dir}/{title}.qnorm.ref.txt.gz"
if os.path.exists(fname):
logger.debug(f"quantile normalization for {title}")
qnorm_ref = pd.read_table(fname, index_col=0)["qnorm_ref"].values
if len(self.regions) != len(qnorm_ref):
qnorm_ref = np.random.choice(
qnorm_ref, size=len(self.regions), replace=True
)

tmp = qnorm.quantile_normalize(tmp, target=qnorm_ref)
else:
tmp = np.log1p(tmp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributions.py is WIP, but the part that used Quans file were in order. Why not use that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code replaces Quan's code. If the distributions.py changes we can use it here.


Basically, this will select the columns that are available,
based on the different types of data that are loaded.
Reference regions will have the mmost information.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmost

"""
if factor is None and motifs is None:
raise ValueError("Need either a TF name or one or more motifs.")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add # TODO: remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

exit(1)


def predict_peaks(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a point on formality (and can therefore be ignored if it takes too long).

half of this function is checking the user input, the other half is running the code. The latter part definitely belongs here, but the input checking should maybe move to commands/, as it is the control script.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. However, I like that the code in commands is actually very minimal. This means that the exact same functionality will be available as API-style functionality. What can be done is to move it to a separate function.

@simonvh simonvh merged commit 3c0c82b into vanheeringen-lab:refactor Apr 23, 2021
@simonvh simonvh deleted the peakpredictor branch April 23, 2021 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants