-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Integrate peakpredictor #44
Conversation
…rated list. Added ncpus option.
This PR does the following things:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some general remarks, again not everything related to this PR.
Looks good
parser.add_argument( | ||
"-v", "--version", action="version", version=f"%(prog)s v{__version__}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a cookie 🍪 I can add tab completion 🙃
help="one or more BED format files with putative enhancer regions (e.g. narrowPeak, broadPeak)", | ||
metavar="", | ||
nargs='+', # >= 1 files | ||
p = subparsers.add_parser( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easier to name each subparser to what it actually represents.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah but that is not related to your changes 🤐
ananse/utils.py
Outdated
return factors | ||
|
||
factors = [line.strip() for line in open(fname)] | ||
return factors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no newline 😱
def check_input_factors(factors): | ||
"""Check factors. | ||
|
||
Factors can eiher be a list of transcription factors, or a filename of a | ||
file that containts TFs. Returns a list of factors. | ||
If factors is None, it will return the default transcription factors. | ||
|
||
Returns | ||
------- | ||
list | ||
List of TF names. | ||
""" | ||
# Load factors | ||
if factors is None: | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factors is not a kwarg, so unlikely it is None? Also I don't see how it returns the default TFs
ananse/influence.py
Outdated
if line[1] == "": | ||
realFC = 0 | ||
else: | ||
realFC = float(line[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personal preference, but I am linking pep to make it look like it isn't (https://www.python.org/dev/peps/pep-0008/#programming-recommendations)
For sequences, (strings, lists, tuples), use the fact that empty sequences are false:
# Correct:
if not seq:
if seq:
# Wrong:
if len(seq):
if not len(seq):
Anyways, I think line[1]
should have a descriptive name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole function could be pandas, but again.. Not part of the PR 🙃
ananse/influence.py
Outdated
def filter_TF(scores_df, network=None, tpmfile=None, tpm=20, overlap=0.98): | ||
"""Filter TFs: | ||
1) it have high expression in origin cell type; | ||
2) 98% of its target genes are also regulated by previous TFs. | ||
1) it have high expression in origin cell type; | ||
2) 98% of its target genes are also regulated by previous TFs. | ||
""" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to PR. But the amount of (unexpected to me) filtering is strange, this is one example. I think this is something we should take a look at and think about.
E.g. removing not-validated TFs by default will of course make the AUPRC better, since we do not have true data of those. However, if we are confident in the model I am not sure if this is the way to go
ananse/peakpredictor.py
Outdated
raise ValueError("Need either ATAC-seq or H3K27ac BAM file(s).") | ||
|
||
if genome is None: | ||
logger.info("Assuming genome is hg38") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps logger.warning
? I feel this should really be seen by people, and more explicit is more better.
ananse/peakpredictor.py
Outdated
valid_factors = valid_factors.loc[ | ||
valid_factors["Pseudogene"].isnull(), "HGNC approved gene symbol" | ||
].values | ||
valid_factors = [f for f in valid_factors if f not in ["EP300"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this contain duplicates? Otherwise list(set([...]))
if self.is_human_genome(): | ||
factor = factor.upper() | ||
|
||
if self.is_human_genome() and factor not in valid_factors: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we sure that all valid_factors are actually uppercase in the case of human?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I only noticed the review request when Maarten mentioned it 🙀
I have two serious questions in _load_motifs
, the rest looks good (and therefore I assume it works well :p )
if len(self.f2m) == 1: | ||
logger.debug("using motifs for 1 factor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lovely touch ❤️
if self.pfmfile is not None: | ||
logger.debug("Reading default file") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default file is loaded when self.pfmfile is None
right?
tmp_f2m = {} | ||
if self.pfmfile is not None: | ||
logger.debug("Reading default file") | ||
tmp_f2m = self._load_factor2motifs(indirect=True) | ||
|
||
for k, v in self.f2m.items(): | ||
if k in tmp_f2m: | ||
tmp_f2m[k] += v | ||
else: | ||
tmp_f2m[k] = v | ||
|
||
self.motif_graph = nx.Graph() | ||
d = [] | ||
for f1 in tmp_f2m: | ||
for f2 in tmp_f2m: | ||
jaccard = len(set(tmp_f2m[f1]).intersection(set(tmp_f2m[f2]))) / len( | ||
set(tmp_f2m[f1]).union(set(tmp_f2m[f2])) | ||
) | ||
d.append([f1, f2, jaccard]) | ||
if jaccard > 0: | ||
self.motif_graph.add_edge(f1, f2, weight=1 - jaccard) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of this depends on tmp_f2m
. Is it supposed to run only if if self.pfmfile is not None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this can only be done with the default motif file I think.
ananse/peakpredictor.py
Outdated
* Motif scores. | ||
* The average peak coverage. | ||
* The distance from the peak to nearest TSS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Motif scores.
* The average peak coverage (and their regions).
* The distance from the peak to nearest TSS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (my) understanding, I'd like a note that _avg
and _dist
are used later on, with reference data only.
fname = f"{self.data_dir}/{title}.qnorm.ref.txt.gz" | ||
if os.path.exists(fname): | ||
logger.debug(f"quantile normalization for {title}") | ||
qnorm_ref = pd.read_table(fname, index_col=0)["qnorm_ref"].values | ||
if len(self.regions) != len(qnorm_ref): | ||
qnorm_ref = np.random.choice( | ||
qnorm_ref, size=len(self.regions), replace=True | ||
) | ||
|
||
tmp = qnorm.quantile_normalize(tmp, target=qnorm_ref) | ||
else: | ||
tmp = np.log1p(tmp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distributions.py
is WIP, but the part that used Quans file were in order. Why not use that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code replaces Quan's code. If the distributions.py
changes we can use it here.
ananse/peakpredictor.py
Outdated
|
||
Basically, this will select the columns that are available, | ||
based on the different types of data that are loaded. | ||
Reference regions will have the mmost information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mmost
""" | ||
if factor is None and motifs is None: | ||
raise ValueError("Need either a TF name or one or more motifs.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add # TODO: remove
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
?
exit(1) | ||
|
||
|
||
def predict_peaks( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a point on formality (and can therefore be ignored if it takes too long).
half of this function is checking the user input, the other half is running the code. The latter part definitely belongs here, but the input checking should maybe move to commands/
, as it is the control script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. However, I like that the code in commands is actually very minimal. This means that the exact same functionality will be available as API-style functionality. What can be done is to move it to a separate function.
Still a work in progress, just want to see where the conflicts are for now.
This is a replacement for
ananse binding
which uses a more complex model to predict binding, based on ATAC and/or H3K27ac in a transcription factor-specific manner.