WIP: Integrate peakpredictor #44

simonvh · 2021-02-22T14:32:43Z

Still a work in progress, just want to see where the conflicts are for now.

This is a replacement for ananse binding which uses a more complex model to predict binding, based on ATAC and/or H3K27ac in a transcription factor-specific manner.

…edictor

…included

…rated list. Added ncpus option.

simonvh · 2021-02-25T08:11:52Z

This PR does the following things:

It replaces the "old" model with a new one, which is flexible and can use ATAC-seq and/or H3K27ac. It is trained on ~100 TFs in different cell types. Depending on which data is available the appropriate model is used.
You can use either the pre-defined regions for human, or a set of custom regions. The latter works according to the code from @siebrenf.
Quantile normalization still uses the "old" sampling approach. This could be improved, but I'm currently unclear on what the best approach would be.
It should be possible to use the pfm + motif2factors.txt for any species, and should not change the TF names. There is currently still a check on valid TFs for the default option (human) as the current motif2factors.txt is messy.
You can specify which TFs you want to predict. There is one thing still to change here, as motif scanning is done for all the motifs, which is then not very efficient.

Maarten-vd-Sande

Just some general remarks, again not everything related to this PR.

Looks good

Maarten-vd-Sande · 2021-03-08T11:09:30Z

scripts/ananse

+    parser.add_argument(
+        "-v", "--version", action="version", version=f"%(prog)s v{__version__}"
+    )


For a cookie 🍪 I can add tab completion 🙃

Maarten-vd-Sande · 2021-03-08T11:13:22Z

scripts/ananse

-        help="one or more BED format files with putative enhancer regions (e.g. narrowPeak, broadPeak)",
-        metavar="",
-        nargs='+',  # >= 1 files
+    p = subparsers.add_parser(


Easier to name each subparser to what it actually represents.

Ah but that is not related to your changes 🤐

Maarten-vd-Sande · 2021-03-08T11:14:33Z

ananse/utils.py

+        return factors
+
+    factors = [line.strip() for line in open(fname)]
+    return factors


no newline 😱

Maarten-vd-Sande · 2021-03-08T11:34:50Z

ananse/utils.py

+def check_input_factors(factors):
+    """Check factors.
+
+    Factors can eiher be a list of transcription factors, or a filename of a
+    file that containts TFs. Returns a list of factors.
+    If factors is None, it will return the default transcription factors.
+
+    Returns
+    -------
+    list
+        List of TF names.
+    """
+    # Load factors
+    if factors is None:
+        return


factors is not a kwarg, so unlikely it is None? Also I don't see how it returns the default TFs

Maarten-vd-Sande · 2021-03-08T11:42:24Z

ananse/influence.py

+            if line[1] == "":
                realFC = 0
            else:
                realFC = float(line[1])


Personal preference, but I am linking pep to make it look like it isn't (https://www.python.org/dev/peps/pep-0008/#programming-recommendations)

For sequences, (strings, lists, tuples), use the fact that empty sequences are false: # Correct: if not seq: if seq: # Wrong: if len(seq): if not len(seq):

Anyways, I think line[1] should have a descriptive name.

This whole function could be pandas, but again.. Not part of the PR 🙃

Maarten-vd-Sande · 2021-03-08T11:48:32Z

ananse/influence.py

 def filter_TF(scores_df, network=None, tpmfile=None, tpm=20, overlap=0.98):
    """Filter TFs:
-        1) it have high expression in origin cell type;
-        2) 98% of its target genes are also regulated by previous TFs. 
+    1) it have high expression in origin cell type;
+    2) 98% of its target genes are also regulated by previous TFs.
    """



Unrelated to PR. But the amount of (unexpected to me) filtering is strange, this is one example. I think this is something we should take a look at and think about.

E.g. removing not-validated TFs by default will of course make the AUPRC better, since we do not have true data of those. However, if we are confident in the model I am not sure if this is the way to go

Maarten-vd-Sande · 2021-03-08T11:51:59Z

ananse/peakpredictor.py

+            raise ValueError("Need either ATAC-seq or H3K27ac BAM file(s).")
+
+        if genome is None:
+            logger.info("Assuming genome is hg38")


Perhaps logger.warning? I feel this should really be seen by people, and more explicit is more better.

Maarten-vd-Sande · 2021-03-08T12:03:21Z

ananse/peakpredictor.py

+        valid_factors = valid_factors.loc[
+            valid_factors["Pseudogene"].isnull(), "HGNC approved gene symbol"
+        ].values
+        valid_factors = [f for f in valid_factors if f not in ["EP300"]]


Can this contain duplicates? Otherwise list(set([...]))

Maarten-vd-Sande · 2021-03-08T12:06:18Z

ananse/peakpredictor.py

+                if self.is_human_genome():
+                    factor = factor.upper()
+
+                if self.is_human_genome() and factor not in valid_factors:


are we sure that all valid_factors are actually uppercase in the case of human?

siebrenf

Sorry for the delay, I only noticed the review request when Maarten mentioned it 🙀

I have two serious questions in _load_motifs, the rest looks good (and therefore I assume it works well :p )

siebrenf · 2021-03-08T12:12:41Z

ananse/peakpredictor.py

+        if len(self.f2m) == 1:
+            logger.debug("using motifs for 1 factor")


lovely touch ❤️

siebrenf · 2021-03-08T12:20:26Z

ananse/peakpredictor.py

+        if self.pfmfile is not None:
+            logger.debug("Reading default file")


default file is loaded when self.pfmfile is None right?

siebrenf · 2021-03-08T12:21:19Z

ananse/peakpredictor.py

+        tmp_f2m = {}
+        if self.pfmfile is not None:
+            logger.debug("Reading default file")
+            tmp_f2m = self._load_factor2motifs(indirect=True)
+
+        for k, v in self.f2m.items():
+            if k in tmp_f2m:
+                tmp_f2m[k] += v
+            else:
+                tmp_f2m[k] = v
+
+        self.motif_graph = nx.Graph()
+        d = []
+        for f1 in tmp_f2m:
+            for f2 in tmp_f2m:
+                jaccard = len(set(tmp_f2m[f1]).intersection(set(tmp_f2m[f2]))) / len(
+                    set(tmp_f2m[f1]).union(set(tmp_f2m[f2]))
+                )
+                d.append([f1, f2, jaccard])
+                if jaccard > 0:
+                    self.motif_graph.add_edge(f1, f2, weight=1 - jaccard)


all of this depends on tmp_f2m. Is it supposed to run only if if self.pfmfile is not None?

Yeah, this can only be done with the default motif file I think.

siebrenf · 2021-03-08T12:38:29Z

ananse/peakpredictor.py

+        * Motif scores.
+        * The average peak coverage.
+        * The distance from the peak to nearest TSS.


* Motif scores. * The average peak coverage (and their regions). * The distance from the peak to nearest TSS.

for (my) understanding, I'd like a note that _avg and _dist are used later on, with reference data only.

siebrenf · 2021-03-08T12:53:35Z

ananse/peakpredictor.py

+        fname = f"{self.data_dir}/{title}.qnorm.ref.txt.gz"
+        if os.path.exists(fname):
+            logger.debug(f"quantile normalization for {title}")
+            qnorm_ref = pd.read_table(fname, index_col=0)["qnorm_ref"].values
+            if len(self.regions) != len(qnorm_ref):
+                qnorm_ref = np.random.choice(
+                    qnorm_ref, size=len(self.regions), replace=True
+                )
+
+            tmp = qnorm.quantile_normalize(tmp, target=qnorm_ref)
+        else:
+            tmp = np.log1p(tmp)


distributions.py is WIP, but the part that used Quans file were in order. Why not use that?

This code replaces Quan's code. If the distributions.py changes we can use it here.

siebrenf · 2021-03-08T12:55:23Z

ananse/peakpredictor.py

+
+        Basically, this will select the columns that are available,
+        based on the different types of data that are loaded.
+        Reference regions will have the mmost information.


siebrenf · 2021-03-08T12:57:02Z

ananse/peakpredictor.py

+        """
+        if factor is None and motifs is None:
+            raise ValueError("Need either a TF name or one or more motifs.")
+


add # TODO: remove?

siebrenf · 2021-03-08T13:05:52Z

ananse/peakpredictor.py

+        exit(1)
+
+
+def predict_peaks(


this is a point on formality (and can therefore be ignored if it takes too long).

half of this function is checking the user input, the other half is running the code. The latter part definitely belongs here, but the input checking should maybe move to commands/, as it is the control script.

I see your point. However, I like that the code in commands is actually very minimal. This means that the exact same functionality will be available as API-style functionality. What can be done is to move it to a separate function.

simonvh and others added 17 commits February 17, 2021 19:44

initial start on peakpredictor

53377b7

working version of peakpredictor

0bbd205

merged changes

883bad4

Merge branch 'refactor' into peakpredictor

014a329

further integration of peakpredictor

d62460b

Merge branch 'peakpredictor' of github.com:simonvh/ANANSE into peakpr…

b7e574f

…edictor

simplifying loguru output a bit

c29aeca

removed enhancer command

a0f44be

working version of peakpredictor in ananse binding, with basic model …

f6fde89

…included

Using factor from motif annotation, except for hg38 where we use a cu…

e0ef5f9

…rated list. Added ncpus option.

small update

6cc2dea

merged refactor

7752ebf

merged

d2f0edd

black

7eb6660

cleaned up the command line arguments

1dec88b

update docstring

3f02b90

reduce code complexity

c8281a6

simonvh requested review from bioqxu, siebrenf and Maarten-vd-Sande and removed request for bioqxu February 25, 2021 08:12

simonvh added 3 commits February 26, 2021 08:42

do not use mouse TFs when predicting in human

8a526b0

add missing files to MANIFEST; limit memory usage

fe28f28

also compute expression network

6abbc91

Maarten-vd-Sande approved these changes Mar 8, 2021

View reviewed changes

siebrenf approved these changes Mar 8, 2021

View reviewed changes

simonvh added 3 commits March 24, 2021 13:29

add factor activity to network

2fc0eaa

binding -> prob for influence

e6777dc

remove dependency online file

1adf782

This was referenced Mar 25, 2021

Update influence.py #59

Closed

Update influence to use pandas #60

Closed

Check filtering steps in influence #61

Open

simonvh and others added 9 commits March 25, 2021 19:20

changed based on review

f0d533c

fix ananse import

3183cae

fix network columns

0c07e12

fix bug in ananse network

0f61b64

merged

5cfafd4

fix custom motif file

0cda2a8

remove tmp files

e18c873

merge refactor

bf2fcba

merged

1dec813

simonvh merged commit 3c0c82b into vanheeringen-lab:refactor Apr 23, 2021

simonvh deleted the peakpredictor branch April 23, 2021 07:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Integrate peakpredictor #44

WIP: Integrate peakpredictor #44

simonvh commented Feb 22, 2021

simonvh commented Feb 25, 2021

Maarten-vd-Sande left a comment

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

Maarten-vd-Sande Mar 8, 2021

simonvh Mar 25, 2021

siebrenf left a comment

siebrenf Mar 8, 2021

siebrenf Mar 8, 2021

siebrenf Mar 8, 2021

simonvh Mar 25, 2021

siebrenf Mar 8, 2021

siebrenf Mar 8, 2021

siebrenf Mar 8, 2021

simonvh Mar 25, 2021

siebrenf Mar 8, 2021

siebrenf Mar 8, 2021

simonvh Mar 25, 2021

siebrenf Mar 8, 2021

simonvh Mar 25, 2021

		if len(self.f2m) == 1:
		logger.debug("using motifs for 1 factor")

		if self.pfmfile is not None:
		logger.debug("Reading default file")

WIP: Integrate peakpredictor #44

WIP: Integrate peakpredictor #44

Conversation

simonvh commented Feb 22, 2021

simonvh commented Feb 25, 2021

Maarten-vd-Sande left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

siebrenf left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment