Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default y-min in chromsome scatter plots #643

Closed
micknudsen opened this issue Jul 8, 2021 · 12 comments · Fixed by #645
Closed

Default y-min in chromsome scatter plots #643

micknudsen opened this issue Jul 8, 2021 · 12 comments · Fixed by #645

Comments

@micknudsen
Copy link
Contributor

I wonder if this should be min instead of max?

y_min = max(-5.0, min(y.min() - .1, -.3)) if len(y) else -1.1

With the current default settings, some biallelic deletion are not shown. Here is an example (CNVkit v0.9.9) without --y-min specified:

Screen Shot 2021-07-08 at 16 39 44

and with --y-min=-10:

Screen Shot 2021-07-08 at 16 39 50

@tetedange13
Copy link
Contributor

Hi @micknudsen,

Not an author of CNVkit, but you are probably right
=> I think expected behaviour was (ignoring small plotting corrections):

if -0.5 < y.min():
    y_min = -0.5  # No matter the value of "y.min()"
else:  # In other words: "y.min() < -0.5"
    y_min = y.min() 

Just to precise your issue, minimal command to reproduce is: cnvkit.py scatter -c chr4 a_sample.cnr
=> In other words this is happening as soon as scatter is asked to focus on a specific chromosome / region (no need to have segment, nor b-allele)
=> As you suggested, seems to be fixed by replacing max() by min() at line you highlighted

Thanks a lot for reporting this !

Hope this helps.
Have a nice day.
Felix.

@tskir
Copy link
Collaborator

tskir commented Jul 13, 2021

This old issue of mine might be related: #385.

However, I'm not sure this behaviour is unexpected. I think (and @etal can confirm/deny) the rationale was to prevent very negative scores (such as the ones from biallelic deletions) from compressing the Y axis. Indeed, as you can see on your second graph, variations in most of the signals become hard to see now that the range on the Y axis is so wide.

However, I agree that the data shouldn't just disappear like it happens now. The best solution I can think of is to keep the current default threshold (–5), but to clip the data to that threshold rather than hiding it. @etal @tetedange13 @micknudsen what do you think of this proposal?

@micknudsen
Copy link
Contributor Author

Maybe one could put a hard lower bound on the y-axis – as it is now – and then indicate segments falling below this with a line (or some other indicator) below the axis? Something along the line (pun intended) in which IGV indicates downsampling of reads by showing a thick, black line above the alignment. This could also be applied to very high values.

@tetedange13
Copy link
Contributor

@tskir ,

The problem I see with clipping, is cases like @micknudsen has: important number of data located relatively at same coordinates
=> If clipped, they will turn into a big strange cluster of points at -5

Idea of a line looks cool and not that hard to implement
=> Could be plotted as soon as any data < -5 (absent otherwise) and disabled when user passes --y-min and/or --y-max
=> Maybe we could aslo add a tiny vertical line on top of horizontal one, to discretely mark coordinates of masked data ?

No matter the choice made, I think we should also add more logging about this
=> Because for now data simply disappear from the plot, as you said and no warning is raised
=> In most case it is OK, because data with log_ratio<-5 are often "antitarget" one, with little importance, but can be risky

@tetedange13
Copy link
Contributor

tetedange13 commented Jul 15, 2021

  • Added following logging.warning:

WARNING: Setting default 'y_min=<y_min_cutoff>' so <nb_hidden_probes> probes are hidden --> Use parameter '--y-min <y.min()>' to see them

  • 1st try:
    scatter1

  • 2nd try:
    scatter2

@etal
Copy link
Owner

etal commented Jul 21, 2021

Lemme look at this a little more.

The intended behavior was:

  • If there are no segments in this window at all, use -1.1 as the y-axis lower limit
  • If the segments are all positive, set the y-axis lower limit to -.3 so you can still see the origin
  • If there is a segment below -5, set the y-axis lower limit to -5 for aesthetic reasons (QUESTIONABLE)
  • Otherwise, just use the minimum y-value, minus a little extra padding of .1

Aside from the segments, there are almost always lots of spurious single-bin datapoints (probes) and I think it's OK to ignore these in setting the plot limits and focus our attention on the segments.

I agree that showing clipping on the y-axis is much better than failing to plot those segments at all. I'm not sure the rug plot is exactly the right solution. I'd prefer to keep the color the same as the segments, only represent the segments (ignore individual probes, unless the input was only .cnr) and cling to the bottom spine rather than add another h-line. Maybe a rug plot with orange points facing upward, on a black line identical to the plot's bottom border? Or just spans of orange on that line, or a small triangle on the bottom to indicate each clipped segment?

@tetedange13
Copy link
Contributor

tetedange13 commented Jul 22, 2021

Hi @etal,

Thanks for your feedback ! This makes sense to rather focus on masked segments
=> Is this the kind of representation you have in mind for masked segments? (small orange triangle at bottom-right):
masked_segment

Corresponding code:

    if segments and hidden_seg.sum():
        x_hidden = segments.start[hidden_seg] * MB
        y_hidden = np.array([y_min] * len(x_hidden)) + 0.2
        axis.scatter(x_hidden, y_hidden, marker='^', linewidth=3, snap=False,
                     color=segment_color)

Have a nice day.
Felix.

@etal
Copy link
Owner

etal commented Jul 22, 2021

Yes, that mockup is very close. To ensure it's not mistaken for an ordinary segment, I'd like to avoid rounded corners (so it looks sharp and somewhat alarming), and ideally, overlap the plot's lower spine (higher z-order or something like that, so it obscures the plot boundary itself and is clearly "breaking through" -- if that's possible). To avoid rounding, maybe switch linewidth=3 to linewidth=1 or =0 in the call to axis.scatter?

@tetedange13
Copy link
Contributor

tetedange13 commented Aug 1, 2021

Hi @etal,

  • To avoid rounding, trick is to use edgecolor='none'
  • Following your advice I could make triangles overlap, adding clip_on=False, zorder=10 (no y-offset needed)

Result bellow is still with linewidth=3 (tell me if triangles size LGTY ?)
=> I used "amplicon.{cnr,cns}" from CNVkit test files
Figure_1

Have a nice day.
Felix.

@etal
Copy link
Owner

etal commented Aug 3, 2021

Hey Felix, this looks great! If you have a PR for this feature I'd be very happy to merge it.

@tetedange13
Copy link
Contributor

tetedange13 commented Aug 4, 2021

Hi @etal ,

I updated #645

Have a nice day.
Felix.

@etal etal closed this as completed in #645 Aug 8, 2021
@etal
Copy link
Owner

etal commented Aug 8, 2021

Great, I've reviewed and merged #645. This feature is now part of CNVkit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants