Analyze yeast cytosolic 8x dicodon screen data #3

kychen37 · 2022-04-01T23:56:10Z

GanttStart: 2022-03-29

Background

Transformed at high efficiency using Bloom lab's liquid recovery protocol: https://github.com/rasilab/rqc_aggregation_aging/issues/98
Library prepped: https://github.com/rasilab/rqc_aggregation_aging/issues/104

Ideas TODO

Analyze the endogenous fragment data in the same library and see if the codon or amino acid effects can account for their mRNA effects.
Put the same library in different quality control factor mutants to see their effects on codon and aa mRNA effects.
Correlate with CSC, AASC, tAI
Correlate with ribosome pause scores from Weinberg 2016

Analysis Links

snakemake on cluster: https://github.com/rasilab/lab_analysis_code/tree/master/kchen/analysis/deepseq/20220329_exp51_cyto_8x_dicodon
analysis in devcontainer: https://github.com/rasilab/rqc_aggregation_aging/tree/deepseq/analysis/deepseq/20220329_exp51_cyto_8x_dicodon

Brief conclusion

analysis used for publication here and here

kychen37 · 2022-04-12T17:24:50Z

@rasi

Next week I will go through this data and https://github.com/rasilab/rqc_aggregation_aging/issues/101 more because I want to present it for joint lab meeting

rasi · 2022-04-12T17:27:47Z

@kychen37 Ok. I remember you showing some data in the short updates, but I don't remember ever us discussing this. Can you update the issue whenever you finish some analysis (I think you used to do this previously?) I can give more useful feedback if I have some time to think about your results than just looking at them for the first time during group meeting.

kychen37 · 2022-04-15T00:56:49Z

@rasi

Code: https://github.com/rasilab/rqc_aggregation_aging/blob/master/analysis/deepseq/20220329_exp51_cyto_8x_dicodon/scripts/plot_alignment_statistics.ipynb

kychen37 · 2022-04-15T01:01:15Z

Prelim analysis, just making a heatmap and starting to look at inserts that are still missing despite good representation:

Code: https://github.com/rasilab/rqc_aggregation_aging/blob/master/analysis/deepseq/20220329_exp51_cyto_8x_dicodon/scripts/plot_insert_mrna_levels.ipynb

kychen37 · 2022-04-15T01:03:01Z

Note to self: my branches are kind of messed up right now, my most recent analyses were done in master (I had switched to doing these analyses in deepseq and forgot, then continued in master). Pulling from master didn't update the deepseq branch, I may need to 'sync' or something, for now the master branch is the most up to date for the scripts of this analysis only

kychen37 · 2022-05-27T23:20:09Z

@rasi Update I will revisit this data next week to see what is pertinent for my committee meeting

rasi · 2022-06-06T22:01:56Z

@kychen37 Post the figures as Issue comments so that we can come back to them easily (the commit links above are enough to figure out where the figure is).

Why is the data not centered around 0 unlike here: https://github.com/rasilab/lab_analysis_code/blob/master/rasi/analysis/deepseq/20210703_pb_8xdicodon_resequence_grna_mrna/scripts/plot_human_8xdicodon_effects_files/figure-gfm/unnamed-chunk-6-1.png?

kychen37 · 2022-06-06T23:04:59Z

@rasi I've been trying to figure that out but I'm not sure. Could it have to do with the fact that the mean of all the dipeptide lfcs is slightly negative?

rasi · 2022-06-06T23:15:04Z

@kychen37 The mean and median cannot be this different. First, calculate simple log fold change without bootstrapping and make sure you understand what is going on before doing the bootstrap. The bootstrap is just for error bars.

kychen37 · 2022-06-06T23:16:19Z

@rasi sorry I forgot that my lfc column had already been median normalized when I calculated the median and said it was zero. The median is actually -1.5 and mean is -1.7

kychen37 · 2022-06-06T23:18:37Z

After median-normalization, the median = 0 and mean = -0.19

rasi · 2022-06-06T23:20:09Z

Either way, the above plot should be approximately centered around 0 if everything is done correctly.

rasi · 2022-06-06T23:21:44Z

Can you also add horizontal lines for each amino acid similar to what is in Phil's paper?

kychen37 · 2022-06-06T23:45:38Z

@rasi I replotted the dipeptide heatmap using your code because I realized I wasn't looking at missing dipeptides correctly, turns out there is a group of dipeptides that are missing (in grey):

kychen37 · 2022-06-07T00:06:19Z

@rasi Does bootstrapping involve normalizing lfcs to the mean (I thought it didn't)? My data appears to skew negative (unnormalized lfcs range from -5 to 0.1), and the bootstrap function takes from these. Should I have it take from median-normalized lfcs instead (which are centered on zero)? It didn't look like you had done that in Phil's data so I didn't do it for mine

rasi · 2022-06-07T01:38:32Z

@kychen37 Ok, your explanation makes sense. It looks like I did not median normalize because my mRNA and gRNA had almost equal counts. But it will be good to do it for yours since your read counts are skewed.

rasi · 2022-06-07T01:45:03Z

@kychen37 Most of the missing dipeptides are hydrophobic or bulky. Maybe these are just toxic as you noticed in your plasmid library. Can you calculate the bootstrap error bars for the remaining dipeptides, so that you can highlight ones that are atleast 2xSD below median value?

Also, can you make the above plot for all codons? I assume Arg is so destabilizing primarily because it has two rare codons CGG and CGA, but this will be good to see.

kychen37 · 2022-06-07T05:08:33Z

@rasi I went with a mean-normalization since the bootstrap used mean so I figured I would keep it consistent:

code

rasi · 2022-06-07T05:20:05Z

The codon plot looks good! Will be useful to spread out the Y-axis a bit.

rasi · 2022-06-07T05:21:52Z

I recommend making frame plots similar to 1E in Phil's paper and a schematic that mirrors 1A as closely as possible.

kychen37 · 2022-06-07T05:40:48Z

@kychen37 Most of the missing dipeptides are hydrophobic or bulky. Maybe these are just toxic as you noticed in your plasmid library.

@rasi I'm trying to figure out what the effect of slice(1) is in the human code shown here:

The way I had plotted dipeptides before (which did not include slice(1) and associated grouping) I get less missing dipeptides than when I include it, I directly compared the two methods here. I think I'm just not really understanding what the slice function is doing to the data and if I need it?

rasi · 2022-06-07T05:46:23Z

@kychen37: slice(1) takes the first element of each group (https://dplyr.tidyverse.org/reference/slice.html).

It should not drop any dipeptides since you are grouping by dipeptides in the previous step.

kychen37 · 2022-06-07T06:01:19Z

@rasi

code

median-normalized frame effects if we want it zero-centered:

not normalized:

rasi · 2022-06-07T06:13:32Z

Looks good and better than what I might have guessed :-) It will be useful to carefully look at the off-diagonal elements of the above plot to see whether it is noise or something interesting.

Try the same frame plot for codon pairs as well.

Look through the language in Phil's paper and see whether you can place the observations above in the context of what is known about translation in yeast (lot more than in human).

See some of my TODO's at the top comment. Add to them if you think of additional experiments to do (this can be a useful checklist to come back to as you dig deeper).

kychen37 · 2022-06-07T19:44:40Z

@rasi bootstrapped error bars for dipeptide levels of just the VK/FK dipeptides + some controls. I'm going to show this one because the full dipeptide plot is too big

rasi · 2022-06-07T20:46:18Z

Sounds good. Add end caps to the error bars to make it a bit easier to see.
I would still start showing the full heat map from #3 after you show the single codon and single aa effects.

kychen37 · 2022-06-10T22:34:33Z

Breakdown of dicodons that are missing from this sequencing run, code

library_missing_from	n_dicodons_missing	possible reason for the dicodon being missing
gdna only	0
mrna only	108	since it's present in gdna and linkageseq, these mrnas may be severely destabilized, or mrna library was bottlenecked
gdna and mrna only	141	since it's present in linkageseq but not gdna or mrna, this implies either toxicity or bottlenecking
gdna, mrna, and linkage	180	something to do with oligo pool synthesis or cloning of the plasmid library, maybe bottlenecked at e.coli step?
TOTAL	429

For the 180 dicodons that are missing from linkage sequencing, this is the dipeptide breakdown (what proportion of inserts for each dipeptide were missing in linkage sequencing, top ~20):

For the 180 dicodons that are missing from linkage sequencing of the integrating plasmid library, 130 of these dicodons are also missing in the dual-barcode 2um plasmid library. The dipeptide breakdown of these 130 common missing dicodons (top ~20):

@rasi

kychen37 · 2022-06-10T22:38:54Z

Update

@rasi

Next week I will look into endogenous inserts from this sequencing run and plan remaining experiments + paper

kychen37 · 2022-06-15T01:19:14Z

CODE

Remake Fig1A from Presnyak 2015 to make sure the CSCs I calculated matched there's.

CSC = correlation coefficient of (frequency of a codon within a gene) vs (total RNA half life from Presnyak2015 supplemental table 1)
My remake:

Presnyak 2015:

There are couple codons where the orders are a little off, there might have been a normalization step I was missing but overall it looks very similar

Correlating the Coller lab CSCs with my LFCs

after redoing CSCs above, I found their published CSCs in Forrest2020 supplement. Very similar to my re-calculated but decided to just plot using their published data instead
maybe a moderate correlation between the two metrics

Plotting all data together

kychen37 · 2022-06-15T19:51:46Z

8x dicodon median-normalized codon average LFCs color-coded by Coller designations of 'unstable' vs 'stable' (aka CSC < 0 or > 0)

Correlating AASC with booststrapped average amino acid LFC

kychen37 · 2022-06-15T19:53:05Z

@rasi there is a moderate correlation between Coller lab CSCs and my average codon LFCs, there is basically no correlation between Coller lab AASCs and my average amino acid effects though, see comments above

phiburke · 2022-06-15T20:39:38Z

@Katharine Just wanted to say, that colored codon plot looks really good.

I wouldn't worry too much about the amino acid level effects not matching up well; AA-level effects in your library are probably driven more by localized repeats, whereas they're looking at transcriptome-wide aggregate effect measurements, so it's a very different thing.

Also, if you're doing literature comparisons you should totally compare your data to Gamble2016, Adjacent Codons Act in Concert to Modulate Translation Efficiency in Yeast from Stan Fields' group. That's probably the closest published experimental measurement to what you've done. Also, they only look at protein levels, so even if things don't match up well it might still be interesting (eg. some dicodons effect on mRNA but not protein level).

kychen37 · 2022-06-16T16:29:40Z

Thanks @phiburke ! My memory of Gamble2016 is that they just found a bunch of rare codons but it's been a while so definitely worth a re-look!

kychen37 · 2022-06-17T23:55:26Z

Update
@rasi
In addition to plots above, I am still interested in trying to correlate LFCs with tAI, CAI, and ribosome pause scores.

kychen37 · 2022-07-01T16:38:39Z

Update & TODO

I want to use the below resources to calculate tAI and CAI for my 6000-insert library
For CAI plot: Use tool or similar tools as described here to generate codon usage table and CAI scores from the 6000 insert library, then correlate with LFC
For tAI plot:
- use the genomic tRNA database to get the tRNA gene copy number
- follow dos Reis et al 2004 described here for calculating the tAI or find a tAI calculator like above

@rasi

kychen37 · 2022-07-09T00:46:44Z

@rasi
Update

calculated tAI for the 6000-insert library here
plotted LFC vs tAI here
weak correlation between LFC and tAI:

I'm currently thinking of other ways to analyze the data
I will also do this plot for CAI = geometric mean of relative adaptiveness of all codons in the gene sequence (which was calculated already in code above)

kychen37 · 2022-07-28T18:21:51Z

Frydman optimized tAI, maybe Weinberg2016, find tAI for each codon
separate dicodon and endogenous (keep as separate experiments)

kychen37 · 2022-08-12T21:18:50Z

Update

@rasi

Plots using tAI

Plots using Frydman normalized tAI (nTE): code

I want to play around with how to plot these like we discussed in short updates (look more closely at the cloud of points with very low nTE/tAI and LFC, slice and dice some more)

kychen37 · 2022-08-16T21:12:41Z

Update

Plots for CAI

kychen37 · 2022-12-22T21:35:41Z

Summary of different types of data filtering

Read cutoffs based on insert & barcode-level plots

alignment stats as described above would suggest the following cutoffs:

insert_reads_cutoff = 500
barcode_reads_cutoff = 10
n_barcodes_cutoff = 6

dipeptide heat map applying the above insert/barcode-level cutoffs:

code:

diaa_lfc <- barcode_counts %>%
    filter(insert_num != 7000) %>% # remove spike-ins
    filter(barcode_count >= barcode_reads_cutoff) %>%
    group_by(sample_name, insert_num) %>%
    summarize(insert_count = sum(barcode_count), n_barcodes = n(), .groups='drop') %>%
    inner_join(insert_annotations, by = "insert_num") %>%
    pivot_wider(names_from = sample_name, values_from = c(insert_count, n_barcodes)) %>%
    drop_na() %>%
    filter( (n_barcodes_wt_gdna >= n_barcodes_cutoff) & (n_barcodes_wt_mrna >= n_barcodes_cutoff) ) %>%
    filter( (insert_count_wt_gdna >= insert_reads_cutoff) & (insert_count_wt_mrna >= insert_reads_cutoff) ) %>%
    ungroup() %>%
    group_by(diaa) %>%
    summarize(diaa_count_gdna = sum(insert_count_wt_gdna), diaa_count_mrna = sum(insert_count_wt_mrna), diaa_n_barc_gdna = sum(n_barcodes_wt_gdna), diaa_n_barc_mrna = sum(n_barcodes_wt_mrna)) %>%
    mutate(diaa_lfc = log2(diaa_count_mrna) - log2(diaa_count_gdna)) %>%
    ungroup() %>%
    mutate(lfc_med = diaa_lfc - median(diaa_lfc))

Read cutoffs using Burke2022 code

The code from Burke2022 is slightly different from above in that it groups by dipeptide first and then imposes the insert_reads_cutoff/barcode cutoffs, so the filtering is much more lenient
dipeptide heat map using Burke2022 code:

code:

diaa_lfc_burke2022 <- barcode_counts %>% 
  filter(insert_num != 7000) %>% # remove spike-ins
  group_by(insert_num, sample_name) %>%
  summarize(count = sum(barcode_count), n_barcodes = dplyr::n()) %>%
  ungroup() %>% 
  inner_join(insert_annotations, by = "insert_num") %>%
  group_by(diaa, aa1, aa2, sample_name) %>%
  mutate(count = sum(count), n_barcodes = sum(n_barcodes)) %>%
  ungroup() %>%
  group_by(diaa, aa1, aa2) %>%
  filter(sum(count) >= insert_reads_cutoff, sum(n_barcodes) >= n_barcodes_cutoff) %>%
  ungroup() %>% 
  pivot_wider(names_from = sample_name, values_from = c("count", "n_barcodes")) %>%
  mutate(lfc = log2(count_wt_mrna) - log2(count_wt_gdna)) %>% 
  group_by(diaa, aa1, aa2) %>%
  slice(1) %>% 
  ungroup() %>%
  drop_na() %>% 
  mutate(lfc_med = lfc - median(lfc))

Minimal read cutoffs -> bootstrap by dipeptide -> filter by SD

code

barcode_reads_cutoff = 10
n_barcodes_cutoff = 2

calc_lfc_bootstrap <- function(data, indices) {
  d <- data[indices,]
  log2(sum(d$barcode_count_mrna)) - log2(sum(d$barcode_count_gdna))
}

wt_diaa_bootstrap_diaa <- wt_dicodon %>%
  filter(sample_name == 'wt') %>%
  group_by(diaa) %>%
  nest() %>%
  mutate(lfc_boot = map(data, function(df) boot::boot(data=df, statistic=calc_lfc_bootstrap, R=100)$t)) %>%
  select(-data) %>%
  mutate(lfc = map_dbl(lfc_boot, mean)) %>%
  mutate(lfc_sd = map_dbl(lfc_boot, sd)) %>%
  select(-lfc_boot) %>%
  ungroup() %>%
  mutate(lfc_mean = lfc - mean(lfc)) %>%
  mutate(lfc_med = lfc - median(lfc)) %>%
  mutate(lfc_max = lfc - max(lfc)) %>%
  mutate(strain = 'wt') %>%
  filter(lfc_sd <= 0.25)

@rasi let me know if you have any thoughts on the proper/best way to plot this

kychen37 · 2023-01-03T20:41:27Z

Destabilized dipeptides vs frameshifts

bootstrapping grouped by dipeptides
filter dipeptides with bootstrapped sds <= 0.15 based on:

find dipeptides where bootstrapped dipeptide lfcs are >=2 standard deviations below the median, these dipeptides are labeled here in the plot of mrna & grna read counts in log2 space:

for these destabilized dipeptides, plot their effects against frameshift effects
- get +1 frameshift of each dicodon belonging to destabilized dipeptide
- get bootstrapped dipeptide lfc for frameshifts, removing any dicodons that happen to correspond to destabilized set of dicodons or that happen to include CGG/CGA

kychen37 · 2023-01-03T20:44:06Z

@rasi

code

median-normalized frame effects if we want it zero-centered:

not normalized:

remake of frame effects plot above, this time imposing read cutoffs (500 reads per insert, 10 reads per barcode, 6 barcodes per insert) and removing any dicodons that go missing:

kychen37 added the analysis An analysis that we decide to carry out label Apr 1, 2022

kychen37 self-assigned this Apr 1, 2022

kychen37 mentioned this issue Aug 17, 2024

Transform and library prep fk8 DMS library #2

Open

6 tasks

kychen37 closed this as completed Sep 25, 2023

MariaT0 transferred this issue from another repository Aug 17, 2024

rasi reopened this Aug 17, 2024

Analyze yeast cytosolic 8x dicodon screen data #3

Analyze yeast cytosolic 8x dicodon screen data #3

Comments

kychen37 commented Apr 1, 2022 • edited Loading

Background

Ideas TODO

Analysis Links

Brief conclusion

kychen37 commented Apr 12, 2022

rasi commented Apr 12, 2022

kychen37 commented Apr 15, 2022 • edited Loading

kychen37 commented Apr 15, 2022

kychen37 commented Apr 15, 2022

kychen37 commented May 27, 2022

rasi commented Jun 6, 2022 • edited Loading

kychen37 commented Jun 6, 2022 • edited Loading

rasi commented Jun 6, 2022

kychen37 commented Jun 6, 2022

kychen37 commented Jun 6, 2022

rasi commented Jun 6, 2022

rasi commented Jun 6, 2022

kychen37 commented Jun 6, 2022 • edited Loading

kychen37 commented Jun 7, 2022

rasi commented Jun 7, 2022

rasi commented Jun 7, 2022

kychen37 commented Jun 7, 2022

rasi commented Jun 7, 2022

rasi commented Jun 7, 2022

kychen37 commented Jun 7, 2022

rasi commented Jun 7, 2022

kychen37 commented Jun 7, 2022 • edited Loading

rasi commented Jun 7, 2022

kychen37 commented Jun 7, 2022 • edited Loading

rasi commented Jun 7, 2022

kychen37 commented Jun 10, 2022 • edited Loading

Breakdown of dicodons that are missing from this sequencing run, code

kychen37 commented Jun 10, 2022

kychen37 commented Jun 15, 2022 • edited Loading

Remake Fig1A from Presnyak 2015 to make sure the CSCs I calculated matched there's.

Correlating the Coller lab CSCs with my LFCs

Plotting all data together

kychen37 commented Jun 15, 2022 • edited Loading

8x dicodon median-normalized codon average LFCs color-coded by Coller designations of 'unstable' vs 'stable' (aka CSC < 0 or > 0)

Correlating AASC with booststrapped average amino acid LFC

kychen37 commented Jun 15, 2022

phiburke commented Jun 15, 2022 • edited Loading

kychen37 commented Jun 16, 2022

kychen37 commented Jun 17, 2022

kychen37 commented Jul 1, 2022 • edited Loading

kychen37 commented Jul 9, 2022 • edited Loading

kychen37 commented Jul 28, 2022

kychen37 commented Aug 12, 2022 • edited Loading

Update

kychen37 commented Aug 16, 2022 • edited Loading

Update

kychen37 commented Dec 22, 2022 • edited Loading

Summary of different types of data filtering

Read cutoffs based on insert & barcode-level plots

Read cutoffs using Burke2022 code

Minimal read cutoffs -> bootstrap by dipeptide -> filter by SD

kychen37 commented Jan 3, 2023 • edited Loading

Destabilized dipeptides vs frameshifts

kychen37 commented Jan 3, 2023 • edited Loading

kychen37 commented Apr 1, 2022 •

edited

Loading

kychen37 commented Apr 15, 2022 •

edited

Loading

rasi commented Jun 6, 2022 •

edited

Loading

kychen37 commented Jun 6, 2022 •

edited

Loading

kychen37 commented Jun 6, 2022 •

edited

Loading

kychen37 commented Jun 7, 2022 •

edited

Loading

kychen37 commented Jun 7, 2022 •

edited

Loading

kychen37 commented Jun 10, 2022 •

edited

Loading

kychen37 commented Jun 15, 2022 •

edited

Loading

kychen37 commented Jun 15, 2022 •

edited

Loading

phiburke commented Jun 15, 2022 •

edited

Loading

kychen37 commented Jul 1, 2022 •

edited

Loading

kychen37 commented Jul 9, 2022 •

edited

Loading

kychen37 commented Aug 12, 2022 •

edited

Loading

kychen37 commented Aug 16, 2022 •

edited

Loading

kychen37 commented Dec 22, 2022 •

edited

Loading

kychen37 commented Jan 3, 2023 •

edited

Loading

kychen37 commented Jan 3, 2023 •

edited

Loading