-
Notifications
You must be signed in to change notification settings - Fork 67
Proposed Analysis: Molecularly subtype chordomas #250
Comments
This is what I think the table summarizing the results would look like based on the above:
|
@mkoptyra and I will start on this next week. |
Brachyury / T / TBXT protein expression is a standard test clinically to diagnose chordoma. |
Documenting a whiteboarding session with @mkoptyra from this morning: I will post my thoughts on the SMARCB1 expression jitterplot at the bottom of this photo later today. |
Figured out my source of confusion from yesterday morning @mkoptyra 🙂 #248 (comment) We might want to wait until the next data release to proceed with the SMARCB1 jitter plot because then we can more easily work with the expression data for this purpose, but I'll give it a bit of a think. |
Thank you! That makes sense reading the #248 (imagine, that I even understood that conversation ;) ) |
Okay, I apologize for the delay @mkoptyra. We got v11 out this week so this is ready to be revisited. There's a greater level of wrangling required than I fully appreciated! In addition to the fields outlined here #250 (comment), we need to include the To get the SMARCB1 jitter plot in the photo here #250 (comment), you will first need to read in the collapsed expression data: expression_data <- read_rds("../../data/pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds") or, preferably, using expression_data <- read_rds(file.path("..", "..", "data", "pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds")) In what you have so far in your notebook, you have a data frame called Side note: We are in the process of getting copy number consensus calls (#128) through review so the copy number file input may change.
> chordoma_loss
# A tibble: 4 x 7
biospecimen_id status copy_number ploidy ensembl gene_symbol cytoband
<chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 BS_6F49F7WH loss 1 3 ENSG00000099956 SMARCB1 22q11.23
2 BS_9GN1QA3Q loss 1 2 ENSG00000099956 SMARCB1 22q11.23
3 BS_BWZTMWTM loss 1 2 ENSG00000099956 SMARCB1 22q11.23
4 BS_JTBM5TSE loss 3 4 ENSG00000099956 SMARCB1 22q11.23 Any copy neutral events are removed from the file that you're using. So to make the jitter plot:
So first thing, let's get a data frame that contains the identifier information we need but limited to chordoma samples: chordoma_id_df <- histologies_df %>%
# only rows with chordoma samples
filter(short_histology == "Chordoma") %>%
# select only these columns that we'll need later
select(Kids_First_Biospecimen_ID, sample_id, Kids_First_Participant_ID,
experimental_strategy) and then to get the copy neutral data frame described above: copy_neutral_df <- chordoma_id_df %>%
# the copy events can only be taken from WGS data
# not RNA-seq data
# we also only want biospecimens where a loss was
# not recorded to avoid duplicates
filter(experimental_strategy == "WGS",
!(Kids_First_Biospecimen_ID %in% chordoma_loss$biospecimen_id)) %>%
# if there's no loss, let's assume status is copy
# neutral
mutate(status = "neutral") %>%
# let's get the columns to match chordoma_loss
rename(biospecimen_id = Kids_First_Biospecimen_ID) %>%
select(biospecimen_id, status) which looks like: > copy_neutral_df
# A tibble: 5 x 2
biospecimen_id status
<chr> <chr>
1 BS_59FR1NC2 neutral
2 BS_5B6XZ7YP neutral
3 BS_FBJ516WW neutral
4 BS_HN8DE43A neutral
5 BS_XEVMEYFS neutral Now it's time to join the losses with the neutrals to get a new data frame: chordoma_copy <- chordoma_loss %>%
select(biospecimen_id, status) %>%
bind_rows(copy_neutral_df) And we need to get the chordoma_copy <- chordoma_copy %>%
# get only the Kids_First_Biospecimen_ID, sample_id
# columns from our identifier data.frame
# then use biospecimen IDs to add the sample_id info
inner_join(select(chordoma_id_df,
Kids_First_Biospecimen_ID,
sample_id),
by = c("biospecimen_id" = "Kids_First_Biospecimen_ID")) The result looks like: > chordoma_copy
# A tibble: 9 x 3
biospecimen_id status sample_id
<chr> <chr> <chr>
1 BS_6F49F7WH loss 7316-3295
2 BS_9GN1QA3Q loss 7316-2935
3 BS_BWZTMWTM loss 7316-1101
4 BS_JTBM5TSE loss 7316-3632
5 BS_59FR1NC2 neutral 7316-723
6 BS_5B6XZ7YP neutral 7316-4062
7 BS_FBJ516WW neutral 7316-921
8 BS_HN8DE43A neutral 7316-2248
9 BS_XEVMEYFS neutral 7316-431 Okay, now we're ready to look at SMARCB1 expression values only in chordoma: # get the row that contains the SMARCB1 values
# gene symbols are rownames
smarcb1_expression <- expression_data[which(rownames(expression_data) == "SMARCB1"), ]
# now only the columns correspond to chordoma samples
smarcb1_expression <- smarcb1_expression[, which(colnames(expression_data) %in% chordoma_samples) ] And then > str(smarcb1_expression)
'data.frame': 1 obs. of 10 variables:
$ BS_0VRSD9V3: num 113
$ BS_64HCD9K3: num 33.7
$ BS_67PX06P3: num 15.7
$ BS_DBDXCXT5: num 18.7
$ BS_EJ9JKM1C: num 3.89
$ BS_GFM6EA61: num 20.9
$ BS_GSVCN2XC: num 43.8
$ BS_P4D8Y3S8: num 7.1
$ BS_W36RZSFA: num 53.1
$ BS_YB07VF1X: num 14.3 So we can do the following to get it close to the final table we want and to get this ready for plotting: # transpose such that samples are rows
smarcb1_expression <- t(smarcb1_expression) %>%
# make a data.frame
as.data.frame() %>%
# we want the rownames that are biospecimen identifers
# as their own column called Kids_First_Biospecimen_ID
tibble::rownames_to_column("Kids_First_Biospecimen_ID") %>%
# give SMARCB1 column a slightly better column name
rename(SMARCB1_expression = SMARCB1) That step gives us this: > smarcb1_expression
Kids_First_Biospecimen_ID SMARCB1_expression
1 BS_0VRSD9V3 113.08
2 BS_64HCD9K3 33.73
3 BS_67PX06P3 15.68
4 BS_DBDXCXT5 18.66
5 BS_EJ9JKM1C 3.89
6 BS_GFM6EA61 20.87
7 BS_GSVCN2XC 43.79
8 BS_P4D8Y3S8 7.10
9 BS_W36RZSFA 53.15
10 BS_YB07VF1X 14.29 This also needs smarcb1_expression <- smarcb1_expression %>%
inner_join(select(chordoma_id_df,
Kids_First_Biospecimen_ID,
sample_id),
by = "Kids_First_Biospecimen_ID") Now we're ready to get the copy number information and the expression values into the same data frame using chordoma_smarcb1_df <- smarcb1_expression %>%
# any missing samples will get filled with NA
# when using a full join
full_join(chordoma_copy, by = "sample_id") To get: > str(chordoma_smarcb1_df)
'data.frame': 10 obs. of 5 variables:
$ Kids_First_Biospecimen_ID: chr "BS_0VRSD9V3" "BS_64HCD9K3" "BS_67PX06P3" "BS_DBDXCXT5" ...
$ SMARCB1_expression : num 113.08 33.73 15.68 18.66 3.89 ...
$ sample_id : chr "7316-2248" "7316-406" "7316-3632" "7316-4062" ...
$ biospecimen_id : chr "BS_HN8DE43A" NA "BS_JTBM5TSE" "BS_5B6XZ7YP" ...
$ status : chr "neutral" NA "loss" "neutral" ... All of the information we need to look at SMARCB1 is in a single data frame, where each row is a chordoma sample. Now to get a bit closer to what is in this comment #250 (comment), we can add the participant ID in using the data frame that contains the identifiers. # this step is what adds in the participant identifier
# we use the sample_id to match between the two data.frame
chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
inner_join(distinct(select(chordoma_id_df,
sample_id,
Kids_First_Participant_ID)),
by = "sample_id")
# here we're combining the two biospecimen identifiers to
# a single column where all biospecimen IDs for a sample
# will be separated by a comma
chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
mutate(Kids_First_Biospecimen_ID = if_else(
# one sample is missing WGS data, so if that's true
# only include biospecimen ID from RNA-seq
is.na(biospecimen_id),
Kids_First_Biospecimen_ID,
paste(Kids_First_Biospecimen_ID, biospecimen_id, sep = ", ")
)) And we can drop the redundant biospecimen ID column and rearrange the data frame such that the identifiers come first, followed by focal SMARCB1 status and SMARCB1 expression values: chordoma_smarcb1_df <- chordoma_smarcb1_df %>%
select(Kids_First_Participant_ID,
Kids_First_Biospecimen_ID,
sample_id,
status,
SMARCB1_expression) %>%
# we'll give 'status' a more descriptive name
rename(focal_SMARCB1_status = status)
Now for the plot -- first step is putting
# this specifies that this is the data we want to plot
chordoma_smarcb1_df %>%
# drop the sample that doesn't have WGS data
tidyr::drop_na() %>%
# this step specifies what should go on the x- and y-axes
ggplot(aes(x = focal_SMARCB1_status,
y = SMARCB1_expression)) +
# we want a jitter plot where the points aren't too far
# apart that's what width does
geom_jitter(width = 0.1) +
# this is plotting the median as a blue diamond
stat_summary(fun.y = "median",
geom = "point",
size = 3,
color = "blue",
shape = 18) There are lots of ways to customize the plot to your liking. One way is to change the theme (see Let me know if you have any questions @mkoptyra :) |
Per this discussion https://github.com/AlexsLemonade/OpenPBTA-analysis/pull/475/files#r371877081 - what's left here is to evaluate the poorly differentiated calls in light of how the expression levels look overall (#387) and to have a domain expert weigh in. |
I am going to call this subsumed by #608. |
Scientific goals
What are the scientific goals of the analysis?
Subtype chordomas into poorly differentiated if contain INI1/SMARCB1 loss.
Proposed methods
What methods do you plan to use to accomplish the scientific goals?
Add Poorly-differentiated subtype if the chordomas show SMARCB1 loss and loss of expression.
Required input data
What input data will you use for this analysis?
Histologies file, SNVs, copy number, RNA expression data.
Proposed timeline
What is the timeline for the analysis?
1-2 days
Relevant literature
If there is relevant scientific literature, put links to those items here.
Link to Loss of SMARCB1/INI1 expression in poorly differentiated chordomas.
Link to Poorly differentiated chordoma with loss of SMARCB1/INI1 expression in pediatric patients: A report of two cases and review of the literature.
The text was updated successfully, but these errors were encountered: