Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultiCCA clustering and tumor samples #1

Open
galadrielbriere opened this issue Apr 30, 2019 · 0 comments
Open

MultiCCA clustering and tumor samples #1

galadrielbriere opened this issue Apr 30, 2019 · 0 comments

Comments

@galadrielbriere
Copy link

Hello,

I'm trying to run some parts of your benchmark and I have some questions about your code and some of the choices you made.

First, I have a question about the MultiCCA run :

cca.ret = PMA::MultiCCA(omics.transposed, ncomponents = MAX.NUM.CLUSTERS)
sample.rep = omics.transposed[[1]] %*% cca.ret$ws[[1]]

It seems here that only the fisrt omic dataset is used to generate sample.rep, reducing it using the canonical variates found for this dataset. sample.rep is then used for the clustering. Why did you choose the first omic ? Can we consider using another dataset ? Let's say :

sample.rep = omics.transposed[[2]] %*% cca.ret$ws[[2]]

What are the consequences on the results ?

Second, in the same MultiCCA run, the silhouette values of clusters are computed to chose coherent clusters :

 sils = c()
  clustering.per.num.clusters = list()
  for (num.clusters in 2:MAX.NUM.CLUSTERS) {
    cur.clustering = kmeans(sample.rep, num.clusters, iter.max=100, nstart=30)$cluster  
    sil = get.clustering.silhouette(list(t(sample.rep)), cur.clustering)
    sils = c(sils, sil)
    clustering.per.num.clusters[[num.clusters - 1]] = cur.clustering
}
 cca.clustering = clustering.per.num.clusters[[which.min(sils)]]

I don't understand the last line of this code : why did you choose the min average silhouette width ? I thought the higher the silhouette value, the better was the clustering. Shouldn't it be which.max(sils) instead ?

Finally, my last question is about the choice of removing some tissues from the datasets :

filter.non.tumor.samples <- function(raw.datum, only.primary=only.primary) {
  # 01 is primary, 06 is metastatic, 03 is blood derived cancer
  if (!only.primary)
    return(raw.datum[,substring(colnames(raw.datum), 14, 15) %in% c('01', '03', '06')])
  else
    return(raw.datum[,substring(colnames(raw.datum), 14, 15) %in% c('01')])
}

Why did you chose to select only primary tumors for some cancers and discard other sample types like metastatic or recurrent tumor ? Is it coherent to discard only "normal" samples and keep the information on the samples types (not running the fix.patient.names function) so that the clusters also take this information ?

I hope my questions are clear,
Thank you in advance !
Galadriel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant