Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SSM files within ICGC projects #4

Open
victorlin opened this issue Nov 28, 2018 · 4 comments
Open

Add support for SSM files within ICGC projects #4

victorlin opened this issue Nov 28, 2018 · 4 comments
Assignees

Comments

@victorlin
Copy link

I'm wondering if it would be appropriate to add functionality of reading files such as simple_somatic_mutation.open.BRCA-US.tsv.gz.

Reading this data is the initial step of a project I will be starting soon. I would be more than willing to implement the parser.

For reference, the file has these columns:

  1. icgc_mutation_id
  2. icgc_donor_id
  3. project_code
  4. icgc_specimen_id
  5. icgc_sample_id
  6. matched_icgc_sample_id
  7. submitted_sample_id
  8. submitted_matched_sample_id
  9. chromosome
  10. chromosome_start
  11. chromosome_end
  12. chromosome_strand
  13. assembly_version
  14. mutation_type
  15. reference_genome_allele
  16. mutated_from_allele
  17. mutated_to_allele
  18. quality_score
  19. probability
  20. total_read_count
  21. mutant_allele_read_count
  22. verification_status
  23. verification_platform
  24. biological_validation_status
  25. biological_validation_platform
  26. consequence_type
  27. aa_mutation
  28. cds_mutation
  29. gene_affected
  30. transcript_affected
  31. gene_build_version
  32. platform
  33. experimental_protocol
  34. sequencing_strategy
  35. base_calling_algorithm
  36. alignment_algorithm
  37. variation_calling_algorithm
  38. other_analysis_algorithm
  39. seq_coverage
  40. raw_data_repository
  41. raw_data_accession
  42. initial_data_release_date
@Ad115
Copy link
Owner

Ad115 commented Feb 4, 2019

Sorry for the late response. Of course! It would be great. Feel free to make the changes and send me a pull request! 💃

@Ad115
Copy link
Owner

Ad115 commented Feb 4, 2019

Also, it would be great if you used the facilities in the standard library for gzip files and tsv files.

Maybe that parser for project SSM's may be another class analogous to the SSM_Reader, but how would it be named?? Maybe Project_SSM_Reader? or should we extend the interface of the existing parser?? I'm thinking something like:

reader = SSM_Reader(filename='simple_somatic_mutation.open.BRCA-US.tsv.gz', file_type='project ssm')

Another thing that seems sensible to me would be to refactor the dependency on the vcf.Reader, so that one could change from vcf to tsv and not make a separate class. Something like:

# The old behavior:
reader = SSM_Reader(filename=' simple_somatic_mutation.aggregated.vcf.gz', file_type='vcf')

# The new behavior:
reader = SSM_Reader(filename='simple_somatic_mutation.open.BRCA-US.tsv.gz', file_type='tsv')

And in each case, internally, a different reader (vcf.Reader or csv.reader) would be instantiated internally.

I'd be great to hear your thoughts on the subject 🌝

@victorlin
Copy link
Author

It seems like ICGC provides SSM data in two formats: VCF-like and ICGC-like Mutation Format. I'll reference the "ICGC-like" format as TSV for now.

It's worth noting that the TSV format isn't only available per-project. All SSM data downloaded from the web portal is in the TSV format. It seems to be a more widely available option, whereas the VCF format is only available by downloading all at once from the data release in DCC/current/Summary.

That is just my personal understanding of the ICGC structure. Maybe this library could automatically detect either format:

reader1 = SSM_Reader(filename='simple_somatic_mutation.aggregated.vcf.gz')
reader2 = SSM_Reader(filename='simple_somatic_mutation.open.BRCA-US.tsv.gz')

or have the user specify which is passed in:

reader1 = SSM_Reader(vcf='simple_somatic_mutation.aggregated.vcf.gz')
reader2 = SSM_Reader(tsv='simple_somatic_mutation.open.BRCA-US.tsv.gz')

@Ad115
Copy link
Owner

Ad115 commented Feb 5, 2019

I love the idea of automatic detection of the file format, given that every well-formed VCF must start with a line specifying the format, according to the VCF specification.

Given that, I can not think of a plausible case for having the user to specify the file format manually, maybe it would be good to have an optional switch, or to remove the option altogether.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants