Skip to content

Commit

Permalink
Merge pull request #1739 from milaboratory/4-7-0-RC-plus-mitool-integ…
Browse files Browse the repository at this point in the history
…ration

4 7 0 rc plus mitool integration
  • Loading branch information
gnefedev authored Aug 7, 2024
2 parents 4adb74a + 0f7f626 commit c9a1bb8
Show file tree
Hide file tree
Showing 359 changed files with 12,563 additions and 2,697 deletions.
6 changes: 3 additions & 3 deletions build.gradle.kts
Original file line number Diff line number Diff line change
Expand Up @@ -134,7 +134,7 @@ val toObfuscate: Configuration by configurations.creating {
val obfuscationLibs: Configuration by configurations.creating


val mixcrAlgoVersion = "4.6.0-120-develop"
val mixcrAlgoVersion = "4.6.0-206-4-7-0-RC-plus-mitool-integration"
// may be blank (will be inherited from mixcr-algo)
val milibVersion = ""
// may be blank (will be inherited from mixcr-algo or milib)
Expand Down Expand Up @@ -178,8 +178,8 @@ dependencies {
toObfuscate("io.repseq:repseqio") { exclude("*", "*") }
toObfuscate("com.milaboratory:milm2-jvm") { exclude("*", "*") }

// proguard require classes that were inherited
obfuscationLibs("com.github.ajalt.clikt:clikt:$cliktVersion") { exclude("*", "*") }
// required for call mitool
implementation("com.github.ajalt.clikt:clikt:$cliktVersion")

// required for buildLibrary (to call repseqio)
implementation("com.beust:jcommander:$jcommanderVersion")
Expand Down
30 changes: 27 additions & 3 deletions changelogs/v4.6.1.md → changelogs/v4.7.0.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,35 @@
## 🚀 New features and major changes
## ❗ Breaking changes

- Starting from version 4.7.0 of MiXCR, users are required to specify the assembling feature for all presets in cases
where it's not defined by the protocol. This can be achieved using either the option `--assemble-clones-by Feature`
or `--assemble-contigs-by Feature` for fragmented data (such as RNA-seq or 10x VDJ data). This ensures consistency in
assembling features when integrating various samples or types of samples, such as 10x single-cell VDJ and AIRR
sequencing data, for downstream analyses like inferring alleles or building SHM trees. The previous behavior for
fragmented data, which aimed to assemble as long sequences as possible, can still be achieved with either the
option `--assemble-contigs-by-cell` for single-cell data or `--assemble-longest-contigs` for RNA-seq/Exom-seq data.

## 🚀 Major fixes and upgrades

- Ability to trigger realignments of left or right reads boundaries with global alignment algorythm using
parameters `rightForceRealignmentTrigger Feature` or `leftForceRealignmentTrigger Feature` in case the reads do
not span the CDR3 regions (rescue alignments in case of fragmented single cell data).
- Fixed `assemble` behavior in presets for single-cell data (in some cases consensuses were assembled from reads coming
from different cells)
- Ability to override the `relativeMeanScore` and `maxHits` parameters in `assemble` and `assembleContigs` steps
(improve the V genes assignments)
- Consensus assembly in `assemble` now is performed separately for each chain. This allows to prevent effects from
different expression levels on the consensus assembly algorithm. This change is specifically important for single-cell
presets with cell-level assembly (most of the MiXCR presets for single-cell data).
- Export of trees and tree nodes now support imputed features
- Options `--dont-correct-tag-with-name <tag_name>` or `--dont-correct-tag-type (Molecule|Cell|Sample)` could be
specified to skip tag correction. It will degrade the overall quality of analysis, but will decrease memory
consumption
- MiTool pipeline integrated into `10x-sc-xcr-vdj` preset which improved overall quality of `analyze`

## 🛠️ Minor improvements & fixes

- Default input quality filter in `assemble` (`badQualityThreshold`) stage was decreased to 10.
- Added validation for `assembleCells` that input files should be assembled by fixed feature
- Export of trees and tree nodes now support imputed features
- Fixed parsing of optional arguments
for `exportShmTreesWithNodes`: `-nMutationsRelative`, `-aaMutations`, `-nMutations`, `-aaMutationsRelative`, `-allNMutations`, `-allAAMutations`, `-allNMutationsCount`, `-allAAMutationsCount`.
- Fixed parsing of optional arguments for `exportClones`
Expand Down Expand Up @@ -44,7 +64,11 @@
- Fixed naming of output files of `downsample` command
- `--output-not-used-reads` of `analyze` command now works with bam input files too, alongside `--not-aligned-(R1|R2)`
and `--not-parsed-(R1|R2)` of `align` command
- fix `replaceWildcards` behaviour on parsing BAM. It led before to discarding of quality on `align`
- Fix `replaceWildcards` behaviour on parsing BAM. It led before to discarding of quality on `align`
- `v_call`, `d_call`, `j_call` and `c_call` columns in airr now output only bets hit, not the whole list
- Stable behavior of `replaceWildcards`. Before it depended on the position of read in a file, now it depends on read
content
- If sample sheet supplied by `--sample-sheet[-strict]` option has `*` symbol after tag name, then it will be preserved

## New Presets

Expand Down
29 changes: 17 additions & 12 deletions itests/case-IR.sh
Original file line number Diff line number Diff line change
Expand Up @@ -19,19 +19,21 @@ assert() {
set -eux

mixcr analyze generic-lt-single-cell-amplicon \
--tag-pattern "^(R1:*)\^(R2:*)\^(CELL1:*)\^(CELL2:*)" \
--assemble-clonotypes-by CDR3 \
--tag-pattern "^(CELL1:*)\^(CELL2:*)\^(R1:*)\^(R2:*)" \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
subset_B004-7_S247_L001_I1_001.fastq.gz \
subset_B004-7_S247_L001_I2_001.fastq.gz \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
output_normal

mixcr analyze generic-lt-single-cell-amplicon \
--tag-pattern "^(R1:*)\^(R2:*)\^(CELL1:*)\^(CELL2:*)" \
--assemble-clonotypes-by CDR3 \
--tag-pattern "^(CELL1:*)\^(CELL2:*)\^(R1:*)\^(R2:*)" \
--species hsa \
--rna \
--floating-left-alignment-boundary \
Expand All @@ -41,38 +43,41 @@ mixcr analyze generic-lt-single-cell-amplicon \

## R2 as UMI
mixcr analyze generic-lt-single-cell-amplicon-with-umi \
--tag-pattern "^(R1:*)\^(UMI:*)\^(CELL1:*)\^(CELL2:*)" \
--assemble-clonotypes-by CDR3 \
--tag-pattern "^(CELL1:*)\^(CELL2:*)\^(R1:*)\^(UMI:*)" \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
subset_B004-7_S247_L001_I1_001.fastq.gz \
subset_B004-7_S247_L001_I2_001.fastq.gz \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
output_UMI1

# R1 as UMI and payload
mixcr analyze generic-lt-single-cell-amplicon-with-umi \
--tag-pattern "^N{16}(UMI:N{10})(R1:*)\^(R2:*)\^(CELL1:*)\^(CELL2:*)" \
--assemble-clonotypes-by CDR3 \
--tag-pattern "^(CELL1:*)\^(CELL2:*)\^N{16}(UMI:N{10})(R1:*)\^(R2:*)" \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
subset_B004-7_S247_L001_I1_001.fastq.gz \
subset_B004-7_S247_L001_I2_001.fastq.gz \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
output_UMI2

# R1+R2+I1
mixcr analyze generic-lt-single-cell-amplicon \
--tag-pattern "^(R1:*)\^(R2:*)\^(CELL1:*)" \
--assemble-clonotypes-by CDR3 \
--tag-pattern "^(CELL1:*)\^(R1:*)\^(R2:*)" \
--species hsa \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
subset_B004-7_S247_L001_I1_001.fastq.gz \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
subset_B004-7_S247_L001_I1_001.fastq.gz \
output_R1_R2_I1
16 changes: 9 additions & 7 deletions itests/case-base_single_cell.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ set -euxo pipefail

mixcr analyze --verbose 10x-sc-xcr-vdj \
--species hs \
--assemble-contigs-by-cells \
single_cell_vdj_t_subset_R1.fastq.gz \
single_cell_vdj_t_subset_R2.fastq.gz \
base_single_cell.raw
Expand All @@ -31,14 +32,15 @@ mixcr analyze --verbose 10x-sc-xcr-vdj \
single_cell_vdj_t_subset_R2.fastq.gz \
base_single_cell.vdjcontigs

assert "cat base_single_cell.vdjcontigs.assembleContigs.report.json | head -n 1 | jq -r .finalCloneCount" "6"
assert "cat base_single_cell.vdjcontigs.assembleContigs.report.json | head -n 1 | jq -r .finalCloneCount" "7"

assert "mixcr exportClones --no-header base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "6"
assert "mixcr exportClones --no-header --split-by-tags Cell base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "6"
assert "mixcr exportClones --no-header --split-by-tags Molecule base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "59"
assert "mixcr exportClones --no-header -tags Molecule base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "59"
assert "mixcr exportClones --no-header base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "7"
assert "mixcr exportClones --no-header --split-by-tags Cell base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "7"
assert "mixcr exportClones --no-header --split-by-tags Molecule base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "92"
assert "mixcr exportClones --no-header -tags Molecule base_single_cell.vdjcontigs.assembledCells.clns | wc -l" "92"
assert "mixcr exportClones --no-header --drop-default-fields -cellGroup base_single_cell.vdjcontigs.assembledCells.clns | sort | uniq | wc -l" "3"

assert "mixcr exportClones --no-header --add-export-clone-grouping tag:CELL --drop-default-fields -readFraction base_single_cell.vdjcontigs.assembledCells.clns | jq -s add" "3"
assert "mixcr exportClones --no-header --add-export-clone-grouping tag:CELL --drop-default-fields -uniqueTagFraction Molecule base_single_cell.vdjcontigs.assembledCells.clns | jq -s add" "3"
# I didn't found normal way to round up a number in bash
#assert "mixcr exportClones --no-header --add-export-clone-grouping tag:CELL --drop-default-fields -readFraction base_single_cell.vdjcontigs.assembledCells.clns | jq -s add" "3"
#assert "mixcr exportClones --no-header --add-export-clone-grouping tag:CELL --drop-default-fields -uniqueTagFraction Molecule base_single_cell.vdjcontigs.assembledCells.clns | jq -s add" "3"

2 changes: 2 additions & 0 deletions itests/case-export_preset.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,15 @@
set -euxo pipefail

mixcr analyze --verbose \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary J\
test-tcr-shotgun test_R1.fastq test_R2.fastq result

mixcr exportPreset --preset-name test-tcr-shotgun \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand Down
5 changes: 3 additions & 2 deletions itests/case-parse-header.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,9 @@ assert() {
set -euxo pipefail

mixcr analyze -f test-tag-from-header \
--assemble-clonotypes-by CDR3 \
sample_IGH_{{R}}.fastq \
case_header_parse

assert "cat case_header_parse.TAGCTT.assemble.report.json | head -n 1 | jq .readsInClones" "64"
assert "cat case_header_parse.GAGCTT.assemble.report.json | head -n 1 | jq .readsInClones" "68"
assert "cat case_header_parse.TAGCTT.assemble.report.json | head -n 1 | jq .readsInClones" "65"
assert "cat case_header_parse.GAGCTT.assemble.report.json | head -n 1 | jq .readsInClones" "66"
1 change: 1 addition & 0 deletions itests/case-qc.sh
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ set -euxo pipefail

mixcr analyze --verbose 10x-vdj-tcr-qc-test \
--species hs \
--assemble-contigs-by-cells \
single_cell_vdj_t_subset_R1.fastq.gz \
single_cell_vdj_t_subset_R2.fastq.gz \
result
Expand Down
2 changes: 1 addition & 1 deletion itests/case-reset_whitelist.sh
Original file line number Diff line number Diff line change
Expand Up @@ -26,4 +26,4 @@ mixcr analyze --verbose 10x-vdj-tcr \
single_cell_vdj_t_subset_R2.fastq.gz \
base_single_cell

assert "cat base_single_cell.assembleContigs.report.json | head -n 1 | jq -r .finalCloneCount" "6"
assert "cat base_single_cell.assembleContigs.report.json | head -n 1 | jq -r .finalCloneCount" "7"
6 changes: 4 additions & 2 deletions itests/case-single_cell_reproducible_hash.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ cd result_1

mixcr analyze --verbose 10x-vdj-bcr \
--species hs \
--assemble-contigs-by VDJRegion \
../single_cell_vdj_t_subset_R1.fastq.gz \
../single_cell_vdj_t_subset_R2.fastq.gz \
result
Expand All @@ -33,6 +34,7 @@ cd result_2

mixcr analyze --verbose 10x-vdj-bcr \
--species hs \
--assemble-contigs-by VDJRegion \
../single_cell_vdj_t_subset_R1.fastq.gz \
../single_cell_vdj_t_subset_R2.fastq.gz \
result
Expand All @@ -44,8 +46,8 @@ if ! cmp result_1/result_report.yaml result_2/result_report.yaml; then
diff result_1/result_report.yaml result_2/result_report.yaml
fi

first_sha=$(shasum result_1/result.vdjca | awk '{print $1}')
assert "shasum result_2/result.vdjca | awk '{print \$1}'" "$first_sha"
first_sha=$(shasum result_1/result.alignments.vdjca | awk '{print $1}')
assert "shasum result_2/result.alignments.vdjca | awk '{print \$1}'" "$first_sha"

first_sha=$(shasum result_1/result.refined.vdjca | awk '{print $1}')
assert "shasum result_2/result.refined.vdjca | awk '{print \$1}'" "$first_sha"
Expand Down
4 changes: 2 additions & 2 deletions itests/case-single_cell_vdjcacontigs_reproducible_hash.sh
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ if ! cmp result_1/result_report.yaml result_2/result_report.yaml; then
diff result_1/result_report.yaml result_2/result_report.yaml
fi

first_sha=$(shasum result_1/result.vdjca | awk '{print $1}')
assert "shasum result_2/result.vdjca | awk '{print \$1}'" "$first_sha"
first_sha=$(shasum result_1/result.alignments.vdjca | awk '{print $1}')
assert "shasum result_2/result.alignments.vdjca | awk '{print \$1}'" "$first_sha"

first_sha=$(shasum result_1/result.refined.vdjca | awk '{print $1}')
assert "shasum result_2/result.refined.vdjca | awk '{print \$1}'" "$first_sha"
Expand Down
5 changes: 5 additions & 0 deletions itests/case-tag_validation.sh
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ mixcr analyze generic-lt-single-cell-amplicon \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
--assemble-clonotypes-by CDR3 \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
subset_B004-7_S247_L001_I1_001.fastq.gz \
Expand All @@ -65,6 +66,7 @@ mixcr analyze generic-lt-single-cell-amplicon \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
--assemble-clonotypes-by CDR3 \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
output 2>err
Expand All @@ -78,6 +80,7 @@ mixcr analyze generic-lt-single-cell-amplicon \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
--assemble-clonotypes-by CDR3 \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
output 2>err
Expand All @@ -91,6 +94,7 @@ mixcr analyze generic-amplicon \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
--assemble-clonotypes-by CDR3 \
subset_B004-7_S247_L001_R1_001.fastq.gz \
output 2>err

Expand All @@ -103,6 +107,7 @@ mixcr analyze generic-amplicon \
--rna \
--floating-left-alignment-boundary \
--floating-right-alignment-boundary C \
--assemble-clonotypes-by CDR3 \
subset_B004-7_S247_L001_R1_001.fastq.gz \
subset_B004-7_S247_L001_R2_001.fastq.gz \
output 2>err
Expand Down
2 changes: 2 additions & 0 deletions itests/case-usage_of_template.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
set -euxo pipefail

mixcr analyze --verbose generic-tcr-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand All @@ -13,6 +14,7 @@ mixcr analyze --verbose generic-tcr-amplicon \
[[ -f use_of_templates_1.contigs.clns ]] || exit 1

mixcr align -p generic-tcr-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand Down
1 change: 1 addition & 0 deletions itests/case-use_arguments_from_file_on_export.sh
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ assert() {
set -euxo pipefail

mixcr analyze --verbose generic-tcr-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand Down
1 change: 1 addition & 0 deletions itests/case001.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
set -euxo pipefail

mixcr align -p generic-amplicon --species hs \
--assemble-clonotypes-by CDR3 \
--dna \
-OsaveOriginalReads=true \
--floating-left-alignment-boundary \
Expand Down
2 changes: 2 additions & 0 deletions itests/case003.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
set -euxo pipefail

mixcr analyze --verbose generic-amplicon --dry-run \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand All @@ -11,6 +12,7 @@ mixcr analyze --verbose generic-amplicon --dry-run \
test_R1.fastq test_R2.fastq case3

mixcr analyze --verbose generic-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand Down
7 changes: 4 additions & 3 deletions itests/case004.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set -euxo pipefail

# Checking generic pipeline with relatively big input files
mixcr analyze --verbose generic-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand All @@ -12,9 +13,9 @@ mixcr analyze --verbose generic-amplicon \
CD4M1_test_R1.fastq.gz CD4M1_test_R2.fastq.gz case4

# Checking AIRR export on big files
mixcr exportAirr --imgt-gaps case4.vdjca case4.vdjca.imgt.airr.tsv
mixcr exportAirr --imgt-gaps --from-alignment case4.vdjca case4.vdjca.imgta.airr.tsv
mixcr exportAirr case4.vdjca case4.vdjca.airr.tsv
mixcr exportAirr --imgt-gaps case4.alignments.vdjca case4.vdjca.imgt.airr.tsv
mixcr exportAirr --imgt-gaps --from-alignment case4.alignments.vdjca case4.vdjca.imgta.airr.tsv
mixcr exportAirr case4.alignments.vdjca case4.vdjca.airr.tsv

mixcr exportAirr --imgt-gaps case4.clna case4.clna.imgt.airr.tsv
mixcr exportAirr --imgt-gaps --from-alignment case4.clna case4.clna.imgta.airr.tsv
Expand Down
1 change: 1 addition & 0 deletions itests/case005.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@ gzip -dc CD4M1_test_R2.fastq.gz CD4M1_test_R2.fastq.gz | tr 'N' 'A' > case5_R2.f
#mixcr analyze --verbose amplicon --assemble '-OcloneClusteringParameters=null' --impute-germline-on-export -s hs --starting-material rna --contig-assembly --5-end v-primers --3-end j-primers --adapters adapters-present case5_R1.fastq case5_R2.fastq case5

mixcr analyze --verbose generic-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand Down
1 change: 1 addition & 0 deletions itests/case006.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ touch empty_R1.fastq
touch empty_R2.fastq

mixcr analyze --verbose generic-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand Down
1 change: 1 addition & 0 deletions itests/case007.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ gzip -dc CD4M1_test_R1.fastq.gz | head -n 1012 | tail -n 4 >>case7_R1.fastq
gzip -dc CD4M1_test_R2.fastq.gz | head -n 1012 | tail -n 4 >>case7_R2.fastq

mixcr analyze --verbose generic-amplicon \
--assemble-clonotypes-by CDR3 \
--species hs \
--rna \
--floating-left-alignment-boundary \
Expand Down
Loading

0 comments on commit c9a1bb8

Please sign in to comment.