Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to best handle "Invalid trim interval for read id ... Trimming will be skipped."? #1149

Open
claczny opened this issue Nov 23, 2024 · 1 comment
Labels
barcode Issues related to barcoding question Issue is a question trim Issues related to adapter/primer trimming

Comments

@claczny
Copy link

claczny commented Nov 23, 2024

Issue Report

Please describe the issue:

When running dorado-0.8.2-linux-x64 on barcoded bacterial native DNA sequence data from a MinION run, I am getting [debug] messages like the following: Invalid trim interval for read id b1879b87-0723-4d45-95f9-6a9f7064580f: 34-14. Trimming will be skipped.
The basecalling of this run is still ongoing but I guess this affects only a minority of reads (ca. 300 reads thus far).

I would have two main questions:

    1. Could you please clarify what exactly the issue is and what it means "downstream"? Does it mean that no trimming happens at all, i.e., that adapters and barcodes (no primers in this case) would remain in the reads? Put differently, how should the resulting basecalled reads be further processed? I am running dorado demux afterwards, followed by hybracter, which includes Porechop_ABI. This will be a hybrid assembly, also involving Illumina data.
    1. Is this information also available in case a user has not added the -v option?

Steps to reproduce the issue:

Please list any steps to reproduce the issue.

Run environment:

  • Dorado version: 0.8.2
  • Dorado command: "basecaller" "sup" "--device" "cuda:all" "--kit-name" "SQK-NBD114-24" "-v" "/PATH/TO/pod5" "--modified-bases" "4mC_5mC"
  • Operating system:
  • Hardware (CPUs, Memory, GPUs): 4x NVIDIA V100 32GB (Tesla V100-SXM2-32GB)
  • Source data type (e.g., pod5 or fast5 - please note we always recommend converting to pod5 for optimal basecalling performance): pod5
  • Source data location (on device or networked drive - NFS, etc.):
  • Details about data (flow cell, kit, read lengths, number of reads, total dataset size in MB/GB/TB): FLO-MIN114, SQK-NBD114-24
  • Dataset to reproduce, if applicable (small subset of data to share as a pod5 to reproduce the issue):

Logs

  • Please provide output trace of dorado (run dorado with -v, or -vv on a small subset)
[2024-11-23 00:36:26.206] [info] Running: "basecaller" "sup" "--device" "cuda:all" "--kit-name" "SQK-NBD114-24" "-v" "/PATH/TO/pod5" "--modified-bases" "4mC_5mC"
[2024-11-23 00:37:49.575] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify th
e location manually.
[2024-11-23 00:37:49.614] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0 with httplib
[2024-11-23 00:37:49.634] [error] Failed to download dna_r10.4.1_e8.2_400bps_sup@v5.0.0: SSL server verification failed
[2024-11-23 00:37:49.634] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M 27  178M   27 48.7M    0     0  88.7M      0  0:00:02 --:--:--  0:00:02 88
.5M^M100  178M  100  178M    0     0   121M      0  0:00:01  0:00:01 --:--:--  121M
[2024-11-23 00:48:28.879] [debug] - matching modification model found: dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v2
[2024-11-23 00:48:34.890] [warning] Unknown certs location for current distribution. If you hit download issues, use the envvar `SSL_CERT_FILE` to specify th
e location manually.
[2024-11-23 00:48:34.890] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v2 with httplib
[2024-11-23 00:48:34.908] [error] Failed to download dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v2: SSL server verification failed
[2024-11-23 00:48:34.908] [info]  - downloading dna_r10.4.1_e8.2_400bps_sup@v5.0.0_4mC_5mC@v2 with curl
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
^M  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0^M 60 18.4M   60 11.1M    0     0  40.5M      0 --:--:-- --:--:-- --:--:-- 40
.4M^M100 18.4M  100 18.4M    0     0  56.1M      0 --:--:-- --:--:-- --:--:-- 55.9M
[2024-11-23 00:50:59.981] [info] > Creating basecall pipeline
[2024-11-23 00:50:59.981] [debug] CRFModelConfig { qscale:1.050000 qbias:1.300000 stride:6 bias:1 clamp:0 out_features:4096 state_len:5 outsize:4096 blank_sc
ore:0.000000 scale:1.000000 num_features:1 sample_rate:5000 sample_type:DNA mean_qscore_start_pos:60 SignalNormalisationParams { strategy:pa StandardisationS
calingParams { standardise:1 mean:93.692398 stdev:23.506744}} BasecallerParams { chunk_size:12288 overlap:600 batch_size:0} convs: { 0: ConvParams { insize:1
 size:64 winlen:5 stride:1 activation:swish} 1: ConvParams { insize:64 size:64 winlen:5 stride:1 activation:swish} 2: ConvParams { insize:64 size:128 winlen:
9 stride:3 activation:swish} 3: ConvParams { insize:128 size:128 winlen:9 stride:2 activation:swish} 4: ConvParams { insize:128 size:512 winlen:5 stride:2 ac
tivation:swish}} model_type: tx { crf_encoder: CRFEncoderParams { insize:512 n_base:4 state_len:5 scale:5.000000 blank_score:2.000000 expand_blanks:1 permute
:1} transformer: TxEncoderParams { d_model:512 nhead:8 depth:18 dim_feedforward:2048 deepnorm_alpha:2.449490}}}
[2024-11-23 00:51:10.417] [debug] TxEncoderStack: use_koi_tiled false.
[2024-11-23 00:51:10.460] [debug] TxEncoderStack: use_koi_tiled false.
[2024-11-23 00:51:10.462] [debug] TxEncoderStack: use_koi_tiled false.
[2024-11-23 00:51:10.464] [debug] TxEncoderStack: use_koi_tiled false.
[2024-11-23 00:51:12.667] [debug] cuda:0 memory available: 33.39GB
[2024-11-23 00:51:12.667] [debug] cuda:3 memory available: 33.39GB
[2024-11-23 00:51:12.667] [debug] cuda:1 memory available: 33.39GB
[2024-11-23 00:51:12.667] [debug] cuda:1 memory limit 31.62GB
[2024-11-23 00:51:12.667] [debug] cuda:3 memory limit 31.62GB
[2024-11-23 00:51:12.667] [debug] cuda:0 memory limit 31.62GB
[...]
[2024-11-23 00:51:30.668] [info] cuda:3 using chunk size 6144, batch size 640
[2024-11-23 00:51:30.669] [debug] cuda:3 Model memory 22.84GB
[2024-11-23 00:51:30.669] [debug] cuda:3 Decode memory 2.77GB
[2024-11-23 00:51:30.679] [info] cuda:1 using chunk size 6144, batch size 640
[2024-11-23 00:51:30.679] [debug] cuda:1 Model memory 22.84GB
[2024-11-23 00:51:30.679] [debug] cuda:1 Decode memory 2.77GB
[2024-11-23 00:51:30.690] [info] cuda:0 using chunk size 6144, batch size 640
[2024-11-23 00:51:30.690] [debug] cuda:0 Model memory 22.84GB
[2024-11-23 00:51:30.690] [debug] cuda:0 Decode memory 2.77GB
[2024-11-23 00:51:30.692] [info] cuda:2 using chunk size 6144, batch size 640
[2024-11-23 00:51:30.692] [debug] cuda:2 Model memory 22.84GB
[2024-11-23 00:51:30.692] [debug] cuda:2 Decode memory 2.77GB
[2024-11-23 00:51:33.626] [debug] BasecallerNode chunk size 12288
[2024-11-23 00:51:33.626] [debug] BasecallerNode chunk size 6144
[2024-11-23 00:51:33.675] [debug] Load reads from file /PATH/TO/pod5/FBA20348_64
39aea7_a5bd1332_10.pod5
[2024-11-23 00:51:37.434] [debug] > Kits to evaluate: 1
[2024-11-23 01:00:02.783] [debug] Invalid trim interval for read id b1879b87-0723-4d45-95f9-6a9f7064580f: 34-14. Trimming will be skipped.
[2024-11-23 01:01:12.033] [debug] Invalid trim interval for read id b30f72d8-a84b-4c91-a213-7c960cf80a78: 33-13. Trimming will be skipped.
[2024-11-23 01:02:32.494] [debug] Invalid trim interval for read id e8c03500-5f44-4c2a-8cc4-8743413e4e94: 26-6. Trimming will be skipped.

Thank you very much!

Best wishes and stay safe,

Cedric

@malton-ont
Copy link
Collaborator

Hi @claczny,

This issue occurs when the trim intervals for the adapter and the barcode overlap in such a way that the entire read would be removed i.e. the adapter trimming thinks that it should retain region 34-50, but the barcoding thinks it should retain 8-14. As there is no sensible way to perform trimming in this instance, we skip it entirely.

Reads that suffer from this issue tend to be very short, and can usually be excluded based on read length. As they will be untrimmed, you could also align the the adapter sequence and discard reads with a high alignment score.

If you want to try to retain these reads there are two other options available:

  1. Run dorado trim on your output to remove any remaining adapters. This will, however, leave any untrimmed barcodes in place.
  2. Rebasecall your data without barcoding but with --no-trim, and then demux and barcode in the second step:
dorado basecaller sup --device cuda:all --no-trim -v /PATH/TO/pod5 --modified-bases" "4mC_5mC" > calls.bam
dorado demux --kit-name SQK-NBD114-24 -v --output-dir /output/dir calls.bam

This will skip explicit adapter trimming so there will be no conflict between the intervals, and adapters will be removed anyway by the barcode trimming during demux. If you also want to remove adapters from any unclassified reads, use step 1 on the unclassified.bam file.

@malton-ont malton-ont added question Issue is a question barcode Issues related to barcoding trim Issues related to adapter/primer trimming labels Nov 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
barcode Issues related to barcoding question Issue is a question trim Issues related to adapter/primer trimming
Projects
None yet
Development

No branches or pull requests

2 participants