Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DECIPHER: Alignment larger than the maximum allowable size #54

Open
Charmy0619 opened this issue Jan 13, 2025 · 4 comments
Open

DECIPHER: Alignment larger than the maximum allowable size #54

Charmy0619 opened this issue Jan 13, 2025 · 4 comments
Labels
DSL2 Prioritize for DSL2 implementation

Comments

@Charmy0619
Copy link

Charmy0619 commented Jan 13, 2025

Hi,

I am running pacbio data nad I have trouble in the step of AlignReadsDECIPHER.
Here is the error message:

Error executing process > 'AlignReadsDECIPHER (AlignReadsDECIPHER:R1)'

Caused by:
  Process `AlignReadsDECIPHER (AlignReadsDECIPHER:R1)` terminated with an error exit status (1)

Command executed [/home/qj5/ped_xtan25_chi_link/qj5/test16S_TADA/src/TADA/templates/AlignReadsDECIPHER.R]:

  #!/usr/bin/env Rscript
  .libPaths(c("/mmfs1/home/qj5/R/x86_64-pc-linux-gnu-library/4.4", .libPaths()))
  suppressPackageStartupMessages(library(dada2))
  suppressPackageStartupMessages(library(DECIPHER))
  
  seqs <- readDNAStringSet("asvs.md5.nochim.R1.fna")
  alignment <- AlignSeqs(seqs,
             anchor=NA,
             processors = 64)
  writeXStringSet(alignment, "aligned_seqs.R1.fasta")

Command exit status:
  1

Command output:
  Determining distance matrix based on shared 9-mers:
  ================================================================================
  
  Time difference of 2.04 secs
  
  Clustering into groups by similarity:
  ================================================================================
  
  Time difference of 0.37 secs
  
  Aligning Sequences:
  ================================================================================
  
  Time difference of 949.63 secs
  
  Iteration 1 of 2:
  
  Determining distance matrix based on alignment:
  ================================================================================
  
  Time difference of 1.3 secs
  
  Reclustering into groups by similarity:
  ================================================================================
  
  Time difference of 0.25 secs
  
  Realigning Sequences:
  ===============================================================================

Command error:
  Error in f(p.profile, s.profile) : 
    Alignment larger (6,205,073,628) than the maximum allowable size (2,147,483,647).
  Calls: AlignSeqs -> .align -> do.call -> do.call ->  -> f
  Execution halted

Work dir:
  /mmfs1/projects/ped_xtan25_chi/qj5/16S_pacbio_2024_dam/src/work/15/04113d39009483d31a03e09bfe1530

Tip: you can replicate the issue by changing to the process work dir and entering the command `bash .command.run`

I try to check the "asvs.md5.nochim.R1.fna" files:

seqs <- readDNAStringSet("asvs.md5.nochim.R1.fna")
> # Number of sequences
num_sequences <- length(seqs)
print(num_sequences)
[1] 814

> summary(sequence_lengths)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
   1000    1166    1374    1355    1486    1796
> # Calculate total base pairs (bp) in all sequences
total_bp <- sum(width(seqs))
print(total_bp)
[1] 1102620

I also try to increase the memory to 240G but it's still not working.

Can anyone help?
Thanks.

@Charmy0619 Charmy0619 changed the title Alignment larger than the maximum allowable size DECIPHER: Alignment larger than the maximum allowable size Jan 13, 2025
@cjfields
Copy link
Contributor

@Charmy0619 this is an unusual one. Are you running this on UIUC resources or somewhere else?

@Charmy0619
Copy link
Author

@Charmy0619 this is an unusual one. Are you running this on UIUC resources or somewhere else?

Thank you for your comment. It's unusual and it did not happen to me before. At this time, I did not run it in UIUC Biocluster HPC. I set up this pipeline in the UIC lakeshore HPC.

I consulted Erik, the developer of DECIPHER. He mentioned, "This error occurs when the alignment dramatically expands in width during alignment. This typically indicates there are non-homologous sequences in the input, which should not be aligned." After that, I played with the iteration and refinement based on his suggestion. I can go through the pipeline without iteration.

I am wondering if this will cause some problems for the results. Maybe just related to the tree if I am correct. As you know, I previously ran rumen fluid data, and we don't have a problem. However, this dataset is currently from mice and may be contaminated with the mitochondrial sequence.

@cjfields
Copy link
Contributor

@Charmy0619 one possibility is to skip the alignment + phylogenetic tree step, particularly if you are concerned there are contaminants present. In the main branch this can be done by setting runTree to either false or '' (empty string). In the DSL2 work on dev this will be much simpler, but that code isn't ready for use at this time.

Saying that, normally I haven't found mitochondrial or chloroplast 16S rRNA to be an issue, but other contaminants (off-target sequences for example) can certainly be a problem.

@cjfields
Copy link
Contributor

@Charmy0619 as a quick follow up: we don't currently pre-screen sequences prior to DECIPHER, though this step has been proposed as a new feature (see #60 for tracking this). It will take a little time to implement this, but you could essentially emulate this by skipping alignment + tree but allowing taxonomic assignment. Screen out any ASVs that have no assignment, then perform either DECIPHER or another MSA tool (e.g., muscle5), then use fasttree to generate a ML-based tree. Happy to walk you through these steps, just email me.

@cjfields cjfields added the DSL2 Prioritize for DSL2 implementation label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DSL2 Prioritize for DSL2 implementation
Projects
None yet
Development

No branches or pull requests

2 participants