Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with accessions with non-existing files #139

Open
bmlab-sg opened this issue Mar 24, 2023 · 6 comments
Open

Deal with accessions with non-existing files #139

bmlab-sg opened this issue Mar 24, 2023 · 6 comments
Assignees
Labels
enhancement Improvement for existing functionality
Milestone

Comments

@bmlab-sg
Copy link

Description of feature

Hi,

In SRA some of the run accessions have no associated files.
For example bioproject PRJEB18755 has several runs that are total ghosts: ERR2013571, ERR2013572, ERR2013573, ..., while other are fine.
When these ghost accessions are provided in the input, the pipeline will first retry:

[60/81e7b9] NOTE: Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)` failed -- Execution is retried (2)

and then terminate with errors:

Command error:
  [ERROR] There is no content for id ERR2013581. Maybe you lack the right permissions?

Of course one thing that can be done is to filter first these entries before feeding to the pipeline, but it will be great if these errors can be ignored.
Or maybe there is an option like that already that I am missing?
Thanks for any info on that, it will be extremely helpful to be able to easily deal with it!

@bmlab-sg bmlab-sg added the enhancement Improvement for existing functionality label Mar 24, 2023
@Midnighter
Copy link
Contributor

If you just want to ignore the errors, you can create a local nextflow configuration:

process {
  withName: SRA_IDS_TO_RUNINFO {
    errorStrategy = 'ignore'
  }
}

@drpatelh drpatelh added this to the 1.10 milestone Apr 25, 2023
@drpatelh
Copy link
Member

Did this solution work for you @bmlab-sg ? We could try to incorporate ignoring these sorts of ids via the pipeline but we would need some sort of way to detect this via the metadata or otherwise.

@bmlab-sg
Copy link
Author

@drpatelh - yes, that solution mostly solves this issue.
After looking at few datasets, seems like AvgSpotLen and/or Bases that are >0 can be a good filtering marker for these ghosts.

@drpatelh
Copy link
Member

Cool. Thanks for the update. We can see if these metadata fields are exposed so we can add conditional filtering to the pipeline in these scenarios so it doesn't hard fail.

@drpatelh drpatelh assigned robsyme and drpatelh and unassigned robsyme May 5, 2023
@drpatelh
Copy link
Member

drpatelh commented May 6, 2023

I am unable to reproduce this issue anymore. This could be due to the changes made to the ENA API recently as fixed in #148

I am now getting [ERROR] No matches found for database id ERR2013613! and we are unable to retrieve any metadata via the API URL below which means we can't explicitly filter by Bases or otherwise:
https://www.ebi.ac.uk/ena/portal/api/filereport?accession=ERR2013613&result=read_run&fields=run_accession%2Cexperiment_accession

ERR2013613

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013613)` terminated with an error exit status (1)

Command executed:

  echo ERR2013613 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      ERR2013613.runinfo.tsv \
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] No matches found for database id ERR2013613!
  Line: 'ERR2013613'

ERR2013581

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013581)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (ERR2013581)` terminated with an error exit status (1)

Command executed:

  echo ERR2013581 > id.txt
  sra_ids_to_runinfo.py \
      id.txt \
      ERR2013581.runinfo.tsv \
  
  
  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
      python: $(python --version | sed 's/Python //g')
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  [ERROR] No matches found for database id ERR2013581!
  Line: 'ERR2013581'

Will close this issue for now but please feel free to re-open if you encounter the issue again along with providing the appropriate ids we can use to fix.

@drpatelh drpatelh closed this as completed May 6, 2023
@rohitrrj
Copy link

rohitrrj commented Jul 19, 2024

Hello @drpatelh,
Recently I encountered this issue while working on PRJNA1079722. Multiple runs in this project SRR29688921, SRR29688964, SRR29688955, SRR29688939, SRR29688945, SRR29688933, SRR29688921, SRR29688964 seem to cause this same error. However these dont seem to be "ghosts" as you found previously. Each of these runs seem to host data for the associated sample. Below is the error for one of these:

`ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955)'

Caused by:
Process NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO (SRR29688955) terminated with an error exit status (1)

Command executed:

echo SRR29688955 > id.txt
sra_ids_to_runinfo.py
id.txt
SRR29688955.runinfo.tsv \

cat <<-END_VERSIONS > versions.yml
"NFCORE_FETCHNGS:SRA:SRA_IDS_TO_RUNINFO":
python: $(python --version | sed 's/Python //g')
END_VERSIONS

Command exit status:
1

Command output:
(empty)

Command error:
[ERROR] No matches found for database id SRR29688955!
Line: 'SRR29688955'
`

@rohitrrj rohitrrj reopened this Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement for existing functionality
Projects
None yet
Development

No branches or pull requests

5 participants