diff --git a/README.md b/README.md index 6fda8ab3..eb523352 100644 --- a/README.md +++ b/README.md @@ -54,18 +54,17 @@ This project aims to provide an easy way to show biodiversity within a geographi Species are created from samples representing the physical or genetic entity of the species. -The samples can be inserted locally via excel (following the format of the ERGA manifest) or via form.ยก +The samples can be inserted locally via excel (following the format of the ERGA manifest) or via form. -This project offers additional services for sequencing projects: +This project offers additional services for sequencing projects (under the Earth Biogenome scope): -* cronjob to collect public information related to the project (genomic data) +* cronjob to collect INSDC data related to the project (genomic data): assemblies, reads, samples metadata and taxonomy. * export of an excel file containing locally inserted samples to submit to COPO(https://copo-project.org/) IMPORTANT: -This project uses the metadata of the ERGA manifesto (https://github.com/ERGA-consortium/COPO-manifest) and is mainly intended to retrieve data from BioSamples(https://www.ebi.ac.uk/biosamples) for samples metadata and ENA (https://www.ebi.ac.uk/ena/browser/home) for reads and assemblies. For specific project needs you can open an issue. +This project uses the metadata of the ERGA manifest (https://github.com/ERGA-consortium/COPO-manifest) and is mainly intended to retrieve data from BioSamples(https://www.ebi.ac.uk/biosamples) for samples metadata and ENA (https://www.ebi.ac.uk/ena/browser/home) for reads and assemblies. For specific project needs you can open an issue. -Use the `BLANK_README.md` to get started.

(back to top)

@@ -97,8 +96,8 @@ The .env file contains many parts that have to be configured depending on the ne RESTKEY=secretPassword #change this in production!! --> password that will be inserted to access the admin area The CRONJOB part: - Configure this part if you want to retrieve public data from ENA/BioSamples and/or NCBI (the sample metadata format must be compliant with the ENA checklist) - the cronjob will automatically check for read data in ENA if PROJECTS and/or PROJECT_ACCESSION are present + Configure this part if you want to retrieve public data from ENA/BioSamples and/or NCBI (the sample metadata format must be compliant with the ENA checklist). + The cronjob will automatically check for read data in ENA if PROJECTS and/or PROJECT_ACCESSION are present PROJECTS= --> list of projects (comma separated) wich name figures in the sample metadata submitted to the ENA/BioSamples PROJECT_ACCESSION --> bioproject accession to retrieve data from NCBI EXEC_TIME=600 --> how often, in seconds, the job should be performed @@ -106,9 +105,7 @@ The .env file contains many parts that have to be configured depending on the ne The DATA PORTAL part: This part have some default values that can be modified - RANKS=--> ordered, descending list of taxonomic ranks you want to display. Note that is a rank is not present in the species' lineage it will be skipped, for instance you may find phylum nodes that has as a children class nodes. - MAX_NODES=90 --> number of max leaves to display in the tree of life page (numbers greaters than 150 may affect performance and visualization) - + ROOT_NODE=the INSDC bioproject acccession which will be used as the root project of the application To add a custom logo and an icon follow this steps: @@ -154,7 +151,7 @@ Here is a list of the APIs consumed: ## Sequencing Project -For sequencing projects with the aim to sequence species within a geographical context, it is strongly recommended to submit public samples to the ENA via the [COPO web service](https://copo-project.org/), this service ensure that all the submitted samples share the same format before submission to ENA. It will, then, be responsibility of the single project to upload assemblies and reads to ENA/NCBI and associate them with the sample accession submitted through COPO. +For sequencing projects, it is strongly recommended to submit public samples to the ENA via the [COPO web service](https://copo-project.org/), this service ensure that all the submitted samples share the same format before submission to ENA(INSDC). It will, then, be responsibility of the single project to upload assemblies and reads to ENA/NCBI and associate them with the sample accession submitted through COPO. To facilitate the sample submission to COPO this project provides the possibility to download the samples inserted locally in an excel compliant with the [ERGA submission manifest](https://github.com/ERGA-consortium/COPO-manifest). The generated excel will be then submitted to COPO. Once the samples will be pubblicly available in BioSamples the data portal will link the accession to the sample unique name and will start checking for new assemlies and/or reads every time the cronjob will be executed (the EXEC_TIME env variable). IMPORTANT: the ERGA manifest will change during time, this portal will try to keep it up to date. @@ -179,7 +176,7 @@ The import function uses the BioSamples API to retrieve samples metadata via the - [ ] Add Changelog - [ ] Add JSON schema (OAS docs) -- [ ] Add local names management +- [X] Add local names management - [X] Add organism photo management - [X] Add sample accession import feature - [ ] Add custom fields management diff --git a/client/package.json b/client/package.json index 14f15d32..44e478d6 100644 --- a/client/package.json +++ b/client/package.json @@ -13,7 +13,6 @@ "bootstrap-vue": "^2.21.2", "d3": "^7.0.0", "ol": "^6.10.0", - "swagger-ui": "^4.4.1", "vue": "^2.6.14", "vue-router": "^3.1.3", "vuex": "^3.6.2", diff --git a/client/src/components/base/LastPublishedBanner.vue b/client/src/components/base/LastPublishedBanner.vue index dc795e28..187cd5f2 100644 --- a/client/src/components/base/LastPublishedBanner.vue +++ b/client/src/components/base/LastPublishedBanner.vue @@ -1,7 +1,5 @@ + \ No newline at end of file diff --git a/client/src/components/modal/DataModal.vue b/client/src/components/modal/DataModal.vue index 139a6d6e..3ffe4f0c 100644 --- a/client/src/components/modal/DataModal.vue +++ b/client/src/components/modal/DataModal.vue @@ -3,16 +3,18 @@ + diff --git a/client/src/views/OrganismDetailsPage.vue b/client/src/views/OrganismDetailsPage.vue index eea06133..17eeefa6 100644 --- a/client/src/views/OrganismDetailsPage.vue +++ b/client/src/views/OrganismDetailsPage.vue @@ -88,10 +88,11 @@ export default { return portalService.getCoordinatesBySampleIds({ids:records}) }) .then(response =>{ - if(response){ - this.$nextTick(()=>{ - this.geojson = {...response.data} - }) + if(response.data && response.data.features.length){ + this.geojson = {...response.data} + } + else{ + this.geojson = null } }) .catch(e => { diff --git a/client/src/views/SampleDetailsPage.vue b/client/src/views/SampleDetailsPage.vue index 55da9c9e..6d936e8e 100644 --- a/client/src/views/SampleDetailsPage.vue +++ b/client/src/views/SampleDetailsPage.vue @@ -75,13 +75,15 @@ export default { this.sample = response.data this.$store.commit('portal/setBreadCrumb', {value: {text: accession, to: {name: 'sample-details', params:{accession: accession}}}}) this.$store.dispatch('portal/hideLoading') - console.log(this.sample) return portalService.getCoordinatesBySampleIds({ids: [this.sample._id]}) }) .then(response =>{ - if(response){ + if(response.data && response.data.features.length){ this.geojson = {...response.data} } + else{ + this.geojson = null + } }) .catch(e => { this.$store.dispatch('portal/hideLoading') diff --git a/server/app.py b/server/app.py index 9e309376..eee7d757 100644 --- a/server/app.py +++ b/server/app.py @@ -1,15 +1,12 @@ +import imp from flask import Flask from flask_cors import CORS -from apscheduler.schedulers.background import BackgroundScheduler from config import BaseConfig from db import initialize_db from rest import initialize_api -from datetime import datetime,timedelta -from cronjobs.import_records import import_records -import os +from cronjobs.import_records import handle_tasks from flask_jwt_extended import JWTManager -from flask_apscheduler import APScheduler app = Flask(__name__) @@ -22,12 +19,9 @@ jwt = JWTManager(app) -TIME= os.getenv('EXEC_TIME') -if os.getenv('PROJECTS') or os.getenv('PROJECT_ACCESSION'): - PROJECTS = os.getenv('PROJECTS').split(',') - sched = BackgroundScheduler(daemon=True) - sched.add_job(import_records, "interval", id="interval-job", start_date=datetime.now()+timedelta(seconds=20),seconds=int(TIME)) - sched.start() + +handle_tasks() + # # if __name__ == '__main__': # app.run(debug=True,host='0.0.0.0') \ No newline at end of file diff --git a/server/cronjobs/import_from_NCBI.py b/server/cronjobs/import_from_NCBI.py index fc5bbd7e..91512bf4 100644 --- a/server/cronjobs/import_from_NCBI.py +++ b/server/cronjobs/import_from_NCBI.py @@ -1,28 +1,66 @@ import requests import time from utils import ena_client,utils -from services import sample_service,organisms_service,geo_loc_service, bioproject_service -from db.models import Assembly,SecondaryOrganism -from mongoengine.queryset.visitor import Q -from datetime import datetime, timedelta +from services import organisms_service,geo_loc_service, bioproject_service,annotations_service +from db.models import Assembly, Experiment,SecondaryOrganism +from datetime import datetime - -SAMPLE_QUERY = (Q(last_check=None) | Q(last_check__lte=datetime.now()- timedelta(days=2))) - -##import from NCBI -##retrieve assemblies with bioprojects def import_from_NCBI(project_accession): assemblies = get_assemblies(project_accession) - if assemblies: - existing_assemblies = Assembly.objects(accession__in=[assembly['assembly_accession'] for assembly in assemblies]) - if existing_assemblies: - assemblies = [ass for ass in assemblies if ass['assembly_accession'] not in [ex['accession'] for ex in existing_assemblies]] - if not assemblies: - print('NO NEW ASSEMBLIES') - return - parse_data(assemblies, project_accession) - print('DONE') + existing_assembly_accessions=Assembly.objects.scalar('accession') + for ass in assemblies: + if ass['assembly_accession'] in existing_assembly_accessions: + continue + sample_accession=ass['biosample_accession'] + ass_obj = Assembly(accession = ass['assembly_accession'],assembly_name= ass['display_name'], sample_accession= sample_accession).save() + organism = organisms_service.get_or_create_organism(str(ass['org']['tax_id'])) + sample_obj = SecondaryOrganism.objects(accession=sample_accession).first() + ##parse sample + if not sample_obj: + required_metadata=dict(accession=sample_accession,taxid=organism.taxid,scientificName=organism.organism) + sample_obj = SecondaryOrganism(**handle_biosample(ass,required_metadata)) + ##save coordinates + organism.assemblies.append(ass_obj) + sample_obj.assemblies.append(ass_obj) + #get reads + experiments = ena_client.get_reads(sample_obj.accession) + for exp in experiments: + if Experiment.objects(experiment_accession=exp['experiment_accession']).first(): + continue + exp_obj = Experiment(**exp).save() + organism.experiments.append(exp_obj) + sample_obj.experiments.append(exp_obj) + sample_obj.last_check = datetime.utcnow() + #get bioproject lineage + bioproject_accessions = [bioproject.accession for bioproject in bioproject_service.create_bioprojects_from_NCBI(ass['bioproject_lineages']) if bioproject.accession != project_accession] + for b_acc in bioproject_accessions: + if not b_acc in organism.bioprojects: + organism.bioprojects.append(b_acc) + if not b_acc in sample_obj.bioprojects: + sample_obj.bioprojects.append(b_acc) + #get annotations + annotation = annotations_service.parse_annotation(organism,ass_obj) + if annotation: + organism.annotations.append(annotation) + print(organism.annotations) + sample_obj.save() + geo_loc_service.get_or_create_coordinates(sample_obj) + if not sample_obj.id in organism.insdc_samples: + organism.insdc_samples.append(sample_obj) + organism.save() + print('ASSEMBLIES FROM NCBI IMPORTED') +def handle_biosample(assembly, required_metadata): + extra_metadata=dict() + if not 'biosample' in assembly.keys() or not 'attributes' in assembly['biosample'].keys(): + #retrieve sample metadata from EBI/BioSamples + resp = ena_client.get_sample_from_biosamples(required_metadata['accession']) + extra_metadata = resp['_embedded']['samples'][0]['characteristics'] if '_embedded' in resp.keys() else dict() + else: + biosample_metadata = assembly['biosample'] + for attr in biosample_metadata['attributes']: + extra_metadata[attr['name']] = [dict(text=attr['value'])] + return {**required_metadata, **utils.parse_sample_metadata(extra_metadata)} ##retrieve assemblies by bioproject in NCBI def get_assemblies(project_accession): @@ -43,55 +81,3 @@ def get_assemblies(project_accession): assemblies.extend([ass['assembly'] for ass in result['assemblies']]) return assemblies -## get biosample accession from assemblies -def parse_data(assemblies, project_accession): - samples_not_found=set() - for assembly in assemblies: - sample_accession=assembly['biosample_accession'] - organism = organisms_service.get_or_create_organism(str(assembly['org']['tax_id'])) - sample_obj = SecondaryOrganism.objects(accession=sample_accession).first() - if not sample_obj: - sample_obj = SecondaryOrganism(accession=sample_accession,taxid=organism.taxid,scientificName=organism.organism).save() - organism.insdc_samples.append(sample_obj) - if not 'biosample' in assembly.keys() or not 'attributes' in assembly['biosample'].keys(): - #retrieve sample metadata from EBI/BioSamples - create_sample_from_biosamples(sample_obj, samples_not_found) - else: - biosample = assembly['biosample'] - sample_metadata=dict() - for attr in biosample['attributes']: - sample_metadata[attr['name']] = [dict(text=attr['value'])] - metadata = utils.parse_sample_metadata(sample_metadata) - sample_obj.modify(**metadata) - geo_loc_service.get_or_create_coordinates(sample_obj) - ass_obj = Assembly.objects(accession = assembly['assembly_accession']).upsert_one(accession = assembly['assembly_accession'],assembly_name= assembly['display_name'], sample_accession= sample_obj.accession) - if not organism.assemblies or not ass_obj.id in [ass.id for ass in organism.assemblies]: - organism.assemblies.append(ass_obj) - sample_obj.modify(push__assemblies=ass_obj) - sample_service.get_reads([sample_obj]) - bioproject_accessions = [bioproject.accession for bioproject in bioproject_service.create_bioprojects(assembly['bioproject_lineages']) if bioproject.accession != project_accession] - for b_acc in bioproject_accessions: - if not b_acc in organism.bioprojects: - organism.bioprojects.append(b_acc) - if not b_acc in sample_obj.bioprojects: - sample_obj.bioprojects.append(b_acc) - #save triggers status tracking - sample_obj.save() - organism.save() - if len(list(samples_not_found))>0: - print('SAMPLES NOT FOUND IN BIOSAMPLES: ', samples_not_found) - print('NCBI DATA IMPORTED') - # get_reads(samples_accessions) - - -def create_sample_from_biosamples(sample_obj, samples_not_found): - resp = ena_client.get_sample_from_biosamples(sample_obj.accession) - if '_embedded' in resp.keys(): - metadata = utils.parse_sample_metadata(resp['_embedded']['samples'][0]['characteristics']) - sample_obj.modify(**metadata) - geo_loc_service.get_or_create_coordinates(sample_obj) - else: - print('SAMPLE NOT FOUND') - samples_not_found.add(sample_obj.accession) - print(sample_obj.accession) - diff --git a/server/cronjobs/import_from_biosample.py b/server/cronjobs/import_from_biosample.py index c36cf6c2..77370424 100644 --- a/server/cronjobs/import_from_biosample.py +++ b/server/cronjobs/import_from_biosample.py @@ -1,69 +1,59 @@ +from db.models import SecondaryOrganism,Experiment +from utils import ena_client,utils,constants +from services import organisms_service,geo_loc_service from datetime import datetime -from db.models import SecondaryOrganism,Assembly -from utils import ena_client,utils -from services import sample_service,organisms_service,geo_loc_service -from mongoengine.queryset.visitor import Q def import_from_EBI_biosamples(PROJECTS): print('STARTING IMPORT BIOSAMPLES JOB') - samples = collect_samples(PROJECTS) - if len(samples) == 0: - print('NO SAMPLES FOUND') - return - samples_accessions=[sample['accession'] for sample in samples] - existing_samples = SecondaryOrganism.objects(accession__in=samples_accessions) - if existing_samples.count() > 0: - samples = [sample for sample in samples if sample['accession'] not in [ex_sam['accession'] for ex_sam in existing_samples]] - print('NEW SAMPLES: ',len(samples)) - if len(samples) > 0: - samples_accessions=[sample['accession'] for sample in samples] - for sample in samples: + sample_dict = collect_samples(PROJECTS) ##return dict with project names as keys + existing_samples = SecondaryOrganism.objects.scalar('accession') + for project in sample_dict.keys(): + for sample in sample_dict[project]: + if sample['accession'] in existing_samples: + continue taxid = str(sample['taxId']) - metadata = utils.parse_sample_metadata(sample['characteristics']) - - organism = organisms_service.get_or_create_organism(taxid) ##add common names + characteristics = utils.parse_sample_metadata(sample['characteristics']) + organism = organisms_service.get_or_create_organism(taxid) if not organism: - #TODO CALL NCBI + print('TAXID NOT FOUND:',taxid) + print('SKIPPING SAMPLE CREATION') continue - if not 'scientificName' in metadata.keys(): - metadata['scientificName'] = organism.organism - metadata['taxid'] = taxid - metadata['accession'] = sample['accession'] - sample_obj = SecondaryOrganism(**metadata).save() + characteristics['scientificName'] = organism.organism #overwrite or create scientificName + required_attr=dict(accession=sample['accession'],taxid=taxid) + sample_obj = SecondaryOrganism(**required_attr,**characteristics) + ##link with bioproject + if project in constants.BIOPROJECTS_MAPPER.keys(): + project_accession = constants.BIOPROJECTS_MAPPER[project] + sample_obj.bioprojects.append(constants.BIOPROJECTS_MAPPER[project]) + if not project_accession in organism.bioprojects: + organism.bioprojects.append(project_accession) + ##get experiments + experiments = ena_client.get_reads(sample_obj.accession) + for exp in experiments: + if Experiment.objects(experiment_accession=exp['experiment_accession']).first(): + continue ##ena sometimes returns duplicates + exp_obj = Experiment(**exp).save() + organism.experiments.append(exp_obj) + sample_obj.experiments.append(exp_obj) + sample_obj.last_check = datetime.utcnow() + ## we rely on the NCBI job to retrieve assemblies + sample_obj.save() geo_loc_service.get_or_create_coordinates(sample_obj) if not sample_obj.sample_derived_from: organism.insdc_samples.append(sample_obj) - organism.save() - assemblies = ena_client.parse_assemblies(sample_obj.accession) - if len(assemblies) > 0: - print('ASSEMBLY PRESENT') - existing_assemblies=Assembly.objects(accession__in=[ass['accession'] for ass in assemblies]) - if len(existing_assemblies) > 0: - assemblies=[ass for ass in assemblies if ass['accession'] not in [ex_as['accession'] for ex_as in existing_assemblies]] - if len(assemblies) > 0: - for ass in assemblies: - if not 'sample_accession' in ass.keys(): - ass['sample_accession'] = sample_obj.accession - assemblies = Assembly.objects.insert([Assembly(**ass) for ass in assemblies]) - organism.assemblies.extend(assemblies) - organism.save() - sample_obj.assemblies.extend(assemblies) - sample_obj.last_checked=datetime.utcnow() - sample_obj.save() - print('GETTING READS') - sample_service.get_reads([sample_obj]) + organism.save() print('APPENDING SPECIMENS') ##append specimens as a backup if biosamples api fails append_specimens() print('DATA FROM ENA/BIOSAMPLES IMPORTED') def collect_samples(PROJECTS): - samples = list() + samples = dict() for project in PROJECTS: biosamples = ena_client.get_biosamples(project) print('lenght ebi biosamples', len(biosamples)) if biosamples: - samples.extend(biosamples) + samples[project] = biosamples return samples def append_specimens(): @@ -78,7 +68,6 @@ def append_specimens(): - diff --git a/server/cronjobs/import_records.py b/server/cronjobs/import_records.py index 57a8542f..f662b5fd 100644 --- a/server/cronjobs/import_records.py +++ b/server/cronjobs/import_records.py @@ -1,6 +1,10 @@ ##import data job +from apscheduler.schedulers.background import BackgroundScheduler + +from services.organisms_service import get_or_create_organism from .import_from_NCBI import import_from_NCBI from .import_from_biosample import import_from_EBI_biosamples +from services.bioproject_service import create_bioproject_from_ENA from db.models import SecondaryOrganism,Experiment,Organism from mongoengine.queryset.visitor import Q from datetime import datetime, timedelta @@ -8,14 +12,14 @@ import os -SAMPLE_QUERY = Q(accession__ne=None) & (Q(last_check=None) | Q(last_check__lte=datetime.now()- timedelta(days=5))) +SAMPLE_QUERY = Q(accession__ne=None) & (Q(last_check=None) | Q(last_check__lte=datetime.now()- timedelta(days=2))) def import_records(): PROJECTS = [p.strip() for p in os.getenv('PROJECTS').split(',') if p] ACCESSION = os.getenv('PROJECT_ACCESSION') if ACCESSION: import_from_NCBI(ACCESSION) - if len(PROJECTS)>0: + if PROJECTS: import_from_EBI_biosamples(PROJECTS) update_samples() @@ -27,25 +31,27 @@ def update_samples(): return print('SAMPLES TO UPDATE: ',len(samples)) for sample in samples: - accession = sample.accession - experiments = ena_client.get_reads(accession) + experiments = ena_client.get_reads(sample.accession) if not experiments: sample.modify(last_check=datetime.utcnow()) continue - unique_exps=list({v['experiment_accession']:v for v in experiments}.values()) #avoid duplicate records bug in ENA (when ranges are assigned to a biosample) - if sample.experiments: - existing_exps = Experiment.objects(experiment_accession__in=[exp['experiment_accession'] for exp in unique_exps]) - new_exps = [Experiment(**exp) for exp in unique_exps if exp['experiment_accession'] not in [exp['experiment_accession'] for exp in existing_exps]] - else: - new_exps = [Experiment(**exp) for exp in unique_exps] - if not new_exps: - sample.modify(last_check=datetime.utcnow()) - continue - Experiment.objects.insert(new_exps, load_bulk=False) - sample = SecondaryOrganism.objects(accession=accession).first() - sample.modify(push_all__experiments=new_exps, last_check=datetime.utcnow()) - org = Organism.objects(taxid=sample.taxid).first() - org.experiments.extend(new_exps) - #trigger status update - org.save() - \ No newline at end of file + organism = get_or_create_organism(sample.taxid) + existing_experiments = Experiment.objects.scalar('experiment_accession') + for exp in experiments: + if exp['experiment_accession'] in existing_experiments: + continue + exp_obj = Experiment(**exp).save() + organism.experiments.append(exp_obj) + sample.experiments.append(exp_obj) + sample.last_check = datetime.utcnow() + organism.save() + sample.save() + +def handle_tasks(): + PROJECT_ACCESSION=os.getenv('PROJECT_ACCESSION') + if PROJECT_ACCESSION: + create_bioproject_from_ENA(PROJECT_ACCESSION) + TIME= os.getenv('EXEC_TIME') if os.getenv('EXEC_TIME') else 172800 ##48h hours by default + sched = BackgroundScheduler(daemon=True) + sched.add_job(import_records, "interval", id="interval-job", start_date=datetime.now()+timedelta(seconds=20),seconds=int(TIME)) + sched.start() \ No newline at end of file diff --git a/server/db/models.py b/server/db/models.py index d7c2e699..af92821f 100644 --- a/server/db/models.py +++ b/server/db/models.py @@ -9,7 +9,7 @@ class TrackStatus(Enum): SAMPLE = 'Biosample Submitted' READS = 'Reads Submitted' ASSEMBLIES = 'Assemblies Submitted' - ANN_SUBMITTED = 'Annotation Submitted' + ANN_SUBMITTED = 'Annotations Created' def handler(event): """Signal decorator to allow use of callback functions as class decorators.""" @@ -194,13 +194,15 @@ class SecondaryOrganism(db.Document): custom_fields = db.DictField() meta = { 'indexes': [ - {'fields':('accession','tube_or_well_id'), 'unique':True} + {'fields':('accession','tube_or_well_id'), 'unique':True}, ] } @handler(db.pre_save) def update_modified(sender, document): - if document.assemblies: + if document.annotations: + document.trackingSystem= TrackStatus.ANN_SUBMITTED + elif document.assemblies: document.trackingSystem= TrackStatus.ASSEMBLIES elif document.experiments: document.trackingSystem= TrackStatus.READS @@ -218,6 +220,24 @@ class Geometry(db.EmbeddedDocument): 'coordinates', ] } + +class Annotation(db.Document): + name = db.StringField(required=True,unique=True) + gffGzLocation = db.StringField(required=True,unique=True) + pageURL=db.StringField() + annotationSource=db.StringField(default='https://github.com/FerriolCalvet/geneidBLASTx-nf') + tabIndexLocation = db.StringField() + targetGenome = db.StringField(required=True) + assemblyAccession=db.StringField() + lengthTreshold = db.StringField() + evidenceSource = db.StringField() + created = db.StringField() + meta = { + 'indexes': [ + 'name' + ] + } + # ##TODO Migrate samples geo attributes to this model -> test class GeoCoordinates(db.Document): geo_loc = db.StringField(unique=True,required=True) @@ -237,6 +257,7 @@ class Organism(db.Document): tolid_prefix = db.StringField() bioprojects = db.ListField(db.StringField()) common_name = db.ListField(db.StringField()) + annotations = db.ListField(db.LazyReferenceField(Annotation)) insdc_common_name = db.StringField() local_samples = db.ListField(db.LazyReferenceField(SecondaryOrganism)) insdc_samples = db.ListField(db.LazyReferenceField(SecondaryOrganism)) diff --git a/server/rest/data_input_api.py b/server/rest/data_input_api.py index dd56af44..a3744e64 100644 --- a/server/rest/data_input_api.py +++ b/server/rest/data_input_api.py @@ -4,7 +4,7 @@ from datetime import timedelta import os import json -from db.models import GeoCoordinates, TaxonNode, SecondaryOrganism, Organism, Assembly, Experiment,BioProject +from db.models import GeoCoordinates,Annotation, TaxonNode, SecondaryOrganism, Organism, Assembly, Experiment,BioProject class Login(Resource): def post(self): @@ -22,6 +22,7 @@ def post(self): @jwt_required() def delete(self): + Annotation.drop_collection() TaxonNode.drop_collection() GeoCoordinates.drop_collection() BioProject.drop_collection() diff --git a/server/rest/samples_api.py b/server/rest/samples_api.py index adbf6352..60b627a5 100644 --- a/server/rest/samples_api.py +++ b/server/rest/samples_api.py @@ -12,6 +12,8 @@ from mongoengine.queryset.visitor import Q from utils.pipelines import SamplePipeline,SamplePipelinePrivate import json +from flask import current_app as app + #CRUD operations on sample class SamplesApi(Resource): @@ -24,6 +26,7 @@ def get(self,accession): else: result = sample.aggregate(*SamplePipeline).next() result['_id'] = str(result['_id']) + app.logger.info(result) return Response(json.dumps(result),mimetype="application/json", status=200) raise NotFound @@ -98,7 +101,7 @@ def post(self): organism.assemblies.extend(assemblies) organism.save() sample.assemblies.extend(assemblies) - sample.last_checked=datetime.utcnow() + sample.last_check=datetime.utcnow() sample.save() return Response(json.dumps(f'sample with id {id} has been saved'),mimetype="application/json", status=201) else: diff --git a/server/services/annotations_service.py b/server/services/annotations_service.py new file mode 100644 index 00000000..272f8d2c --- /dev/null +++ b/server/services/annotations_service.py @@ -0,0 +1,15 @@ +from utils.utils import get_annotations +from db.models import Annotation,Assembly + +GENOME_BROWSER_URL='https://genome.crg.cat/geneid-predictions/#/organisms/' + +def parse_annotation(organism_obj, ass_obj): + response = get_annotations(organism_obj.organism) + if not response or not 'annotations' in response.keys(): + return + for ann in response['annotations']: + if ass_obj.assembly_name == ann['targetGenome'] and not Annotation.objects(name=ann['name']).first(): + page_url=GENOME_BROWSER_URL+organism_obj.organism + annotation = Annotation(pageURL=page_url, assemblyAccession=ass_obj.accession,**ann).save() + return annotation + \ No newline at end of file diff --git a/server/services/bioproject_service.py b/server/services/bioproject_service.py index a9ecf158..9cb9c707 100644 --- a/server/services/bioproject_service.py +++ b/server/services/bioproject_service.py @@ -1,6 +1,10 @@ from db.models import BioProject +from utils.ena_client import get_bioproject +import os -def create_bioprojects(bioprojects): +ROOT_PROJECT = os.getenv('PROJECT_ACCESSION') + +def create_bioprojects_from_NCBI(bioprojects): ##first save all bioprojects saved_bioprojects=list() for projects_container in bioprojects: @@ -18,4 +22,16 @@ def create_bioprojects(bioprojects): parent_project = BioProject.objects(accession=p_acc).first() project_obj.modify(add_to_set__parents=parent_project) # project_obj.parents.append(BioProject.objects(accession = p_acc).first()) - return saved_bioprojects \ No newline at end of file + return saved_bioprojects + +def create_bioproject_from_ENA(project_accession): + if BioProject.objects(accession=project_accession).first(): + return + resp = get_bioproject(project_accession) + for r in resp: + if 'study_accession' in r.keys() and r['study_accession'] == project_accession: + bioproject_obj = BioProject(accession=project_accession, title=r['description']).save() + if bioproject_obj.accession != ROOT_PROJECT: + root_proj = BioProject.objects(accession=ROOT_PROJECT).first() + bioproject_obj.parents.append(root_proj) + bioproject_obj.save() \ No newline at end of file diff --git a/server/services/data_service.py b/server/services/data_service.py index b64a9491..31e1fcdb 100644 --- a/server/services/data_service.py +++ b/server/services/data_service.py @@ -1,14 +1,15 @@ -from db.models import Assembly, Experiment, SecondaryOrganism +from db.models import Assembly, Experiment, SecondaryOrganism,Annotation DB_MODEL_MAPPER={ 'assemblies': Assembly, 'experiments':Experiment, 'local_samples':SecondaryOrganism, - 'insdc_samples':SecondaryOrganism + 'insdc_samples':SecondaryOrganism, + 'annotations':Annotation } def get_data(model, ids): - return DB_MODEL_MAPPER[model].objects(id__in=ids).to_json() + return DB_MODEL_MAPPER[model].objects(id__in=ids).exclude('id').to_json() def get_last_created(model): return DB_MODEL_MAPPER[model].objects.order_by('-id').first().to_json() diff --git a/server/services/geo_loc_service.py b/server/services/geo_loc_service.py index 4930948c..18a0844b 100644 --- a/server/services/geo_loc_service.py +++ b/server/services/geo_loc_service.py @@ -25,7 +25,7 @@ def geoloc_samples(bioproject=None): if not bioproject or bioproject == PROJECT_ACCESSION: geo_objs = list(GeoCoordinates.objects.aggregate(*GeoCoordinatesPipeline)) else: - sample_ids = [sample.id for sample in SecondaryOrganism.objects(bioprojects=bioproject)] + sample_ids = SecondaryOrganism.objects(bioprojects=bioproject,sample_derived_from=None).scalar('id') geo_objs = list(GeoCoordinates.objects(biosamples__in=sample_ids).aggregate(*GeoCoordinatesPipeline)) if geo_objs: FEATURE_COLLECTION_OBJECT['features'] = geo_objs diff --git a/server/services/sample_service.py b/server/services/sample_service.py index 88f02bc0..3e812621 100644 --- a/server/services/sample_service.py +++ b/server/services/sample_service.py @@ -58,6 +58,8 @@ def delete_samples(ids): samples_to_delete.delete() return {'success':'samples: '+ ','.join(ids) + ' deleted'} +#TODO handle save outsite method +##this should only return reads def get_reads(samples): for sample in samples: accession = sample.accession diff --git a/server/services/search_service.py b/server/services/search_service.py index 75575fcf..d7b105f0 100644 --- a/server/services/search_service.py +++ b/server/services/search_service.py @@ -12,7 +12,7 @@ def query_search(offset=0, limit=20, sortOrder=None, sortColumn=None, taxName=ROOT_NODE, insdc_samples='false', local_samples='false', assemblies='false', - experiments='false', filter=None, option=None, onlySelectedData='false', bioproject=PROJECT_ACCESSION): + experiments='false', annotations='false',filter=None, option=None, onlySelectedData='false', bioproject=PROJECT_ACCESSION): query=dict() json_resp=dict() filter_query = get_query_filter(filter, option) if filter else None @@ -20,7 +20,7 @@ def query_search(offset=0, limit=20, query['taxon_lineage'] = tax_node if tax_node else TaxonNode.objects(name=ROOT_NODE).first() if bioproject and not bioproject==PROJECT_ACCESSION: query['bioprojects'] = bioproject - insdc_dict = dict(insdc_samples=insdc_samples,local_samples=local_samples,assemblies=assemblies,experiments=experiments) + insdc_dict = dict(insdc_samples=insdc_samples,local_samples=local_samples,assemblies=assemblies,experiments=experiments,annotations=annotations) get_insdc_query(insdc_dict,query,onlySelectedData) organisms = Organism.objects(filter_query, **query).exclude(*FIELDS_TO_EXCLUDE) if filter_query else Organism.objects.filter(**query).exclude(*FIELDS_TO_EXCLUDE) if sortColumn: @@ -31,9 +31,6 @@ def query_search(offset=0, limit=20, return json.dumps(json_resp) def get_insdc_query(insdc_dict, query, only_selected_data): - values = insdc_dict.values() - if all(value=='false' for value in values): - return if only_selected_data == 'false': for key in insdc_dict.keys(): if insdc_dict[key] == 'true': diff --git a/server/utils/constants.py b/server/utils/constants.py index ed7bd570..f942de03 100644 --- a/server/utils/constants.py +++ b/server/utils/constants.py @@ -1,4 +1,12 @@ +#TODO handle mapping between biosamples attribute project name and bioproject accession +BIOPROJECTS_MAPPER={ + 'VGP':'PRJNA489243', + 'DTOL':'PRJEB40665' +} + + + CHECKLIST_FIELD_GROUPS = [ {'fields': [ {'label': 'organism part','model':'organism_part', 'description': "The part of organism's anatomy or substance arising from an organism from which the biomaterial was derived, excludes cells.", 'type': 'text_choice_field', 'mandatory': 'mandatory', 'multiplicity': 'single', 'options': ['WHOLE_ORGANISM', 'HEAD', 'THORAX', 'ABDOMEN', 'CEPHALOTHORAX', 'BRAIN', 'EYE', 'FAT_BODY', 'INTESTINE', 'BODYWALL', 'TERMINAL_BODY', 'ANTERIOR_BODY', 'MID_BODY', 'POSTERIOR_BODY', 'HEPATOPANCREAS', 'LEG', 'BLOOD', 'LUNG', 'HEART', 'KIDNEY', 'LIVER', 'ENDOCRINE_TISSUE', 'SPLEEN', 'STOMACH', 'PANCREAS', 'MUSCLE', 'MODULAR_COLONY', 'TENTACLE', 'FIN', 'SKIN', 'SCAT', 'EGGSHELL', 'SCALES', 'MOLLUSC_FOOT', 'HAIR', 'GILL_ANIMAL', '**OTHER_SOMATIC_ANIMAL_TISSUE**', 'OVIDUCT', 'GONAD', 'OVARY_ANIMAL', 'TESTIS', 'SPERM_SEMINAL_FLUID', 'EGG', '**OTHER_REPRODUCTIVE_ANIMAL_TISSUE**', 'WHOLE_PLANT', 'SEEDLING', 'SEED', 'LEAF', 'FLOWER', 'BLADE', 'STEM', 'PETIOLE', 'SHOOT', 'BUD', 'THALLUS_PLANT', 'BRACT', '**OTHER_PLANT_TISSUE**', 'MYCELIUM', 'MYCORRHIZA', 'SPORE_BEARING_STRUCTURE', 'HOLDFAST_FUNGI', 'STIPE', 'CAP', 'GILL_FUNGI', 'THALLUS_FUNGI', 'SPORE', '**OTHER_FUNGAL_TISSUE**', 'NOT_COLLECTED', 'NOT_APPLICABLE', 'NOT_PROVIDED', 'UNICELLULAR_ORGANISMS_IN_CULTURE', 'MULTICELLULAR_ORGANISMS_IN_CULTURE']}, diff --git a/server/utils/ena_client.py b/server/utils/ena_client.py index 348db4c1..43c40b91 100644 --- a/server/utils/ena_client.py +++ b/server/utils/ena_client.py @@ -1,5 +1,5 @@ import requests -from flask import current_app as app +from flask import current_app as app, request import time def get_taxon_from_ena(taxon_id): @@ -25,6 +25,13 @@ def get_tolid(taxid): else: return response[0]['prefix'] +def get_bioproject(project_accession): + resp = requests.get(f"https://www.ebi.ac.uk/ena/portal/api/filereport?accession={project_accession}&format=JSON&result=study") + if resp.status_code != 200: + return list() + else: + return resp.json() + def get_biosamples_page(url , samples): response = requests.get(url) if response.status_code != 200: diff --git a/server/utils/pipelines.py b/server/utils/pipelines.py index 0ca75bd8..3923a4b3 100644 --- a/server/utils/pipelines.py +++ b/server/utils/pipelines.py @@ -20,6 +20,8 @@ "as": "biosamples", } }, + # {'$unwind': '$biosamples'}, + # {'$sort': {'biosamples.scientificName': 1}}, {"$project": {"_id":0, "properties": { @@ -32,7 +34,7 @@ 'tube_or_well_id': '$$biosample.tube_or_well_id', 'scientificName': '$$biosample.scientificName', } - } + }, } }, "geo_loc":1, @@ -71,6 +73,13 @@ "as": "assemblies", } }, + {"$lookup": + {"from": "annotation", + "localField": "annotations", + "foreignField": "_id", + "as": "annotations", + } + }, {"$lookup": {"from": "taxon_node", "localField": "taxon_lineage", @@ -91,7 +100,8 @@ "collector_orcid_id":0,"sample_coordinator_orcid_id":0}, "taxon_lineage" : 0, "assemblies" : {"_id":0, "created":0}, - "experiments": {"_id":0} + "experiments": {"_id":0}, + "annotations":{"_id":0} } } ] @@ -136,7 +146,7 @@ "associated_traditional_knowledge_applicable":0,"ethics_permits_mandatory":0, "sampling_permits_mandatory":0, "regulatory_compliance":0,"nagoya_permits_mandatory":0, "collector_orcid_id":0,"sample_coordinator_orcid_id":0}, - "assemblies" : {"_id":0}, + "assemblies" : {"_id":0,"created":0}, "experiments": {"_id":0} } } @@ -167,9 +177,9 @@ { "created":0, "last_check":0, - "assemblies" : {"_id":0}, + "assemblies" : {"_id":0, "created":0}, "experiments": {"_id":0} - + } } ] \ No newline at end of file diff --git a/server/utils/utils.py b/server/utils/utils.py index b199cbe0..cfcdd9a9 100644 --- a/server/utils/utils.py +++ b/server/utils/utils.py @@ -1,6 +1,13 @@ from lxml import etree from flask import make_response,jsonify from .constants import CHECKLIST_FIELD_GROUPS +import requests + +def get_annotations(org_name): + response = requests.get(f'https://genome.crg.cat/geneid-predictions/api/organisms/{org_name}') + if response.status_code != 200: + return + return response.json() def parse_taxon(xml): root = etree.fromstring(xml) @@ -44,7 +51,7 @@ def parse_sample_metadata(metadata): sample['geographic_location_country'] = metadata[key][0]['text'] else: custom_fields[key] = metadata[key][0]['text'] - if len(custom_fields.keys()) > 0: + if custom_fields.keys(): sample['custom_fields'] = custom_fields return sample