Skip to content

Latest commit



executable file
84 lines (60 loc) · 8.73 KB

File metadata and controls

executable file
84 lines (60 loc) · 8.73 KB

GWAS Repository Information

Overall Protocol

This document provides some details on the repository. This repository contains several manual trials of our GWAS SNP filtering method.

The main protocol for each GWAS analysis is to:

  1. Identify an Enformer track list relevant to the trait being studied. Tracks are labelled according to tissue type in target_dnase_ataq_tracks_labelled.csv.

  2. Merge 1000 genomes SAD scores for each track in the track list with summary statistics for the given GWAS using a These files match rows of summary statistics by either snp id (rs id) or chromosome and position pairs, depending on the information provided in the summary statistics. For gwas 1, this is called "". For gwas 3, this is called "". For gwas 4, this is called "". To replicate this in a new study: The 1000 genomes SAD score data are stored in 1000genomes_as_csv. The summary statistics for your given study must be downloaded to the repository. Then, the SAD score data must be merged with the summary statistics on a common column depending on the variables in the summary stats (e.g. rs-id, [chr,pos]).

  3. Replicate the given GWAS using a pre-filtered list of SNPs. These files are labelled These files first filter the SNPs based upon global thresholds for Enformer SAD track scores, and then adjust the p-value based upon the number of SNPs filtered from the given study. This is done with for gwas 1, and for gwas 3, and for gwas 4. To replicate this in a new study, 1st apply GWAS filtering. This is done by computing global threshhold for Enformer SAD track scores across n_0 initial SNPs, then filtering the SNP list according to these thresholds. Compute the new number of SNPs (n_1) and multiply 5e-8 x (n_0/n_1). This will create a higher p-value threshold. Then recreate the GWAS selection process using methods outlined in the study using this new threshold for significance and filtered snp list.

  4. Following the above steps, perform analyses of results compared to original study.

Note that all files associated with a specific gwas are labelled beginning with “gwas_#”. To reference which gwas study this corresponds to please look at references.txt. Note that the analysis for gwas 2 was not completed due to limitations in study replicability. The following analyses have been conducted, as of Oct 30, 2024:


  • Original summary statistics are available in GWAS/gwas_1_and_2_summary_statistics_data/GCST90277450.tsv
  • The matched files (result of step 2 in Overall Protocol) are stored in gwas_1_matching. The file for a track t is labelled as result_SAD.csv, and houses the SAD scores for each SNP in 1KG. This can be reproduced by running: python ./gwas_1_and_2_summary_statistics_data/GCST90277450.tsv ./1000genomes_as_csv [0,1,9,76,78,80,81,172,179,216,240,261,278,319,326,338,355,370,403,411,421,458,462,469,499,524,552,580,582,602,644,669]
  • Files produced as runs are stored in GWAS/gwas_1_intermediate_files
  • Results of running of each threshold t are stored in GWAS/filtered_snps_gwas_1/filtered_snps_gwas_1_threshold=.csv.

Analysis of results of GWAS 1 analysis was completed using the following data:

  • eQTL (Data available in GWAS/EQTLs)
  • FANTOM5_annotations (Data available in GWAS/FANTOM5_annotations)

gwas 1 single track analysis:

  • A seperate analysis was ran to understand the effects of a single SAD track being used as a threshold. The results of this are stored in GWAS/gwas_1_single_track_analysis/GWAS_1_leading_SNPs_by_track amd GWAS/gwas_1_single_track_analysis/GWAS_1_leading_snps_by_track_random. These results were produced using the GWAS/gwas_1_single_track_analysis/ file. The comparison notebook to all results is labelled GWAS/gwas_1_single_track_analysis/gwas_1_all_vs_single_track.ipynb and for a random set of tracks in GWAS/gwas_1_single_track_analysis/gwas_1_all_vs_single_track_random.ipynb. A large-scale version of this analysis is currently underway.


  • Original summary statistics are available in GWAS/gwas_1_and_2_summary_statistics_data/PGC_UKB_depression_genome-wide.txt
    • The matched files (result of step 2 in Overall Protocol) are stored in gwas_1_matching. The file for a track t is labelled as result_SAD.csv, and houses the SAD scores for each SNP in 1KG. This can be reproduced by running: python ./gwas_1_and_2_summary_statistics_data/PGC_UKB_depression_genome-wide.txt ./1000genomes_as_csv [0,1,9,76,78,80,81,172,179,216,240,261,278,319,326,338,355,370,403,411,421,458,462,469,499,524,552,580,582,602,644,669]
  • Results of for each threshold t are stored in GWAS/filtered_snps_gwas_2/filtered_snps_gwas_2_threshold=.csv.
  • Files produced as runs are stored in GWAS/gwas_2_intermediate_files.


  • Summary statistics are available in GWAS/gwas_3_scz_original_files
  • Files produced as runs are stored in GWAS/gwas_3_scz_intermediate_files.
  • Results of running are available in GWAS/gwas_3_scz_result_files.


  • Summary statistics are available in GWAS/gwas_4_alz_summary_statistics. The study we are looking at is labelled PGCALZ2sumstatsExcluding23andMe.txt. The stats from a 2019 ALZ GWAS study from the same group (precursor to this one) are also there as they are used to label snps in this study with the correct snp ids.
  • The original results of the study, which included 23andMe data (we don't have access to) are available in gwas_4_alz_original_results_final.csv
  • The matched files (result of step 2 in Overall Protocol) are stored in gwas_1_matching. The file for a track t is labelled as result_SAD.csv, and houses the SAD scores for each SNP in 1KG. This can be reproduced by running: python ./GWAS/gwas_4_alz_summary_statistics/PGCALZ2sumstatsExcluding23andMe.txt ./1000genomes_as_csv [0,1,9,76,78,80,81,172,179,216,240,261,278,319,326,338,355,370,403,411,421,458,462,469,499,524,552,580,582,602,644,669].
  • Files produced as runs are stored in GWAS/gwas_4_alz_intermediate_files.
  • Results of running are available in GWAS/gwas_4_alz_result_files

Other Directories:


  • At beginning of project, wanted to ensure that we could run Enformer + Tangermeme in pytorch. Files to do so are contained here. These were just tests to ensure functionality, nothing is built yet.

Finemapping: GWAS/finemapping and GWAS/fine_mapping_results

  • Used to do finemapping analysis on GWAS 1

Genome Wide Complex Trait Analysis (GCTA): GWAS/gcta_output_directory and GWAS/gcta

  • GCTA software website:
  • GCTA install is in GWAS/gcta
  • GWAS/gcta_output_directory housed intermediate gcta files while gwas_2 was running, but is currently empty. GWAS 2 was not able to be replicated well, so the analysis was not completed.



  • Contains genome annotations. Version v46lift37 (GWAS/genome_assembly/gencode.v46lift37.annotation.gff3) is most applicable to hg37, which is applicable to gwas 1-4.
  • Contains This was used to create the exon_regions that are used to identify coding snps in the GWAS filtering procedure. This routine creates the regions for coding snps, which are contained in GWAS/genome_assembly/exon_regions_v2.csv


leading SNPs_SAD_track_visualization

  • This was an analysis to compare the SAD enformer scores of the track list we identified for psychiatric conditions ([0,1,9,76,78,80,81,172,179,216,240,261,278,319,326,338,355,370,403,411,421,458,462,469,499,524,552,580,582,602,644,669]) to a random track list (see gwas_random_tracks_matching for the track list and files analysed).


  • Used for running analyses for Gwas 1 on random track sets, any intermediate files were placed here instead of gwas_1_intermediate_files