The IEU GWAS database comprises over 10,000 curated, QC’d and harmonised complete GWAS summary datasets and can be queried using an API. See here for documentation on the API itself. This R package is a wrapper to make generic calls to the API, plus convenience functions for specific queries.
-
-
-Authentication
-
Most datasets in the database are public and don’t need authentication. But if you want to access a private dataset that is linked to your (gmail) email address, you need to authenticate the query using a method known as Google OAuth2.0.
+
The IEU GWAS database
+comprises over 10,000 curated, QC’d and harmonised complete GWAS summary
+datasets and can be queried using an API. See here for documentation on
+the API itself. This R package is a wrapper to make generic calls to the
+API, plus convenience functions for specific queries.
+
+
Authentication
+
+
Most datasets in the database are public and don’t need
+authentication. But if you want to access a private dataset that is
+linked to your (gmail) email address, you need to authenticate the query
+using a method known as Google OAuth2.0.
Essentially - you run this command at the start of your session:
which will open up a web browser asking you to provide your google username and password, and upon doing so a directory will be created in your working directory called ieugwasr_oauth. This directory contains a file that looks like this: <random_string>_<email@address>. It is a binary file (not human readable), which contains your access token, and it acts as a convenient way to hold a randomly generated password.
-
If you are using a server which doesn’t have a graphic user interface then the ieugwasr::get_access_token() method is not going to work. You need to generate the ieugwasr_oauth directory and token file on a computer that has a web browser, and then copy that directory (containing the token file) to your server (to the relevant work directory).
-
If you are using R in a working directory that does not have write permissions then this command will fail, please navigate to a directory that does have write permissions.
-
If you need to run this in a non-interactive script then you can generate the token file on an interactive computer, copy that file to the working directory that R will be running from, and then run a batch (non-interactive).
which will open up a web browser asking you to provide your google
+username and password, and upon doing so a directory will be created in
+your working directory called ieugwasr_oauth. This
+directory contains a file that looks like this:
+<random_string>_<email@address>. It is a binary
+file (not human readable), which contains your access token, and it acts
+as a convenient way to hold a randomly generated password.
+
If you are using a server which doesn’t have a graphic user interface
+then the ieugwasr::get_access_token() method is not going
+to work. You need to generate the ieugwasr_oauth directory
+and token file on a computer that has a web browser, and then copy that
+directory (containing the token file) to your server (to the relevant
+work directory).
+
If you are using R in a working directory that does not have write
+permissions then this command will fail, please navigate to a directory
+that does have write permissions.
+
If you need to run this in a non-interactive script then you can
+generate the token file on an interactive computer, copy that file to
+the working directory that R will be running from, and then run a batch
+(non-interactive).
You can test to see if you have authenticated using the function
It will return NULL if you are not authenticated, or a
+long random token string if you are.
+
To unauthenticate, simply delete the relevant file in the
+ieugwasr_oauth folder, or delete the folder entirely.
-
-
-General API queries
-
The API has a number of endpoints documented here. A general way to access them in R is using the api_query function. There are two types of endpoints - GET and POST.
+
+
General API queries
+
+
The API has a number of endpoints documented here. A general way to
+access them in R is using the api_query function. There are
+two types of endpoints - GET and POST.
-GET - you provide a single URL which includes the endpoint and query. For example, for the association endpoint you can obtain some rsids in some studies, e.g.
+GET - you provide a single URL which includes the
+endpoint and query. For example, for the association
+endpoint you can obtain some rsids in some studies, e.g.
-POST - Here you send a “payload” to the endpoint. So, the path specifies the endpoint and you add a list of query specifications. This is useful for long lists of rsids being queried, for example
+POST - Here you send a “payload” to the endpoint. So,
+the path specifies the endpoint and you add a list of query
+specifications. This is useful for long lists of rsids being queried,
+for example
The api_query function returns a response object from the httr package. See below for a list of functions that make the input and output to api_query more convenient.
+
The api_query function returns a response
+object from the httr package. See below for a list of
+functions that make the input and output to api_query more
+convenient.
By default this will look for LD proxies using 1000 genomes reference data (Europeans only, the reference panel has INDELs removed and only retains SNPs with MAF > 0.01). This behaviour can be turned off using proxies=0 as an argument.
-
Note that the queries are performed on rsids, but chromosome:position values will be automatically converted. A range query can be done using e.g.
By default this will look for LD proxies using 1000 genomes reference
+data (Europeans only, the reference panel has INDELs removed and only
+retains SNPs with MAF > 0.01). This behaviour can be turned off using
+proxies=0 as an argument.
+
Note that the queries are performed on rsids, but chromosome:position
+values will be automatically converted. A range query can be done using
+e.g.
PheWAS can also be performed in only specific subsets of the data. The datasets in the IGD are organised by batch, you can see info about it here: https://gwas.mrcieu.ac.uk/datasets/ or get a list of batches and their descriptions using:
PheWAS can also be performed in only specific subsets of the data.
+The datasets in the IGD are organised by batch, you can see info about
+it here: https://gwas.mrcieu.ac.uk/datasets/ or get a list of
+batches and their descriptions using:
There are 5 super-populations that can be requested via the pop argument. By default this will use the Europeans subset (EUR super-population). The reference panel has INDELs removed and only retains SNPs with MAF > 0.01 in the selected population.
-
Note that you can perform the same operation locally if you provide a path to plink and a bed/bim/fam LD reference dataset. e.g.
There are 5 super-populations that can be requested via the
+pop argument. By default this will use the Europeans subset
+(EUR super-population). The reference panel has INDELs removed and only
+retains SNPs with MAF > 0.01 in the selected population.
+
Note that you can perform the same operation locally if you provide a
+path to plink and a bed/bim/fam LD reference dataset. e.g.
This uses the API by default but is limited to only 500 variants. You can use, instead, local plink and LD reference data in the same manner as in the ld_clump function, e.g.
There are 5 super-populations that can be requested via the pop argument. By default this will use the Europeans subset (EUR super-population). The reference panel has INDELs removed and only retains SNPs with MAF > 0.01 in the selected population.
This uses the API by default but is limited to only 500 variants. You
+can use, instead, local plink and LD reference data in the same manner
+as in the ld_clump function, e.g.
There are 5 super-populations that can be requested via the
+pop argument. By default this will use the Europeans subset
+(EUR super-population). The reference panel has INDELs removed and only
+retains SNPs with MAF > 0.01 in the selected population.
Translating between rsids and chromosome:position, while also getting other information, can be achieved.
+
+
Variant information
+
+
Translating between rsids and chromosome:position, while also getting
+other information, can be achieved.
The chrpos argument can accept the following
<chr>:<position>
<chr>:<start>-<end>
For example
-
a <- variants_chrpos(c("7:105561135-105563135", "10:44865737"))
-
This provides a table with dbSNP variant IDs, gene info, and various other metadata. Similar data can be obtained from searching by rsid
-
b <- variants_rsid(c("rs234", "rs333"))
-
And a list of variants within a particular gene region can also be found. Provide a ensembl or entrez gene ID (e.g. ENSG00000123374 or 1017) to the following:
And a list of variants within a particular gene region can also be
+found. Provide a ensembl or entrez gene ID (e.g. ENSG00000123374 or
+1017) to the following:
-Extracting GWAS summary data based on gene region
-
Here is an example of how to obtain summary data for some datasets for a gene region. As an example, we’ll extract CDK2 (HGNC number 1017) from a BMI dataset (ieu-a-2)
Extracting GWAS summary data based on gene region
+
+
Here is an example of how to obtain summary data for some datasets
+for a gene region. As an example, we’ll extract CDK2 (HGNC number 1017)
+from a BMI dataset (ieu-a-2)
The OpenGWAS database contains a database of population annotations from the 1000 genomes project - the alternative allele frequencies and the LD scores for each variant, calculated for each super population separately. Only variants are present if they are MAF > 1% in at least one super population. You can access this info in different ways
+
+
1000 genomes annotations
+
+
The OpenGWAS database contains a database of population annotations
+from the 1000 genomes project - the alternative allele frequencies and
+the LD scores for each variant, calculated for each super population
+separately. Only variants are present if they are MAF > 1% in at
+least one super population. You can access this info in different
+ways
We have tried to provide useful cloud-based functionality for many operations, including relatively demanding LD operations. If you are running a large number of LD operations, we request that you think about performing those locally rather than through the API. We have tried to write the software to enable this to work seamlessly. Some examples below.
+
We have tried to provide useful cloud-based functionality for many
+operations, including relatively demanding LD operations. If you are
+running a large number of LD operations, we request that you think about
+performing those locally rather than through the API. We have tried to
+write the software to enable this to work seamlessly. Some examples
+below.
There are 5 super-populations that can be requested via the pop argument. By default this will use the Europeans subset (EUR super-population). The reference panel has INDELs removed and only retains SNPs with MAF > 0.01 in the selected population.
-
Note that you can perform the same operation locally if you provide a path to plink and a bed/bim/fam LD reference dataset.
+
+library(ieugwasr)
+#> OpenGWAS updates:
+#> Date: 2024-03-07
+#> [>] There is exceptional load on the OpenGWAS servers.
+#> [>] Urgent infrastructure development being performed.
+#> [>] See local options for analysis: https://mrcieu.github.io/gwasvcf/.
+
+
LD clumping
+
+
The API has a wrapper around plink version 1.90 and
+can use it to perform clumping with an LD reference panel from 1000
+genomes reference data.
There are 5 super-populations that can be requested via the
+pop argument. By default this will use the Europeans subset
+(EUR super-population). The reference panel has INDELs removed and only
+retains SNPs with MAF > 0.01 in the selected population.
+
Note that you can perform the same operation locally if you provide a
+path to plink and a bed/bim/fam LD reference dataset.
This contains an LD reference panel for each of the 5 super-populations in the 1000 genomes reference dataset. e.g. for the European super population it has the following files:
This contains an LD reference panel for each of the 5
+super-populations in the 1000 genomes reference dataset. e.g. for the
+European super population it has the following files:
EUR.bed
EUR.bim
EUR.fam
-
Now supposing in R you have a dataframe, dat, with the following columns:
+
Now supposing in R you have a dataframe, dat, with the
+following columns:
This uses the API by default but is limited to only 500 variants. You can use, instead, local plink and LD reference data in the same manner as in the ld_clump function, e.g.
This uses the API by default but is limited to only 500 variants. You
+can use, instead, local plink and LD reference data in the same manner
+as in the ld_clump function, e.g.
To automatically extract variants from a dataset, and search for LD proxies when a requested variant is not present in the dataset, please look at the options available in the gwasvcf package:
To automatically extract variants from a dataset, and search for LD
+proxies when a requested variant is not present in the dataset, please
+look at the options available in the gwasvcf package: