EDF1: Gene expression patterns across whole blood samples

We used a total of 1,061 whole blood samples from our controls cohorts and rare disease samples. A: Density plot representing the proportion of annotated junctions covered per gene. Those are a subset of genes for which at least one junction is covered with at least 5 uniquely mapped reads across at least 20% of the samples. On average (blue dashed line) 86%, (median of 100% - red dashed line) of junctions fulfil those criteria. B: Percentage of genes from disease genes panels in which at least one junction is covered with at least 5 uniquely mapped reads in at least 20% of samples. We observe that about 50% of genes from OMIM, Neurology, Musculoskeletal, Ophthalmology or Hematology panels are fulfilling this criteria. C: Tolerance to different types of mutations (from ExAC) in function of the expression status in a single versus multiple tissues (two-sided Wilcoxon test, p-value≤2×10−16). Analysis performed on 620 individuals from GTEx v7 across 22 tissues. Boxplots represent median value, with lower and upper hinges corresponding to the 25th and 75th percentiles, and lower and upper whiskers extend from the hinge to the smallest and largest value at most 1.5 * inter-quartile range of the hinge respectively. Genes that are expressed in multiple tissues tend to be more sensitive to missense and LoF mutations. D: Number of LoF intolerant genes stratified by expression level in blood. We considered genes with pLI score ≥ 0.9 as LoF intolerant.

EDF2: Correction for batch effects - Expression data.

Analyses performed on n=909 DGN samples and 143 rare diseases (cases and family controls). A: Plot of first two principal components run on uncorrected gene expression data. Samples are coloured by batch. Largest cluster (green dots) are DGN control samples (n=909). B: Plot of first two principal components run on gene expression data after regressing out significant surrogate variables found by SVA. C: Correlation between known covariates and all significant surrogate variables (SVs). We observed that SV2 is highly correlated with the read type, and the sequencing technology corresponding to differences between DGN and the other samples.

EDF3: Use of regression splines in expression data normalization

A: Normalized gene expression residuals from 1052 samples in an example gene without correction (left panel), after regressing out significant surrogate variables (SVs) (middle panel) and significant SVs plus regression splines on top SVs significantly associated with batch and study (right panel). Residuals were plotted against SV2 for illustration purposes (SV2 is significantly associated with batch (p-value<1e-30, two-sided t.test from linear regression, no adjustment for multiple correction). B: Mean number of outlier genes per sample (n=990) in each batch (absolute Z-score>8) after correction with SVs (left panel) and SVs with regression splines (right panel). SD is displayed above each bar. Regression splines resulted in a more consistent number of outlier genes across samples in all batches. C: Benjamini & Hochberg adjusted p-values resulting from a per-gene likelihood ratio test comparing linear regression model fit both with and without regression splines. Regression splines improve the model fit for 2,644 genes (p-value≤0.05,17.6% of all genes in dataset). Red dashed line indicates p-value=0.05 cutoff. D: Change in R2, in decreasing order, across all genes in the dataset (n=14,988) after correcting data using significant SVs with regression splines, compared to correcting data using significant SVs without regression splines. Mean change in R2 is 0.036 (SD=0.025).

EDF4:Impact of the number of controls on loss-of-function intolerance enrichment.

A: Enrichment of case (n=64)(red) under-expression outliers in LoF sensitive genes as we increase the number of controls (7,600 random subsets for each sample size indicated in legend). This enrichment was not observed for rare disease family member controls (gray, n=34). B: Benjamini & Hochberg adjusted -log10p-value associated with the enrichment at different number of controls (two-sided t.test, n=64 cases). Horizontal line indicates 0.05 significance cutoff. The p-values are decreasing as we increase the number of controls. When switching cases for controls (gray) we observed significant negative log odds when using the a smaller number of controls, but this trend disappeared when using the full set of 900 controls. For A and B: Boxplots represent median value, with lower and upper hinges corresponding to the 25th and 75th percentiles, and lower and upper whiskers extend from the hinge to the smallest and largest value at most 1.5 x inter-quartile range of the hinge respectively.

EDF5: Percentage of samples left when filtering outliers.

Filters have various impacts on the number of samples with at least one candidate gene. By combining several layers of filters we are drastically reducing the number of candidate genes but also the number of samples for which we have candidates. We recommend to relax filter stringency after looking at sets of genes that match the most stringent criterion. A: Expression outliers. After filtering for outlier genes matching HPO terms, with a deleterious rare variant within 10kb, we observed less than 2.6% of samples with over 25 candidate genes. B: Splicing outliers. When keeping only genes with HPO match, and a deleterious rare variant with 20bp of the outlier junction, we observed less than 1.3% of samples with more than 5 candidate genes.

EDF6: Correction for batch effects - Splicing data.

Analyses performed on 65 PIVUS samples and 143 rare disease samples. A: Plot of first two principal components (PCs) run on uncorrected splicing ratio data. Samples are coloured by batch. We observed that PC1 was separating PIVUS controls samples(left) from rare disease samples (right). B: Plot of first two PCs on splicing ratios after regressing out PCs explained up to 95% of the variance in the data. Batches were no longer separated on the first PCs. C: Correlation between known covariates 10 first PCs. We observed that PC1 is highly correlated with the batch, whereas PCs 2 and 3 separated samples from one institution (batch 1, CHEO) from others. We also observed that PC1 is highly correlated with RIN, highlighting differences in quality across samples.

EDF7: Allele specific expression across rare disease samples

A: Prevalence of ASE events in rare diseases samples (n=112). Results are displayed separately for exome and genome sequencing. B: Difference in proportion of genes matching HPO terms for top 20 ASE outliers per case in comparison to random genes (100 random gene sets for each sample, n=109 samples). Analysis performed for all genes, genes with pLI ≥ 0.9, genes with a rare variant (RV) and genes with a RV with CADD score≥10. The top 20ASE outlier genes are enriched for overlap with HPO-associated genes per case, regardless of the filters applied to the extreme ASE genes and background genes (p-value ≤ 1×10−4, two-sided Wilcoxon test).For A and B: Boxplots represent median value, with lower and upper hinges corresponding to the 25thand 75th percentiles, and lower and upper whiskers extend from the hinge to the smallest and largest value at most 1.5 * inter-quartile range of the hinge respectively. C: Rare deleterious variants are biased towards the alternative allele across all samples. A stop-gain variant was highly expressed in EFHD2 for one sample where there were matching symptoms.

EDF8: Diagnostic rate after analysis of 80 distinct cases.

A: Overview of cases. Solved: causal gene found and further validated. Strong candidate: Strong candidate after RNAseq analysis (* Out of a subset of 30 affected individuals for which we have prior candidate genes information from literature).Unsolved: Other cases for which further investigation is needed. B: Percentage of cases for which prior candidate gene is in final set of filtered genes (outlier with deleterious rare variant in a gene linked to symptoms). Analysis was performed only on a subset of 30 cases for which we have prior candidate gene information and for which we have genetic information. Shuffling candidates corresponds to the percentage of cases for which we observe a prior candidate genes in the most stringent gene list when shuffling gene lists across individuals (10,000 permutations). On average, no match is found. Shuffling genes corresponds to the percentage of prior candidate genes we observed within the final set of DNA-only filters when sampling from this list a matched number of genes corresponding to the expression filters. Average matched percentage is 4.1% after 10,000 percentage. Real data corresponds to the percentage of cases for which we found a candidate gene in the most stringent RNA-based filter set. We find a match for 7 affected samples out of 30, i.e. 25.9 % of cases. There is significantly more match in real data in comparison to permuted data (two-sided Wilcoxon rank sum test, p-value<10-5). Boxplots represent median value, with lower and upper hinges corresponding to the 25th and 75th percentiles, and lower and upper whiskers extend from the hinge to the smallest and largest value at most 1.5 * inter-quartile range of the hinge respectively.

EDF9: Identification of disease gene through expression outlier detection.

MECR case. A: Proband results. After our most stringent filter, there are 11 candidate genes left and MECR is rank 2nd by Z-score. B: Proband’s brother. After filtering, only 15 out of 1,099 expression outliers are left and MECR is ranked 10th in that list.

EDF10: Solved case without genetic data: ASAH1 case.

A: After filtering our detected splicing outliers for genes related to the phenotype (through HPO IDs), only one candidate was left, ASAH1(Z-score = 3.9), for which we previously confirmed the association with SMA-PME phenotype in the case. B: Sashimi plot of the case and 2 controls of the ASAH1 gene. For the case (red track), we observed an alternative transcripts skipping exon 6 (supported by 142 reads). This pattern was never observed in controls.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

EDF1: Gene expression patterns across whole blood samples

EDF2: Correction for batch effects - Expression data.

EDF3: Use of regression splines in expression data normalization

EDF4:Impact of the number of controls on loss-of-function intolerance enrichment.

EDF5: Percentage of samples left when filtering outliers.

EDF6: Correction for batch effects - Splicing data.

EDF7: Allele specific expression across rare disease samples

EDF8: Diagnostic rate after analysis of 80 distinct cases.

EDF9: Identification of disease gene through expression outlier detection.

EDF10: Solved case without genetic data: ASAH1 case.

Files

README.md

Latest commit

History

README.md

File metadata and controls

EDF1: Gene expression patterns across whole blood samples

EDF2: Correction for batch effects - Expression data.

EDF3: Use of regression splines in expression data normalization

EDF4:Impact of the number of controls on loss-of-function intolerance enrichment.

EDF5: Percentage of samples left when filtering outliers.

EDF6: Correction for batch effects - Splicing data.

EDF7: Allele specific expression across rare disease samples

EDF8: Diagnostic rate after analysis of 80 distinct cases.

EDF9: Identification of disease gene through expression outlier detection.

EDF10: Solved case without genetic data: ASAH1 case.