Skip to content

Latest commit

 

History

History
1546 lines (1430 loc) · 90.4 KB

genomics_course.md

File metadata and controls

1546 lines (1430 loc) · 90.4 KB

Introduction

The biology of cancer {#Intro-biology}

Cancer was the second cause of death worldwide, with almost 10 million deaths, in 2018 [@Bray2018] and could in a near future become the leading cause [@Dagenais2020]. The disease can affect different parts of the body, although some tissues are more frequently altered than others. Lung cancer, on which the work described in this manuscript will focus, is one of the most common cancers and the deadliest according to the 2018 GLOBOCAN database (a project of the [IARC]{acronym-label="IARC" acronym-form="singular+short"} providing worldwide cancer statistics) [@Bray2018]. Cancer is a complex disease that is highly controlled by the genome [@Stratton2009; @Wishart2015]. It originates from normal cells whose genetic information has been altered. Those alterations can result from endogenous processes as well as from exogenous processes like environmental exposures and lifestyle [@Eggert2011; @Luch2005]. As a result of these alterations, tumor cells have acquired specific capabilities that allow them to grow in an uncontrolled way as opposed to normal cells. These capabilities are referred to as the hallmarks of cancer and are listed in Figure [fig:intro_hallmarks]{reference-type="ref" reference="fig:intro_hallmarks"} [@Hanahan2011].

The hallmarks of cancer. From Hanahan et al. [@Hanahan2011]{width="55%"}

[[fig:intro_hallmarks]]{#fig:intro_hallmarks label="fig:intro_hallmarks"}

The first part of the introduction describes how genomic changes can influence cancer development and how the technological advances in the genomics area have enabled to shed lights on the mechanisms involved.

The central dogma of molecular biology

At the beginning of the 19th century, Avery and colleagues isolated and identified the [DNA]{acronym-label="DNA" acronym-form="singular+short"} as the molecule constituting our chromosomes, defined previously as carriers of our hereditary material by Thomas Morgan [@Avery1944; @Morgan]. In 1953, Watson and Crick proposed a new structure for the [DNA]{acronym-label="DNA" acronym-form="singular+short"} molecule, the double helix structure [@Watson1953] (See Figure [fig:intro_fig1]{reference-type="ref" reference="fig:intro_fig1"}A). Five years later, Francis Crick formulates how the information contained in the sequence of nucleic acids is processed to produce the proteins needed by our cells in what is called the central dogma of molecular biology (Figure [fig:intro_fig1]{reference-type="ref" reference="fig:intro_fig1"}B-C).

The [DNA]{acronym-label="DNA" acronym-form="singular+short"} molecule and the central dogma of molecular biology. A) The structure of [DNA]{acronym-label="DNA" acronym-form="singular+short"}: the double helix molecule is composed of two complementary strands of nucleotides. B) Representation of the steps described by the central dogma of molecular biology. C) Illustration of the molecules resulting from the central dogma transfers at a higher resolution. Created with BioRender.com{width="95%"}

[[fig:intro_fig1]]{#fig:intro_fig1 label="fig:intro_fig1"}

Three main transfers are described by the central dogma: replication, transcription and translation (See Figure [fig:intro_fig1]{reference-type="ref" reference="fig:intro_fig1"}B). During replication, the [DNA]{acronym-label="DNA" acronym-form="singular+short"} molecule is duplicated to provide the needed information to progeny cells. Through the two other steps, the information contained in [DNA]{acronym-label="DNA" acronym-form="singular+short"} is used to generate proteins. Firstly, the process of transcription consists in reading the [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequence to synthesize a single-stranded molecule of the same length, the [RNA]{acronym-label="RNA" acronym-form="singular+short"}. During translation, the transcribed molecule is then read using a reading frame of three nucleotides that form what is called a codon encoding for one amino acid, the unit of a protein (See Figure [fig:intro_fig1]{reference-type="ref" reference="fig:intro_fig1"}C). Note that the genetic code is redundant; multiple codons can encode an amino acid. The conversion of the information encoded in our genes to functional gene products like proteins is referred to as gene expression.

Since the statement of the central dogma, other mechanisms have been identified as determinant for the expression of a protein. Firstly, the [RNA]{acronym-label="RNA" acronym-form="singular+short"} molecule resulting from the transcription process, containing regions coding for the final amino acids sequence (exons) and non-coding regions (introns), is actually a [pre-mRNAs]{acronym-label="pre-mRNAs" acronym-form="singular+short"}. The step transforming precursor [RNA]{acronym-label="RNA" acronym-form="singular+short"} to mature [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"} is called alternative splicing and consists in truncating intronic regions and joining different exons together (See Figure [fig:intro_fig1]{reference-type="ref" reference="fig:intro_fig1"}B). Hence, one pre-mRNA can lead to multiple [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"} that are then transported outside of the nucleus to be translated into different proteins. While around 20,000 genes are described, much more proteins are generated as a result of alternative splicing.

Although all of our cells share the same genetic information and follow the same dogma, it is known that cells in distinct tissues differentiate and do not express the same proteins, at the same time. Such differences can be explained by the fact that several regulatory processes control gene expression levels. Firstly, genes transcription is dependent on transcription factors that represent around 7% of the genes [@Weinberg2014]. They specifically bind to control regions of genes, provide or prevent access to the [DNA]{acronym-label="DNA" acronym-form="singular+short"} and can control multiple genes [@Weinberg2014]. The fact that genes, for example the transcription factors, can influence multiple genes and thus multiple possibly unrelated phenotypes is referred to as pleiotropy. After transcription, [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"} can also be regulated through other [RNA]{acronym-label="RNA" acronym-form="singular+short"} molecules, like the [miRNAs]{acronym-label="miRNAs" acronym-form="singular+short"}, that can degrade [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"}. Besides, differences in gene expression can be controlled via non-genetic mechanisms like epigenetic processes, including histone modifications and [DNA]{acronym-label="DNA" acronym-form="singular+short"} methylation. Histones are proteins around which the [DNA]{acronym-label="DNA" acronym-form="singular+short"} is wrapped and hence control [DNA]{acronym-label="DNA" acronym-form="singular+short"} accessibility (Figure [fig:intro_fig2]{reference-type="ref" reference="fig:intro_fig2"}). For example, histone phosphorylation leads to the condensation of the chromatin and inhibits gene expression [@Weinberg2014]. [DNA]{acronym-label="DNA" acronym-form="singular+short"} methylation consists in the addition of a methyl group to cytosine nucleotides located in [CpG]{acronym-label="CpG" acronym-form="singular+short"} dinucleotides sites (cytosine followed by a guanine nucleotide). Such positions are not homogeneously distributed across the genome and are more frequently observed in what is called [CpG]{acronym-label="CpG" acronym-form="singular+short"} islands, themselves mainly observed in regulatory regions of genes, the promoters. It has been observed that the methylation of [CpG]{acronym-label="CpG" acronym-form="singular+short"} sites in promoters can repress gene expression while methylation of positions in the gene body positively correlates with gene expression [@Ma2013a].

Regulation of transcription. The figure represents different configurations of [DNA]{acronym-label="DNA" acronym-form="singular+short"} packaging. The [DNA]{acronym-label="DNA" acronym-form="singular+short"} molecule is wrapped around histones proteins that themselves are gathered in complexes called nucleosomes. This packaging forms the chromatin structure. This structure can be more or less compact (open versus condensed chromatin), which is influencing gene expression. When the chromatin is open, transcription factors can access the [DNA]{acronym-label="DNA" acronym-form="singular+short"} molecule, and [RNA]{acronym-label="RNA" acronym-form="singular+short"} polymerases can initiate the transcription. Note that the structure of the chromatin can be influenced by histones modifications and [DNA]{acronym-label="DNA" acronym-form="singular+short"} methylation events. Created with BioRender.com{width="80%"}

[[fig:intro_fig2]]{#fig:intro_fig2 label="fig:intro_fig2"}

Finally, post-translational events like enzymatic modifications of proteins or protein cleavage can occur and increase the number of proteins that can be generated in human cells, hence adding an additional layer of complexity.

As such, the numerous steps of transferring the [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequence information to proteins reflect the complexity behind protein expression. Any of these steps can be disrupted and result in altered molecules and proteins, leading to cancer development.

Cancer: a genomic disease

Our [DNA]{acronym-label="DNA" acronym-form="singular+short"} continuously undergoes diverse alterations and their accumulation over time can cause cancer. Researchers started to investigate the role of genomes in cancer at the end of the 19th century. In 1890, David von Hansemann, by observing cancer cell division under a microscope, identified for the first time abnormal chromosomes. This observation, among others, led Theodor Boveri 20 years later to suggest that cancer was a consequence of alterations in our inherited [DNA]{acronym-label="DNA" acronym-form="singular+short"} [@Stratton2009]. His hypothesis was supported in the mid 20th century by the identification of a recurrent alteration resulting in a peculiar chromosome 22 (the Philadelphia chromosome), in [CML]{acronym-label="CML" acronym-form="singular+short"}. While those alterations have been observed at the chromosomal level, genomes can be impacted by a multitude of alterations detectable at a lower resolution, the modification of one nucleotide in the [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequence being the lowest resolution.

At any position of the genome, the nucleotides might vary from an individual to another as well as between cells of an individual; those variations are called [SNVs]{acronym-label="SNVs" acronym-form="singular+short"}. Also, larger events like nucleotides [indels]{acronym-label="indels" acronym-form="singular+short"} of up to 1,000 bases and structural variations (chromosomal rearrangements or large [indels]{acronym-label="indels" acronym-form="singular+short"}) can alter the [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequence. All of these genomic changes are called mutations.

The timing of somatic mutations acquisition. Mutations can be inherited at birth (germline mutations, in green) or acquired during life course (somatic mutations, in yellow, blue and red). They can have little to no impact (passenger mutations represented by circles) or confer an advantage to the cell (driver mutations represented by stars). Adapted from Stratton et al. [@Stratton2009]{width="100%"}

[[fig:intro_fig3]]{#fig:intro_fig3 label="fig:intro_fig3"}

Mutations can occur at different moments in life (See Figure [fig:intro_fig3]{reference-type="ref" reference="fig:intro_fig3"}). Some mutations are inherited at birth since they are present in the germ line cells (sperm and egg) transmitted by parents to the offspring. They are called germline mutations and are found in all the cells of an individual, normal cells as well as tumor cells. Such mutations are observed at different frequencies in different populations and are called [SNP]{acronym-label="SNP" acronym-form="singular+short"}s. Another category of mutations can also be found in all cells of the body even if they were not transmitted by our parents, if they occur early in life during the development, at gestation. They are called de novo mutations. Finally, the rest of the mutations found in humans are acquired later in life as a result of errors in the [DNA]{acronym-label="DNA" acronym-form="singular+short"} maintenance or exogenous damages (See next section). Those mutations occur in cells outside the germ line and are called the somatic mutations.

Also, whether they are germline or somatic, mutations can have different impacts. Most mutations have, due to the redundancy of the genetic code, little to no impact on the genes encoded around them, they are the passenger mutations [@Vogelstein2013]. Others though alter the gene product and confer a selective advantage to the cell, e.g. a faster proliferation or a better survival in comparison to neighbour cells [@Stratton2009]. Those mutations are called driver mutations as they are thought to contribute to "driving the carcinogenic process" and are preserved by positive selection. In 2018, the Cancer Gene Census described more than 700 driver genes (genes carrying driver mutations). Among them, 90% were associated with somatic mutations and 20% contained germline mutations [@Sondka2018; @CancerGeneCensus]. Generally, two types of driver genes exist, oncogenes and [TSG]{acronym-label="TSG" acronym-form="singular+short"}. Oncogenes are genes whose functions are to promote cell growth, proliferation or inhibit apoptosis and usually result from a gain of function. A mutation in an oncogene can thus lead to a deregulation of one of these processes, hence resulting in uncontrolled proliferation and cancer. The first mutation identified as causing cancer was discovered in 1982 by Reddy et al. and was activating an oncogene named HRAS [@Reddy1982]. Besides mutations, other processes like over-expression of genes via amplification or chromosomal translocations can activate this category of genes. In contrast to oncogenes, [TSG]{acronym-label="TSG" acronym-form="singular+short"}s are refraining cellular growth and proliferation and are often referred to as the "gatekeepers" genes. Mutations in [TSG]{acronym-label="TSG" acronym-form="singular+short"}s tend to result in a loss of function; the latter genes are inactivated, and their negative regulation of cell proliferation is cancelled, which leads to abnormal growth. In 1971, Knudson proposed the two hits hypothesis which stipulates that both alleles (versions of a gene inherited by our mother and father, identical alleles leading to the homozygous state while two different alleles to the heterozygous state) of a [TSG]{acronym-label="TSG" acronym-form="singular+short"} must be inactivated or lost for the gene to lose its normal functions [@Knudson1971]. This hypothesis seemed to explain familial cancer cases [@Martinez-Jimenez2020]. Indeed when the first hit is an inherited germline mutation, the cancer susceptibility of a person increases since only one alteration is needed to alter the [TSG]{acronym-label="TSG" acronym-form="singular+short"} functions. The second alteration can result from different events: a mutation in the second allele, the loss or translocation of chromosome pieces or the loss of an entire chromosome. The two latter events causing what is called [LOH]{acronym-label="LOH" acronym-form="singular+short"} [@Eggert2011].

In the case of the two hits hypothesis, two mutations in the same gene are required for cancer initiation. However, it has been described that cancer is rather a multi-step process, meaning that multiple mutations and more than one gene are usually involved. A certain number of alterations in key pathways are necessary, and it can take several years for cancer to develop [@Weinberg2014]. However, the multi-step process can be accelerated. Firstly, as mentioned previously, the inheritance of germline mutations speeds up the cancer development as one driver mutation might be present from birth, increasing the probability that the remaining necessary events, which generally follow a stochastic process, will also occur. [@Weinberg2014]. Also, even if multiple [DNA]{acronym-label="DNA" acronym-form="singular+short"} repair mechanisms fix most of the alterations that a genome endures, the [DNA]{acronym-label="DNA" acronym-form="singular+short"} repair pathways themselves can be disrupted, leading to an acceleration of the accumulation of alterations. Such an event increases the mutation rate of an individual and generates what is called a "mutator phenotype" [@Stratton2009; @Loeb1991]. Finally, driver genes can also be altered by epigenetic changes that are more frequent, which increases the chance of disrupting key biological pathways for cancer development.

Cancer: an environmental disease

Mutations can arise from endogenous processes, for example, errors happening during [DNA]{acronym-label="DNA" acronym-form="singular+short"} replication. In that regard, the appearance of mutations across the genome seems random, and the advent of a driver mutation leading to cancer development seem associated with bad luck. This idea has been developed by Tomasetti et al. [@Tomasetti2015] in a controversial paper, published in 2015, suggesting that the majority of cancer mutations were due to "bad luck". In 2017, the same authors confirmed that mutations due to random errors represent a large proportion of mutations in multiple cancers while specifying that if luck and randomness do play a role in cancer development, other factors like exogenous processes also impact our [DNA]{acronym-label="DNA" acronym-form="singular+short"} and contribute to cancer development. [@Tomasetti2017]

Cancer incidence varies depending on the countries considered. Lung cancer incidence, for example, is much higher in Asia, Europe and North America than in Africa [@globocan_lung]. Those differences can be explained by the fact that cancer has a heritable component that differs in different parts of the world and by the fact that environmental exposures are different across countries. It has been shown, though, in studies exploring cancer rates in migrants populations, that the differences observed among populations could not be explained only by the genetic component [@Peto2001]. In the second half of the 20th century, epidemiological studies have indicated that several environmental exposures were associated with cancer incidence, showing that many cancers could be prevented. One of the most striking findings was that of Doll et al. showing that smokers had a twenty-fold higher risk of developing lung cancer than non-smokers [@Doll1950]. At the same period, chemical agents have been identified as being able to induce cancer, i.e. being carcinogenic [@Loeb2008]. Some of these agents were also defined as mutagenic agents, i.e. agents inducing [DNA]{acronym-label="DNA" acronym-form="singular+short"} damages.

Some carcinogens can impact cancer evolution without causing [DNA]{acronym-label="DNA" acronym-form="singular+short"} alterations; they are non-mutagenic agents and are considered as tumor promoters. One example of tumor promoter is alcohol which is a cytotoxic substance. Its consumption leads indeed to the death of epithelial cells in the mouth and throat, which triggers the division of the stem cells to regenerate the epithelium. If tobacco consumption precedes this event, tobacco-induced mutations might be present in the dividing cells, and clonal expansion of these mutations may lead to cancer [@Weinberg2014]. In that case, smoking acts as a tumor initiator and alcohol as a promoter by stimulating cell proliferation. Such interaction between alcohol and smoking is observed in head and neck cancers. Note, however, that alcohol can also have a mutagenic effect due to metabolites generated during ethanol oxidation like acetaldehyde [@Seitz2010]. Other examples of tumor promoters are steroid hormones acting as mitogenic agents or chronic inflammation (e.g. due to viruses).

We have seen that mutations in our genome can result from endogenous processes like replication errors or [DNA]{acronym-label="DNA" acronym-form="singular+short"} repair defects and from exposition to carcinogens. Observing these mutations across the whole genome have revealed patterns. Indeed, each of these processes can generate what is called mutational signatures, i.e. specific combinations of mutations [@Alexandrov2013]. The first studies of mutational signatures focused on single base nucleotide substitutions (six possible substitutions: C$>$A, C$>$T, C$>$G, T$>$A, T$>$C, T$>$G) and their tri-nucleotide contexts (the 5' and 3' nucleotides flanking the substitution) leading to 96 possible classes of mutations. The classification of all mutations found in cancer genomes in those 96 groups and the use of mathematical methods (See section 1.4{reference-type="ref" reference="Intro-method"}) to decompose the mutational processes enable the identification of a limited but diverse set of signatures. In the case of lung cancers, comparing the DNA of smokers with that of non-smokers revealed an increase of mutations in smokers mainly due to an elevation of C to A (C$>$A) mutations, probably caused by the tendency of tobacco carcinogens to induce this particular change [@Nik-Zainal2015]. In melanoma samples, an increase of C$>$T substitutions has been identified as a result of [UV]{acronym-label="UV" acronym-form="singular+short"} light exposition [@Alexandrov2014]. In 2015, COSMIC provided a curated set of 30 mutational signatures based on previously published studies on different cancer types [@Cosmic_2015]. Recently the methods to disentangle mutational signatures in human genomes have been extended. In 2020, Alexandrov et al. have considered higher context to classify single base substitutions by considering two flanking bases around the positions of the mutations and analyzed as well other types of mutations like double base substitutions and [indels]{acronym-label="indels" acronym-form="singular+short"}. This work led to an expansion of the repertoire of mutational signatures with more than 60 signatures in total [@Alexandrov2020].

Although some signatures are resulting from endogenous processes, like defects in [DNA]{acronym-label="DNA" acronym-form="singular+short"} repair or unknown processes, multiple signatures have been associated with preventable exposures. Considering the important impact of environmental exposures, Wild et al. suggested in 2005 the concept of the exposome which corresponds to all the exposures encountered by an individual during his lifetime (e.g. life-style, exposures to chemicals). He expressed the need to improve the measurement of such exposures at the same scale of the genomic events measurements [@Wild2005]. Indeed on the genome side, remarkable technological advances were made in the past decades allowing researchers to explore the human genome at high resolution. The evolution of these technologies is described in the next section.

The era of genomics {#Intro-ngs}

From arrays to next generation sequencing

The identification of the genomics variations leading to cancer has been enabled by multiple technical and technological advances that occurred after the discovery of the [DNA]{acronym-label="DNA" acronym-form="singular+short"} structure. Since that discovery, researchers have attempted to decipher the hidden information contained in the double helix molecule. One fundamental advancement in genomics has been the development of the first generation sequencing by Frederic Sanger in the 1970s. After automatization, this technique led indeed to the sequencing of the first human genome in the context of the [HGP]{acronym-label="HGP" acronym-form="singular+short"} that started in the 1980s, took 13 years and cost around 3 billion dollars to lead, in 2003, to the sequencing of the 3 billion nucleotides that our [DNA]{acronym-label="DNA" acronym-form="singular+short"} constitutes. At that time, the largest genome sequenced was the 20,000 times smaller genome of the Epstein-Barr virus [@Roberts2001]. While many researchers thought it was impossible, the project completed and delivered the first version of the human genome reference which, after being revised and improved, is now used on a day-to-day basis in genomics. However, the first generation sequencing technology was too long and costly to be applied in larger research projects aiming in that period to catalogue the genetic variations involved in human diseases.

The array technology {#the-array-technology .unnumbered}

At the same period, the microarrays technologies were far less expensive. This technique consists in disposing, on an array, [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequences, called probes, designed to bind (by hybridization) to target sequences in a sample. The target sequences are labelled to measure the hybridization and quantify the target molecules.

Microarrays. A) [SNP]{acronym-label="SNP" acronym-form="singular+short"} arrays: fragmented [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequences bind to designed probes on the microarray, which generates an intensity signal that varies depending on the allele carried by the [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequences. B) Expression arrays: tagged complementary [DNA]{acronym-label="DNA" acronym-form="singular+short"}, reverse-transcribed from [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"} molecules, bind to gene-specific probes, which generates a fluorescence signal used to compare expression levels in different cell conditions. Created with BioRender.com{width="90%"}

[[fig:intro_arrays]]{#fig:intro_arrays label="fig:intro_arrays"}

In order to study genomic variations across the genome, specific microarrays were developed, the genotyping or [SNP]{acronym-label="SNP" acronym-form="singular+short"}s arrays. Those arrays contain unique probe sequences, targeting specific positions of the genome, which hybridize to single-stranded [DNA]{acronym-label="DNA" acronym-form="singular+short"} that has been fragmented. This generates intensities signals varying depending on the allele carried by the [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequence binding to each probe. This intensity, indicating the presence or absence of each allele, is then converted into genotypes [@Laframboise2009] (See Figure [fig:intro_arrays]{reference-type="ref" reference="fig:intro_arrays"}A). The [SNP]{acronym-label="SNP" acronym-form="singular+short"} arrays developed for commercial purposes have evolved, interrogating from 10,000 to millions of sites simultaneously in a given individual [@Xing2016]. Key produces of these technologies were developed by Affymetrix and Illumina inc. Those arrays have been used so far for different purposes. They allowed the identification of copy number changes or, for arrays with high marker density regions, the detection of [LOH]{acronym-label="LOH" acronym-form="singular+short"} events by identifying regions without heterozygous positions [@Beroukhim2006; @Dutt2007]. They have also been used to identify germline variants that associate with a certain disease through [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} [@XueyingMao2007]. As illustrated in Figure [fig:intro_gwas]{reference-type="ref" reference="fig:intro_gwas"}, [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} interrogate millions of positions across the genome by testing their association with a specific trait, like smoking traits, individually and reveal positions significantly associated with that trait.

Genome-wide association studies. The figure illustrates a [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} identifying [SNP]{acronym-label="SNP" acronym-form="singular+short"}s associated with the number of cigarettes smoked per day. For each position, the association between the variant genotypes and the number of cigarettes per day is tested (rs789 example). The associations p-values are represented in a Manhattan plot (left panel). [SNP]{acronym-label="SNP" acronym-form="singular+short"}s reaching the genome-wide significance threshold of $5.10^{-8}$ are considered as true associations. Those [SNP]{acronym-label="SNP" acronym-form="singular+short"}s do however not always correspond to the causal variant but often tag a nearby [SNP]{acronym-label="SNP" acronym-form="singular+short"} in linkage disequilibrium. Created with BioRender.com{width="100%"}

[[fig:intro_gwas]]{#fig:intro_gwas label="fig:intro_gwas"}

Although [SNP]{acronym-label="SNP" acronym-form="singular+short"}s arrays are limited to the positions assayed, much more positions can be studied based on the arrays. Indeed, [SNP]{acronym-label="SNP" acronym-form="singular+short"}s are transmitted to the offspring linked to other close [SNP]{acronym-label="SNP" acronym-form="singular+short"}s in blocks called haplotypes. This relationship between [SNP]{acronym-label="SNP" acronym-form="singular+short"}s is called [LD]{acronym-label="LD" acronym-form="singular+short"}. Knowing the [SNP]{acronym-label="SNP" acronym-form="singular+short"}s composition of a haplotype enables to predict the genotype of [SNP]{acronym-label="SNP" acronym-form="singular+short"}s that were not assayed by the array by using the information of the assayed positions in the haplotype. Hence, genotyping hundred thousands of [SNP]{acronym-label="SNP" acronym-form="singular+short"}s allows actually to impute the genotype of millions of other variants thanks to [LD]{acronym-label="LD" acronym-form="singular+short"}. The definition of the haplotypes required though to study such genomic structure in different samples to build a map as reference. Those were the goals of the [HapMap]{acronym-label="HapMap" acronym-form="singular+short"} started in 2002 [@Belmont2003; @Hapmap_britannica].

Micro-arrays platforms have also been used to study the other molecular layers like the transcriptome and the methylome. For the analysis of the expression profile, micro-arrays have enabled to measure and compare the expression levels of specific genes in cells under different conditions, e.g. diseased versus healthy cells or treated versus non-treated cells. Figure [fig:intro_arrays]{reference-type="ref" reference="fig:intro_arrays"}B describes the main steps of an expression array experiment. The extracted [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"} molecules from both types of cells, after being reverse-transcribed to [cDNA]{acronym-label="cDNA" acronym-form="singular+short"} and labelled with fluorescent dye, hybridize to the genes specific probes fixed on the array. The array is then scanned using fluorescent imaging [@Tarca2006]. The fluorescence amount detected at each probe is proportional to the amount of [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"} in cells. While these measures do not provide absolute quantification of gene expression levels, they enable to compare the expression levels in the different conditions. Arrays have also been used to study the epigenome by allowing the detection and the analysis of methylation events. The most commonly used methylation arrays are the Illumina arrays [@Illumina_Methylation_infinium]. As for the [SNP]{acronym-label="SNP" acronym-form="singular+short"}s arrays, probes are designed to target specific loci of the human genome, in this case, [CpG]{acronym-label="CpG" acronym-form="singular+short"} positions. The number of positions interrogated by such arrays can vary from 25,000 to 850,000 positions depending on the array (e.g. Illumina 25K, 450K and 850K arrays). Probes are designed and fixed to the array to bind to both methylated and unmethylated loci (Figure [fig:intro_methylation]{reference-type="ref" reference="fig:intro_methylation"}). This binding is enabled by a chemical process called bisulfite conversion, which converts unmethylated cytosines to uracil and leave methylated cytosine unchanged. At the hybridization step, a single-based extension is performed with labelled nucleotides, allowing to distinguish for each locus a methylated vs non-methylated signal (Figure [fig:intro_methylation]{reference-type="ref" reference="fig:intro_methylation"}). The ratio between the two signals at a locus provides a value, called $\beta$ value, which indicates the level of methylation. This value ranges between 0 and 1, 0 corresponding to a non-methylated and 1 a methylated position.

The Illumina Infinium methylation assay (From [@Illumina_Methylation_infinium]). This figure represents the probes used for methylation profiling by Illumina. A) Infinium type I probes. Two site-specific probes are found on the array: probes allowing methylated sites with the preserved cytosine to bind (methylated bead M) and probes designed for the unmethylated site with the thymine nucleotide resulting from bisulfite conversion and whole-genome amplification (methylated bead U). B) Infinium type II probes. Only one probe per locus is required to bind to both methylated and unmethylated sites. In that case, single-base extension with labelled nucleotides is used.{width="80%"}

[[fig:intro_methylation]]{#fig:intro_methylation label="fig:intro_methylation"}

Next-generation sequencing {#next-generation-sequencing .unnumbered}

While the [SNP]{acronym-label="SNP" acronym-form="singular+short"} arrays enabled to access the genotype information of millions of positions, there was still a need to re-sequence human genomes more efficiently and access the complete [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequence to better identify genetic variations. Around 2005, the second generation of sequencing methods called [NGS]{acronym-label="NGS" acronym-form="singular+short"} has been developed.

Next Generation Sequencing methods. The figure describes the [NGS]{acronym-label="NGS" acronym-form="singular+short"} steps consisting in: i) fragmenting the nucleic acid molecule, ii) amplifying the fragments (using [PCR]{acronym-label="PCR" acronym-form="singular+short"}), iii) sequencing the resulting copies using single-base extension that adds one after the other labelled nucleotides whose signals are detected using digital imaging. The sequencing reads are then aligned to a reference genome to assemble the reads in a single sequence or to detect mutations across the genome. In the case of [RNA]{acronym-label="RNA" acronym-form="singular+short"} sequencing, the reads align to exonic regions of the genes and they are counted to quantify gene expression levels. Created with BioRender.com{width="80%"}

[[fig:intro_ngs]]{#fig:intro_ngs label="fig:intro_ngs"}

The main change in these new methods in comparison to the first one was the parallelization of the sequencing, which allowed to produce millions of sequences, called reads, at the same time and hence to decrease drastically the time of sequencing as well as its cost [@Wetterstrand] (Figure [fig:intro_ngs]{reference-type="ref" reference="fig:intro_ngs"}). [NGS]{acronym-label="NGS" acronym-form="singular+short"} methods enabled the rapid re-sequencing of different parts and lengths of the genome. The entire genome sequence (except some highly problematic regions) can be accessed with [WGS]{acronym-label="WGS" acronym-form="singular+short"}. The restricted sequencing of coding regions (exonic regions) can be performed with [WES]{acronym-label="WES" acronym-form="singular+short"}. Finally, it is possible to sequence specific regions of the genome, usually genes, using targeted sequencing. Based on these techniques, bioinformatics methods have been developed to detect germline as well as somatic variants. They consist in mapping (or aligning) the sequenced reads to a reference genome, and positions that vary from the reference are identified as variations (Figure [fig:intro_ngs]{reference-type="ref" reference="fig:intro_ngs"}). A mismatch between a sequenced genome and the reference genome is expected around every 1,000 bases. To distinguish somatic from germline mutations, both tumor and normal cells [DNA]{acronym-label="DNA" acronym-form="singular+short"} from the same individual have to be sequenced. The tumor [DNA]{acronym-label="DNA" acronym-form="singular+short"} is compared to the normal [DNA]{acronym-label="DNA" acronym-form="singular+short"} and variations found in the tumor cells only are classified as somatic mutations. Somatic mutations are expected every 1,000,000 bases approximately depending on the cancer type [@Alexandrov2013].

While the [DNA]{acronym-label="DNA" acronym-form="singular+short"} sequencing techniques have been used to detect [DNA]{acronym-label="DNA" acronym-form="singular+short"} mutations, they do not explore the expression or methylation layers. In 2008, the sequencing of the [RNA]{acronym-label="RNA" acronym-form="singular+short"} molecule ([RNA-Seq]{acronym-label="RNA-Seq" acronym-form="singular+short"}) had been performed to study expression profiles. In this technique, the [mRNAs]{acronym-label="mRNAs" acronym-form="singular+short"} molecules are fragmented and converted to complementary [DNA]{acronym-label="DNA" acronym-form="singular+short"} before sequencing, and the resulting reads are aligned to the reference genome [@Wang2009]. After the alignment step, the reads can be assigned to genes and the abundance of reads mapped on a gene, quantified using the number of mapped reads, reflects the expression level of the gene (Figure [fig:intro_ngs]{reference-type="ref" reference="fig:intro_ngs"}). A high read count value indicating that a gene is active and transcribed in that sample. The comparison of the read counts distributions in samples from different conditions, e.g. samples with and without disease or diseased samples under different treatment, can be used to identify genes involved in or causing a specific condition. [RNA-Seq]{acronym-label="RNA-Seq" acronym-form="singular+short"} can also be used to identify different transcripts of a gene as well as gene rearrangements like translocations.

Note that other recent techniques, while not described in the thesis, also exist to access different omics layers. A new sequencing technique has been developed for the analysis of the methylome, the bisulfite sequencing, which in contrast with the methylation arrays, can interrogate millions of [CpG]{acronym-label="CpG" acronym-form="singular+short"}s positions across the whole genome as well as positions in targeted regions. Also, the study of chromatin accessibility and [DNA]{acronym-label="DNA" acronym-form="singular+short"}-binding proteins is possible thanks to [ATAC-seq]{acronym-label="ATAC-seq" acronym-form="singular+short"} and Chromatin immunoprecipitation experiments followed by sequencing ([ChiP-Seq]{acronym-label="ChiP-Seq" acronym-form="singular+short"}) respectively [@Furey2012; @Yan2020]. Finally, while the sequencing methods presented so far process [DNA]{acronym-label="DNA" acronym-form="singular+short"} coming from a bulk of cells, single-cell sequencing methods have been developed to perform molecular characterization at the cell level. These methods allow the identification of distinct populations of cells in a tumor and hence the study of tumor heterogeneity and tumor microenvironment [@Hwang2018; @Finotello2019].

The decreasing costs of genotyping and sequencing methods have enabled the establishment of genomics studies involving large cohorts [@Wetterstrand]. Sequencing a human genome today costs less than 1,000 dollars using [NGS]{acronym-label="NGS" acronym-form="singular+short"} methods while it would still cost millions if the Sanger method was chosen. Multiple research groups have coordinated their efforts to create large consortia for that purpose and in many cases have shared the resulting data to the scientific community. The next section provides an overview of some of these initiatives.

Large public databases

The Cancer Genome Atlas {#the-cancer-genome-atlas .unnumbered}

[TCGA]{acronym-label="TCGA" acronym-form="singular+short"} is a public database providing access to 10,000 patients whose tumors have undergone multi-omics characterization. The project was launched in 2005 by the [NIH]{acronym-label="NIH" acronym-form="singular+short"} and aimed at characterizing the genomic alterations underlying several cancer types. For that purpose, multiple omics data were generated [@Tomczak2015]. The tumor and normal samples from most of the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} participants have been sequenced using [WES]{acronym-label="WES" acronym-form="singular+short"}. Based on these data, multiple variant callers have been used to catalogue the germline and somatic mutations present in each sample. Genotyping has been performed to analyze copy number variations. The transcriptome of most samples has also been sequenced, using [RNA]{acronym-label="RNA" acronym-form="singular+short"} and [miRNAs]{acronym-label="miRNAs" acronym-form="singular+short"} sequencing. The methylation profiles of the tumors were explored with the use of 25K or 450K methylation arrays. Finally, protein expression profiling has been performed based on [RPPA]{acronym-label="RPPA" acronym-form="singular+short"}. In addition to the molecular data, clinical and environmental exposures data were collected when possible. The [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} projects also delivered the histopathological images associated to each tumor. Based on these diverse omics and clinical datasets, "marker papers" describing the molecular landscape of each tumor type have been published. While the tissues explored at the beginning of the initiative were limited to lung, brain and ovaries, the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} data encompasses today molecular data from 33 different cancer types. Those cancer-specific studies led to the identification of genomics alterations causing each cancer type, hence the discovery of new driver genes and potential cancer biomarkers, i.e. molecules found in the body as an indicator of a disease or specific condition. Also, cancer subtypes were characterized on the molecular level and subtype-specific alterations were identified, which resulted in new clinical managements of tumors [@Weinstein2013]. In parallel to the cancer-specific studies, the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} research network launched, in 2012, the Pan-Cancer Atlas initiative aiming at exploring the commonalities between cancer types, distinguishing tissue-specific determinants of cancer as well as increasing the statistical power for the identification of genomic alterations [@Weinstein2013]. This initiative was completed in 2018 and the data have been released and associated to 27 papers, published in Cell, focusing on three main topics: i) cell-of-origin patterns and cancers subgrouping, ii) oncogenic processes, and iii) signaling pathways involved in cancer [@PanCanatlas_site].

The International Cancer Genome Consortium (ICGC) initiatives {#the-international-cancer-genome-consortium-icgc-initiatives .unnumbered}

The [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} studies focused their efforts on the characterization of the cancer exomes. However, exomes represent only 1% of the human genome and much more can be discovered by exploring the remaining 99% of the genome. In 2007, the [ICGC]{acronym-label="ICGC" acronym-form="singular+short"} project was launched to study more than 20,000 whole genomes from 50 cancer types having an impact in multiple regions of the world (the 25k initiative). The international consortium aimed at generating a catalogue of the somatic mutations in those cancer types, sharing the resulting datasets and complementing them with transcriptomic and epigenomic datasets [@Hudson2010; @Cieslik2020]. Based on the samples included in the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} and the [ICGC]{acronym-label="ICGC" acronym-form="singular+short"} projects, the [PCAWG]{acronym-label="PCAWG" acronym-form="singular+short"} project, an [ICGC]{acronym-label="ICGC" acronym-form="singular+short"} initiative also know as the Pan-Cancer project, has arisen [@Campbell2020]. The project relied on more than 2600 samples from 38 different tumor types and aimed at meta-analyzing whole-genome data across cancers along the same lines as the PanCancer Atlas project. The first results from these data have been released in 2020 in a series of publications in Nature [@Cieslik2020]. While the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} initiative enabled the study of the coding regions of the samples, the [PCAWG]{acronym-label="PCAWG" acronym-form="singular+short"} project, thanks to the use of whole genome sequences, was designed to explore broader mutational patterns in the coding and non-coding regions, from small to large events like structural variations. For example, chromoplexy and chromothripsis events, which are complex chromosomal rearrangements resulting from catastrophic genomic events, have been observed in more cancers than expected, 17.8% and 22.3% of the tumors, respectively [@Cieslik2020]. Also, one major result from the [PCAWG]{acronym-label="PCAWG" acronym-form="singular+short"} project has been the expansion of the mutational signatures mentioned in section 1.1{reference-type="ref" reference="Intro-biology"} [@Alexandrov2020], as well as the discovery of 16 structural variants signatures [@Li2020].

UKbiobank {#ukbiobank .unnumbered}

The previously described projects mainly targeted the somatic landscape of genomes. Other large projects have enabled the research community to explore the germline component of human disease. The largest public dataset, focusing on germline genetics, has been generated by the UKbiobank project, which started in 2010 in the UK. This project gathered data from a population-based cohort of around 500,000 participants between 40 and 69 [@Bahcall2018] and had as main objective to improve our understand of the interaction between genetics and multiple human diseases. For that purpose, all participants were genotyped. Besides, multiple other biological samples, like urine, blood and saliva as well as physical measures, e.g. brain [MRI]{acronym-label="MRI" acronym-form="singular+short"}, heart and eye measurements, were collected. It is a prospective cohort; participants are followed up and are linked to electronic health records [@Bycroft2018]. The genotyping data of the full cohort were released in 2017. Based on this dataset and the large panel of phenotypes, a multitude of [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} studies related to human diseases have been performed and their resulting summary statistics were made available. In 2019, around 100 [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} studies resulting from the UKbiobank data were available on the [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} catalogue, which provides curated [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} summary statistics results [@Buniello2019]. The follow-up of the patients has established that, in 2018, 79,000 of the participants were diagnosed with cancer [@Bycroft2018], which means that cancer-related traits can also be studied using this dataset. After the release of the genotyped and imputed data, [WES]{acronym-label="WES" acronym-form="singular+short"} and [WGS]{acronym-label="WGS" acronym-form="singular+short"} sequencing of the samples have been initiated. Part of the exome data, around 50,000 exomes, have already been released and about 200,000 exomes should be expected by the end of 2020. These data foreshadow future key findings in genomics, a better understanding of molecular and phenotypic interactions and probably an improvement of the translation of those findings in the clinic.

Data sharing {#data-sharing .unnumbered}

With the increasing number of genomics studies, public repositories, like the [dbGAP]{acronym-label="dbGAP" acronym-form="singular+short"}, the [EGA]{acronym-label="EGA" acronym-form="singular+short"} or [GEO]{acronym-label="GEO" acronym-form="singular+short"}, have been established to store petabytes of genomics data that can be accessed by the research community. In addition, large projects, like the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} and [ICGC]{acronym-label="ICGC" acronym-form="singular+short"}, have worked on solutions to improve data storage and accessibility. One of the goals of those projects was to promote open-access data and the development of tools to foster the reuse of the data by the research community [@Weinstein2013; @Hudson2010]. In 2010, the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} provided the data in open access for the first time [@TCGA_milestones] and updated and extended the content of the open access data over the years. In 2016, the [GDC]{acronym-label="GDC" acronym-form="singular+short"} was launched by the [NCI]{acronym-label="NCI" acronym-form="singular+short"} to store all the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} data [@Jensen2017]. For each omics, the data are categorized by levels: low-level data (raw and unnormalized data) that generally enable individuals re-identification are under controlled access, while higher-level data (processed data, clinical data) that do not permit re-identifiability are available without any requirement. In addition to providing the data storage, the [GDC]{acronym-label="GDC" acronym-form="singular+short"} also aimed at harmonizing and sharing the bioinformatics pipelines used to process the data [@Jensen2017; @Gao2019]. The processed data resulting from the PanCancer Atlas papers are also available via the [NIH]{acronym-label="NIH" acronym-form="singular+short"} [GDC]{acronym-label="GDC" acronym-form="singular+short"} website [@PanCancerAtlas_data] and allow researchers to explore broader genomic features like immune variables [@Thorsson2018] or biological pathway measures [@Knijnenburg2018]. Also, cloud computing solutions have been developed to facilitate the analyses of large public genomic datasets while avoiding the download and duplication of the data. The [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} and [ICGC]{acronym-label="ICGC" acronym-form="singular+short"} data are available and can be analyzed on the cloud, for example via the [CGC]{acronym-label="CGC" acronym-form="singular+short"} [@Lau2017] or the [ISB-CGC]{acronym-label="ISB-CGC" acronym-form="singular+short"} [@Reynolds2017]. Also, the [ICGC]{acronym-label="ICGC" acronym-form="singular+short"} consortium, to process the [PCAWG]{acronym-label="PCAWG" acronym-form="singular+short"} data, has developed a computational tool, Butler, which simplifies genomic analyses that have to be run on clouds environments (academic or commercial) [@Yakneen2020].

In the past decades, the development of genomics technologies and the implementation of large consortia have enabled to characterize human cancers on the molecular level. The understanding of cancer causes and the biological mechanisms underlying tumor development has been improved. Also, due to the identification of correlations between molecular events and patient's prognosis and response to treatments, molecular studies have impacted the way that tumors are classified and managed in the clinic.

The example of lung cancer {#Intro-lung}

Lung cancer subtypes and etiology

Lung cancer subtypes. Each lung cancer type occurs at different frequencies as well as at distinct locations in the lung (from proximal to distal locations). Each box on the figure is associated to one cancer type and provides their characteristics (frequency, localisation, etiology and overall 5-year survival rate (5y SR)) [@Travis2010; @Derks2018; @Simbolo2019a; @ASCO]. Figure created with BioRender.com{width="100%"}

[[fig:intro_lung]]{#fig:intro_lung label="fig:intro_lung"}

As mentioned at the beginning of the manuscript, lung cancer is one of the most common and deadliest cancer worldwide. Several subtypes of lung cancers have been identified (Figure [fig:intro_lung]{reference-type="ref" reference="fig:intro_lung"}). The most common lung cancers are usually divided into two groups: the [SCLC]{acronym-label="SCLC" acronym-form="singular+short"} and the [NSCLC]{acronym-label="NSCLC" acronym-form="singular+short"} samples, representing respectively around 20 and 75% of the lung cancers [@Politi2015]. The second group is further separated into two main subgroups: the [LUAD]{acronym-label="LUAD" acronym-form="singular+short"} and the [LUSC]{acronym-label="LUSC" acronym-form="singular+short"}. Also, rarer forms of lung cancer exist. Multiple lung cancer subtypes, including such rarer cancers, were grouped in one category named the lung neuroendocrine tumors by the [WHO]{acronym-label="WHO" acronym-form="singular+short"} 2015 classification [@Travis2015]. This group comprises the pulmonary carcinoids, including the typical and atypical carcinoids, [LCNEC]{acronym-label="LCNEC" acronym-form="singular+short"} as well as the previously mentioned [SCLC]{acronym-label="SCLC" acronym-form="singular+short"} tumors. Each lung cancer type can be distinguished by different etiologies, histopathological characteristics, molecular profiles and clinical outcomes (See Figure [fig:intro_lung]{reference-type="ref" reference="fig:intro_lung"}).

The strongest risk factor for lung cancer is smoking. Indeed, [SCLC]{acronym-label="SCLC" acronym-form="singular+short"}s and [LCNEC]{acronym-label="LCNEC" acronym-form="singular+short"}s are frequently found in heavy smokers. Smoking is also a major risk factor for [LUAD]{acronym-label="LUAD" acronym-form="singular+short"} and [LUSC]{acronym-label="LUSC" acronym-form="singular+short"} cancers [@Campbell2016]. However, lung cancer can also develop in non-smokers. In particular, the [LUAD]{acronym-label="LUAD" acronym-form="singular+short"} category corresponds to the lung cancer type most commonly found in never smokers. Although the etiology of the pulmonary carcinoids is not clear, the majority of these tumors are found in nonsmokers [@Derks2018]. In addition, around only 15% of smokers develop lung cancer suggesting than other factors mediate lung cancer risk.

Lung cancer susceptibility

While exposures other than smoking like air pollution, radon, heavy metals or asbestos have been identified as lung cancer risk factors [@DeAlencar2020], genetics is also contributing to the disease risk. In line with this hypothesis, it has been shown that having a family history of lung cancer confers a 2.5 fold lung cancer risk increase [@Amos1999]. Further evidence of lung cancer germline susceptibility has been revealed by [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} studies, with the identification of common variations associated with lung cancer. Genes involved in nicotine addiction (CHRNA genes), telomere activities (TERT) as well as genes related to the DNA repair and cell-cycle pathways (e.g. Check2, RAD52 or CDKN2A) have been identified [@Bosse2018]. Also, some lung cancer associated variants were identified as related to the propensity to smoke [@Thorgeirsson2008; @McKay2017] and genetic correlations between lung cancer and smoking traits, like smoking initiation, smoking cessation or smoking intensity have been described [@McKay2017]. Such observations provided evidence that susceptibility variants could influence lung cancer risk through environmental exposures. Hence, [GWAS]{acronym-label="GWAS" acronym-form="singular+short"} studies have enabled to gain insights on lung cancer etiology as well as on the biological pathways involved in the disease. However, the variants identified so far do not account for most of the heritability of lung cancer, estimated at 18% and remaining today largely unexplained [@McKay2017].

Lung cancer molecular profiling

In the past decades, molecular profiles of human tumors, including lung tumors, have also been explored thanks to the development of [NGS]{acronym-label="NGS" acronym-form="singular+short"} studies. Such studies have, for example, established that lung cancers are among the cancer types with the highest mutational burden (total number of mutations for a given part of [DNA]{acronym-label="DNA" acronym-form="singular+short"}) [@Lawrence2013]. As mentioned in Section 1.1{reference-type="ref" reference="Intro-biology"}, in smoking-related cancers, those mutations revealed a signature associated with tobacco consumption. Among the [COSMIC]{acronym-label="COSMIC" acronym-form="singular+short"} signatures identified by Alexandrov et al. [@Alexandrov2013; @Alexandrov2020], the smoking signature corresponds to the Signature 4 ([COSMIC]{acronym-label="COSMIC" acronym-form="singular+short"} version 2) and SBS 4 ([COSMIC]{acronym-label="COSMIC" acronym-form="singular+short"} version 3). Those signatures are the results of [DNA]{acronym-label="DNA" acronym-form="singular+short"} damages caused mainly by benzo[$\alpha$]pyrene, which is a mutagenic compound found in tobacco smoke and whose effects on [DNA]{acronym-label="DNA" acronym-form="singular+short"} has been shown in experimental mutagenesis studies [@Nik-Zainal2015]. Even though smoking does heavily impact the lung tissue, it has been shown that quitting smoking can restore the damaged tissue [@Yoshida2020].

In addition, molecular analyses of lung tumors have identified cancer driver genes in the different cancer types. Among those genes, the [EGFR]{acronym-label="EGFR" acronym-form="singular+short"} gene, which is part of the protein kinase family currently known to be mutated in around 15% of the [LUAD]{acronym-label="LUAD" acronym-form="singular+short"} samples [@Collisson2014], has been related to therapeutic response in 2004 [@Politi2015]. Indeed LUAD samples, carrying activating mutations in the [EGFR]{acronym-label="EGFR" acronym-form="singular+short"} gene, are responsive to tyrosine kinase inhibitor therapy and have an improved survival in comparison to other cancer patients treated with chemotherapy. Such molecular studies largely influenced the way that lung tumors are classified since it is only since those discoveries that [NSCLC]{acronym-label="NSCLC" acronym-form="singular+short"} are further sub-classified. Guidelines were published in 2013 to include molecular testing, mainly based on [EGFR]{acronym-label="EGFR" acronym-form="singular+short"} and ALK alterations testing, in the clinical practice for the [NSCLC]{acronym-label="NSCLC" acronym-form="singular+short"} patients. In 2018, those guidelines were updated and new alterations, like rearrangements in the tyrosine kinase ROS1, are now recommended for molecular testing [@Lindeman2018]. In 2012 and 2014, the [TCGA]{acronym-label="TCGA" acronym-form="singular+short"} marker papers on the two lung cancer cohorts ([LUAD]{acronym-label="LUAD" acronym-form="singular+short"} and [LUSC]{acronym-label="LUSC" acronym-form="singular+short"}) were published. The authors expanded the molecular profiling of these tumors and hence the list of drivers genes, improving the understanding of the biological mechanisms involved and providing new opportunities for patients management [@Network2012; @Collisson2014]. Those studies also explored the transcriptomic, methylation and proteomic data from the lung tumors. Based on their expression profiles, the [LUAD]{acronym-label="LUAD" acronym-form="singular+short"} tumors, were divided into subtypes that could help to refine those tumors classification [@Collisson2014].

The identification of driver genes in lung cancer has also led to the proposal of molecular targets for early detection. The molecular profiling of [SCLC]{acronym-label="SCLC" acronym-form="singular+short"}s is an example of such an application. [SCLC]{acronym-label="SCLC" acronym-form="singular+short"}s are characterized by universal inactivation of both RB1 and TP53 genes [@Peifer2012a; @George2015; @Fernandez-Cuesta2019]. In 2016, Fernandez-Cuesta et al. analyzed [ctDNA]{acronym-label="ctDNA" acronym-form="singular+short"}, which are fragments of tumor [DNA]{acronym-label="DNA" acronym-form="singular+short"} released in the bloodstream that can be used as molecular biomarkers, in [SCLC]{acronym-label="SCLC" acronym-form="singular+short"}s. They showed that TP53 mutations were detectable in the [ctDNA]{acronym-label="ctDNA" acronym-form="singular+short"} of the [SCLC]{acronym-label="SCLC" acronym-form="singular+short"} cases [@Fernandez-cuesta2016]. [ctDNA]{acronym-label="ctDNA" acronym-form="singular+short"} applications are viable for multiple cancer types. In 2018, Cohen et al. described a blood test called CancerSEEK, detecting proteins and mutations in cell-free [DNA]{acronym-label="DNA" acronym-form="singular+short"} for the early detection of eight different cancer types, including lung cancer [@Cohen2018]. Such tests face though sensitivity issues due to the low abundance of mutated [DNA]{acronym-label="DNA" acronym-form="singular+short"} in body fluids, hence adapted bioinformatics tools are needed. I contributed to the optimization of such tool, Needlestack, a highly sensitive multi-sample variant caller [@Delhomme2020].

Even though rare forms of lung cancers are less explored than the common lung cancers, recent molecular studies have started to characterize the lung neuroendocrine tumors as well [@Fernandez-Cuesta2014; @George2018; @Rekhtman2016; @Simbolo2019]. Those studies have revealed that, on top of their histopathological differences, the lung neuroendocrine neoplasms were also distinct molecular entities [@Fernandez-Cuesta2019]. Low mutational burden has been observed in the atypical and typical pulmonary carcinoids in contrast to the highly mutated [LCNEC]{acronym-label="LCNEC" acronym-form="singular+short"}s and [SCLC]{acronym-label="SCLC" acronym-form="singular+short"}s [@Derks2018]. Also, the transcriptomic profiling of those tumors has been investigated. These analyses identified molecular subgroups in different cancer types, revealing the molecular heterogeneity in those tumors [@George2018; @Rudin2019]. The work described in chapters [Chapter1]{reference-type="ref" reference="Chapter1"} and [Chapter2/4]{reference-type="ref" reference="Chapter2/4"} of this thesis contributed to the molecular characterization of the lung neuroendocrine tumors.

The discoveries described in this section were enabled thanks to the large amount of data generated during the era of genomics (See Section 1.2{reference-type="ref" reference="Intro-ngs"}). However, the analyses of these data have raised multiple challenges that required the use and development of specific computational methods. The next section intends to describe those aspects.

Interpreting high dimensional data {#Intro-method}

The evolution of genotyping and sequencing technologies led to the generation of high dimensional datasets. In Section 1.2{reference-type="ref" reference="Intro-ngs"}, we have seen for example that arrays can interrogate thousands to millions of positions across the genome and that sequencing techniques can provide the entire genome sequence or the expression levels of thousands of genes. While the amount of information unveiled by these methods is colossal, it can also bring about several challenges and adapted computational methods are required to analyze and interpret the data. The issues resulting from high dimensionality are associated to what is called the curse of dimensionality, firstly introduced by Bellman in 1961 and stipulating that the number of samples needed to interpret high dimensional data analyses appropriately increases exponentially with the number of dimensions [@Altman2018]. In omics datasets, even though large cohorts have been implemented (see section 1.2{reference-type="ref" reference="Intro-ngs"}), the number of variables (also known as features), $p$, to analyze can be largely superior to the number of samples, $n$, included in the study. This introduces the $n<<p$ problem, which leads to multiple issues. Firstly, usual statistical models like regression models need to be adapted since they require $p<n$. There is also a substantial amount of noise in the generated data that can mask the true signal in the data, i.e. not all the measured features are of interest [@Domingos2012; @Ronan2016]. In addition, when the number of dimensions increases, the data points can occupy a more voluminous space and a larger proportion of this space will be empty, we say that the data are sparse (See Figure [fig:intro_highdim]{reference-type="ref" reference="fig:intro_highdim"}) [@Altman2018]. High data sparsity influences basic properties to which we are used to in two or three dimensions like distances. In high dimensions, distances between points increase and all points seem at the same distance from each other [@Ronan2016; @Altman2018]. Also, the higher the dimensions, the lower the correlations between the features will be. For those reasons, it is thus statistically more difficult to identify groups of points with similar characteristics compared with random events, as such larger sample sizes are required to distinguish meaningful relationships. Another issue resulting from high dimensionality is multi-collinearity. Since the number of features is high, the information they carry can be correlated and become redundant; some variables might be defined as a linear combination of others which makes the data interpretation more difficult [@Altman2018]. Finally, the nature of omics datasets complicates the visualization of the data. In this section, we will discuss in a first instance different strategies to explore such complex datasets and secondly focus on methods that attempt to diminish the problem of the curse of dimensionality: the dimensionality reduction methods.

Illustration of data sparsity. Figure from [@Ronan2016]. The figure represents how the data occupy the available space when going from a one-dimensional space to two and three-dimensional spaces (from left to right panels). {width="80%"}

[[fig:intro_highdim]]{#fig:intro_highdim label="fig:intro_highdim"}

Supervised and unsupervised methods

Different approaches exist to analyze high dimensional data like omics data. In the case where specific biological hypotheses need to be tested, confirmatory data analyses based on inference models can be used. It can also happen that there are no predefined hypotheses and that the goal is to "let the data talk", in that case, [EDA]{acronym-label="EDA" acronym-form="singular+short"} will be more adapted [@Holmes]. A broad panel of statistical methods exists to assist both approaches. Among them, a large proportion can be grouped in the popular category of machine learning methods. The term [ML]{acronym-label="ML" acronym-form="singular+short"} was used for the first time by Arthur Samuel around 1950 and defined a group of computer algorithms able to learn without being explicitly programmed to learn. Depending on the definition of learning, different classes of [ML]{acronym-label="ML" acronym-form="singular+short"} methods have been established. In 1997, Tom Mitchell proposed a formal definition of algorithms learning saying that "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." [@Mitchell1997]. This definition matches a class of [ML]{acronym-label="ML" acronym-form="singular+short"} methods, the supervised learning methods, used for classification and regression tasks. A common example is the identification of spam emails, where labelling emails in the spam or non-spam categories would be the task T, learning from a set of labelled emails would be the experience, and the proportion of correctly classified emails would be the performance measure P. However, [ML]{acronym-label="ML" acronym-form="singular+short"} algorithms that simply learn from the input dataset without predefined ground truth (labelled data) also exist and are part of the unsupervised [ML]{acronym-label="ML" acronym-form="singular+short"} methods. Those methods learn underlying structures in the data; hence algorithms like clustering or dimensionality reduction methods such as [PCA]{acronym-label="PCA" acronym-form="singular+short"}, which was developed even before [ML]{acronym-label="ML" acronym-form="singular+short"}, are often included in the unsupervised learning category. In the next paragraphs, both supervised and unsupervised learning are described (See Figure [fig:intro_supervisedVSunsupervised]{reference-type="ref" reference="fig:intro_supervisedVSunsupervised"}).

Machine learning methods: supervised vs non-supervised methods. A) Supervised methods: a model is trained on several variables, features, to recognize predefined labels. The trained model is then applied to an unlabelled dataset for prediction purposes. B) Unsupervised methods: a model learns structures underlying a dataset that has not been labelled. Those methods are divided into two main categories: clustering methods to identify subgroups of samples and dimensionality reduction methods to explore the data in lower dimensions and highlight specific structures. Figure adapted from [@Libbrecht2015]. {width="100%"}

[[fig:intro_supervisedVSunsupervised]]{#fig:intro_supervisedVSunsupervised label="fig:intro_supervisedVSunsupervised"}

Supervised analyses

The goal of supervised methods is to predict the value of an outcome based on a set of features given as inputs. Depending on the type of outcome, supervised analyses can be further divided into two main categories: classification or regression problems. In classification problems, the outcome is categorical, e.g. a binary variable distinguishing a diseased or healthy status or a multi-classes variable like cancer subtypes. In regression problems, the objective is to predict a continuous variable. Note that some regression models, like logistic regressions, where the outcome variable is discrete, can be used though to perform classification. The main steps of supervised analyses consist in: i) defining the labels of each sample in the dataset, ii) train the model to classify the samples in the correct category, and iii) use the generated model on a dataset containing independent and unknown instances (Figure [fig:intro_supervisedVSunsupervised]{reference-type="ref" reference="fig:intro_supervisedVSunsupervised"}A). Several types of supervised methods exist and have to be chosen with regard to the nature of the data. The simplest supervised models are regression models. While the most common regression algorithms model linear relationships, other methods like [SVM]{acronym-label="SVM" acronym-form="singular+short"} or neural networks can adapt to non-linear data. Another parameter that determines the type of methods to use is the data type; some methods deal only with numerical features while others like decision trees are more flexible. Figure [fig:intro_randomforest]{reference-type="ref" reference="fig:intro_randomforest"} describes a method based on decision trees, the random forest algorithm.

The random forest method. Figure from [@Denisko2018]. A labelled dataset (A) is taken as input and processed by multiple decision trees (B and C) built using random selections of features and samples. The decision trees form a random forest (D). Each tree classifies the input samples and the votes given by the different trees are then combined to provide the final predictions. The label with the most votes being chosen (here red label). {width="80%"}

[[fig:intro_randomforest]]{#fig:intro_randomforest label="fig:intro_randomforest"}

Regardless of the method used, the model and its results have to generalize to other datasets. In order to assess generalizability, the [ML]{acronym-label="ML" acronym-form="singular+short"} algorithm has to be trained on a training dataset, and a testing dataset containing independent samples has to be used to validate the results. Two main errors underlying the generalization issue exist: bias and variance [@Domingos2012]. The first scenario occurs when the model is underfitting the data, i.e. the model has a poor performance even on the training data for example because of a model that is not complex enough (See Figure [fig:intro_biasVSvariance]{reference-type="ref" reference="fig:intro_biasVSvariance"} left panel). When the model is underfitting the data, it is as well unable to generalize to other datasets. In the second case, when the number of features is too large or the number of samples small, the chances to encounter features that can perfectly discriminate two output categories or perfectly predict an outcome increase. The model, in that case, performs correctly on the training dataset but fails to generalize to other datasets and is qualified as high variance model. Such performance discrepancy indicates that the model overfits (See Figure [fig:intro_biasVSvariance]{reference-type="ref" reference="fig:intro_biasVSvariance"} right panel). Note that in high dimensional data, overfitting and data sparsity, resulting from the $n<<p$ problem mentioned at the beginning of this section, can be linked. Indeed, in such data, since the number of samples in the training dataset is fixed and limited, the entire input space is not covered. Thus the machine learning algorithm has not faced all possible configurations during the learning phase and the ability of the model to generalize can be diminished.

High bias and high variance models. Created with BioRender.com. {width="80%"}

[[fig:intro_biasVSvariance]]{#fig:intro_biasVSvariance label="fig:intro_biasVSvariance"}

One method that can be used to detect as well as overcome overfitting is cross-validation. The method consists in randomly splitting the dataset in $k$ folds and iteratively training the model on $k-1$ folds while reserving the remaining $k$th fold for testing (See Figure [fig:intro_crossvalidation]{reference-type="ref" reference="fig:intro_crossvalidation"}). The overall performance of the model can be assessed by averaging the performances in the testing folds from each iteration. As a result, while none of the samples is used simultaneously in the training and testing group, the entire dataset is used for training as well as is used in the testing phase. Hence, cross-validation can also be beneficial in studies with low sample sizes. One extreme case of cross-validation is the leave-one-out analysis, where $k=1$. Each sample is set aside from the training set and predicted at each iteration.

$K$-fold cross-validation. Figure from [@BradleyBrandonGreenwell]. The figure illustrates 5-fold cross-validation. Five rounds are thus represented. In each of them, 4 folds are used to train the model and the model is tested on the remaining fold. The performances resulting from the test phase in each round are then averaged to estimate the overall performance of the model and its ability to generalize. {width="90%"}

[[fig:intro_crossvalidation]]{#fig:intro_crossvalidation label="fig:intro_crossvalidation"}

In addition, to find a compromise between bias and variance, parameter tuning and algorithm optimization might be required. Note that a third dataset, referred to as the validation dataset, can be introduced for the optimization step. In this setting, multiple models (e.g. one algorithm with different sets of parameters or different algorithms) learn on the training set, and their performances are evaluated on the validation dataset. The model with the best performance can then be applied on the testing dataset.

Unsupervised analyses

Unsupervised algorithms are hypothesis-free methods and can be associated to exploratory analyses [@Oskolkov]. The goal of such methods is usually to identify and extract useful properties of the data [@Eraslan2019]. In contrast to the supervised methods, each element of the dataset is not labelled, no predefined groups are given to the algorithms. Thus, it is not possible to compare the algorithm output with a predefined truth and the data do not need to be split in training and testing datasets (Figure [fig:intro_supervisedVSunsupervised]{reference-type="ref" reference="fig:intro_supervisedVSunsupervised"}B). Since there is thus no feedback on the performance of the unsupervised model, often the validation of the results is required.

As for the supervised analyses, there are several unsupervised algorithms. A commonly used category of unsupervised methods that can unveil structure in the data is the group of clustering algorithms (e.g. $k$-means clustering, hierarchical clustering, density-based clustering). Those methods aim at grouping elements together based on common patterns observed in the set of features. In the field of cancer, clustering algorithms can be used, for example, to identify new subtypes of cancers based on molecular data. The second most commonly used unsupervised method is the group of dimensionality reduction methods. In the next paragraph, more details about such methods are provided.

Dimensionality reduction methods

The goal of [DR]{acronym-label="DR" acronym-form="singular+short"} methods is to transform a high dimensional dataset into a low dimensional representation of the data while preserving as much as possible its initial structure. More specifically, if three clusters exist in the studied dataset, a lower dimensional representation of the same data should also reveal the initial three clusters. [DR]{acronym-label="DR" acronym-form="singular+short"} methods are part of the feature extraction techniques which aim at finding latent structures in the data. Those methods allow to summarize and transform a large number of features in a smaller number of variables, which mitigates the curse of dimensionality and is valuable for data visualization. Note that these methods are different from feature selection methods, which make a selection of the most important features in the initial dataset [@Hastie2017]. Mainly two families of [DR]{acronym-label="DR" acronym-form="singular+short"} methods exist: matrix factorization methods (e.g. PCA, PLS, ICA, NMF) or neighbour graphs approaches (e.g. t-SNE and UMAP).

Matrix factorization methods examples

Omics datasets, after pre-processing, often result in data matrices. For example, in the case of [RNA-Seq]{acronym-label="RNA-Seq" acronym-form="singular+short"}, after aligning the reads to a reference genome (See Figure [fig:intro_ngs]{reference-type="ref" reference="fig:intro_ngs"}), reads counting is performed and generates a matrix in which rows represent the genes (the features) and columns the read counts for each sample (the observations). Matrix factorization consists in decomposing an initial matrix in two smaller matrices (Figure [fig:intro_MF]{reference-type="ref" reference="fig:intro_MF"}). This decomposition leads to the generation of new variables, in smaller numbers.

Matrix factorization methods. The input matrix is decomposed, under specific constraints, in two smaller matrices defined by new variables that can be used to reveal structures and patterns in the data.{width="90%"}

[[fig:intro_MF]]{#fig:intro_MF label="fig:intro_MF"}

A classical matrix factorization method is Principal Component Analysis ([PCA]{acronym-label="PCA" acronym-form="singular+short"}). The goal of [PCA]{acronym-label="PCA" acronym-form="singular+short"} is to project the data to a lower dimensional space while maximizing the variance in the data within this lower dimensional space. In [PCA]{acronym-label="PCA" acronym-form="singular+short"}, the new variables correspond to a linear combination of the initial features. The matrix factorization results in the loading and score matrices. In the first matrix, the columns correspond to the new variables, called principal components and the rows indicate the contribution of each feature to the latent variables. The principal components are orthogonal; they correspond to the directions of maximal variance and are ranked by the importance of variance explained, i.e. the first principal component captures most of the variation in the dataset. The second matrix contains the coordinates of the samples in the projected space. While [PCA]{acronym-label="PCA" acronym-form="singular+short"} maximizes the variance in the data, similar methods use other criteria. For example, [ICA]{acronym-label="ICA" acronym-form="singular+short"}, which is a method attempting at disentangling independent signals that are linearly mixed, maximizes the independence between the new variables. Other methods have in addition specific constraints [@Stein-Obrien2018]. [NMF]{acronym-label="NMF" acronym-form="singular+short"}, for example, enforces the decomposed matrices to be positive; this method has enabled the extraction of de novo mutational signatures from whole genome sequencing data [@Alexandrov2013a]. One limitation of those methods is that they are linear models. In the following paragraphs, two non-linear methods based on neighbour graphs are presented.

Neighbor graphs methods examples

The principle of [DR]{acronym-label="DR" acronym-form="singular+short"} methods based on neighbor graphs models is to use neighbors distances and similarities to represent the structure of the data in high dimensions and then to embed this representation in a lower dimensional space.

A method called [t-SNE]{acronym-label="t-SNE" acronym-form="singular+short"} [@VanDerMaaten2008] has been widely used in the past years to perform [DR]{acronym-label="DR" acronym-form="singular+short"}. The [t-SNE]{acronym-label="t-SNE" acronym-form="singular+short"} method can be seen as a neighbor graph based algorithm [@McInnes2018] in a sense that similarity scores based on Euclidean distances between neighbors are computed to embed the high dimensional structure in a two-dimensional space. Samples positions in the two-dimensional space are randomly initialized and are then moved iteratively so that the pair-wise samples similarities match the ones in the original space. [t-SNE]{acronym-label="t-SNE" acronym-form="singular+short"} has limitations though. Firstly, the method can be computationally intensive when applied to huge datasets. Also, the interpretation of the [t-SNE]{acronym-label="t-SNE" acronym-form="singular+short"} representation must be performed with caution. Indeed, the method retains local structures but has limited ability to maintain global structure [@McInnes2018].

Recently, a novel method called [UMAP]{acronym-label="UMAP" acronym-form="singular+short"} [@McInnes2018] was developed and is more and more replacing the [t-SNE]{acronym-label="t-SNE" acronym-form="singular+short"} method. [UMAP]{acronym-label="UMAP" acronym-form="singular+short"} is based on topological theory. The algorithm builds what is called a simplicial complex which is a representation of the data as a weighted graph (See Figure [fig:intro_UMAP_topo]{reference-type="ref" reference="fig:intro_UMAP_topo"}), the weights corresponding to the likelihood that there is a connection between two points [@McInnes_doc; @AndyCoenen].

[UMAP]{acronym-label="UMAP" acronym-form="singular+short"} topological representation. A) The building blocks of a simplicial complex, the simplices. B) An example of a simplicial complex. Figures from [@McInnes_doc].{width="90%"}

[[fig:intro_UMAP_topo]]{#fig:intro_UMAP_topo label="fig:intro_UMAP_topo"}

As mentioned at the beginning of Section 1.4{reference-type="ref" reference="Intro-method"}, in high dimensional spaces data sparsity increases. To connect all the points in the simplicial complex, [UMAP]{acronym-label="UMAP" acronym-form="singular+short"} varies the radius in which the search of neighbors is performed by fixing the number of neighbors to consider around each point [@McInnes_doc; @AndyCoenen]. This number of neighbors influences how the data structure is preserved, low and high values favoring local and global structures, respectively. Once the graphical representation of the high dimensional data is constructed, a low dimensional representation of the data is optimized so that it is as close as possible to the high dimensional representation. One of the advantages of [UMAP]{acronym-label="UMAP" acronym-form="singular+short"} over [t-SNE]{acronym-label="t-SNE" acronym-form="singular+short"} is that the method better maintains the global structure of the data. Also, [UMAP]{acronym-label="UMAP" acronym-form="singular+short"} is computationally more efficient [@McInnes2018]. Note that [UMAP]{acronym-label="UMAP" acronym-form="singular+short"} can be applied on a lower dimensional dataset resulting, for example, from a [DR]{acronym-label="DR" acronym-form="singular+short"} method like [PCA]{acronym-label="PCA" acronym-form="singular+short"}.

Multi-omics data integration

The methods previously described consider as input a single dataset. [DR]{acronym-label="DR" acronym-form="singular+short"} methods processing multiple matrices also exist and can be used to integrate multi-omics datasets. Such integration raises, though, multiple challenges. Firstly, the data to integrate are heterogeneous. The nature of the collected data is different, hence their statistical properties can vary. Also, it can happen that all the omics datasets are not available for each sample included in the analysis for technical reasons or due to quality issues. Hence, distinct patterns of missing data can occur in each omic dataset. Besides, integrating multiple datasets amplifies the curse of dimensionality issues already encountered in each dataset individually.

In 2018, a method called [MOFA]{acronym-label="MOFA" acronym-form="singular+short"} was developed to integrate multi-omics data while considering the previously mentioned challenges [@Argelaguet2018]. [MOFA]{acronym-label="MOFA" acronym-form="singular+short"} is an unsupervised analysis based on matrix factorization (See Section 1.4{reference-type="ref" reference="Intro-method"}), and can be seen as an extension of [PCA]{acronym-label="PCA" acronym-form="singular+short"} to multi-omics data, called modalities or also views. It is a factor analysis method which reduces the dimensions of the data to a smaller number of unobserved factors, called the latent factors. These factors differ from the [PCA]{acronym-label="PCA" acronym-form="singular+short"} components. The latter are linear combinations of the initial features, while in factor analyses the initial features are expressed as linear combinations of the latent factors, plus a residual noise term. To enable multi-omics data (modalities) integration, [MOFA]{acronym-label="MOFA" acronym-form="singular+short"} supports different noise models depending on the nature of the data (continuous, counts or binary data). Based on this model, [MOFA]{acronym-label="MOFA" acronym-form="singular+short"} identifies different sources of variations across multiple omics data. [MOFA]{acronym-label="MOFA" acronym-form="singular+short"} presents though several limitations. The model does not capture non-linear relationships and assume features independence [@Argelaguet2018]. Also, additional features accounting for samples structure, such as groups of samples, batches or samples conditions, were not available in the initial version of [MOFA]{acronym-label="MOFA" acronym-form="singular+short"} but have been recently introduced in a second version, [MOFA]{acronym-label="MOFA" acronym-form="singular+short"}+ [@Argelaguet2020]. In this framework, the [MOFA]{acronym-label="MOFA" acronym-form="singular+short"} dimensionality reduction is performed with regards to additional samples information (e.g. batch or cluster information) to identify sources of variations shared between groups or exclusive to one of them.

Other integrative methods can take into consideration samples structure. For example, the [PLS]{acronym-label="PLS" acronym-form="singular+short"} method, which is a matrix factorization method, attempts to relate two matrices: a response matrix and a matrix gathering explanatory variables. The advantage of this method is that it ensures that the new variables resulting from the dimensionality reduction explain the response data. In that sense, the [PLS]{acronym-label="PLS" acronym-form="singular+short"} method can be considered as a supervised [DR]{acronym-label="DR" acronym-form="singular+short"} framework. While [PCA]{acronym-label="PCA" acronym-form="singular+short"} maximizes the variance of the components, [PLS]{acronym-label="PLS" acronym-form="singular+short"} maximizes the covariance between the latent components of the response and explanatory datasets [@Kim; @Hastie2017]. When the response data is a categorical variable, a variant of [PLS]{acronym-label="PLS" acronym-form="singular+short"} called [PLS-DA]{acronym-label="PLS-DA" acronym-form="singular+short"} can be used to perform classification tasks, e.g. samples groups prediction. In 2017, Lê Cao team published the mixOmics framework implementing multivariate analyses tools, including the [PLS]{acronym-label="PLS" acronym-form="singular+short"} methods previously described [@Rohart2017]. The mixOmics tools also include the [DIABLO]{acronym-label="DIABLO" acronym-form="singular+short"} method, which is a multivariate dimension reduction method that can be used for supervised multi-omics data integration [@Singh2019]. [DIABLO]{acronym-label="DIABLO" acronym-form="singular+short"} maximizes the correlation between the features of the different omics datasets, one of this dataset corresponding to the labelled samples. Hence, the method extracts what the authors call multi-omics signatures that are discriminant and can be used for prediction in a supervised framework.