PHD2023-02
Combined deep learning and synthetic-based approaches to unravel the genetic determinants of enhancer versus promoter activity of Epromoters
Host laboratory and collaborators
Aitor González (TAGC) / aitor.gonzalez@univ-amu.fr
Badih Ghattas (I2M) / badih.ghattas@univ-amu.fr
Salvatore Spicuglia (TAGC) / salvatore.spicuglia@inserm.fr
Abstract
Regulation of gene transcription is accomplished by proximal (promoters) and distal (enhancers) regulatory elements. However, a strict dichotomy model is now challenged and a major question in the field is to define the genetic determinants of the different regulatory activities. The Spicuglia team has previously identified Epromoters as cis regulatory elements with both enhancer and promoter (E/P) activities and is currently using high-throughput approaches to evaluate both activities in thousands of wild-type and mutant DNA sequences. In this project, we will build a sequence-based deep learning model of Epromoters to unravel the genetic determinants of enhancer vs. promoter activities. The model will be challenged and refined in back and forth exchanges between model predictions, experimental validation and synthetic generation of Epromoters. A. González and S. Spicuglia will supervise the overall project. A. González will supervise data processing, integration and analysis. B. Ghattas will supervise the design and validation of the deep learning models to predict E/P activities. S. Spicuglia supervises the experimental work to generate model input data and experimentally validate the predictions.
Keywords
Cis-regulatory elements, genetic variants, machine learning, synthetic biology
Objectives
1) To process NGS data from the dual reporter assay.
2) To create a deep learning model of DNA sequences to predict E/P activities.
3) To design de-novo synthetic regulatory sequences with defined E/P activities.
4) To evaluate experimentally E/P activities in synthetic sequences.
5) To analyse the model predictions to infer the logics of E/P activities at the DNA sequence level and assessment of natural genetic variants.
Proposed approach (experimental / theoretical / computational)
Regulation of gene transcription in higher eukaryotes is accomplished through the involvement of transcription start site (TSS)-proximal (promoters) and -distal (enhancers) regulatory elements. It is now well acknowledged that enhancer elements play an essential role during development and cell differentiation, while genetic alterations in these elements are a major cause of human disease. The classical definition of enhancers implies the property to activate gene expression at a distance, while promoters induce local gene expression. However, this basic dichotomy has been challenged by broad similarities between promoters and enhancers. Using high-throughput enhancer reporter assays, the team of Salvatore Spicuglia, at the TAGC laboratory demonstrated that a subset of gene-promoters, termed Epromoters, actually works also as bona fide enhancers and regulates distal gene expression. Subsequent studies from our and other labs have since suggested that Single Nucleotide Polymorphisms (SNPs) affecting distal gene expression are significantly enriched within Epromoters. Overall, these findings have significant implications for the understanding of complex gene regulation in normal development and open the intriguing possibility that physiological traits or disease-associated variants lying within a subset of (E)promoters might also directly impact distal gene expression. Previous work has suggested that the intrinsic and external factors controlling the promoter and enhancer activity of a given Epromoter are not necessarily the same. To understand the mechanistic bases of the proximal versus distal regulation, the Spicuglia team recently developed a dual reporter assay allowing us to simultaneously measure the promoter and enhancer activity of a given Epromoter. This approach will allow the assessment of hundreds of wild-type or mutated Epromoters in parallel, including natural genetic variants, in human lymphoblastoid cells. In the present project, the PhD student will develop deep learning approaches to computationally model the sequence determinants of enhancer and promoter activity in Epromoters. Based on these models, he/she will infer the grammatical rules dictating the enhancer versus promoter activity of Epromoters and prioritize genetic variants according to their impact on regulatory potential and transcriptional activity in a large panel of genotyped individuals. Predictions will be assessed experimentally by biologists of the Spicuglia team in a back-and-forth strategy until the development of a stable model of the genetic determinants influencing enhancer versus promoter activity.
Interdisciplinarity
Dr. Salvatore Spicuglia is an expert in mammalian transcriptional regulation and his lab implements high-throughput reporter assays to quantitatively assess enhancer activity in mammals.
Dr. Aitor González is a Bioinformaticien that uses a variety of methods including machine learning to analyse gene regulatory regions and variants.
Prof. Badih Ghattas is an expert in statistics and machine learning that has applied predictive models in genomics since many years.
The student will collaborate with members of the TAGC laboratory under the supervision of Salvatore Spicuglia and Aitor González to prepare the sequences that will be generated by experimentalists. A deep learning model will then be created under the supervision of both Aitor González and Badih Ghattas that is able to predict the enhancer vs. promoter activities. The model will be used to gain different insights suchs as the sequence composition that defines the promoter vs. enhancer activities with the help of all three supervisors.
Expected profile
The PhD candidate should have a master in bioinformatics or related fields, with a solid background in computer science, statistics and/or mathematics. The candidate should be interested in “omics” data analyses, genomics and gene regulation. Previous experience in manipulating NGS data and/or deep learning and in collaborations with experimental biologists is an advantage.
Is this project the continuation of an existing project or an entirely new one? In the case of an existing project, please explain the links between the two projects
This is a new CENTURI project. The TAGC has the necessary funding and human ressources to cover the experimental part of the project (ANR, H2020 ITN).
2 to 5 references related to the project
- Bernardo P. de Almeida, Franziska Reiter, Michaela Pagani, Alexander Stark. DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of enhancers. doi:10.1101/2021.10.05.463203
- W Kopp, R Monti, A Tamburrini, U Ohler, A Akalin. Deep learning for genomics using Janggu. Nat Commun. 2020 Jul 13;11(1):3488. doi: 10.1038/s41467-020-17155-y.
- Andersson, R., Sandelin, A. Determinants of enhancer and promoter activities of regulatory elements. Nat Rev Genet 21, 71–87 (2020). https://doi.org/10.1038/s41576-019-0173-8
- Medina A, Santiago D, Puthier D, Spicuglia S. (2018) Wide-spread enhancer activity from core promoters. TiBS. 43(6):452-468.
- Core, L., Martins, A., Danko, C. et al. Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers. Nat Genet 46, 1311–1320 (2014). https://doi.org/10.1038/ng.3142
3 main publications from each PI over the last 5 years
Aitor González
- Anthony Baptista, Aitor Gonzalez, Anaïs Baudot. Universal multilayer network exploration by random walk with restart. Communications Physics, Nature Research, 2022, 5 (1), pp.170. ⟨10.1038/s42005-022-00937-9⟩.
- Aitor Gonzalez*, Vincent Dubut, Emmanuel Corse, Reda Mekdad, Thomas Dechatre, et al.. VTAM: A robust pipeline for validating metabarcoding data using internal controls. 2021. Bioarxiv. doi:10.1101/2020.11.06.371187.
- Aitor Gonzalez*, Marie Artufel, Pascal Rihet. TAGOOS: genome-wide supervised learning of non-coding loci associated to complex phenotypes. Nucleic Acids Research, doi:10.1093/nar/gkz320.
*Corresponding author
Badih Ghattas
- Fournel J, Bartoli A, Bendahan D, Guye M, Bernard M, Rauseo E, Khanji MY, Petersen SE, Jacquier A, Ghattas B. Medical image segmentation automatic quality control: A multi-dimensional approach. Med Image Anal. 2021 Dec;74:102213. doi: 10.1016/j.media.2021.102213. Epub 2021 Aug 12.
- Bartoli A, Fournel J, Bentatou Z, Habib G, Lalande A, Bernard M, Boussel L, Pontana F, Dacher JN, Ghattas B, Jacquier A. Deep Learning-based Automated Segmentation of Left Ventricular Trabeculations and Myocardium on Cardiac MR Images: A Feasibility Study. Radiol Artif Intell. 2020 Nov 25;3(1):e200021. doi:10.1148/ryai.2020200021.
- Jaotombo F, Pauly V, Auquier P, Orleans V, Boucekine M, Fond G, Ghattas B, Boyer L. Machine-learning prediction of unplanned 30-day rehospitalization using the French hospital medico-administrative database. Medicine (Baltimore). 2020 Dec 4;99(49):e22361. doi: 10.1097/MD.0000000000022361.
Salvatore Spicuglia
- Santiago-Algarra D, Souaid C, Singh H, Dao LTM, Hussain S, Medina-Rivera A, Ramirez-Navarro L, Castro-Mondragon JA, Sadouni N, Charbonnier G, Spicuglia S. (2021) Epromoters function as a hub to recruit key transcription factors required for the inflammatory response. Nat Commun. Nov 18;12(1):6660. doi: 10.1038/s41467-021-26861-0.
- Belhocine M, Simonin M, Abad Flores JD, Cieslak A, Manosalva I, Pradel L, Smith C, Mathieu EL, Charbonnier G, Martens JHA, Stunnenberg HG, Maqbool MA, Mikulasova A, Russell LJ, Rico D, Puthier D, Ferrier P, Asnafi V, Spicuglia S. (2021). Dynamic of broad H3K4me3 domains uncover an epigenetic switch between cell identity and cancer-related genes. Genome Research. Jun 23. Advanced online doi:10.1101/gr.266924.120
- Dao LTM, Galindo-Albarrán AO, Castro-Mondragon JA, Andrieu-Soler C, Medina-Rivera A, Souaid C, Charbonnier G, Griffon A, Vanhille L, Stephen T, Alomairi J, Martin D, Torres M, Fernandez N, Soler E, van Helden J, Puthier D, Spicuglia S (2017). Genome-wide characterization of mammalian promoters with distal enhancer functions. Nat Genet. 49(7):1073-1081. PMID: 28581502.