Supervised learning of disease-specific models to predict gene regulatory loci


Complex diseases are influenced by common variants. Many SNPs fall in non-coding regions and are likely gene regulatory SNPs (rSNPs). Supervised machine learning models take particular sets of rSNPs or regions and molecular data such eQTLs, transcription factor binding sites and motifs to predict functional regions [1]. In the case of rSNPs, most supervised models are trained with rare variants showing large effects and
do not focus on particular diseases or particular non-coding region such as intergenic and intronic [2; 3]. However several international initiatives exist to associate genetic and phenotypic variation such as the INSERM Genomic Variability 2018, where the TAGC lab participates (https://bit.ly/2TIfSeo ). Recently we have developed a new method to train a supervised model with common regulatory variants associated to
complex diseases that we have run on intergenic and intronic regions ( http://tagoos.readthedocs.io ). In the present project, we plan to make the TAGOOS method disease-specific.


Complex diseases; common variants; machine learning; gene regulatory regions


The objective of this Postdoc project is to develop a disease specific-model of common regulatory variants for four disease classes, namely cardiovascular, digestive, immune system and nervous system diseases in the GWAS catalog. This model will be used to predict new non-coding genomic and SNPs involved in these diseases. The resulting predictions will be help uncover shared and specific contributions of the genome to these disease classes.

Proposed approach

Disease-specific SNPs will be selected in non-coding regions and expanded with correlated SNPs in linkage disequilibrium (LD) in different human populations. The SNPs will be annotated with molecular data relevant to gene regulation such as transcription factor binding sites and motifs. This list of SNPs will be pruned to select SNPs in independent LD blocks. Annotation selection and decorrelation strategies will be applied to deal with the large number of annotations [4] . The resulting matrix of annotated SNPs will be used to train supervised algorithms such as gradient boosting or neural networks. To take into account different classes of non-coding regions such as distal intergenic, promoter or intronic regions, supervised or unsupervised learning approaches will be developed and compared. The best model will be used to score every position of the human genome and prioritize new SNPs involved in these disease classes. These genome scores will allow to uncover shared and specific contributions of the genome to these diseases classes.


The Postdoc candidate should have a Doctoral thesis in an area related to Mathematics, Computer Science, Bioinformatics or Population Genetics with interest and experience in machine learning, data analysis and human genetics. Aitor González (TAGC) is a bionformaticien that uses machine
learning to analyze the non-coding regions and variants of the genome. Badih Ghattas (I2M) is a mathematicien with expertise in statistical modeling and prediction using machine and deep learning with a large previous experience of collaboration with biologists [4; 5] . Pascal Rihet (TAGC lab) uses quantitative genetics methods and experimental validation to find genetic markers of complex diseases such as malaria [6; 7]. The prospective postdoc candidate will thus benefit from the expertise of TAGC and I2M laboratories, which are side by side in the Luminy campus.

Expected profile

The Postdoc candidate should have a Doctoral degree in an area related to Mathematics, Computer Science, Biophysics, Bioinformatics or Population Genetics with interest and/or experience in statistics, machine learning, data analysis and human genetics. We expect the candidate to have at least one publication in a peer-reviewed journal relate to any of these topics. The Postdoc candidate will work under the joined supervision of a bioinformatics scientist (Aitor González, TAGC, Marseille, France), a geneticist (Pascal Rihet, TAGC, Marseille, France) and an statistician (Badih Ghattas, I2M, Marseille, France).


[1] Seyres, D.; Darbo, E.; Perrin, L.; Herrmann, C. and González, A. (2016). LedPred: an R/bioconductor package to predict regulatory sequences using support vector machines. Bioinformatics (Oxford, England) 32: 1091-1093.

[2] Amlie-Wolf, A.; Tang, M.; Mlynarski, E. E.; Kuksa, P. P.; Valladares, O.; Katanic, Z.; Tsuang, D.; Brown, C. D.; Schellenberg, G. D. and Wang, L.-S. (2018). INFERNO: inferring the molecular mechanisms of noncoding genetic variants. Nucleic acids research 46: 8740-8753.

[3] Wang, J.; Dayem Ullah, A. Z. and Chelala, C. (2018). IW-Scoring: an Integrative Weighted Scoring framework for annotating and prioritizing genetic variations in the noncoding genome. Nucleic acids research 46: e47.

[4] Ghattas, B.; Pierre, M. and Laurent, B. (2019). Assessing variable importance in clustering: a new method based on unsupervised binary decision trees Computational Statistics 34: 301-321.

[5] Lopez, F.; Granjeaud, S.; Ara, T.; Ghattas, B. and Gautheret, D. (2006). The disparate nature of "intergenic" polyadenylation sites. RNA (New York, N.Y.) 12: 1794-1801.

[6] Baaklini, S.; Afridi, S.; Nguyen, T. N.; Koukouikila-Koussounda, F.; Ndounga, M.; Imbert, J.; Torres, M.; Pradel, L.; Ntoumi, F. and Rihet, P. (2017). Beyond genome-wide scan: Association of a cis-regulatory NCR3 variant with mild malaria in a population living in the Republic of Congo. PloS one 12: e0187818.

[7] Thiam, A.; Baaklini, S.; Mbengue, B.; Nisar, S.; Diarra, M.; Marquet, S.; Fall, M. M.; Sanka, M.; Thiam, F.; Diallo, R. N.; Torres, M.; Dieye, A. and Rihet, P. (2018). NCR3 polymorphism, haematological parameters, and severe malaria in Senegalese patients. PeerJ 6: e6048.