Development of a Convolutional Neural Network for Automated Copy Number Variants Validation and its Application in the UKB

Development of a Convolutional Neural Network for Automated Copy Number Variants Validation and its Application in the UKB


Author(s): Simone Montalbano,Andres Ingason

Affiliation(s): Institute of Biological Psychiatry



Structural variants are a major source of variation in the human genome. In particular, copy number variants (CNVs) have been associated with multiple diseases and syndromes. CNVs are typically defined as deletions or duplications spanning ~50kbp to ~10Mbp. Genotyping arrays still remain the most widely used platform to detect CNVs from, especially in large biobanks. However, CNV calling algorithms are prone to produce a high number of false positives (from 10% up to more than 50% depending on the level of sample quality), thus requiring analysts to manually 'validate' calls. This has largely limited CNV research to the so-called recurrent loci. We have developed a machine learning algorithm based on the convolutional neural network architecture that is capable of automating the visual validation of CNVs across the whole human genome. It was trained on ~15,000 human-validated examples from UKB singletons and Icelandic trios at deCODE genetics and has an accuracy above 90% across multiple cohorts and chip types, making it on average as good as a human analyst. We showcase the application of this tool in the UKB, creating the first genome-wide map of validated CNVs in a large biobank population. Furthermore, we describe how CNVs are distributed across the genome and how regions are differentially permissive or intolerant to the presence of CNVs. Finally we show how to group CNVs making them akin to SNPs in association analysis and we present the results of the association of genome wide CNVs to a wide selection of phenotypes.