Locus-to-Gene (L2G)
Overview of the Open Targets Locus-to-Gene algorithm
Based on genetic and functional genomics traits, the Locus-to-Gene (L2G) machine-learning algorithm ranks the most likely causative genes at each GWAS credible set. The likelihood that a gene is causal for a particular GWAS locus is measured by the L2G score, ranging from 0 to 1.
The L2G model included in the Platform is based on the original method published by Mountjoy et al. Nature Genetics (2021), but contains several enhancements that can result in different performance.
Features
📖 Gentropy
The predictive features used by the L2G algorithm are designed to capture various genetic and genomic contexts that influence the likelihood of a gene being causal at a given GWAS credible set.
Features are divided into four categories:
distanceTssMean
Average distance between all variants in the credible set and the TSS of a gene's canonical transcript. The distance of each variant is weighted by its posterior probability
(0, 1)
distanceTssMeanNeighbourhood
Ratio between distanceTssMean for a gene and the maximum distanceTssMean for any given gene in the vicinity
(0, 1)
distanceSentinelTss
Distance between the sentinel variant and the TSS of a gene's canonical transcript
(0, 1)
distanceSentinelTssNeighbourhood
Ratio between distanceSentinelTss for a gene and the maximum distanceSentinelTss for any given gene in the vicinity
(0, 1)
distanceSentinelFootprint
Distance between sentinel variant and a gene's footprint
(0, 1)
distanceSentinelFootprintNeighbourhood
Ratio between distanceSentinelFootprint for a gene and the maximum distanceSentinelFootprint for any given gene in the vicinity
(0, 1)
distanceFootprintMean
Average distance between all variants in the credible set and a gene's footprint. The distance of each variant is weighted by its posterior probability
(0, 1)
distanceFootprintMeanNeighbourhood
Ratio between distanceFootprintMean for a gene and the maximum distanceFootprintMean for any given gene in the vicinity
(0, 1)
eQtlColocClppMaximum
Maximum CCLP across all eQTL studies for a gene
(0, 1)
pQtlColocClppMaximum
Maximum CCLP across all pQTL studies for a gene
(0, 1)
sQtlColocClppMaximum
Maximum CCLP across all sQTL and tuQTL studies for a gene
(0, 1)
eQtlColocH4Maximum
Maximum H4 across all eQTL studies for a gene
(0, 1)
pQtlColocH4Maximum
Maximum H4 across all pQTL studies for a gene
(0, 1)
sQtlColocH4Maximum
Maximum H4 across all sQTL and tuQTL studies for a gene
(0, 1)
eQtlColocClppMaximumNeighbourhood
Ratio between eQtlColocClppMaximum for a gene and the maximum eQtlColocClppMaximum for any protein-coding gene in the vicinity
(0, 1)
pQtlColocClppMaximumNeighbourhood
Ratio between pQtlColocClppMaximum for a gene and the maximum pQtlColocClppMaximum for any protein-coding gene in the vicinity
(0, 1)
sQtlColocClppMaximumNeighbourhood
Ratio between sQtlColocClppMaximum for a gene and the maximum sQtlColocClppMaximum for any protein-coding gene in the vicinity
(0, 1)
eQtlColocH4MaximumNeighbourhood
Ratio between eQtlColocH4Maximum for a gene and the maximum eQtlColocH4Maximum for any protein-coding gene in the vicinity
(0, 1)
pQtlColocH4MaximumNeighbourhood
Ratio between pQtlColocH4Maximum for a gene and the maximum pQtlColocH4Maximum for any protein-coding gene in the vicinity
(0, 1)
sQtlColocH4MaximumNeighbourhood
Ratio between sQtlColocH4Maximum for a gene and the maximum sQtlColocH4Maximum for any protein-coding gene in the vicinity
(0, 1)
vepMaximum
Maximum VEP score across all variants in the credible set
(0, 1)
vepMaximumNeighbourhood
Ratio between vepMaximum for a gene and the maximum vepMaximum for any protein-coding gene in the vicinity
(0, 1)
vepMean
Average VEP score between all variants in the credible set and a gene's footprint. The score of each variant is weighted by its posterior probability
(0, 1)
vepMeanNeighbourhood
Ratio between vepMean for a gene and the maximum vepMean for any protein-coding gene in the vicinity
(0, 1)
geneCount500kb
Number of genes 250kb up- and down-stream the sentinel variant of a credible set
(0, 60000)
proteinGeneCount500kb
Number of protein-coding genes 250kb up- and down-stream the sentinel variant of a credible set
(0, 60000)
credibleSetConfidence
Degree of confidence we assign to the credible set definition based on the fine mapping methodology: 1 when fine-mapped with SuSIE using in-sample LD: 0.75 when fine-mapped with SuSIE using out-of-sample LD: 0.5, when fine-mapped with PICS and the locus is based on the analysis of summary statistics; 0.25, when fine-mapped with PICS and the locus was reported as a top hit according to the GWAS Catalog
(0, 1)
Training set
The L2G model is trained based on prior knowledge of gene-trait associations collated from different sources, to later bring representative credible sets supporting that association. The next steps describe the methodology to compose the training set:
Effector gene list
The effector gene list (EGL) represents the set of biologically validated gene-trait associations. This list is derived from several sources:
Manually curated gold standards (“medium” and “high” confidence) from OTG 22.10.
Gene-indication pairs for the pharmacological target in all phase III or IV clinical trials according to the latest ChEMBL release.
Gene-disease or phenotype mappings with evidence score ≥ 0.95 from ClinVar, UniProt, Gene2Phenotype, Genomics England PanelApp and ClinGen from the latest Open Targets Platform release.
To ensure the uniqueness of gene-EFO pairs, the combined list was de-duplicated.
Positives
For each gene-trait pair from the effector gene list, positive gene-credible set pairs were extracted using the following criteria:
Only protein-coding genes.
Removed any pair with a
distanceSentinelTSSfeature less than 0.1.Only include gene-trait pairs supported by at least two credible sets in different studies sharing the same lead variant. We added this criterion to reduce the chance of false positives among the credible sets.
Among all positive credible set-gene pairs, we removed duplications based on functional genomics features.
Removed credible sets involved in more than two positives.
Negatives
All other protein-coding genes in the window are classified as negatives as long as they don't have a strong functional interaction (STRINGdb score > 0.8) with any positive genes for the trait.
The L2G model
📖 Gentropy
The Locus-to-Gene model is trained for every data release using the training set and feature matrix following. Similarly to Mountjoy et al., the model is trained based on a gradient-boosting algorithm with the scikit-learn library, and using nested cross-validation and hyperparameter tuning.
L2G predictions
📖 Gentropy
The trained L2G model is applied to every GWAS credible set in the Open Targets Platform. All predictions below 0.05 are filtered out and feature contributions are added using SHAP analysis as described here. These results feature as Target-Disease evidence, as well as L2G annotation for every credible in the Platform.
Reference
Mountjoy, E., Schmidt, E.M., Carmona, M. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet 53, 1527–1533 (2021)
Last updated
Was this helpful?