Locus-to-Gene (L2G)
Overview of the Open Targets Locus-to-Gene algorithm
Based on genetic and functional genomics traits, the Locus-to-Gene (L2G) machine-learning algorithm ranks the most likely causative genes at each GWAS credible set. The likelihood that a gene is causal for a particular GWAS locus is measured by the L2G score, ranging from 0 to 1.
The L2G model included in the Platform is based on the original method published by Mountjoy et al. Nature Genetics (2021), but contains several enhancements that can result in different performance.
Features
📖 Gentropy
The predictive features used by the L2G algorithm are designed to capture various genetic and genomic contexts that influence the likelihood of a gene being causal at a given GWAS credible set.
Features are divided into four categories:
distanceTssMean
Average distance between all variants in the credible set and the TSS of a gene's canonical transcript. The distance of each variant is weighted by its posterior probability
(0, 1)
distanceTssMeanNeighbourhood
Ratio between distanceTssMean
for a gene and the maximum distanceTssMean
for any given gene in the vicinity
(0, 1)
distanceSentinelTss
Distance between the sentinel variant and the TSS of a gene's canonical transcript
(0, 1)
distanceSentinelTssNeighbourhood
Ratio between distanceSentinelTss
for a gene and the maximum distanceSentinelTss
for any given gene in the vicinity
(0, 1)
distanceSentinelFootprint
Distance between sentinel variant and a gene's footprint
(0, 1)
distanceSentinelFootprintNeighbourhood
Ratio between distanceSentinelFootprint
for a gene and the maximum distanceSentinelFootprint
for any given gene in the vicinity
(0, 1)
distanceFootprintMean
Average distance between all variants in the credible set and a gene's footprint. The distance of each variant is weighted by its posterior probability
(0, 1)
distanceFootprintMeanNeighbourhood
Ratio between distanceFootprintMean
for a gene and the maximum distanceFootprintMean
for any given gene in the vicinity
(0, 1)
Neighbourhood features
While some features are computed independently for each gene, others reflect the comparative relative context of a given gene compared to the other genes in the neighbourhood (+/- 500,000bp).
For a more detailed description of how each feature is computed, see the L2G Feature documentation.
Training set
The L2G model is trained based on prior knowledge of gene-trait associations collated from different sources, to later bring representative credible sets supporting that association. The next steps describe the methodology to compose the training set:
Effector gene list
The effector gene list (EGL) represents the set of biologically validated gene-trait associations. This list is derived from several sources:
Manually curated gold standards (“medium” and “high” confidence) from OTG 22.10.
Gene-indication pairs for the pharmacological target in all phase III or IV clinical trials according to the latest ChEMBL release.
Gene-disease or phenotype mappings with evidence score ≥ 0.95 from ClinVar, UniProt, Gene2Phenotype, Genomics England PanelApp and ClinGen from the latest Open Targets Platform release.
To ensure the uniqueness of gene-EFO pairs, the combined list was de-duplicated.
Positives
For each gene-trait pair from the effector gene list, positive gene-credible set pairs were extracted using the following criteria:
Only protein-coding genes.
Removed any pair with a
distanceSentinelTSS
feature less than 0.1.Only include gene-trait pairs supported by at least two credible sets in different studies sharing the same lead variant. We added this criterion to reduce the chance of false positives among the credible sets.
Among all positive credible set-gene pairs, we removed duplications based on functional genomics features.
Removed credible sets involved in more than two positives.
Negatives
All other protein-coding genes in the window are classified as negatives as long as they don't have a strong functional interaction (STRINGdb score > 0.8) with any positive genes for the trait.
The L2G model
📖 Gentropy
The Locus-to-Gene model is trained for every data release using the training set and feature matrix following. Similarly to Mountjoy et al., the model is trained based on a gradient-boosting algorithm with the scikit-learn
library, and using nested cross-validation and hyperparameter tuning.
The L2G model will be updated in each release to include new features and/or fixes. You should expect some variation in the prediction scores with each release.
L2G predictions
📖 Gentropy
The trained L2G model is applied to every GWAS credible set in the Open Targets Platform. All predictions below 0.05 are filtered out and feature contributions are added using SHAP analysis as described here. These results feature as Target-Disease evidence, as well as L2G annotation for every credible in the Platform.
The GWAS Associations evidence set is constructed based on GWAS credible sets linking to a protein-coding gene with an L2G score higher than 0.05.
Reference
Mountjoy, E., Schmidt, E.M., Carmona, M. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet 53, 1527–1533 (2021)
Last updated
Was this helpful?