Locus-to-Gene (L2G)

Overview of the Open Targets Locus-to-Gene algorithm

Based on genetic and functional genomics traits, the Locus-to-Gene (L2G) machine-learning algorithm ranks the most likely causative genes at each GWAS credible set. The likelihood that a gene is causal for a particular GWAS locus is measured by the L2G score, ranging from 0 to 1.

Features

📖 Gentropy

The predictive features used by the L2G algorithm are designed to capture various genetic and genomic contexts that influence the likelihood of a gene being causal at a given GWAS credible set.

Features are divided into four categories:

Feature Name
Description
Range

distanceTssMean

Average distance between all variants in the credible set and the TSS of a gene's canonical transcript. The distance of each variant is weighted by its posterior probability

(0, 1)

distanceTssMeanNeighbourhood

Ratio between distanceTssMean for a gene and the maximum distanceTssMean for any given gene in the vicinity

(0, 1)

distanceSentinelTss

Distance between the sentinel variant and the TSS of a gene's canonical transcript

(0, 1)

distanceSentinelTssNeighbourhood

Ratio between distanceSentinelTss for a gene and the maximum distanceSentinelTss for any given gene in the vicinity

(0, 1)

distanceSentinelFootprint

Distance between sentinel variant and a gene's footprint

(0, 1)

distanceSentinelFootprintNeighbourhood

Ratio between distanceSentinelFootprint for a gene and the maximum distanceSentinelFootprint for any given gene in the vicinity

(0, 1)

distanceFootprintMean

Average distance between all variants in the credible set and a gene's footprint. The distance of each variant is weighted by its posterior probability

(0, 1)

distanceFootprintMeanNeighbourhood

Ratio between distanceFootprintMean for a gene and the maximum distanceFootprintMean for any given gene in the vicinity

(0, 1)

Neighbourhood features

While some features are computed independently for each gene, others reflect the comparative relative context of a given gene compared to the other genes in the neighbourhood (+/- 500,000bp).

For a more detailed description of how each feature is computed, see the L2G Feature documentation.

Training set

The L2G model is trained based on prior knowledge of gene-trait associations collated from different sources, to later bring representative credible sets supporting that association. The next steps describe the methodology to compose the training set:

Effector gene list

The effector gene list (EGL) represents the set of biologically validated gene-trait associations. This list is derived from several sources:

  1. Manually curated gold standards (“medium” and “high” confidence) from OTG 22.10.

  2. Gene-indication pairs for the pharmacological target in all phase III or IV clinical trials according to the latest ChEMBL release.

  3. Gene-disease or phenotype mappings with evidence score ≥ 0.95 from ClinVar, UniProt, Gene2Phenotype, Genomics England PanelApp and ClinGen from the latest Open Targets Platform release.

To ensure the uniqueness of gene-EFO pairs, the combined list was de-duplicated.

Positives

For each gene-trait pair from the effector gene list, positive gene-credible set pairs were extracted using the following criteria:

  1. Only protein-coding genes.

  2. Removed any pair with a distanceSentinelTSS feature less than 0.1.

  3. Only include gene-trait pairs supported by at least two credible sets in different studies sharing the same lead variant. We added this criterion to reduce the chance of false positives among the credible sets.

  4. Among all positive credible set-gene pairs, we removed duplications based on functional genomics features.

  5. Removed credible sets involved in more than two positives.

Negatives

All other protein-coding genes in the window are classified as negatives as long as they don't have a strong functional interaction (STRINGdb score > 0.8) with any positive genes for the trait.

The L2G model

📖 Gentropy

The Locus-to-Gene model is trained for every data release using the training set and feature matrix following. Similarly to Mountjoy et al., the model is trained based on a gradient-boosting algorithm with the scikit-learn library, and using nested cross-validation and hyperparameter tuning.

The L2G model will be updated in each release to include new features and/or fixes. You should expect some variation in the prediction scores with each release.

L2G predictions

📖 Gentropy

The trained L2G model is applied to every GWAS credible set in the Open Targets Platform. All predictions below 0.05 are filtered out and feature contributions are added using SHAP analysis as described here. These results feature as Target-Disease evidence, as well as L2G annotation for every credible in the Platform.

The GWAS Associations evidence set is constructed based on GWAS credible sets linking to a protein-coding gene with an L2G score higher than 0.05.

Reference

Mountjoy, E., Schmidt, E.M., Carmona, M. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet 53, 1527–1533 (2021)

Last updated

Was this helpful?