LogoLogo
OT PlatformOT GeneticsCommunityBlog
  • Open Targets Platform
  • Getting started
  • Target
    • Tractability
    • Safety
    • Chemical probes & TEPs
    • Baseline expression
    • Molecular interactions
    • Core Gene Essentiality
    • Pharmacogenetics
  • Disease or Phenotype
    • Clinical signs and symptoms
  • 🆕Variant
  • 🆕Study
  • Drug
    • Clinical Precedence
    • Pharmacovigilance
    • Pharmacogenetics
  • 🆕Credible Set
  • Target–disease evidence
  • Target–disease associations
  • 🆕GWAS & functional genomics
    • Data sources
    • Fine-mapping
    • Colocalisation
    • Locus-to-Gene (L2G)
    • Gentropy
  • Bibliography
  • Web interface
    • Associations on the Fly
    • Target Prioritisation
    • Evidence pages
    • Entity profile pages
  • Data and code access
    • Download datasets
    • Google BigQuery
    • GraphQL API
    • 🆕Platform infrastructure
    • 🆕Data pipeline
  • 🆕FAQs
  • Release notes
  • Citation
  • Licence
    • Terms of use
  • Partner Preview Platform
Powered by GitBook
On this page
  • Features
  • Training set
  • The L2G model
  • L2G predictions
  • Reference

Was this helpful?

Export as PDF
  1. GWAS & functional genomics

Locus-to-Gene (L2G)

Overview of the Open Targets Locus-to-Gene algorithm

Based on genetic and functional genomics traits, the Locus-to-Gene (L2G) machine-learning algorithm ranks the most likely causative genes at each GWAS credible set. The likelihood that a gene is causal for a particular GWAS locus is measured by the L2G score, ranging from 0 to 1.

The L2G model included in the Platform is based on the original method published by Mountjoy et al. Nature Genetics (2021), but contains several enhancements that can result in different performance.

Features

đź“– Gentropy

The predictive features used by the L2G algorithm are designed to capture various genetic and genomic contexts that influence the likelihood of a gene being causal at a given GWAS credible set.

Features are divided into four categories:

Feature Name
Description
Range

distanceTssMean

Average distance between all variants in the credible set and the TSS of a gene's canonical transcript. The distance of each variant is weighted by its posterior probability

(0, 1)

distanceTssMeanNeighbourhood

Ratio between distanceTssMean for a gene and the maximum distanceTssMean for any given gene in the vicinity

(0, 1)

distanceSentinelTss

Distance between the sentinel variant and the TSS of a gene's canonical transcript

(0, 1)

distanceSentinelTssNeighbourhood

Ratio between distanceSentinelTss for a gene and the maximum distanceSentinelTss for any given gene in the vicinity

(0, 1)

distanceSentinelFootprint

Distance between sentinel variant and a gene's footprint

(0, 1)

distanceSentinelFootprintNeighbourhood

Ratio between distanceSentinelFootprint for a gene and the maximum distanceSentinelFootprint for any given gene in the vicinity

(0, 1)

distanceFootprintMean

Average distance between all variants in the credible set and a gene's footprint. The distance of each variant is weighted by its posterior probability

(0, 1)

distanceFootprintMeanNeighbourhood

Ratio between distanceFootprintMean for a gene and the maximum distanceFootprintMean for any given gene in the vicinity

(0, 1)

Feature Name
Description
Range

eQtlColocClppMaximum

Maximum CCLP across all eQTL studies for a gene

(0, 1)

pQtlColocClppMaximum

Maximum CCLP across all pQTL studies for a gene

(0, 1)

sQtlColocClppMaximum

Maximum CCLP across all sQTL and tuQTL studies for a gene

(0, 1)

eQtlColocH4Maximum

Maximum H4 across all eQTL studies for a gene

(0, 1)

pQtlColocH4Maximum

Maximum H4 across all pQTL studies for a gene

(0, 1)

sQtlColocH4Maximum

Maximum H4 across all sQTL and tuQTL studies for a gene

(0, 1)

eQtlColocClppMaximumNeighbourhood

Ratio between eQtlColocClppMaximum for a gene and the maximum eQtlColocClppMaximum for any protein-coding gene in the vicinity

(0, 1)

pQtlColocClppMaximumNeighbourhood

Ratio between pQtlColocClppMaximum for a gene and the maximum pQtlColocClppMaximum for any protein-coding gene in the vicinity

(0, 1)

sQtlColocClppMaximumNeighbourhood

Ratio between sQtlColocClppMaximum for a gene and the maximum sQtlColocClppMaximum for any protein-coding gene in the vicinity

(0, 1)

eQtlColocH4MaximumNeighbourhood

Ratio between eQtlColocH4Maximum for a gene and the maximum eQtlColocH4Maximum for any protein-coding gene in the vicinity

(0, 1)

pQtlColocH4MaximumNeighbourhood

Ratio between pQtlColocH4Maximum for a gene and the maximum pQtlColocH4Maximum for any protein-coding gene in the vicinity

(0, 1)

sQtlColocH4MaximumNeighbourhood

Ratio between sQtlColocH4Maximum for a gene and the maximum sQtlColocH4Maximum for any protein-coding gene in the vicinity

(0, 1)

Feature Name
Description
Range

vepMaximum

Maximum VEP score across all variants in the credible set

(0, 1)

vepMaximumNeighbourhood

Ratio between vepMaximum for a gene and the maximum vepMaximum for any protein-coding gene in the vicinity

(0, 1)

vepMean

Average VEP score between all variants in the credible set and a gene's footprint. The score of each variant is weighted by its posterior probability

(0, 1)

vepMeanNeighbourhood

Ratio between vepMean for a gene and the maximum vepMean for any protein-coding gene in the vicinity

(0, 1)

Feature Name
Description
Range

geneCount500kb

Number of genes 250kb up- and down-stream the sentinel variant of a credible set

(0, 60000)

proteinGeneCount500kb

Number of protein-coding genes 250kb up- and down-stream the sentinel variant of a credible set

(0, 60000)

credibleSetConfidence

Degree of confidence we assign to the credible set definition based on the fine mapping methodology: 1 when fine-mapped with SuSIE using in-sample LD: 0.75 when fine-mapped with SuSIE using out-of-sample LD: 0.5, when fine-mapped with PICS and the locus is based on the analysis of summary statistics; 0.25, when fine-mapped with PICS and the locus was reported as a top hit according to the GWAS Catalog

(0, 1)

Neighbourhood features

While some features are computed independently for each gene, others reflect the comparative relative context of a given gene compared to the other genes in the neighbourhood (+/- 500,000bp).

For a more detailed description of how each feature is computed, see the L2G Feature documentation.

Training set

The L2G model is trained based on prior knowledge of gene-trait associations collated from different sources, to later bring representative credible sets supporting that association. The next steps describe the methodology to compose the training set:

Effector gene list

The effector gene list (EGL) represents the set of biologically validated gene-trait associations. This list is derived from several sources:

  1. Manually curated gold standards (“medium” and “high” confidence) from OTG 22.10.

  2. Gene-indication pairs for the pharmacological target in all phase III or IV clinical trials according to the latest ChEMBL release.

  3. Gene-disease or phenotype mappings with evidence score ≥ 0.95 from ClinVar, UniProt, Gene2Phenotype, Genomics England PanelApp and ClinGen from the latest Open Targets Platform release.

To ensure the uniqueness of gene-EFO pairs, the combined list was de-duplicated.

Positives

For each gene-trait pair from the effector gene list, positive gene-credible set pairs were extracted using the following criteria:

  1. Only protein-coding genes.

  2. Removed any pair with a distanceSentinelTSS feature less than 0.1.

  3. Only include gene-trait pairs supported by at least two credible sets in different studies sharing the same lead variant. We added this criterion to reduce the chance of false positives among the credible sets.

  4. Among all positive credible set-gene pairs, we removed duplications based on functional genomics features.

  5. Removed credible sets involved in more than two positives.

Negatives

All other protein-coding genes in the window are classified as negatives as long as they don't have a strong functional interaction (STRINGdb score > 0.8) with any positive genes for the trait.

The L2G model

đź“– Gentropy

The Locus-to-Gene model is trained for every data release using the training set and feature matrix following. Similarly to Mountjoy et al., the model is trained based on a gradient-boosting algorithm with the scikit-learn library, and using nested cross-validation and hyperparameter tuning.

The L2G model will be updated in each release to include new features and/or fixes. You should expect some variation in the prediction scores with each release.

L2G predictions

đź“– Gentropy

The trained L2G model is applied to every GWAS credible set in the Open Targets Platform. All predictions below 0.05 are filtered out and feature contributions are added using SHAP analysis as described here. These results feature as Target-Disease evidence, as well as L2G annotation for every credible in the Platform.

The GWAS Associations evidence set is constructed based on GWAS credible sets linking to a protein-coding gene with an L2G score higher than 0.05.

Reference

Mountjoy, E., Schmidt, E.M., Carmona, M. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet 53, 1527–1533 (2021)

PreviousColocalisationNextGentropy

Last updated 3 months ago

Was this helpful?

🆕
https://huggingface.co/opentargets/locus_to_genehuggingface.co
The L2G model is available on Hugging Face and in the FTP location