Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Manually-curated safety data to support target prioritisation
Throughout the drug discovery and development process, target safety assessments help with understanding the role of the drug target in normal physiology and potential unintended adverse consequences and safety liabilities when modulating the target with a chemical compound or drug (Brennan, 2017).
To support target prioritisation, we have manually curated experimental data and insights from publications and other well-known sources of target safety and toxicity data, including the ToxCast, AOPWiki and PharmGKB (Open Targets downstream analysis of the toxicity datasets).
Safety data is available on the target profile page and can be accessed to provide a systematic view of potentially relevant target safety liabilities.
Target safety datasets are mapped to the correct Ensembl gene ID and ingested during our initial pipeline steps to enrich the target annotation object. The data is available for download as part of the target core annotation from our data download page.
Bowes J, Brown AJ, Hamon J, Jarolimek W, Sridhar A, Waldron G, Whitebread S. Reducing safety-related drug attrition: the use of in vitro pharmacological profiling. Nat Rev Drug Discov. 2012 Dec;11(12):909-22. doi: 10.1038/nrd3845. PMID: 23197038.
Brennan R.J. (2017) Target Safety Assessment: Strategies and Resources. In: Gautier JC. (eds) Drug Safety Evaluation. Methods in Molecular Biology, vol 1641. Humana Press, New York, NY. doi: 10.1007/978-1-4939-7172-5_12
Force T, Kolaja KL. Cardiotoxicity of kinase inhibitors: the prediction and translation of preclinical models to clinical outcomes. Nat Rev Drug Discov. 2011 Feb;10(2):111-26. doi: 10.1038/nrd3252. PMID: 21283106.
Ann M. Richard, Richard S. Judson, Keith A. Houck, Christopher M. Grulke, Patra Volarath, Inthirany Thillainadarajah, Chihae Yang, James Rathman, Matthew T. Martin, John F. Wambaugh, Thomas B. Knudsen, Jayaram Kancherla, Kamel Mansouri, Grace Patlewicz, Antony J. Williams, Stephen B. Little, Kevin M. Crofton, and Russell S. Thomas. ToxCast Chemical Landscape: Paving the Road to 21st Century Toxicology. Chemical Research in Toxicology 2016 29 (8), 1225-1251. doi: 10.1021/acs.chemrestox.6b00135. PMID: 27367298.
Lamore SD, Ahlberg E, Boyer S, Lamb ML, Hortigon-Vinagre MP, Rodriguez V, Smith GL, Sagemark J, Carlsson L, Bates SM, Choy AL, Stålring J, Scott CW, Peters MF. Deconvoluting Kinase Inhibitor Induced Cardiotoxicity. Toxicol Sci. 2017 Jul 1;158(1):213-226. doi: 10.1093/toxsci/kfx082. PMID: 28453775; PMCID: PMC5837613.
Lynch JJ 3rd, Van Vleet TR, Mittelstadt SW, Blomme EAG. Potential functional and pathological side effects related to off-target pharmacological activity. J Pharmacol Toxicol Methods. 2017 Sep;87:108-126. doi: 10.1016/j.vascn.2017.02.020. PMID: 28216264.
Urban L, Whitebread S, Hamon J et al. Screening for safety-relevant off-target affinities. In: Polypharmacology in Drug Discovery. Peters JU (Ed.). John Wiley and Sons, NJ, USA (2012). doi: 10.1002/9781118098141.ch2.
Welcome!
The Open Targets Platform is a comprehensive tool that supports systematic identification and prioritisation of potential therapeutic drug targets.
By integrating publicly available datasets including data generated by the Open Targets consortium, the Platform builds and scores target-disease associations to assist in drug target identification and prioritisation. It also integrates relevant annotation information about targets, diseases/phenotypes, drugs, variants, studies, and credible sets as well as their most relevant relationships.
The Platform is a freely available resource that is actively maintained with quarterly updates. Our data can be accessed through an intuitive web user interface, an API, and data downloads. Likewise, our pipeline and infrastructure codebases are open-source and can be used to create a self-hosted private instance of the Platform with custom data. For more information, please review our Licence documentation and, if you use our data and/or pipelines, please cite our latest publication.
Check out our blog to learn more about the Platform and the Open Targets research programme.
You can also join the Open Targets Community and follow us on:
LinkedIn: Open Targets
Bluesky: @opentargets.org
Twitter: @opentargets
YouTube: Open Targets
For additional help with the Open Targets Platform, or to report bugs, data issues, or submit a feature request, please post on the Open Targets Community, using the relevant categories and tags. If the request you would like to make has already been posted, please like the post to indicate you would like this to be prioritised.
Data for assessing tractability with small molecule, antibody, and other clinical modalities
To support target prioritisation, the Open Targets Platform includes tractability data that identifies key details, including if there is a binding site suitable for small molecule binding, an accessible epitope for antibody based therapy, relevant data for using Proteolysis Targeting Chimeras (PROTACs), or a compound in clinical trials with a modality other than small molecule or antibody.
The tractability data can assist in target prioritisation by identifying potential drug targets suitable for discovery pipelines and therapeutic modalities that are most likely to succeed. It also supports further investigation of targets for which there are no ligands or experimental structures or those targets outside a "druggable" target family but with strong genetic associations.
Our target tractability is based on a modified version of Approaches to target tractability assessment – a practical perspective and The PROTACtable genome and has workflows that generate tractability assessments for small molecule (SM), antibody (AB), Proteolysis Targeting Chimeras (PR), and other clinical (OC) modalities.
The tractability assessments displayed on the Platform's target profile pages is the result of an open-source computational pipeline that performs in silico tractability assessments with small molecule, antibody, PROTAC, and other clinical modality workflows.
Data sources used in the pipeline include UniProt, HPA, PDBe, DrugEBIlity, ChEMBL, Pfam, InterPro, Complex Portal, DrugBank, Gene Ontology, and BioModels.
Assessments common to all modalities, ingested from ChEMBL, are:
Approved Drug: the target has clinical precedence with Phase IV drugs
Advanced Clinical: the target has clinical precedence with Phase II or III drugs
Phase 1 Clinical: the target has clinical precedence with Phase I drugs.
We also include additional assessments specific to each modality.
Structure with Ligand: Target has been co-crystallised with a small molecule (source: Protein Data Bank)
High-Quality Ligand: Target with ligand(s) (PFI ≤ 7, SMART hits ≤ 2, scaffolds ≥ 2) (source: ChEMBL)
High-Quality Pocket: Target has a DrugEBIlity score of ≥ 0.7 (source: DrugEBIlity)
Med-Quality Pocket: Target has a DrugEBIlity score between 0 and 0.7 (source: DrugEBIlity)
Druggable Family: Target is considered druggable as per Finan et al’s Druggable Genome pipeline.
UniProt loc high conf: High confidence that the subcellular location of the target is either plasma membrane, extracellular region/matrix, or secretion (source: Uniprot)
GO CC high conf: High confidence that the subcellular location of the target is either plasma membrane, extracellular region/matrix, or secretion (source: Gene Ontology)
UniProt loc med conf: Medium confidence that the subcellular location of the target is either plasma membrane, extracellular region/matrix, or secretion (source: Uniprot)
UniProt SigP or TMHMM: Target has a predicted signal peptide or trans-membrane regions, and not destined to organelles (source: Uniprot SigP, TMHMM)
GO CC med conf: Medium confidence that the subcellular location of the target is either plasma membrane, extracellular region/matrix, or secretion (source: Gene Ontology)
Human Protein Atlas loc: High confidence that the target is located in the Plasma membrane (source: HPA)
Literature: Target mentioned in a set of manually curated PROTAC-related publications (source: Europe PMC)
UniProt Ubiquitination: Target tagged with the Uniprot keyword “Ubl conjugation [KW-0832]”, which indicates that the protein has a ubiquitination site, based on evidence from the literature (source: Uniprot)
Database Ubiquitination: Target has reported ubiquitination sites in PhosphoSitePlus, mUbiSiDa (2013), or Kim et al. 2011
Half-life Data: Target has available half-life data (source: Mathieson et al. 2018)
Small Molecule Binder: Target has a reported small-molecule ligand in ChEMBL with a measured activity of at least 10 μM in a target-based assay (source: ChEMBL)
The data is available for download as part of the target core annotation from our data downloads page.
Alternatively, you can also download the input TSV file with the per-target assessments via FTP. To access this file, visit our FTP site and click on the release version (e.g. 21.04), followed by "input", followed by "annotation-files". You can then download the tractability_buckets
TSV file. Descriptions of the columns found in the input file can be found on the pipeline README.md file.
Brown KK, Hann MM, Lakdawala AS, Santos R, Thomas PJ, Todd K. Approaches to target tractability assessment - a practical perspective. Medchemcomm. 2018 Feb 14;9(4):606-613. doi: 10.1039/c7md00633k. PMID: 30108951; PMCID: PMC6072525.
Schneider M, Radoux CJ, Hercules A, Ochoa D, Dunham I, Zalmas LP, Hessler G, Ruf S, Shanmugasundaram V, Hann MM, Thomas PJ, Queisser MA, Benowitz AB, Brown K, Leach AR. The PROTACtable genome. Nat Rev Drug Discov. 2021 Jul 20. doi: 10.1038/s41573-021-00245-x. PMID: 34285415.
Systematically capturing target - target interactions of different nature
The Molecular Interactions data aggregates and integrates interaction evidence reported in several resources to provide a systematic view on potentially relevant drug targets. Each of the integrated resources captures relationships of different nature including physical binary interactions, enzymatic reactions, or functional relationships. The information here available aims to capture not only the topology of the interaction network, but also the supporting experimental evidence reported on each of the databases.
In order to maximise coverage, the network contains all reported binary relationships between gene products (proteins and RNAs). Although the main focus are interactions between human molecules, the data also includes additional interactions between human gene products and molecules encoded in the genome of infectious pathogens (viruses and bacteria).
IntAct – http://www.ebi.ac.uk/intact – is a freely available, open source database for molecular interaction data. IntAct contains physical interactions derived from literature curation or direct user submissions.
Interactions are scored using the MI score. Benefiting from the PSI-MI controlled vocabulary, the Intact MI score provides a normalised (0 to 1) score that weights how recurrently an interaction has been reported, together with the confidence of the experimental techniques reported. Note that a high scoring interaction can be due to high-confidence evidence, but also a social bias on studying certain proteins. Generally speaking, scores > 0.4 correspond to medium to high confidence interactions, although some good-quality high-throughput interactions might still be scored below that threshold. More info on MI score can be found in the Intact documentation.
Interactions are grouped by interaction detection method and interaction type. As a consequence, the same pair of interactors might be split into multiple entries if individual proteins are reported to have different biological roles.
For IntAct, please note:
The network only contains human and selected pathogen data from IntAct
The majority of interactions are not directional and not signed. However, there are a proportion of interactions where the biological role of the participants can be stated and directionality specified (e.g. enzymatic reactions)
Reactome – https://reactome.org/ – is an open source, open access, manually curated and peer-reviewed pathway database.
For Reactome, please note:
Only human-human interactions are provided
Interactions are directional and signed, with biological roles assigned to each participant if possible
Protein interactions in Reactome are inferred from pathways and complexes based on Reactome internal method.
SIGNOR, the SIGnaling Network Open Resource – https://signor.uniroma2.it/ – contains signaling information published in the scientific literature, which is manually curated and stored in a structured format.
For SIGNOR, please note:
SIGNOR only contains human data
Interactions are directional and signed, with biological roles assigned to each participant
The network pulls information from the SIGNOR relations file
STRING – https://string-db.org – contains functionally interacting proteins. While most interactions in the other resources capture different types of physical interaction between molecules, functional interactions do not necessarily interact physically. Both direct (physical) and indirect (functional) associations are derived from computational predictions, from knowledge transfer between organisms, or from interactions aggregated from other (primary) databases. STRING interactions provide an overall combined_score, as well as each of the pieces of information that compose this score. More information on STRING scoring can be found on their documentation page.
The multipartite network displayed in the Open Targets Platform is the result of post-processing the information stored in a Neo4j graph database (graphDB). The graphDB does not provide STRING information but contains ComplexPortal information on stable protein complexes as an additional data source. The information of the graphDB is then exported together with STRINGdb and mapped to the Open Targets Platform targets (Ensembl Gene IDs).
The resulting dataset as well as all intermediate files can be found in the Open Targets Platform Data Access section or the Intact FTP.
When using this data please remember to acknowledge the sources:
Orchard S, Ammari M, Aranda B, Breuza L, Briganti L, Broackes-Carter F, Campbell NH, Chavali G, Chen C, del-Toro N, Duesbury M, Dumousseau M, Galeota E, Hinz U, Iannuccelli M, Jagannathan S, Jimenez R, Khadake J, Lagreid A, Licata L, Lovering RC, Meldal B, Melidoni AN, Milagros M, Peluso D, Perfetto L, Porras P, Raghunath A, Ricard-Blum S, Roechert B, Stutz A, Tognolli M, van Roey K, Cesareni G, Hermjakob H. The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res. 2014 Jan;42(Database issue):D358-63. doi: 10.1093/nar/gkt1115. Epub 2013 Nov 13. PMID: 24234451; PMCID: PMC3965093.
Jassal B, Matthews L, Viteri G, Gong C, Lorente P, Fabregat A, Sidiropoulos K, Cook J, Gillespie M, Haw R, Loney F, May B, Milacic M, Rothfels K, Sevilla C, Shamovsky V, Shorser S, Varusai T, Weiser J, Wu G, Stein L, Hermjakob H, D'Eustachio P. The reactome pathway knowledgebase. Nucleic acids research. 2020 Jan;48(D1) D498-D503. doi: 10.1093/nar/gkz1031. PubMed PMID: 31691815. PubMed Central PMCID: PMC7145712.
Licata L, Lo Surdo P, Iannuccelli M, Palma A, Micarelli E, Perfetto L, Peluso D, Calderone A, Castagnoli L, Cesareni G. SIGNOR 2.0, the SIGnaling Network Open Resource 2.0: 2019 update. Nucleic Acids Res. 2020 Jan 8;48(D1):D504-D510. doi: 10.1093/nar/gkz949. PMID: 31665520; PMCID: PMC7145695.
Szklarczyk D, Kirsch R, Koutrouli M, Nastou K, Mehryary F, Hachilif R, Gable AL, Fang T, Doncheva NT, Pyysalo S, Bork P, Jensen LJ, von Mering C. The STRING database in 2023: protein-protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 2023 Jan 6;51(D1):D638-D646. doi: 10.1093/nar/gkac1000. PMID: 36370105; PMCID: PMC9825434.
Identifying evidence implicating drug targets with diseases or phenotypes constitutes one of the pivotal challenges of the Open Targets Platform. Several steps are required to aggregate, integrate, validate, and score the collected information, in order to contextualise and weight the underlying evidence. The Platform provides a framework to allow the interactive or programmatic interrogation of target-disease evidence.
The Open Targets data model focuses on five main entities:
These entities are annotated with a variety of public data sources, as well as information about the relationships between them. A more detailed description of the annotations is provided throughout the documentation. Some examples are:
To start addressing therapeutic hypotheses users can find alternative ways to interface with the data:
Please note we are in the process of updating our training content following the most recent updates (25.03).
A disease or phenotype in the Platform is understood as any disease, phenotype, biological process or measurement that might have any type of causality relationship with a human target. The EMBL-EBI (EFO) is used as scaffold for the disease or phenotype entity.
In order to maximise the alignment of the ontology with a clinical application, a few modifications have been added to EFO. Some high-level terms have been removed (e.g. disease by anatomical region) and others have been rearranged to align them to a less anatomical and more clinical interpretation. For each EFO release, the EFO OTAR slim can be found in parallel with the official .
: understood as any candidate drug binding molecule
: including any disease indications, phenotypes, measurements, biological processes and other relevant traits.
: DNA variation that has been associated with a disease, trait, or phenotype
: source of evidence linking genetic variants to traits, diseases, and molecular phenotypes.
: molecules that can act as medicinal products.
A pivotal component of the Platform is the integration of potentially causal evidence linking targets and diseases. The definition of the Platform evidence as well as expanded documentation on each of the data sources are available in the section.
In order to contextualise the information, all evidence referring to unique target-disease pairs are aggregated in the form of associations. Expanded documentation on how associations are built and scored is available in the section.
at . The Platform web application provides a user interface to navigate data relevant for building therapeutic hypotheses. Starting from the home page search box, users can navigate to entity pages, associations pages or evidence pages.
. More complex hypotheses might require advance interfaces or programmatic access. The Platform aims to deliver a number of alternative ways to access data to support the most intensive queries.
More details about the open source codebase and development can be found in the section.
Description; cross-references; synonyms; location; ontology and classification
In the Open Targets Platform, a study is a qualifying genome-wide association study for binary or quantitative traits.
Studies might capture different types of associations, such as curated associations (top hits) from literature or post-processing GWAS summary statistics.
For each study meeting the inclusion criteria, harmonisation, fine-mapping, colocalisation, and Locus-to-Gene analysis are performed, resulting in a list of significant credible sets and their most likely associated genes. The resulting 95% credible sets can be visualised in all study pages. Within the same study, credible sets might result from different fine-mapping pipelines but still be presented in the same study with their respective provenance and confidence.
Studies without GWAS-significant associations
The study will still be presented in the Open Targets Platform if no significant associations are found.
To provide a rich context to downstream interpretation of studies, and their associations, we capture a wide range annotations, whenever possible in a standardised way to enable interoperability.
For example:
The measured trait/phenotype is standardised using the Experimental Factor Ontology (EFO), as the background trait when applied
For molecular QTLs, the affected gene/protein is standardised to Ensembl gene identifiers
Depending on the granularity of the ingested study metadata, a rich cohort information is captured including sample sizes, ancestry composition and a list of named cohorts
Indicated if association data of a study was ingested as summary statistics, together with the corresponding quality control measurements performed on the summary statistics
Besides the Pubmed identifier, rich publication information is also captured including first author, publication year for easier data access
Integrated studies can be split into two major groups that determine how their associations would be utilised in the target identification process.
A GWAS identifies associations between genetic variants and traits or phenotypes by analysing genome-wide variation data from a population. These studies cover binary or quantitative traits reported in the integrated sources.
Shared trait studies
GWAS studies sharing the same disease or phenotype (EFO) as the study are listed.
Sources: GWAS Catalog, FinnGen
A molQTL study identifies genetic variation that can be significantly associated with changes at the molecular level. As such, associations of these studies provide invaluable help to put GWAS derived loci into context via colocalisation. The type of molecular trait measured defines the type of QTL:
eQTLs impact gene expression
pQTLs impact protein abundance
sQTLs have an effect on gene splicing
tuQTLs impact transcript usage
sceQTL impact gene expression at the single-cell level
All molQTL studies are annotated with the gene whose expression levels are regulated, referred as the affected gene.
On top of the effected gene, molQTL studies also capture the tissue/cell type in which the change in the molecular trait was measure providing further context for interpretation.
Each molQTL study captures the unique combination of the following annotations:
The publication authoring the study
The gene product measured in the study
The quantitative method (e.g. aptamer) used to measure the trait, when available
The cell type or tissue where the trait is measured, when available
The experimental conditions (e.g. interferon-stimulated macrophages)
Sources: eQTL Catalogue, UK Biobank Pharma Proteomics Project (UKB-PPP)
Studies must meet predefined criteria to ensure consistent representation, and increased reliability of associations for downstream analysis.
Each integrated study needs to meet the following criteria:
GWAS studies need to have a valid disease or phenotype identifier (EFO)
QTL studies must have an affected gene and tissue/cell-type valid identifier (biosample)
Valid study type (e.g gwas
or a QTL type from a defined set)
Data licensed for commercial usage
The fine-mapping results for all 95% credible sets in the study are displayed with their most relevant metadata, as well as the top Locus-to-Gene assignment.
Because the same study might be processed through different fine-mapping pipelines, it's possible to observe credible sets obtained with different methods even within the same study. The credible set exclusion criteria describes the rules applied to avoid duplicated credible sets resulting from separate fine-mapping pipelines.
Source: Open Targets
Data supporting core essentiality of a target
Core essential target genes are unlikely to tolerate inhibition and are susceptible to causing adverse events if modulated. This is crucial safety information for drug discovery scientists looking to develop inhibition strategies.
To support target prioritisation with an extra focus on target safety, the Open Targets Platform includes key information on target core essentiality in the context of 28 tissues assayed in the DepMap portal. In this project and its ancillary projects (e.g. Achilles), cell fitness is measured after the inhibition of individual genes across a number of cell lines. As a result, a gene is catalogued as core essential if the majority of cell lines die after inhibition/knockout. Although in cancer, this experiment represents a good proxy of whether loss of function is tolerated across a diverse set of tissues.
Essentiality assessment: The Chronos dependency score is based on data from a cell depletion assay. A lower Chronos score indicates a higher likelihood that the gene of interest is essential in a given cell line. A score of 0 indicates a gene is not essential; correspondingly -1 is comparable to the median of all pan-essential genes.
Together with a DepMap target annotation widget, essential target genes are annotated with a `Core essential gene` chip on the target page.
The goal of the Dependency Map (DepMap) portal is to empower the research community to make discoveries related to cancer vulnerabilities by providing open access to key cancer dependencies analytical and visualisation tools.
When using this data please remember to acknowledge the sources:
Pacini C, Dempster JM, Boyle I, Gonçalves E, Najgebauer H, Karakoc E, van der Meer D, Barthorpe A, Lightfoot H, Jaaks P, McFarland JM, Garnett MJ, Tsherniak A, Iorio F. Integrated cross-study datasets of genetic dependencies in cancer. Nat Commun. 2021 Mar 12;12(1):1661. doi: 10.1038/s41467-021-21898-7. PMID: 33712601; PMCID: PMC7955067.
To support further assessments about the suitability of targets for a discovery pipeline, the Open Targets Platform integrates information from various sources on chemical probes and Target Enabling Packages (TEPs).
A chemical probe is a small molecule that can act as chemical modulator of a system, such as a cell or an organism, by reversibly binding to a biological target. Chemical probes are expected to have a minimum standard of high affinity, binding selectivity and efficacy. Chemical probes do not need to meet the same requirements in terms of pharmacokinetics, pharmacodynamics, and bioavailability as drugs.
Target Enabling Packages (TEPs) are a critical mass of reagents and knowledge allowing for the rapid biochemical and chemical exploration of a given target.
As noted by the Structural Genomics Consortium, each TEP contains:
Protein production methods
Biochemical/biophysical assays for activity, affinity
Structures of the protein, potentially including wild type and disease mutant proteins; full-length or domains; protein-ligand complexes; structures of close homologues
Initial chemical matter from a fragment or small molecule screen
Additional components of TEPs may be developed on a case-by-case basis, based upon reasonable scientific need, in collaboration with TEP target nominators:
An antibody or nanobody
Cell-based assay
CRISPR knockout
The Probes & Drugs Portal – https://www.probes-drugs.org/home/ – is a public resource joining together focused libraries of bioactive compounds (probes, drugs, specific inhibitor sets etc.) with commercially available screening libraries. The purpose of the Portal is to reflect the current state of bioactive compound space and to enable its exploration from different points of view. It is intended to serve as a central hub in chemical biology research by providing a unique integration of the most relevant probes sources such as the Chemical Probes Portal, Open Science Probes, and ProbeMiner.
The Structural Genomics Consortium (SGC) – https://www.thesgc.org/ – is a public-private partnership focused on accelerating drug discovery using open science. The SGC’s core research operations are funded by pharmaceutical companies, governments, and charities, all of which act as research partners and participate in the governance of the SGC.
Chemical probes datasets are downloaded from each data source listed above, mapped to the correct Ensembl gene ID, and ingested during our initial pipeline steps. The data is available for download as part of the target core annotation from our data download page.
Baseline RNA and protein expression data helps us to ascertain whether the target is expressed in all tissues or selectively in one or few tissues or cell types. The availability of the target molecule in the location of interest is critical at different stages of the drug development process.
We combine baseline expression information from three sources:
Expression Atlas: RNA expression meta-analysis from RNA- sequencing experiments
Human Protein Atlas: immunohistochemistry-based proteomics data for normal tissues
Genotype-Tissue Expression (GTEx) Program: RNA baseline expression variation data per tissue
Expression Atlas – https://www.ebi.ac.uk/gxa/home – is a manually-curated, freely available data analysis and visualisation resource that provides information about gene and protein expression in more than 3,000 experiments.
The Human Protein Atlas – https://www.proteinatlas.org – is an international collaboration that develops various open access knowledge resources that map all the human proteins in cells, tissues and organs using an integration of various 'omics technologies. The Human Protein Atlas consists of six separate parts, each focusing on a particular aspect of the genome-wide analysis of the human proteins. At the moment, we only use the Tissue Atlas, showing the distribution of the proteins across all major tissues and organs in the human body.
The Genotype-Tissue Expression (GTEx) Portal – https://gtexportal.org/home/ – is an open access data resource that provides data on gene expression, QTLs, and histology images. The Portal is part of the GTEx consortium's ongoing effort to build a comprehensive public resource to study tissue-specific gene expression and regulation and its relationship to genetic variation.
Open Targets in collaboration with Expression Atlas has performed a meta-analysis of over 18,000 samples from 50 different tissues and more than 30 cell types from the following experiments:
RNA-seq of 53 human tissue samples from GTEx (E-MTAB-5214)
RNA-seq of 16 human tissues from the Illumina Body Map project (E-MTAB-513)
RNA-seq of 13 human tissue from the ENCODE project (Snyder Lab) (E-MTAB-4344)
RNA-seq of 6 human tissues from Kaessmann Lab (E-MTAB-3716)
mRNA-seq of 32 human tissues from Human Protein Atlas (E-MTAB-2836)
mRNA-seq of rare types of cells of different haemopoietic lineages from healthy individuals in the BLUEPRINT project (E-MTAB-3819)
RNA-seq of common types of cells of different haemopoietic lineages from healthy individuals in the BLUEPRINT project (E-MTAB-3827)
mRNA-seq of plasma cells of tonsil from healthy individuals from the BLUEPRINT project (E-MTAB-4754)
The tissue- and cell-based samples in these experiments are processed separately to avoid batch effects during normalisation. The samples of each group are then processed together to generate an expression table of normalised Transcripts Per Million (TPMs) units for every gene in each tissue or cell type according to the following steps:
Technical replicates are aggregated
Genes that are expressed below a pre-defined threshold (i.e. minimum of 10 raw reads in at least 15 samples) are filtered out
Samples are then normalised in a two-step process according to Risso et al. 2014 using the RUV (Remove Unwanted Variation) method:
The Coefficient of Variation (CV) was estimated for each gene across all the samples and used to select the least variable genes.
The least variable 1,000 genes were used as negative controls; that is, assumed not to be differentially expressed, to train RUVg to remove unwanted variation
Tissues are mapped to anatomical ontology terms using the Uber-anatomy ontology. Tissues that can't be mapped are then discarded
Samples from the same tissue across different experiments are averaged by median and then merged in the final matrix
Expression tables of the tissue- and cell-based experiments are combined
We analyse this expression file further to compute two values for each gene:
Binned value of expression: The normalised expression values are divided into 10 bins of the same width. Note that this is not the same as the deciles, which all contain the same number of items in them
Tissue specificity: Z-scores are calculated for each gene and each tissue and then they are binned based on quantiles of a perfect normal distribution. We compute the tissue specificity of a target as the number of standard deviations from the mean of the log RNA expression of the target across the available tissues. A target is considered to be tissue specific if the z-score is greater than 0.674 (or the 75th percentile of a perfect normal distribution). We remove data for under-expressed targets before the z-score calculation. This allows us to extract the tissues for which a gene is specific, defined as the expression value being above the 75th z-score percentile — in practice, anything in bin 2 or above.
Normalised files with binned value of expression and tissue specificity values for each target are available from our data download page.
GTEx Consortium. The Genotype-Tissue Expression (GTEx) project. Nat Genet. 2013 Jun;45(6):580-5. doi: 10.1038/ng.2653. PMID: 23715323; PMCID: PMC4010069.
Papatheodorou I, Moreno P, Manning J, Fuentes AM, George N, Fexova S, Fonseca NA, Füllgrabe A, Green M, Huang N, Huerta L, Iqbal H, Jianu M, Mohammed S, Zhao L, Jarnuczak AF, Jupp S, Marioni J, Meyer K, Petryszak R, Prada Medina CA, Talavera-López C, Teichmann S, Vizcaino JA, Brazma A. Expression Atlas update: from tissues to single cells. Nucleic Acids Res. 2020 Jan 8;48(D1):D77-D83. doi: 10.1093/nar/gkz947. PMID: 31665515; PMCID: PMC7145605.
Uhlén M, Fagerberg L, Hallström BM, Lindskog C, Oksvold P, Mardinoglu A, Sivertsson Å, Kampf C, Sjöstedt E, Asplund A, Olsson I, Edlund K, Lundberg E, Navani S, Szigyarto CA, Odeberg J, Djureinovic D, Takanen JO, Hober S, Alm T, Edqvist PH, Berling H, Tegel H, Mulder J, Rockberg J, Nilsson P, Schwenk JM, Hamsten M, von Feilitzen K, Forsberg M, Persson L, Johansson F, Zwahlen M, von Heijne G, Nielsen J, Pontén F. Proteomics. Tissue-based map of the human proteome. Science. 2015 Jan 23;347(6220):1260419. doi: 10.1126/science.1260419. PMID: 25613900.
A target in the Platform is understood as any naturally-occurring molecule that can be targeted by a medicinal product. EMBL-EBI's Ensembl database is used as source for human targets in the Platform, with the Ensembl gene ID as the primary identifier.
Criteria for target inclusion:
Genes from all biotypes encoded in canonical chromosomes
Genes in alternative assemblies encoding for a reviewed protein product.
This definition accounts for some of the complexities of human targets; targets are not only protein coding genes, RNAs or pseudogenes are also considered. However, the current definition has some potential drawbacks. Some drug targets are the result of interactions between genes (e.g. gene fusion) or proteins (e.g. protein complexes). The current target entity does not yet consider these cases, which will be addressed in the future.
Protein, positional, and structural information (ProtVista); Subcellular location; Gene Ontology
Pathways
Comparative genomics
Mouse phenotypes
Cancer hallmarks
Known drugs
CRISPR-Cas9 cancer cell line dependency
Genetic constraint
Data supporting pharmacogenetics annotation for a target
Pharmacogenetics is the study of how genetic variation may change your response to a specific drug. Through the integration of pharmacogenetics data into the Platform, our objective is to apply clinical annotations available in the Pharmacogenetics Knowledgebase (PharmGKB) to aid target prioritisation for drug discovery.
We also enhance these annotations by adding detailed annotations on variant consequence prediction, drug response categories, and specific drug information and whether the gene is a direct target of the drug. This process involves applying advanced phenotype extraction techniques to offer a refined representation of phenotypes, thereby providing a more precise understanding of the genetic determinants influencing treatment outcomes.
We have also included annotations for star (*) alleles, a nomenclature used in the pharmacogenetics field for representing key functional variants involved in drug responses.
When using this data please remember to acknowledge the sources:
Whirl-Carrillo M, Huddart R, Gong L, Sangkuhl K, Thorn CF, Whaley R, Klein TE. An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther. 2021 Sep;110(3):563-572. doi: 10.1002/cpt.2350. Epub 2021 Jul 22. PMID: 34216021; PMCID: PMC8457105.
Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, Altman RB, Klein TE. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012 Oct;92(4):414-7. doi: 10.1038/clpt.2012.96. PMID: 22992668; PMCID: PMC3660037.
To further refine how we define a drug in the Platform, we subset ChEMBL's drugbase to capture molecules that meet the following criteria:
Drugs for which the indication is known
Drugs for which the target they modulate is identified
Drugs that are listed in DrugBank
Drugs that are acknowledged as chemical probes
A drug in the Platform might belong to different modalities, including small molecules, antibodies, or oligonucleotides among others. However, some biologic therapies such as vaccines, blood products, or cell therapies are not represented in our drug set. Moreover, the molecule-centric definition implies multi-ingredient drugs won't be represented and only their individual active moieties might be available on the site.
In the ChEMBL representation of drugs, a clear distinction is made between parent bioactive molecules and their corresponding child molecules. Parent molecules encompass the original, unmodified form of the active ingredient, while child molecules refer to modified versions, such as salts. In the Platform, both parent and child molecules are included, ensuring comprehensive coverage of the drug landscape.
When it comes to data propagation, parent molecules retain their own distinct information, as well as aggregate the specific details from all their child molecules. On the other hand, child molecules solely capture their own individual information, without incorporating any data from other molecules. This consistent approach extends to various aspects, including indications, mechanism of action, and drug warnings. By adopting this systematic framework, our Platform facilitates accurate representation and analysis of the diverse molecular entities and their associated properties.
Data describing pharmacovigilance annotation for the drug
Before approval, new therapeutic drug treatments are extensively tested in clinical trials. However, some of the side effects are only identified when prescribed to larger cohorts of patients, with one or more medical conditions, for a sustained period of time or in combination with other treatments.
For this reason, regulatory agencies—for example, the Food & Drug Administration or the European Medicines Agency—provide pharmacovigilance programs to monitor and survey Adverse Drug Reactions (ADRs).
First we apply a set of filters to the reports as described below:
Only reports submitted by health professionals (primarysource.qualification in (1,2,3))
Exclude reports that resulted in death (no entries with seriousnessdeath=1)
Only drugs that were considered by the reporter to be the cause of the event (drugcharacterization=1)
Next, we sought to map the drugs in the FAERS reports to the drugs in the Open Targets Platform (ChEMBL IDs). Any of the above listed fields were used when exact matches were available:
Due to the nature of the surveillance reports, it's relatively common for the indication for which a drug was prescribed to appear in the list of significant ADRs. Given the current structure of the data provided in a FAERS report, we cannot distinguish whether it's a problem with the dosage the drug was prescribed or an excessive phenotypic characterisation of the patient in the report.
The Clinical Signs and Symptoms data in the Platform aims to capture other diseases or phenotypes that occur as a consequence, or in conjunction with a primary disease.
Disease–phenotype relationships are not only useful to better characterise the disease phenotypic space, but also to serve as proxies for additional causal evidence that can help prioritise new or existing targets.
Common and rare variation in Open Targets Platform
Variant identifier
Variant identifiers of SNPs and small indels are created based on genomic location and alleles like: 6_160589086_A_G
where A
is the reference allele at position 160,589,086
on chromosome 6
and the alternate allele is G
. Being consistent with gnomAD, we are using a 1-based coordinate system.
For longer insertions (200+) and deletions, where keeping the full length of the allele in the variant identifier is impractical, the allele string is hashed to create the identifier, which, when available, might contain the chromosome and position as well. Example: OTVAR_11_614383_9cc2ae367cc98c283cb510e8ea29c9f0
Variants are annotated with an integrated view of variant effects from multiple methods. Based on all predictions or annotations, we normalise the variant's likely deleteriousness to a common scale.
To make the predicted variant effects comparable across different methods, raw predictions from each methods were normalised to a unified scale ranging from likely benign to uncertain to likely deleterious.
Every variant is annotated with the predicted consequence for all canonical transcripts in a +/-500Kb window, allowing us to understand the likely effects in the neighbouring coding or non-coding genes. For all variant-transcript pairs in the region, this information includes:
Distance from transcription start site (TSS)
Distance to footprint
Predicted functional consequence based on Ensembl VEP
Amino-acid consequence relative to the UniProt reference protein
The list of variant sources includes:
The 'Clinical Precedence' widget in the Open Targets Platform provides users with an overview of the clinical development stage of each drug, as defined by the EMBL-EBI ChEMBL database. This widget is designed to aid in the prioritisation of drug targets by highlighting the extent of clinical development for each drug.
The clinical development stages are categorised as follows:
In the Platform, this information offers valuable insights into the clinical development of potential drug targets. By showing the current phase of clinical development, the 'Clinical Precedence' widget can inform decisions about target prioritisation and help identify drugs that may be suitable for repurposing or further development.
Data supporting pharmacogenetics annotation for a drug
Pharmacogenetics is the study of how genetic variation may change your response to a specific drug. Through the integration of pharmacogenetics data into the Platform, our objective is to apply clinical annotations available in the Pharmacogenetics Knowledgebase (PharmGKB) to aid target prioritisation for drug discovery.
We also enhance these annotations by adding detailed annotations on variant consequence prediction, drug response categories, and specific drug information and whether the gene is a direct target of the drug. This process involves applying advanced phenotype extraction techniques to offer a refined representation of phenotypes, thereby providing a more precise understanding of the genetic determinants influencing treatment outcomes.
We have also included annotations for star (*) alleles, a nomenclature used in the pharmacogenetics field for representing key functional variants involved in drug responses.
When using this data please remember to acknowledge the sources:
Whirl-Carrillo M, Huddart R, Gong L, Sangkuhl K, Thorn CF, Whaley R, Klein TE. An Evidence-Based Framework for Evaluating Pharmacogenomics Knowledge for Personalized Medicine. Clin Pharmacol Ther. 2021 Sep;110(3):563-572. doi: 10.1002/cpt.2350. Epub 2021 Jul 22. PMID: 34216021; PMCID: PMC8457105.
Whirl-Carrillo M, McDonagh EM, Hebert JM, Gong L, Sangkuhl K, Thorn CF, Altman RB, Klein TE. Pharmacogenomics knowledge for personalized medicine. Clin Pharmacol Ther. 2012 Oct;92(4):414-7. doi: 10.1038/clpt.2012.96. PMID: 22992668; PMCID: PMC3660037.
; ;
, , and selected publications
is an NIH-funded comprehensive resource that provides information about how human genetic variation affects response to medications. PharmGKB collects, curates and disseminates knowledge about clinically actionable gene–drug associations and genotype–phenotype relationships, focusing on the impact of genetic variation on drug response for clinicians and researchers.
We have modelled and harmonised the data from the section of PharmGKB, which provides information about variant–drug pairs based upon collating and summarising variant annotations. Variant annotations are curated manually from a scientific publication. Clinical Annotations then provide an overarching summary and curated for the association between a genetic variant and particular drug responses based on multiple variant annotations. The likely consequence for each genotype of the variant on drug response is represented, in comparison to the other genotypes of that variant.
A drug in the Platform is understood as any bioactive molecule with drug-like properties as defined in the EMBL-EBI database.
The FDA Adverse Event Reporting System (FAERS) – – is a database that contains millions of public reports with information on adverse event and medication error reports submitted to FDA. The database is designed to support the FDA's post-marketing safety surveillance program for drug and therapeutic biologic products. Adverse events and medication errors are mapped to terms in the Medical Dictionary for Regulatory Activities (MedDRA) terminology.
While recurrence of a given adverse event is relevant, it's the specificity of the event to the drug what might flag concerns. In order to get a list of significant drug–ADRs associations, we have implemented an analysis similar to the one described by .
Remove events curated manually to exclude uninformative events
The significant drug–ADR pairs were then evaluated using the Likelihood Ratio Test (LRT) as previously described by . The significance of a given drug–ADR is implicitly corrected by how often a drug is found in a report and how often an event is reported across drugs. This way, we prevent the drug–ADR associations to be biased by overrepresented ADRs (e.g. headache, nausea) or drugs (e.g. paracetamol, ibuprofen). In order to assess significance, an LRT critical value for every drug is calculated using an empirical Monte Carlo simulation, similar to the one implemented by .
All pharmacovigilance data is available for download on .
Huang L, Zalkikar J, Tiwari RC. Likelihood ratio test-based method for signal detection in drug classes using FDA's AERS database. J Biopharm Stat. 2013;23(1):178-200. doi: . PMID: .
Maciejewski M, Lounkine E, Whitebread S, Farmer P, DuMouchel W, Shoichet BK, Urban L. Reverse translation of adverse event reports paves the way for de-risking preclinical off-targets. Elife. 2017 Aug 8;6:e25818. doi: . PMID: ; PMCID: .
In order to maximise the list of available disease–phenotype links, the Platform ingests data from the (MONDO) and the (HPO), the latter in a joint effort with .
The Monarch Merged Disease Ontology (MONDO) – – is an open-source semi-automatically constructed ontology that integrates multiple disease resources to build a single, coherent merged ontology. Originally constructed in an entirely automatic way with IDs of source databases and ontologies, manually curated cross-ontology axioms have been added and a native Mondo ID system was developed and implemented to reduce confusion with source databases and ontologies.
The Human Phenotype Ontology (HPO) – – provides a standardised vocabulary of phenotypic abnormalities encountered in human disease and contains over 13,000 terms and over 156,000 annotations to hereditary diseases. HPO has been developed using medical literature, Orphanet, DECIPHER, and OMIM.
The relationships described in MONDO and HPO are also enriched with annotations such as sex, typical age of onset or frequency of disease patients presenting the phenotype. To improve the interoperability with the rest of the Platform, diseases and phenotypes are mapped to the (EFO) when possible.
Complete disease–phenotype relationship datasets are available for download on .
Köhler S, Carmody L, Vasilevsky N, Jacobsen JOB, Danis D, Gourdine JP, Gargano M, Harris NL, Matentzoglu N, McMurry JA, Osumi-Sutherland D, Cipriani V, Balhoff JP, Conlin T, Blau H, Baynam G, Palmer R, Gratian D, Dawkins H, Segal M, Jansen AC, Muaz A, Chang WH, Bergerson J, Laulederkind SJF, Yüksel Z, Beltran S, Freeman AF, Sergouniotis PI, Durkin D, Storm AL, Hanauer M, Brudno M, Bello SM, Sincan M, Rageth K, Wheeler MT, Oegema R, Lourghi H, Della Rocca MG, Thompson R, Castellanos F, Priest J, Cunningham-Rundles C, Hegde A, Lovering RC, Hajek C, Olry A, Notarangelo L, Similuk M, Zhang XA, Gómez-Andrés D, Lochmüller H, Dollfus H, Rosenzweig S, Marwaha S, Rath A, Sullivan K, Smith C, Milner JD, Leroux D, Boerkoel CF, Klion A, Carter MC, Groza T, Smedley D, Haendel MA, Mungall C, Robinson PN. Expansion of the Human Phenotype Ontology (HPO) knowledge base and resources. Nucleic Acids Res. 2019 Jan 8;47(D1):D1018-D1027. doi: 10.1093/nar/gky1105. PMID: ; PMCID: PMC6324074.
Shefchek KA, Harris NL, Gargano M, Matentzoglu N, Unni D, Brush M, Keith D, Conlin T, Vasilevsky N, Zhang XA, Balhoff JP, Babb L, Bello SM, Blau H, Bradford Y, Carbon S, Carmody L, Chan LE, Cipriani V, Cuzick A, Della Rocca M, Dunn N, Essaid S, Fey P, Grove C, Gourdine JP, Hamosh A, Harris M, Helbig I, Hoatlin M, Joachimiak M, Jupp S, Lett KB, Lewis SE, McNamara C, Pendlington ZM, Pilgrim C, Putman T, Ravanmehr V, Reese J, Riggs E, Robb S, Roncaglia P, Seager J, Segerdell E, Similuk M, Storm AL, Thaxon C, Thessen A, Jacobsen JOB, McMurry JA, Groza T, Köhler S, Smedley D, Robinson PN, Mungall CJ, Haendel MA, Munoz-Torres MC, Osumi-Sutherland D. The Monarch Initiative in 2019: an integrative data and analytic platform connecting phenotypes to genotypes across species. Nucleic Acids Res. 2020 Jan 8;48(D1):D704-D715. doi: 10.1093/nar/gkz997. PMID: ; PMCID: PMC7056945.
In the Open Targets Platform, a variant refers to any human variation associated with a that has been reported in any of our sources. All variation is mapped to GRCh38 build and enriched with functional annotation. The Platform currently captures single nucleotide polymorphisms (SNPs) and insertions/deletions.
All variants shown in the Platform are reported in at least one of our sources.
Alternate allelic frequencies from variation database are reported for all major populations when available.
Source:
Source:
: Submitted variants at all clinical significances
: Literature-based curation of disease-associated variants
: Variants corresponding to genotypes associated with drug responses
For more details of the clinical phases, please refer to the .
is an NIH-funded comprehensive resource that provides information about how human genetic variation affects response to medications. PharmGKB collects, curates and disseminates knowledge about clinically actionable gene–drug associations and genotype-phenotype relationships, focusing on the impact of genetic variation on drug response for clinicians and researchers.
We have modelled and harmonised the data from the section of PharmGKB, which provides information about variant–drug pairs based upon collating and summarising variant annotations. Variant annotations are curated manually from a scientific publication. Clinical Annotations then provide an overarching summary and curated for the association between a genetic variant and particular drug responses based on multiple variant annotations. The likely consequence for each genotype of the variant on drug response is represented, in comparison to the other genotypes of that variant.
Molecule information, Indications, Mechanisms of action, Drug warnings (black box and withdrawn warnings)
drug.medicinalproduct
drug.medicinalproduct
drug.openfda.generic_name
synonyms
drug.openfda.brand_name
pref_name
drug.openfda.substance_name
trade_names
Method Name
Description
AlphaMissense
FoldX
FoldX is a computational tool that predicts the impact of mutations on protein stability and structure by calculating changes in free energy, helping to assess the potential functional consequences of missense variants (ref). The data represented in the Platform was generated by an Open Targets project team using the FoldX algorithm to predict stability changes for protein variants based on all human AlphaFold2 (ref) predicted structures with confidence scores of pLDDT>70.
GERP
GERP (Genomic Evolutionary Rate Profiling) scores are used to identify regions of the genome that are evolutionarily conserved and likely to be functionally important, where higher conservation indicates potential deleterious impact of variants. ref
LOFTEE
LOFTEE (Loss-Of-Function Transcript Effect Estimator) is a tool used to identify and annotate high-confidence loss-of-function variants in human genetic data, focusing on variants that likely disrupt gene function. ref
SIFT
SIFT (Sorting Intolerant From Tolerant) predicts whether an amino acid substitution affects protein function based on sequence homology and the physical properties of amino acids. ref
Ensembl VEP
4
Phase IV
Drugs have already been given regulatory approval for the treatment of specific diseases in certain regions. Only Phase IV trials for approved indications are included
3 / 2 / 1
Phase III, II, and I
Drugs are currently being tested in clinical trials for potential indications
0.5
Early Phase I
Drugs that are in the preliminary stages of clinical testing
-1
Unknown
Drug or clinical candidate drug where a clinical phase cannot be assigned
null
Preclinical
Compounds with bioactivity data that have not yet reached clinical trials
Open Targets post-GWAS analysis pipelines
The Platform is built upon a significant effort to analyse genome-wide associations studies and interpret them in the context of functional genomics studies. This effort to inform target identification and prioritisation datasets leverages gentropy a highly-scalable python framework for post-GWAS analysis.
A more detailed view on the data and methodology is available in:
Data sources used to ingest variant annotation, GWAS and functional genomics information.
Fine-mapping pipelines to identify likely causal variants in GWAS-significant signals in different sources
Colocalisation performed on overlapping GWAS and molQTL credible sets
Locus-to-Gene predictions to assing likely causal genes to identified credible sets
Overview of the GWAS association and functional genomics data sources
The Platform ingests GWAS, functional genomics and additional datasets to aid the interpretation of the observed signals. These include:
The NHGRI-EBI GWAS Catalog is a data source that provides detailed, structured genome-wide association study data in standardised format of summary statistics and curated associations (top-hits) to EFO traits. We rely on the GWAS Catalog for a rich source of genetic associations, utilising the data for analysis and interpretation.
The GWAS Catalog information feeds the Open Targets Platform by providing study metadata and GWAS associations from two sources:
The Platform ingests the GWAS Catalog curated associations (top-hits) from the table “All associations - with added ontology annotations, GWAS Catalog study accession numbers and genotyping technology” (see GWAS Catalog download page). The GWAS Catalog curation process may result in multiple GWAS being grouped under one study accession, in which case one study accession will need to be split into multiple studies in the Platform. An example of such a study is: GCST001762, where the reported trait is 'Obesity-related traits', but the underlying association data informs which top-hit comes from which specific trait, for example 'BMI z-score change'. In this case, we create as many new studies as there are association-level trait annotations. The new study identifiers are generated from the original study accession plus a suffix. The disease annotations of the newly created studies are inherited from the association data. A similar action is required when results from different ancestries are pooled under one study accession.
The harmonisation process includes a number of steps that flag any associations with quality concerns. When harmonising GCCA, the following cases were flagged:
Variant interaction associations
Variant location is not available in an association source
Variant location data is inconsistent (chromosome and position values don’t match)
Lead variant is duplicated for the same study
The provided variant location doesn’t match any GnomAD variant
The risk allele couldn't be mapped to the GnomAD reference
The lead variant had palindromic alleles
The lead variant p-value was above the genome-wide significant threshold (p-value>1e-8).
Curated associations were converted into the Gentropy StudyLocus
format and uniformly flagged with Study locus from curated top hit
quality control flag. Corresponding StudyIndex
are generated to captures the study metadata.
The Platform ingests GWAS Catalog studies from full summary statistics when they have been harmonised following the protocol. Each harmonised study is converted into a Gentropy SummaryStatistics
format as described in the Gentropy documentation including additional quality controls:
filtering out SNPs with unavailable beta, standard error, or p-value
filtering out SNPs with zero or Inf values for beta and standard error
filtering out SNPs with negative p-values or standard error
filtering out SNPs with p-values equal to 1
In some cases, this results in empty SummaryStatistics
datasets and these studies are excluded from further analysis.
The Open Targets team performs additional manual curation for the GCSS studies from “All studies - With study accession numbers, ontology annotations, genotyping technology, cohort identifiers and full summary statistics availability”. The following information is extracted from the corresponding publication:
Study type — The GCSS studies identified as pQTL or microbiome GWAS are flagged and excluded from further analysis.
Analysis type — studies are flag if performed with any of the following analysis: multivariate analysis, ExWAS, non-additive model, metabolite, GxG, GxE, case-case study. All analysis flags but “metabolite” are excluded from SuSiE fine-mapping and fine-mapped using PICS.
A StudyIndex
dataset was created as part of this pipeline containing all the available metadata for the included GWAS Catalog studies. If available, the ancestries from GWAS Catalog were mapped to gnomAD ancestry suffixes using the dictionary here. The ancestry wasn't assigned if the ancestry was unavailable or wasn't presented in the dictionary.
GWAS summary statistics quality control (QC) is performed for all GWAS Catalog studies with available summary statistics, following the methods described in Winkler et al. (2014):
The P-Z test. This check estimates the mean and standard deviation of the difference between the log p-values reported in the study and the reported betas and standard errors. If at least one value for the study was greater than 0.05, the study fails QC and is flagged with the label The PZ QC check values are not within the expected range
.
The mean beta check. This check estimates the mean value of the beta across all SNPs in the study. If the absolute mean value is more than 0.05, the study fails QC and is flagged with the label The mean beta QC check value is not within the expected range
.
The genomic control (GC) lambda check. This check estimates the additive GC lambda of the study (see Tsepilov et al. (2013)). If the GC lambda value is outside the [0.7,2.5] range the study fails QC and is flagged with the label The GC lambda value is not within the expected range
.
Number of variants. All summary statistics with fewer than 2,000,000 variants do not fail QC but are flagged with the label The number of SNPs in the study is below the expected threshold
. These studies are not eligible for SuSiE fine-mapping.
FinnGen is an academia-industry partnership that aims to produce genome variant data for 500,000 Finns. The genomic data is then combined with phenotype data collected by national health registries, including extensive longitudinal registry data available on all Finns. More details on the protocols are available in the FinnGen documentation. Because the Finnish population has been genetically isolated, fine-mapping was performed with a suitable reference panel of LD. The Platform includes the fine-mapping results from the FinnGen team using SuSiE and FINEMAP, based on a reference panel of whole genome sequencing data from Finns.
The Platform includes the 95% credible sets as described in the documentation after converting them Gentropy StudyIndex
and StudyLocus
objects. Credible sets with a lead SNP p-value>=1e-5 are marked with a Subsignificant p-value
flag.
The eQTL Catalogue aims to provide unified gene expression, protein-level and splicing QTLs from publicly available human public studies. Using a standardised pipeline (QTLmap), the eQTL catalogue is integrated in the Platform as a rich source of molecular QTLs (molQTLs). The pipeline uses SuSiE as the method of fine-mapping and results in 95% CSs.
All credible sets and study information are reformatted to Gentropy's StudyIndex
and StudyLocus
datasets. Credible sets with a lead SNP p-value>=1e-3 are marked with a Subsignificant p-value
flag.
📖 Gentropy
The Pharma Proteomics Project is a pre-competitive biopharmaceutical consortium characterising the plasma proteomic profiles of 54,219 participants in the UK Biobank.
The Platform includes GWAS full summary statistics for 2,954 proteins of European ancestry from the Synapse platform. Data is converted to SummaryStatistics
and StudyIndex
format applying the following modifications:
SNPs with MAF<1e-4 and INFO<0.8 are filtered out
Align the order of effective and reference alleles with the gnomAD annotation: if the alleles are reversed, the sign of the effect size is changed; if the allele combination do not match the reference, the SNP is filtered out
Flag QTLs as cis or trans based on a 5Mb window from the affected gene.
gnomAD (Genome Aggregation Database) is a comprehensive resource that provides aggregated genomic data from large-scale sequencing projects. It encompasses variants from diverse populations and is widely used for variant annotation and population genetics studies.
GnomAD (v4) variant annotation is used to provide extra annotation for variants included in the Platform and assist some of our harmonisation pipelines. Among the most relevant annotations included are: population allelic frequencies, in silico predictors and cross-references. All variants are available in GRCh38 coordinates.
gnomAD v2.1.1 LD matrices are used to create a multi-ancestry LD reference for the PICS fine-mapping method. All variants are liftovered to build GRCh38.
🌍 Website
The Platform uses Pan-UK Biobank project LD matrices for three ancestries: Non-Finnish European (NFE), Central/South Asian (CSA) and African (AFR). See the descriptive summary from Pan-UK Biobank. The LD was computed for each chromosome in 10 Mb radius. Only SNPs with INFO > 0.8, MAC > 20 in each population are used to calculate LD. Matrices are stored in hail BlockMatrix format. We use these LD matrices for SuSiE fine-mapping.
The Platform uses Ensembl as a source of gene, transcript and variant annotations. Affected genes in QTL studies are validated using fixed version of Ensembl and all variants are annotated using Ensembl VEP.
📖 Gentropy
The Experimental Factor Ontology, UBERON and Cell Ontology are composed as a meta-ontology to capture every tissue or cell type that could be described in a QTL study.
Each unique target–disease pair in the Open Targets Platform is defined as an association. For example, while there might be several pieces of evidence referring to CFTR and Cystic fibrosis from multiple sources, one single association contextualises all this information within the Platform. Also, since multiple pieces of evidence might refer to the same or similar associations, the Platform undertakes a series of steps to quantify their relative strength for a given association.
The Platform associations aim to aggregate all evidence referring to the target and disease, but the complex phenotypic representation of the disease might sometimes cause different pieces of evidence to be annotated against slightly different levels of granularity of the disease.
For example, the association between inflammatory bowel disease and NOD2 is annotated with multiple pieces of evidence referring to these two terms specifically. The Platform refers to associations described by aggregated evidence between two specific terms in our data sources as direct associations. By default, direct associations are displayed in the web application when listing associated diseases or phenotypes with a target of interest.
However, evidence can sometimes be informative to discriminate targets in similar diseases or phenotypes. For example, when evaluating the inflammatory bowel disease and NOD2 association, other pieces of evidence describing the relationship between Crohn's disease and NOD2 might also be informative.
To approach this problem systematically, the Platform makes use of the properties of the disease ontology (EFO), to select all evidence referring to NOD2 in the context of inflammatory bowel disease or any of its ontology descendants (including Crohn's disease). This type of association is referred to in the platform as an indirect association. Indirect associations are displayed in the web application when displaying all the evidence for a target-disease association or listing associated targets with a disease or phenotype of interest.
Summary
An association page for diseases associated with a target (e.g. NOD2 associations page) includes direct evidence only.
An association page for targets associated with a disease (e.g. Inflammatory Bowel Disease associations page) includes both direct and indirect evidence.
An evidence page (e.g. NOD2 and inflammatory bowel disease) displays both direct and indirect evidence.
The same calculations are applied to calculate the association scores, the only difference is the evidence included in the calculation.
Both direct and indirect associations can be queried using the GraphQL API or our data downloads page.
Note
RNA expression data type evidence is not propagated in the ontology. We made this decision to prevent parent terms from having long lists of associated targets with weak RNA expression association scores.
Deciding what constitutes a strong association is open to interpretation. While some data sources can be more deterministic about the underlying causal evidence, others can occasionally point to targets indirectly linked to the disease. The Platform relies on a series of heuristics to maximise the transparency of the target and disease rankings.
The Platform's scoring by data source aims to take into account variations in how the data sources organise their evidence.
For example, some data sources are more stringent when it comes to defining what constitutes a single piece of evidence, whereas others rely on their internal evidence score to stratify the strength of the evidence they present.
Additionally, some data sources will capture the meaningful association in one single piece of evidence. In other data sources, the repetition of evidence increases our confidence in the association.
The scoring by data source aims to balance all these differences and provide a consensus view on the strength of the underlying evidence for a particular source.
For all cases, the Platform defines a data source association score by calculating a harmonic sum using the full vector of evidence scores as defined for each data source using the following the next steps:
The pieces of evidence are sorted in descending order and assigned an incremental value that indicates their position in the sorted list (the top-scoring item has a positional id of 1, the second has a positional id of 2, and so on).
The harmonic sum for each data source is then calculated by summing the result of dividing each evidence score by (positional id^2).
To ensure the result is between 0 and 1, the harmonic sum is normalised by dividing the result by the maximum theoretical harmonic sum, which is the one calculated using an infinite vector of ones. The platform derives this calculation (which approximates to 1.644) by using a vector of 1,000 ones.
Example
To calculate the data source association score for a vector of evidence scored 1, 0.9 and 0.8 the Platform will follow the next logic
Step 1: Sorting/Indexing
Step 2: Harmonic sum calculation
Step 3: Scaling
The data type association score aims to capture the strength of the supporting evidence at the data type level (e.g. Genetic Associations). A second harmonic sum is calculated by using the vector of data source association scores weighted by the data source weights.
Association score by data type scaling
While the harmonic sum calculation remains mostly the same for data sources, data types and overall, the scaling factor is slightly modified in the association by data type calculation.
So that data types only featuring one data source (e.g. text mining) are not penalised, the maximum theoretical harmonic sum score is calculated based on a vector of as many ones as data sources are in the respective datatype. In this way, the scaling factor of a data type with one data source will be 1.0/1^2 = 1
, whereas a datatype with 3 data sources will be scaled by the result of calculating: 1.0/1^2 + 1/2^2 + 1/3^2 = 1.36
.
The overall association score aims to summarise all the aggregated evidence for a given target-disease association. The score is derived by calculating the harmonic sum of the association score by data source weighted by the data source weights, regardless of their data type categorisation. The algorithm to compute the scores is the same as the association by data source, resulting in a score between 0 and 1.
To calculate both data type and overall association scores, evidence is weighted using a factor that aims to calibrate the relevance of each data source relative to others. The default weights used in the web application can be modified by the user to adjust to different prioritisation strategies, both in the API and in the user interface through the "Advanced Option" tab from the new Associations on the Fly page.
Europe PMC
0.2
Expression Atlas
0.2
IMPC
0.2
PROGENy
0.5
SLAPenrich
0.5
Cancer Biomarkers
0.5
SysBio
0.5
OTAR Projects (partner preview only)
0.5
Others
1
There are a few important considerations regarding association scores. As described above, association scores are a heuristic based on the availability of data. While scores are useful to rank lists of targets or diseases, they should not be interpreted as a confidence score for the target-disease association.
For example, under-studied diseases are unlikely to produce high-scoring targets due to the lack of available evidence. In such diseases, a relatively low-scoring target might still be the top-ranked target and potentially a very interesting lead from a therapeutic standpoint.
Similarly, not all associations with available target–disease evidence should be considered legitimate target–disease associations. Some of our data sources rely on predictions to assess the relationship between a target and a disease. Thus, they should be considered with caution and always take their relative support into consideration.
Fine-mapping is a statistical analysis technique used to pinpoint the specific genetic variant(s) most likely responsible for a trait association identified in a GWAS. The main result is a credible set (CS): the minimal list of variants with assigned posterior probabilities (PIPs) to be causal that form a predefined probability. The Platform uses 95% CS, meaning the CS has a 0.95 probability of containing the causal variant. The expected sum of PIPs for all variants in CSs have to be within the range of [0.95,1].
Open Targets Platform fine-mapping strategies (see more in Fine-mapping pipelines):
Study-specific
SuSiE
In-sample
Study-specific
SuSiE
In-sample
📖 Gentropy
Clumping is a technique for selecting the most significant variants within a region or set of variants in LD, essentially collapsing signals into loci with an increased chance of containing a single causal variant for the phenotype. Three different methods are used for clumping:
The method is based on an iterative procedure in which the variant with the strongest p-value is selected and all other variants within a predefined distance (radius) from this variant are clumped together. The procedure is repeated as long as there is at least one significant variant. The output of the method is a list of lead variants. There are two parameters: the distance to clump and the p-value significance threshold.
The method is applied to the results of the distance-based clumping (list of lead variants). It is again based on an iterative procedure in which the variant with the strongest p-value is selected and all other variants with high LD (r2>=0.5) are clumped together.
The method consists of three steps:
In the first step, we perform the regular distance-based clumping.
In the second step, we filter the input summary statistics by the baseline p-value (default is 1e-5). We then clump variants that are closer to each other than the cutoff distance (default is 250,000 bp). Next, we filter clumps by having at least one variant with a p-value above the genome-wide significance threshold. At this stage, we define a list of loci consisting of information about the lead variant and the locus boundaries (the leftmost and rightmost variants in the clump). To each of the locus boundaries we subtract/add (to the left/right boundaries respectively) the flanking distance (the default is 100,000 bp) to avoid situations where the locus size consisting of only one lead variant is 0.
In the third step we select the loci with the size greater than the specified large locus size threshold (1,500,000 bp by default) and 'break' each of the large loci using the lead variants from the distance-based clump that lie within the boundaries of the large locus. For each of the distance-based lead variants, we assigned the boundaries as +/-half the size of the large locus size threshold. Thus, each large locus is divided by several overlapping loci of the size of the large locus size threshold. The small loci are unaffected by the splitting.
The general procedure results in a list that contains information about the lead variant and the locus boundaries (the most left and right variants in the cluster), and the largest locus size doesn't exceed the large locus size threshold. The boundaries of the locus are then used to define the region for fine-mapping and LD matrix ingestion. The reason for using this procedure is because locus breaker results in smaller locus sizes on average compared to those defined by window-based clumping.
Acknowledgements: We are grateful to our colleagues at Human Technopole, Sodbo Sharapov and Nicola Pirastu, for advice on this method.
📖 Gentropy
The PICS algorithm was originally implemented in Farh et al. (2015) investigating the fine-mapping of causal autoimmune disease variants. It is a method to fine-map the most likely causal variants associated with a trait or disease within a haplotype. The algorithm is based on the calculation of the Posterior Inclusion Probabilities (PIP) of tag variants linked to the lead variant by LD within the target population. Only five populations are used for PICS fine-mapping: African-American (AFR), American Admixed/Latino (AMR), East Asian (EAS), Finnish (FIN), and Non-Finnish European (NFE). The calculation is performed by using all the proxy variants where r2 ≥ 0.5 and the default parameter k=6.4 as reported in the original paper.
📖 Gentropy
If both summary statistics and high-precision LD are available for the locus, we use SuSiE-inf fine-mapping (see Cui et al. (2024) for more details). This method is a generalisation of the original SuSiE, allowing modelling of infinitesimal effects alongside fewer larger causal effects.
SuSiE-inf has two approaches for updating estimates of the variance components: Method of Moments and Maximum Likelihood Estimator ('MoM' / 'MLE'). The function takes an array of Z-scores and a numpy array matrix of variant LD to perform fine-mapping. There is a boolean option “est_tausq” that enables the estimation of infinitesimal effects. If it is disabled, it performs fine-mapping similar to the original SuSiE method. We use SuSiE-inf only with the combination of pan-UK Biobank LD matrices for three ancestries: NFE, CSA and AFR.
Different pipeline strategies have been defined for different sources based on a combination of the above methodologies:
Distance and LD-based clumping is applied to all available GCSS and GCCA. P-value significance is defined at <=1e-8 threshold and distance-based clumping is performed using a 500,000 bp radius. LD-clumping is performed on the same window using the LD dataset from gnomAD v2.1.1. PICS fine-mapping is then performed using the same LD information and credible sets are defined based on 95% inclusion probability. In the case of multiple ancestry studies, we used Non-Finnish Europeans (NFE) if it was in the list of ancestries, or major ancestry otherwise. In the case the study's ancestry is not in the list available in LD index ancestries, no fine-mapping results are produced. If the lead variant is not present in the LD-index, the credible set contains only that variant and it's flagged with the label Variant not found in LD reference
. All resulting credible sets objects were additionally flagged with the label Study locus fine-mapped without in-sample LD reference
.
GWAS-significant study-locus are identified using a p-value threshold <=1e-8 and the locus-breaker method. SuSiE-inf fine-mapping was applied on all study-loci following the next criteria:
Major ancestry for the study is NFE, AFR, CSA or EAS. If the major ancestry was EAS, we used CSA instead
The study type is "gwas"
The study has no analysis flags except "metabolite"
Study has no quality control flags
The locus didn't overlap with the MHC region. We didn’t exclude X or Y chromosomes if they were presented in GWAS
The number of variants in the locus after overlapping with the LD matrix was in the range [100, 15,000]
For all eligible study loci, the SuSiE-inf method was applied without estimation of infinitesimal effects (equivalent to the classical SuSiE method) using pan-UKBB LD matrices. Filtering of the resulting 95% CSs is performed based on CS log(BF)<=2, minimum R2 purity <=0.25 and lead variant p-value>=1e-5. Additionally, the pairwise r^2 between lead variants within the locus is calculated, removing the less significant of the CSs if the r2>=0.8. As the locus breaker procedure can create overlapping loci, we can obtain the redundant credible sets within the same study. Thus, if two CSs from different loci within a study had the same lead variant, we removed one of them, leaving the CS with the largest CS log(BF). All resulting credible sets from this pipeline are flagged with the label Study locus fine-mapped without in-sample LD reference
.
The UKBB-PPP summary statistics are clustered into locus using the locus breaker method using a p-value significance threshold <=1.7e-11. The resulting study-loci are fine-mapped using the SuSiE-inf method with Pan-UKBB LD matrices for the EUR population. The resulting 95% CSs are filtered similar to GCSS SuSiE pipeline described above. All resulting StudyLocus objects were additionally flagged with the label Study locus fine-mapped without in-sample LD reference
.
A credible set is a set of genetic variants near a genetic association signal that is predicted, with a specific probability, to include the causal variant for that signal. The results of the fine-mapping analysis determine this, assigning each variant in the region a posterior probability of being causal when considering the observed statistics and the population structure. The variants covering the top 95% likelihood of containing the causal variant define the credible sets in the Platform.
A credible set results from statistical analysis on a specific locus in a study. As a consequence, all credible sets are defined as:
Study in which the association is reported
Lead variant - Variant with the highest posterior probability in the credible set
Fine-mapping method and statistics
The Platform contains every credible set resulting from fine-mapping all our sources after applying certain exclusion criteria.
Credible set identifier
Different from other entities in the Open Targets Platform, credible sets are identified with an alphanumeric list of characters that have no semantic meaning. The identifier is derived from a combination of fields that define a credible set's uniqueness, and it will remain the same as long as the credible set metadata hasn't changed.
Credible sets fulfilling any of the next rules are excluded from the Platform:
The lead variant is within the MHC region (chr6:25726063-33400556)
The credible set is not in a valid study
There is another fine-mapped SuSiE credible set from the same region and study
Being a GWAS catalog fine-mapped top-hit, there is a GWAS catalog fine-mapped credible set from summary statistics for the same region and study
The lead variant is reported in an invalid chromosome (1:22, X, Y, XY, MT)
The sum of PIPs in the credible is not within the [0.95,1] range
Credible sets are categorised based on the fine-mapping confidence, derived from the available association data, fine-mapping framework and availability of linkage disequilibrium (LD) information for the specific study/population . The categories in descending order of confidence are:
SuSiE or SuSiE-inf fine-mapped credible set with in-sample LD
SuSiE or SuSiE-inf fine-mapped credible set with out-of-sample LD
PICS fine-mapped credible set extracted from summary statistics
PICS fine-mapped credible set based on reported curated association
Unknown confidence
The confidence values are symbolised by 0 to 4 stars on the UI. It is important to note that this confidence does not reflect the strength of the association or the effect size.
All variants in the credible set are annotated with:
P-value. Unconditioned p-value from the study when available.
Beta. Corresponds to SuSiE mu (SuSiE fine-mapped GWAS Catalog, UKBB-PPP and FinnGen) or beta (PICS fine-mapped GWAS Catalog and SuSiE fine-mapped eQTL Catalog)
Standard error. Only available for PICS-fine-mapped credible sets.
LD (r^2). Linkage-disequilibrium information. Only available for PICS-fine-mapped credible sets.
Posterior probability. Posterior inclusion probability (PIP) of variant being causal after fine-mapping.
log(BF). The logarithm of the Bayes factor. Only available for SuSiE credible sets.
Predicted consequence. The most severe consequence is across all overlapping canonical transcripts, as reported by the Ensembl VEP.
Machine-learning prioritisation of likely causal genes based on available features. L2G integrates multiple features to predict what's the most likely causal gene in the neighbourhood of the observed association. All predictions for protein-coding genes with a score above 0.05 are displayed. See Locus-to-Gene section for a description of the methodology.
Using the SHAP (SHapley Additive exPlanations) library, we have extracted feature importance values for all L2G predictions. Shapley values provide a principled approach based on game theory to explain the contribution of individual features or groups of features, revealing how each group influences the final L2G score. These contributions are approximated to be additive, meaning the sum of the Shapley values for all feature groups equals the total L2G score or reasonably close to it.
We have aggregated the Shapley values into the main feature groups to understand their relative importance.
The base value represents the baseline before any feature-specific information is considered, and is therefore equivalent for all genes and credible sets.
This approach helps to identify which types of evidence (e.g. distance, colocalisation or functional impact) are most influential for a given locus-gene association. In addition, we can visualise the individual contribution of each feature within the group. As the features within the group are highly correlated, the individual values are less interpretable than the group contribution.
Additivity of SHAP values
Because the SHAP analysis is an approximation, in some occasions, the sum of all shapley values might not result in the L2G score.
A full description of all features is available here.
Source: Open Targets
Credible sets are compared against other credible sets to find overlapping signals. Two overlapping credible sets are those that share at least one variant in the set. The Platform contains all the overlaps between all GWAS vs all GWAS and all GWAS vs all molQTL studies.
For the overlapping pairs of credible sets, estimates for two co-localisation methods are computed based on the type of credible set:
Source: Open Targets
A directionality assessment is included in the colocalisation analysis to help the user interpret the relationship between the two overlapping credible sets.
For every overlapping variant in a pair of overlapping credible sets, the sign of the ratio between both beta estimates are calculated (+1 or -1) indicating the individual variant directionality. The average of these signs across all overlapping variants in both credible sets is used to estimate the sign estimates. When the average approximates to +1, the two credible sets are assessed to share the same directionality. If the average approximates to -1, the two credible sets are interpreted to have opposite directionality. In any other case, the assessment is declared inconclusive (N/A).
Source: Open Targets
Pairs of credible sets overlapping at least one variant are subject to additional analysis to infer the likelihood of them sharing the same causal variant. The Platform compares all GWAS vs all GWAS credible sets and all GWAS vs all molQTL credible sets. For overlapping credible sets, co-localisation metrics are estimated using the following methods:
H0: No association with either trait
H1: Association with trait 1, not with trait 2
H2: Association with trait 2, not with trait 1
H3: Association with trait 1 and trait 2, two independent SNPs
H4: Association with trait 1 and trait 2, one shared SNP
To reduce potential false positives in small overlaps, if the size of the overlap is less than 5 SNPs, we check whether the maximum product of the PIPs for at least one of the overlapping SNP pairs is greater than 0.01. Otherwise, the overlap is excluded from the COLOC analysis.
Credible set-based colocalisation
All co-localisations presented in the Platform are based on credible sets. No estimates are based on full locus co-colocalisation
Gentropy is an open-source Python package to facilitate the interpretation and analysis of GWAS and functional genomic studies for target identification. The Platform leverages Gentropy to perform post-GWAS analysis and derive the evidence and datasets for web portal visualisation.
Gentropy provides a set of data models, data ingestion methods and statistical analysis organised in discrete steps to maximise re-usability. Gentropy is designed with scalability in mind, so it's suitable for small dedicated analysis but also for the high-performing orchestrated tasks required to generate the data-lake that populates the Open Targets Platform.
Every event or set of events pinpointing a target as a potential causal gene or protein for a disease represents the unit of information, most often referred to as evidence. Within the Open Targets Platform, a series of pipelines ensure information is retrieved from its sources and standardised in a way that can be immediately applied to answer drug development queries.
All evidence is mapped to the reference target entity identifier (Ensembl gene) and disease or phenotype identifier (experimental factor ontology, EFO), as well as other reference controlled vocabularies and ontologies when appropriate. Evidence is also reviewed to minimise the presence of duplicates within the same data source.
Data sources are also grouped into bigger categories abstracting the type of evidence they predominantly capture. In the platform, these categories are usually referred to as data types, as opposed to the individual resource data referred to as data sources.
The Open Targets Platform provides a scoring framework for each data source to contextualise the relative importance of each piece of evidence. This score will be more relevant when understanding the association scoring in later sections.
The GWAS associations data source aggregates target-disease relationships supported by significant genome-wide associations (GWAS) in the context of other functional genomics data.
Datatype: Genetic associations
Gene burden data comprises gene–phenotype relationships observed in gene-level association tests using rare variant collapsing analyses. The Platform integrates burden tests carried out by several sources:
REGENERON (Backman et al., 2021), a whole-exome sequencing analysis of individuals from the UK Biobank.
AstraZeneca PheWAS Portal (Wang et al., 2021), a whole-exome sequencing analysis of individuals from the UK Biobank.
Genebass (Karczewski et al., 2022): Gene-based Association Summary Statistics (Genebass), a whole-exome sequencing analysis of individuals from the UK Biobank.
The results of whole-exome and whole-genome sequencing analysis based on the SPARK cohort bring evidence of novel targets implicated in autism spectrum disorder (Zhou et al., 2022).
The SCHEMA consortium (Singh et al., 2022), a whole-exome sequencing analysis of individuals with schizophrenia.
The Epi25 collaborative (Epi25 Collaborative, 2019), a whole-exome sequencing analysis of individuals with epilepsy.
The Autism Sequencing Consortium (Satterstrom et al., 2020), a whole-exome sequencing analysis of individuals with autism spectrum disorder.
The results of an Open Targets project (Bomba et al., 2022), a whole-exome sequencing analysis of individuals from the INTERVAL cohort testing for associations between rare coding variants and blood metabolites.
The results of a pan-ancestry whole-exome sequencing analysis identify relevant genes associated with fat distribution (Akbari et al., 2022).
The results of whole-exome and whole-genome sequencing analysis on Parkinson disease and promoted by the AMP-PD initiative, and other collaborators (Makarious et al., 2022).
The results of gene-based analyses of rare variants and circulating metabolic biomarkers relevant to cardiovascular disease (Riveros-McKay et al., 2020).
The results of rare coding variant analyses from whole exome sequencing of Black South African men to identify genes significantly associated with prostate cancer (Soh et al., 2023)
These associations are a result of collapsing rare variants in a gene into a single burden statistic and regress the phenotype on the burden statistic to test for the combined effects of all rare variants in that gene. The different collapsing methods inform about the filters used to select the set of qualifying variants, mostly based on their pathogenicity and frequency in the population.
Datatype: Genetic associations
Evidence scoring: Scaled p-value from 0.25 (p = 1e-7) to 1 (p < 1e-17).
Direction of Effect assessment:
ClinVar is an NIH public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. The ClinVar data source in the Open Targets Platform captures the subset of ClinVar that refers to germline variants (as opposed to somatic variants). Each evidence in the platform aims to capture an individual RCV record in ClinVar.
Information on variants is covered extensively for both single point and structural variants. When available, genomic coordinates are reported with RS numbers, or by following the CHROM_POS_REF_ALT and HGVS notations.
Datatype: Genetic associations
Evidence scoring: ClinVar evidence is scored in a 2-step process. In Step 1, a score is assigned to every piece of evidence based on the clinical significance:
In Step 2, the score is modulated based on the ClinVar review status:
Direction of Effect assessment:
The Genomics England PanelApp is a knowledge base that combines crowdsourced expertise with curation to provide gene–disease relationships. Virtual gene panels related to human disorders are reviewed by experts within the clinical and scientific community to support the interpretation of genomes within the 100,000 Genomes Project. Within a panel, genes are rated based on the level of evidence supporting the association with the phenotypes identified by the panel. Genes are then classified according to a traffic light system with red/stop, amber/pause, and green/go classifications. To receive a green rating (diagnostic-grade) on a version 1+ panel, the gene requires "evidence from 3 or more unrelated families or from 2-3 unrelated families where there is strong additional functional data" and "genes that do not meet these criteria are rated as Amber (borderline) or Red (low level of evidence)."
Data type: Genetic associations
Evidence scoring: Based on Genomics England gene rating:
G2P evidence in the Platform is the result of any target-disease curation by any of the expert panels.
Data type: Genetic associations
Evidence scoring:
Direction of Effect assessment:
The Universal Protein Resource (UniProt) provides a large compendium of sequence and functional information at the protein level. As part of their functional annotation effort, UniProt curators also annotate proteins with publications supporting their involvement on pathogenic processes.
All publications supporting a given target disease relationship are aggregated into one single Platform evidence.
Data type: Genetic associations
Evidence scoring:
The Universal Protein Resource (UniProt) also curate variants supported by publications that are known to alter protein function on disease. Curated mutations are predominantly protein coding or in regulatory regions clearly associated with the causal protein.
All publications supporting a given variant in connection with a disease constitute individual evidence. All supporting publications are aggregated within the same evidence.
Data type: Genetic associations
Evidence scoring:
Orphanet is an international network that offers a range of resources to improve the understanding of rare disorders of genetic origin. These resources include an inventory of rare disease and gene associations, classification of the gene–disease relationship, information on the kind of mutation, and supporting publication references.
Data type: Genetic associations
Evidence scoring:
Direction of Effect assessment:
The Clinical Genome Resource (ClinGen) Gene–Disease Validity Curation aims to evaluate the strength of evidence supporting or refuting a claim that variation in a particular gene causes a particular disease. ClinGen provides a framework of guidelines to assess clinical validity in a semi-quantitative manner allowing curators to classify the validity of given gene–disease pair.
All gene–disease pairs mapped to EFO constitute individual evidence in the Platform.
Data type: Genetic associations
Evidence scoring:
EMBL-EBI's ChEMBL is a manually curated database of bioactive molecules with drug-like properties, either approved for marketing by the U.S Food and Drug Administration (FDA), or clinical candidates. ChEMBL also captures information regarding the drug molecule indications, as well as their curated pharmacological target.
In the Platform, ChEMBL evidence represents any target–disease relationship that can be explained by an approved or clinical candidate drug, targeting the gene product and indicated for the disease. Independent studies are treated as individual evidence.
To provide additional context, we integrate a machine learning-based analysis of the reasons why a clinical trial has ended earlier than scheduled. This sorts the stop reasons into a set of 17 classes which include negative, neutral, and positive reasons. This information is available when hovering on the tooltip of the Source column.
The 17 classes are: Another Study, Business or Administrative, Negative, Study Design, Invalid Reason, Ethical Reason, Insufficient Data, Insufficient Enrolment, Study Staff Moved, Endpoint Met, Regulatory, Logistics or Resources, Safety and Side Effects, No Context, Success, Interim Analysis, and Covid 19.
Data type: Drugs
Evidence scoring: ChEMBL evidence is scored in a 2-step process. In Step 1, a score is assigned to every piece of evidence based on the clinical precedence:
In Step 2, for those clinical trials that have stopped early, the score is down-weighted based on the classification of the reason to stop. In this way, less importance is attributed to evidence of studies that have been stopped due to negative outcomes or safety concerns:
Direction of Effect assessment:
The Reactome database manually curates and identifies reaction pathways that are affected by a disease. Reactome annotation includes information regarding the causal target–disease link either being a protein coding mutation or an altered expression.
In the Platform, any mutation or altered expression event affecting a different reaction is captured in a different target–disease evidence.
Data type: Pathways & systems biology
Evidence scoring: All manually curated evidence in Reactome has a score of 1.
One of the most powerful approaches to uncover gene function is the experimental perturbation of genes followed by the observation of related phenotypes. The perturbation of gene function in human cells has been greatly facilitated by developments in CRISPR technology.
We have linked cell types to diseases, meaning these diseases are often characterised with abnormal phenotypes in these cell types — hence the association. If knocking out a gene causes significant perturbation in the cell type, it might indicate a potential targeting strategy in the disease.
Data Type: Pathways & systems biology
Evidence Scoring: The Platform uses the linearised CRISPRbrain's assessment of statistical significance to assign a score, including hits from both the upper and lower end of the distribution
Project Score is a Wellcome Sanger Institute resource that aims to identify dependencies in cancer cell lines to guide precision medicine. The project combines gene fitness effects derived from whole-genome CRISPR-Cas9 synthetic lethality screenings with tractability data, genomic biomarkers and various target annotation enabling a systematic prioritisation of potential targets. The resulting inferences are then mapped from the cancer cell lines in which the experiment is performed to their corresponding tumours.
In the Platform, any Project Score prioritised target with priority score reaching 36.0 is included as independent evidence; however, pan-cancer dependencies are excluded from the integration.
Data type: Pathways & systems biology
Evidence scoring: Project Score priority score divided by 100
In the Platform, each pathway significantly enriched in tumour-occurring mutations constitutes an individual piece of evidence.
Data type: Pathways & systems biology
Evidence scoring: Scaled enrichment p-value from 0.5 (p = 1e-4) to 1 (p<1e-14).
The Platform also provides information about key driver genes for specific diseases that have been curated from Systems Biology analysis. These publications present different disease gene signatures as potential key drivers or key regulators causing disease.
Data type: Pathways & systems biology
Evidence scoring: Scoring depends on whether the original data contains or not a score:
p-values and rank-based scores are normalised to the 0.5 - 1 range
If there is no score a fixed value of 0.5 is used
In the Platform, a PROGENy evidence is defined as any significantly regulated sample-level pathway activities inferred from matched normal vs. tumour samples.
Data type: Pathways & systems biology
Evidence scoring: Scaled p-value from 0.5 (p = 1e-4) to 1 (p<1e-14).
The EMBL-EBI Expression Atlas provides a differential expression pipeline aiming to identify genes that are differentially expressed in disease vs control samples. Only contrasts from studies with enough replicates and minimum quality criteria are included in the processing.
In a given contrast, to consider a gene significantly regulated in a contrast, all the following rules are required:
Absolute log2 fold change > 1
Adjusted p-value <= 0.05
Maximum significant genes probes per contrast = 1000
In the Platform, each contrast from independent studies capturing differentially regulated genes constitutes independent evidence.
Data type: RNA expression
Evidence scoring: ExpressionAtlas scoring is the result of the product of:
Scaled p-value from 0 (p = 1) to 1 (p<1e-10)
Absolute log2 fold change divided by 10
Percentile rank divided by 100
In the Platform, CGC evidence is aggregated at the target–disease level to provide a summary of all curated evidence supporting the involvement of a target with a particular cancer type.
Data type: Somatic mutations
In the Platform, independent target–disease evidence are defined as any significant driver gene detected in any individual cohort. Information regarding the individual driver methods is also provided within each evidence.
Data type: Somatic mutations
ClinVar is an NIH public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. The ClinVar (somatic) data source in the Open Targets Platform captures the subset of ClinVar that refers to somatic variants (as opposed to germline variants).
Information on variants is covered extensively for both single point and structural variants. When available, genomic coordinates are reported with RS numbers, or by following the CHROM_POS_REF_ALT and HGVS notations.
Each evidence in the Platform aims to capture an individual RCV record in ClinVar.
Datatype: Somatic mutations
Evidence scoring: ClinVar evidence is scored in a 2-step process. In Step 1, a score is assigned to every piece of evidence based on the clinical significance:
In Step 2, scored is modulated based on the ClinVar review status:
Direction of Effect assessment:
The EMBL-EBI's Europe PMC enables access to a worldwide collection of life science publications and preprints from trusted sources. The Europe PMC data source aims to identify target–disease co-occurrences in the literature and provide an assessment on the confidence of the relationship. This pipeline uses deep-learning based Named Entity Recognition (NER) to identify gene/proteins and diseases when mentioned in the text, to later normalise them to the target or disease/phenotype entities in the Platform. All co-occurrences of both types of entities in the same sentence are considered evidence.
In the Platform, a piece of Europe PMC evidence is the result of aggregating all co-occurrences of the same target and disease within the same publication.
Data type: Text mining
The genotype–phenotype associations made available by the International Mouse Phenotypes Consortium (IMPC) are used to identify models of human disease based on phenotypic similarity scores.
The Wellcome Sanger Institute PhenoDigm is an algorithm aimed at capturing the similarity between a knockout mouse and the clinical manifestations (phenotype) of a human disease. The premise is that if a gene knock-out causes an equivalent phenotype in mouse, the human counterpart is likely to be related with the cause of the disease.
It uses a semantic approach to map between clinical features observed in humans and mouse phenotype annotations. The phenotypic effects in mice are then mapped to phenotypes associated with human diseases. The matches are identified and a similarity score between a mouse model and a human disease is computed.
Data type: Animal models
Direction of Effect assessment:
One of the aims of the Cancer Genome Interpreter is to identify how variations in the tumour genome may influence its response to anti-cancer therapies. The Cancer Biomarkers database features biomarkers of drug sensitivity, resistance, and toxicity for drugs targeting specific targets in cancer, curated by clinical and scientific experts in precision oncology, and classified by cancer type.
Data type: Pathways & systems biology
Evidence scoring: All manually curated evidence in Cancer Biomarkers has a score of 1
+
+
📖
If both overlapping credible sets are annotated with the logarithm of the Bayes factor (log(BF)), the Platform pipelines will estimate the probability of sharing the causal variant using the COLOC method (). COLOC uses Bayesian statistics to estimate posterior probabilities for the following hypothesis:
📖
If at least one credible sets in the overlap had no information about a variant log(BF), colocalisation analysis using eCAVIAR was performed. eCAVIAR is an a heuristic algorithm that uses SNP’s PIPs (see ) to estimate the colocalisation posterior probability (CLPP). CLPP is computed as the sum over the product of the variant fine-mapping probabilities between the two overlapping credible sets.
The web interface available at constitutes the first point of access for most of the Platform users. The site provides a unified search box connected to a series of tools allowing users to query different therapeutic hypotheses.
When possible, users can download the displayed information, but we invite users to visit the section when more complex queries are under consideration.
See more in the Gentropy 📖 .
The evidence in this data source results from a comprehensive statistical genetics analysis described in section. The aim of this analysis is to identify GWAS-significant signals across an of GWAS studies covering binary and quantitative traits. To address linkage disequilibrium, all significant signals are and the resulting credible sets against molQTL studies. All GWAS and functional genomic features are leveraged by the machine-learning method aimed to prioritise likely causal genes in the region.
The GWAS association evidence is defined as any credible set in a GWAS trait associated with a gene with a Locus2Gene (L2G) > 0.05. The feature contributions for the L2G predictions are also by SHAP analysis helping with the interpretation of the observed features. All credible sets can also be futher interrogated in their own page, including an interpretation of the directionality in the context of colocalising molQTL studies.
Evidence scoring: , filtered to use scores above 0.05
The FinnGen (R12) gene-based burden test results from collapsing loss of function variants, based on genotyping data from the Finnish population. .
Source: ,
References: ; ; ; , ; ; ; ; ; ; ;
Source: (via )
References: ; ; ;
The Open Targets Platform includes "green" and "amber" genes from version 1+ panels along with their phenotypes, providing the latter can be mapped to a disease or phenotype ontology. As we standardise our evidence to EFO, some of the phenotypes cannot be mapped and included in the Platform; please visit the for the full set.
Source:
References:
The data in Gene2Phenotype (G2P) is produced and curated from the literature by different sets of panels formed by consultant clinical geneticists. The G2P data is designed to facilitate the development, validation, curation, and distribution of large-scale, evidence-based datasets for use in diagnostic variant filtering. Each G2P entry associates an allelic requirement and a mutational consequence at a defined locus with a disease entity. A confidence level and evidence link are assigned to each entry. This confidence level follows the terminology described by for describing gene–disease validity.
Source:
References:
Source:
References:
Source:
References:
Source:
References: ;
Source:
References:
Source:
References:
Source:
References:
CRISPRbrain is a database for functional genomics screens in differentiated human brain cell types. We have prioritised genome-wide (healthy vs KO) for integration in the Platform to generate target–disease evidence.
Source:
Reference:
Source: CRISPR (via)
References:
SLAPenrich (Sample-population Level Analysis of Pathway enrichments) is a novel statistical framework for the identification of significantly mutated pathways, at the sample population level, in large cohorts of cancer patients. SLAPenrich is based on a Poisson binomial model that takes into account the length of blocks of exons in genes within each pathway, and the background mutation rate of the analysed cohort of patients. SLAPenrich enrichment analysis is based on EMBL-EBI Reactome pathways and mutation data from The Cancer Genome Atlas () cohort.
Source:
References:
References: ; ; ;
PROGENy (Pathway RespOnsive GENes) is a linear regression model that calculates pathway activity estimates based on consensus transcriptomic gene signatures obtained from perturbation experiments. PROGENy () provides a framework to systematically compare pathway activities between normal and primary samples from The Cancer Genome Atlas (TCGA).
Source:
References:
Source:
References:
Cancer Gene Census (CGC) is part of the Wellcome Sanger Institute Catalogue of Somatic Mutations in Cancer (). CGC is an effort to catalogue genes which contain mutations that have been causally implicated in cancer. The exhaustive curation of the CGC covers individual studies as well as pan-cancer sequencing efforts, including The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) among others.
Evidence scoring: Scoring is based on the
Source:
References:
IntOGen provides a framework to identify potential cancer driver genes using large-scale mutational data from sequenced tumour samples. By harmonising tumour sequencing data from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes () and other comprehensive efforts, IntOGen aims to provide a consensus assessment of cancer driver genes. Several state-of-the-art driver methodologies aiming to cover different approaches (e.g. dN/dS, Hotspots, etc.) are included to finally produce a consensus q-value for each driver gene in every tumour.
Evidence scoring: Scaled from 0.25 (q = 0.1) to 1 (q < 1e-10)
Source:
References:
Source: (via )
References: ; ; ;
Evidence scoring: Score based on weighted document sections, sentence locations, and title for full text articles and abstracts as described in . The aggregated scores of each gene/disease co-occurrence in the publication are further normalised between 0 and 1.
Source:
References: ;
Evidence scoring: The evidence score indicates the degree of concordance between the mouse and disease phenotypes, as described by .
Source:
References:
Source:
References:
association not found
0
benign
0
not provided
0
likely benign
0
conflicting data from submitters
0.3
conflicting interpretations of pathogenicity
0.3
low penetrance
0.3
other
0.3
uncertain risk allele
0.3
uncertain significance
0.3
established risk allele
0.5
risk factor
0.5
affects
0.5
likely pathogenic
0.7
association
0.9
confers sensitivity
0.9
drug response
0.9
protective
0.9
pathogenic
0.9
no assertion provided
+0
no assertion criteria provided
+0
no assertion for the individual variant
+0
criteria provided, single submitter
+0.02
criteria provided, conflicting interpretations
+0.02
criteria provided, multiple submitters, no conflicts
+0.05
reviewed by expert panel
+0.07
practice guideline
+0.1
LoF variants
Pathogenic/Likely pathogenic = Risk
Protective = Protective
Amber
0.5
Green
1
Limited
0.01
Moderate
0.5
Strong
1
Both RD and IF
1
Definitive
1
LoF and GoF variants
Assumption of Risk
Uniprot confidence
Evidence score
Medium
0.5
High
1
UniProt confidence
Evidence score
Medium
0.5
High
1
Not yet assessed
0.5
Assessed
1
LoF and GoF variants
Assumption of Risk
ClinGen classification
Evidence score
No reported evidence
0.01
Refuted
0.01
Disputed
0.01
Limited
0.01
Moderate
0.5
Strong
1
Definitive
1
Phase I (Early)
0.05
Phase I
0.1
Phase II
0.2
Phase III
0.7
Phase IV (only for approved indications)
1
Negative
0.5
Safety or side effects
0.5
Activators = GoF
Inhibitors = LoF
Assumption of Protective
-0.25
Only 1 mutated sample
+0.25
Gene mutated more frequently in particular disease compared to other diseases
+0.25
Mutations in gene occur more frequently than in other genes of similar length in the same disease
association not found
0
benign
0
not provided
0
likely benign
0
conflicting interpretations of pathogenicity
0.3
other
0.3
uncertain significance
0.3
risk factor
0.5
affects
0.5
likely pathogenic
0.7
association
0.9
drug response
0.9
protective
0.9
pathogenic
0.9
no assertion provided
+0
no assertion criteria provided
+0
no assertion for the individual variant
+0
criteria provided, single submitter
+0.02
criteria provided, conflicting interpretations
+0.02
criteria provided, multiple submitters, no conflicts
+0.05
reviewed by expert panel
+0.07
practice guideline
+0.1
LoF variants
Pathogenic/Likely pathogenic = Risk
Protective = Protective
Assumption of all variants LoF
Assumption of Risk
To support more complex and systematic queries, we provide all datasets as data downloads.
A list of all datasets is available in the Platform Data Downloads page.
All Platform datasets are available as a distributed collection of data. This implies that for each dataset, there will be a directory with a list of partitioned files. Currently, we produce our datasets in Parquet. This formats allow us to expose nested information in a machine-readable way. Next we describe how to download, access and query this information in a step-by-step guide.
Archive datasets, as well as input files and other secondary products are also made available in the FTP server and Google Cloud Platform.
Below is a walkthrough on how to download the disease
dataset from the 25.03
release in Parquet format using different approaches.
We recommend using lftp with a command line client, and when using tools like wget, curl, etc., use https:// rather than ftp://
rsync
is a command line tool for efficiently transferring and synchronising files between a computer and an external hard drive.
wget
is a command line tool that retrieves content from web servers and widely available in Unix systems.
Users with Google Cloud Platform account can download the datasets through the Google Cloud Console or using gsutil
command-line tool.
If you are using a non-Linux or non-Unix machine (e.g. Windows), you can access our FTP service using an FTP client like FileZilla or the Windows ftp
command. For more information, including tips and workarounds, see the Community Windows ftp
thread.
To read the information available in the partitioned datasets, there is no need to manipulate or concatenate files. Datasets can be read directly using the dataset path.
The next scripts provide a proof-of-concept example using the ClinVar evidence provided by the European Variation Archive. The next scripts show how to:
Read a dataset
Explore the schema of the dataset
Select a subset of information (columns)
Display the information
First of all the dataset needs to be downloaded as described in the previous section. For simplicity, only EVA evidence is downloaded, but all evidence can be downloaded at once using the same approach.
The next scripts make use of Apache Spark (PySpark or Sparklyr) to read and query the dataset using modern functional programming approaches. These packages need to be installed in their respective environments.
The next query only displays 6 fields of the ClinVar evidence but there are other non-null values available. The schema is the best way to explore what's available and query the most relevant information. All Platform evidence share the same schema, so there will be a long list of fields that might not be informative for ClinVar but will be relevant if trying to query other data sources.
Dealing with nested information can sometimes be tedious. The Platform aims to minimise the nestiness of the data, however some level of structure is sometimes required. Spark provides a series of functions to deal with complex nested information. The scripts provide an example on how the clinicalSignificances
array is flattened using the explode
function.
Once loaded into Python or R, the user can decide to continue using Spark, write the output to a file or use alternative libraries to process the information (e.g. pandas
, tidyverse
, etc.).
For more information on how to access and work with our data downloads and example scripts based on actual use cases and research questions, check out the Open Targets Community.
To support systematic identification and prioritisation of drug targets, we are committed to making our data open and available in a variety of formats to support academic research purposes and commercial activities. For more information, see our Licence documentation.
For queries about a single entity or target-disease association, we recommend that you use:
Our intuitive web interface with data tables and data visualisations that can be downloaded/exported in multiple formats, including TSV and JSON
Our robust GraphQL API with endpoints that can be accessed with the programming language of your choice and an interactive playground where you can try out sample queries
For more complex, systematic queries, we recommend that you use:
Our comprehensive set of data downloads available via FTP, Google Cloud, and Microsoft Azure Open Datasets.
Our Google BigQuery instance that supports SQL-like queries and allows you to export data to your own Google Cloud Storage bucket. This data is available as a BigQuery Public Dataset.
If you use our data in your research or commercial product, please cite our latest publication.
We have designed and developed a new export functionality for the Associations On the Fly/Target Prioritisation pages, allowing users to download:
Entire dataset view (default status)
Customised dataset view including custom controls changes, subset of data types (aggregations) and/or data from pinned targets only
TSV and JSON formats
Learn about the target prioritisation view
The new target prioritisation page can be accessed by clicking onto the Target prioritisation factors tab from the Associations on the Fly page when searching for targets associated with a disease or phenotype.
The view focuses on displaying target-specific properties in a disease agnostic way, which have been aggregated into four main sections—Precedence, Tractability, Doability, Safety—and individually scored by the Open Targets team.
A "traffic light" system has been designed to visually inform on target prioritisation, with the aim to facilitate target recommendations. Using a colour scale, green indicates potentially positive attributes and red indicates potentially negative attributes, providing information to help users assess the targets for further prioritisation or deprioritisation, respectively.
Definition: Gene is targeted by available drugs in any clinical phase for any indication.
Source of Data: Platform Known Drugs widget (ChEMBL)
Scoring: Maximum clinical trial phase the target has been reported for, independently of the disease. Phases range from 0 to IV (corresponding to values of 0, 0.25, 0.5, 0.75 and 1 in the tool scores).
Definition: Target is annotated to be located in the cell membrane.
Source of Data: Platform Subcellular location widget [HPA (Human Protein Atlas) and UniProt]
Scoring:
1 = Protein target is located (at least) in the cell or plasma membrane.
0 = Protein target is not located in the cell membrane but some location information is accessible.
NA = No location information available.
Definition: Target is secreted or predicted to be secreted.
Source of Data: Platform Subcellular location widget [HPA (Human Protein Atlas) and UniProt]
Scoring:
1 = Protein target is (at least) secreted or predicted to be secreted.
0 = Not secreted but with location information.
NA = No location information available.
Note: When contradictions between HPA (Human Protein Atlas) and UniProt exist (i.e. target is secreted according to HPA but in membrane according to UniProt), the information from HPA is taken.
Definition: Target binds at least one High-Quality Ligand according to ChEMBL tractability bucket.
Source of Data: Platform tractability widget (Open Targets tractability)
Scoring:
1 = Target has a high-quality ligand reported.
0 = Target does not have high-quality ligand reported.
NA = No information available.
Definition: Target has been co-crystallised with a small molecule, reported in the Protein Data Bank.
Source of Data: Platform tractability widget (Open Targets tractability)
Scoring:
1 = Target has a small molecule reported.
0 = Target does not have a small molecule reported.
NA = No information available.
Definition: Target has a DrugEBIlity score equal or above 0.7, which is predictive of harbouring a high-quality pocket.
Source of Data: Platform tractability widget (Open Targets tractability)
Scoring:
1 = Target contains a high-quality predicted pocket.
0 = Target does not have a high-quality predicted pocket.
NA= No information available.
Models, tools and/or reagents that allow target assessment in preclinical settings to enable exploration of a given target
Definition: Mouse orthologs maximum identity percentage. A mouse harbouring an ortholog for the target of interest could be useful for in vivo assaying.
Source of Data: Platform comparative genomics widget (Ensembl Compara)
Scoring:
From 0 to 1 are linearly scored those targets with at least one ortholog in mice harbouring at least 80% with the target.
1 = There is at least one gene in mice that contains a sequence with a 100% of identity with the target.
0 = There are no genes in mice containing a sequence with at least 80% of identity with the target.
NA = No ortholog information.
Note: Here we consider mouse orthologs and display the "query percentage identity" (percentage of the human target sequence that matches to the mouse gene) when there is an 80% identity or more. In the cases of targets with more than one ortholog, we take the one with the maximum query % ID.
Definition: Target has high quality chemical probes.
Chemical probes are small molecules acting as chemical modulators, binding reversibly to the target.
Source of Data: Platform Chemical probes widget (Probes & Drugs)
Scoring:
1 = Target has high-quality chemical probes.
0 = Target does not have high-quality chemical probes.
NA = No information available.
Definition: Genest that are important for human physiology are seen to be depleted of deleterious variants. The Genome Aggregation Database (gnomAD) has developed a continuous measurement of intolerance to loss of function (LoF) variants per gene, based on observed/expected LoF variant analysis. As recommended by gnomAD and implemented in the Open Targets platform, the rank of genes regarding their loss-of-function observed/expected upper bound fraction (LOEUF) metric is used (LOEUF score).
Source of Data: Platform genetic constraint widget (gnomAD)
Scoring:
A score from -1 to 1 is given to genes depending on their LOEUF metric rank, being -1 the least tolerant to LoF variation and 1 the most tolerant.
Definition: The international database Mouse Genome Informatics contains information about reported phenotypes when a gene is knocked-out in this animal model. These phenotypes are categorised in multiple phenotype classes, using an organ/system classification. We retrieve this information (available in our platform) and score the phenotypes classes regarding their severity (from 0 to -1). After aggregating all phenotypes with their scores according to the phenotype class they belong to, we use the harmonic sum to build a continuous score, which is normalised from 0 to -1.
Source of Data: Platform mouse phenotypes widget (Mouse Phenotypes, feeded from MGI, a reference database for mice knockouts)
Scoring:
Below 0 to -1 = When the target has been knocked-out in mice there were multiple and severe phenotypes reported, with a score higher than the first quartile.
0 = Either the target has non-severe phenotypes reported or is in the first quartile of the normalised score.
NA = No information available.
Note: Below you can find how we scored the mouse phenotype classes (-1 being the "most severe" and 0 "non relevant" phenotypes
MP:0005370
liver/biliary system phenotype
-1
MP:0005385
cardiovascular system phenotype
-1
MP:0010768
mortality/aging
-1
MP:0003631
nervous system phenotype
-0.75
MP:0005388
respiratory system phenotype
-0.75
MP:0005367
renal/urinary system phenotype
-0.75
MP:0005376
homeostasis/metabolism phenotype
-0.75
MP:0005386
behavior/neurological phenotype
-0.75
MP:0005381
digestive/alimentary phenotype
-0.5
MP:0005379
endocrine/exocrine gland phenotype
-0.5
MP:0005382
craniofacial phenotype
-0.5
MP:0005377
hearing/vestibular/ear phenotype
-0.5
MP:0005384
cellular phenotype
-0.5
MP:0005380
embryo phenotype
-0.5
MP:0005394
taste/olfaction phenotype
-0.5
MP:0002006
neoplasm
-0.5
MP:0005375
adipose tissue phenotype
-0.5
MP:0005389
reproductive system phenotype
-0.5
MP:0005397
hematopoietic system phenotype
-0.5
MP:0005387
immune system phenotype
-0.5
MP:0005391
vision/eye phenotype
-0.5
MP:0005390
skeleton phenotype
-0.5
MP:0005369
muscle phenotype
-0.25
MP:0001186
pigmentation phenotype
-0.25
MP:0005378
growth/size/body region phenotype
-0.25
MP:0005371
limbs/digits/tail phenotype
-0.25
MP:0010771
integument phenotype
-0.25
MP:0002873
normal phenotype
0
Definition: The second generation map of cancer dependencies (Pacini et al., 2024) increased the number of cancer cell lines analysed (930 CRISPR-Cas9 genome wide knock-out screenings, targeting almost 18,000 genes), spanning to 27 cancer types and curated patient genomic data, to identify cancer-type-specific and pan-cancer gene dependencies integrated with multi-omic markers.
Candidate anti-cancer therapeutic targets were characterised using a prioritisation criteria based on:
- Fitness Score. Strength of the effect on cellular fitness upon target depletion.
- Presence of dependency marker.
- Evidence linking the dependency and marker.
After applying a priority score based on approved drug targets, authors nominated 370 targets for 27 cancer types; 302 were cancer-type specific, while 196 where pan-cancer. This list of genes is the one used to label a target for gene essentiality.
Source of Data: A comprehensive clinically informed map of dependencies in cancer cells and framework for target prioritization. Pacini et al., 2024, Cancer Cell 42, 301–316. Supplementary Table 6. Gene essentiality widget (Cancer DepMap).
Scoring:
-1 = Target reported as essential.
0 = Target not reported as essential
NA = No information available
Definition: Target is associated with curated adverse events.
Source of Data: Safety liability data from Platform safety widget (Open Targets Safety) and Open Targets downstream analysis of toxicity datasets from PharmGKB.
Scoring:
-1 = The target has at least one adverse event.
NA = No information available.
Definition: Target is classified as an oncogene and/or tumour suppressor gene.
Source of Data: Platform Cancer Hallmarks widget (COSMIC)
Scoring:
We use the attribute information from the cancer hallmarks, in the target profile. Here, targets considered as "cancer driver genes" are flagged as tumour suppressor, oncogene, or both
-1 = Target is catalogued as driver gene (tumour suppressor, oncogene or both).
NA = No information available.
Definition: Paralogue maximum identity percentage.
Source of Data: Platform comparative genomics widget (Ensembl Compara)
Scoring:
Below 0 to -1 are linearly scored those targets with at least one paralogue in human harbouring at least 60% of identity with the target.
0 = Those targets with paralogues harbouring less than 60% of identity.
NA = No information available about paralogues for that target.
Definition: HPA assessment on tissue-specific target expression.
Source of Data: Platform baseline expression widget (ExpressionAtlas, HPA and GTEx). We used the assessment for every target from the RNA expression data from the public version of Human Protein Atlas (proteinAtlasTissue)
Scoring:
Tissue enriched >=4 fold higher mRNA in a given tissue compared to any other
1
Group enriched >=4 fold higher average mRNA in 2-5 tissue compared to any other
0.75
Tissue enhanced >=4 fold higher mRNA level in a given tissue compared to average of all other tissues
0.5
Low tissue specificity
-1
Not detected
NA
Definition: HPA assessment on any detectable baseline expression for the target across tissues.
Source of Data: Platform baseline expression widget (Expression Atlas, HPA and GTEx). We used the assessment for every target from the RNA expression data from the public version of Human Protein Atlas.
Scoring:
Detected in single detected in a single tissue
1
Detected in some detected in more than one but less than 1/3 of tissues
0.5
Detected in many detected in at least 1/3 but not all tissues
0
Detected in all
-1
Not detected
NA
To support more complex queries and advanced informatics workflows that use Google Cloud services, the Open Targets Platform data is also available as a Google Cloud public dataset via our Google BigQuery instance — open-targets-prod.
Google BigQuery is a data warehouse that enables researchers to run super-fast, asynchronous SQL queries using Google's cloud infrastructure. After running your query, you can either export into various formats or copy into a Google Cloud bucket for further downstream analyses.
Open Targets Platform data is publicly accessible as a Google Cloud public dataset. Users only pay for the queries they perform on the data, and through this program, the first 1 TB per month is free.
Open Targets has uploaded all of our data to Google BigQuery. You can run queries via:
For more information on BiqQuery, please review the BigQuery documentation.
Below is a sample query that uses our association_overall_direct
dataset to return a list of targets associated with psoriasis (EFO_0000676) and the overall association score.
Similarly, you can use our drug_molecule
dataset and pass a list of drug trade names to find relevant information:
For more information on how to use BigQuery to access Platform data and example queries based on actual use cases and research questions, check out the Open Targets Community and our Google Cloud dataset homepage.
Do you have questions for us? Ask on Open Targets Community!
Why can't I find a variant that I search for?
In the Open Targets Platform, a variant refers to any human variation that is associated with a disease, trait or phenotype that has been reported in at least one of our variant-to-phenotype sources. This accounts for only 1% of the total variants reported in gnomAD (6.5M vs 700M). See here for more information.
Why can't I find a study I search for?
Each study undergoes quality control and validation procedures before ingestion into the Platform.
We remove studies that had unsupported study types, invalid target ID or invalid biosample ID (for molQTLs), and invalid ontology trait mapping. See here for study inclusion criteria.
For GWAS Catalog studies with summary statistic, we perform summary statistic quality control. If this fails, we exclude it.
For GWAS Catalog curated associations (top hits), we tightened the p-value significance threshold compared to OTG 22.10 (1e-8 instead of 5e-8). This may also lead to differences in the study list.
What is a Credible Set?
A credible set is the set of variants near a genetic association signal that have a 95% probability of containing the true causal variant(s) for that signal.
What are the quality control criteria for fine mapping?
Credible sets (CSs) are filtered out based on the following criteria:
The lead variant maps within the MHC region (chr6:25726063-33400556
)
The lead variant has invalid chromosome coding (1:22, X, Y, XY, mt
)
The CS is not on the list of valid studies
It is the GWAS Catalog Summary Statistics PICS CSs and it has valid SuSiE CS from the same region and study
It is the GWAS Catalog Curated Association PICS CS and it has valid GWAS Catalog Summary Statistics PICS CS from the same region and study
The sum of PIPs in the CS is not within the [0.95, 1] range
The CS didn't pass the study-specific p-value threshold (e.g. GWAS Catalog Curated Association p-value<=1e-8)
Why is the V2G score not available anymore?
The variant-to-gene (V2G) score was developed as an aggregated score for the assignment of the variant to the gene in the region using different sources, such as association with molQTLs and VEP.
Although intuitive, V2G does not reflect the complexity of modern data.
We have therefore introduced a new score - Locus-to-Gene score (L2G), based on a machine learning algorithm that assesses the assignment score of the credible set (not only variant) to gene and does not use the old V2G pipeline anymore. It is a more accurate and sophisticated way of assigning genes to associated GWAS loci. In some cases, the L2G could be interpreted as V2G if the credible set is the size of only one variant. In the new Platform datasets, the L2G score has completely superseded V2G.
Full information about variant assignment to genes based on distance or VEP is available on all variant pages in the corresponding widgets. There is also information about whether the variant is part of the molQTL credible set, which can also be used to assign this variant to the gene of interest (available in the molQTL widget).
Why are some L2G associations I found in OT Genetics 22.10 not found in the L2G predictions in OT Platform 25.03?
There could be several reasons for this:
We don't have the matching study anymore (due to quality control or validation).
We don't have the corresponding credible set anymore due to a different fine-mapping approach or p-value filtering thresholds (e.g. for GWAS Catalog curated associations we now use p-value < 1e-8 instead of p-value < 5e-8 in OTG 22.10).
The L2G model is different and can lead to different L2G estimates even when the same study and variant are presented. We don't display L2G assignments when L2G < 0.05.
Where can I find studies or credible sets excluded from the Platform?
To ensure high-quality outputs, the data processing pipelines perform a number of validation steps across different datasets. For data provenance considerations the excluded part of the datasets are also available both on FTP (ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/{release}/excluded
) and Google Cloud (gs://open-targets-data-releases/{release}/excluded/
). In this folder you'll find the list of excluded credible sets (credible_set
), evidence (evidence
), interactions (interaction
), studies (study
) and target validation (target_validation
) dataset. These datasets also provide context about the reason for exclusion.
Assumption of all variants LoF
Overview of the Open Targets Locus-to-Gene algorithm
Based on genetic and functional genomics traits, the Locus-to-Gene (L2G) machine-learning algorithm ranks the most likely causative genes at each GWAS credible set. The likelihood that a gene is causal for a particular GWAS locus is measured by the L2G score, ranging from 0 to 1.
The L2G model included in the Platform is based on the original method published by Mountjoy et al. Nature Genetics (2021), but contains several enhancements that can result in different performance.
📖 Gentropy
The predictive features used by the L2G algorithm are designed to capture various genetic and genomic contexts that influence the likelihood of a gene being causal at a given GWAS credible set.
Features are divided into four categories:
distanceTssMean
Average distance between all variants in the credible set and the TSS of a gene's canonical transcript. The distance of each variant is weighted by its posterior probability
(0, 1)
distanceTssMeanNeighbourhood
Ratio between distanceTssMean
for a gene and the maximum distanceTssMean
for any given gene in the vicinity
(0, 1)
distanceSentinelTss
Distance between the sentinel variant and the TSS of a gene's canonical transcript
(0, 1)
distanceSentinelTssNeighbourhood
Ratio between distanceSentinelTss
for a gene and the maximum distanceSentinelTss
for any given gene in the vicinity
(0, 1)
distanceSentinelFootprint
Distance between sentinel variant and a gene's footprint
(0, 1)
distanceSentinelFootprintNeighbourhood
Ratio between distanceSentinelFootprint
for a gene and the maximum distanceSentinelFootprint
for any given gene in the vicinity
(0, 1)
distanceFootprintMean
Average distance between all variants in the credible set and a gene's footprint. The distance of each variant is weighted by its posterior probability
(0, 1)
distanceFootprintMeanNeighbourhood
Ratio between distanceFootprintMean
for a gene and the maximum distanceFootprintMean
for any given gene in the vicinity
(0, 1)
eQtlColocClppMaximum
Maximum CCLP across all eQTL studies for a gene
(0, 1)
pQtlColocClppMaximum
Maximum CCLP across all pQTL studies for a gene
(0, 1)
sQtlColocClppMaximum
Maximum CCLP across all sQTL and tuQTL studies for a gene
(0, 1)
eQtlColocH4Maximum
Maximum H4 across all eQTL studies for a gene
(0, 1)
pQtlColocH4Maximum
Maximum H4 across all pQTL studies for a gene
(0, 1)
sQtlColocH4Maximum
Maximum H4 across all sQTL and tuQTL studies for a gene
(0, 1)
eQtlColocClppMaximumNeighbourhood
Ratio between eQtlColocClppMaximum
for a gene and the maximum eQtlColocClppMaximum
for any protein-coding gene in the vicinity
(0, 1)
pQtlColocClppMaximumNeighbourhood
Ratio between pQtlColocClppMaximum
for a gene and the maximum pQtlColocClppMaximum
for any protein-coding gene in the vicinity
(0, 1)
sQtlColocClppMaximumNeighbourhood
Ratio between sQtlColocClppMaximum
for a gene and the maximum sQtlColocClppMaximum
for any protein-coding gene in the vicinity
(0, 1)
eQtlColocH4MaximumNeighbourhood
Ratio between eQtlColocH4Maximum
for a gene and the maximum eQtlColocH4Maximum
for any protein-coding gene in the vicinity
(0, 1)
pQtlColocH4MaximumNeighbourhood
Ratio between pQtlColocH4Maximum
for a gene and the maximum pQtlColocH4Maximum
for any protein-coding gene in the vicinity
(0, 1)
sQtlColocH4MaximumNeighbourhood
Ratio between sQtlColocH4Maximum
for a gene and the maximum sQtlColocH4Maximum
for any protein-coding gene in the vicinity
(0, 1)
vepMaximum
Maximum VEP score across all variants in the credible set
(0, 1)
vepMaximumNeighbourhood
Ratio between vepMaximum
for a gene and the maximum vepMaximum
for any protein-coding gene in the vicinity
(0, 1)
vepMean
Average VEP score between all variants in the credible set and a gene's footprint. The score of each variant is weighted by its posterior probability
(0, 1)
vepMeanNeighbourhood
Ratio between vepMean
for a gene and the maximum vepMean
for any protein-coding gene in the vicinity
(0, 1)
geneCount500kb
Number of genes 250kb up- and down-stream the sentinel variant of a credible set
(0, 60000)
proteinGeneCount500kb
Number of protein-coding genes 250kb up- and down-stream the sentinel variant of a credible set
(0, 60000)
credibleSetConfidence
Degree of confidence we assign to the credible set definition based on the fine mapping methodology: 1 when fine-mapped with SuSIE using in-sample LD: 0.75 when fine-mapped with SuSIE using out-of-sample LD: 0.5, when fine-mapped with PICS and the locus is based on the analysis of summary statistics; 0.25, when fine-mapped with PICS and the locus was reported as a top hit according to the GWAS Catalog
(0, 1)
Neighbourhood features
While some features are computed independently for each gene, others reflect the comparative relative context of a given gene compared to the other genes in the neighbourhood (+/- 500,000bp).
For a more detailed description of how each feature is computed, see the L2G Feature documentation.
The L2G model is trained based on prior knowledge of gene-trait associations collated from different sources, to later bring representative credible sets supporting that association. The next steps describe the methodology to compose the training set:
The effector gene list (EGL) represents the set of biologically validated gene-trait associations. This list is derived from several sources:
Manually curated gold standards (“medium” and “high” confidence) from OTG 22.10.
Gene-indication pairs for the pharmacological target in all phase III or IV clinical trials according to the latest ChEMBL release.
Gene-disease or phenotype mappings with evidence score ≥ 0.95 from ClinVar, UniProt, Gene2Phenotype, Genomics England PanelApp and ClinGen from the latest Open Targets Platform release.
To ensure the uniqueness of gene-EFO pairs, the combined list was de-duplicated.
For each gene-trait pair from the effector gene list, positive gene-credible set pairs were extracted using the following criteria:
Only protein-coding genes.
Removed any pair with a distanceSentinelTSS
feature less than 0.1.
Only include gene-trait pairs supported by at least two credible sets in different studies sharing the same lead variant. We added this criterion to reduce the chance of false positives among the credible sets.
Among all positive credible set-gene pairs, we removed duplications based on functional genomics features.
Removed credible sets involved in more than two positives.
All other protein-coding genes in the window are classified as negatives as long as they don't have a strong functional interaction (STRINGdb score > 0.8) with any positive genes for the trait.
📖 Gentropy
The Locus-to-Gene model is trained for every data release using the training set and feature matrix following. Similarly to Mountjoy et al., the model is trained based on a gradient-boosting algorithm with the scikit-learn
library, and using nested cross-validation and hyperparameter tuning.
The L2G model will be updated in each release to include new features and/or fixes. You should expect some variation in the prediction scores with each release.
📖 Gentropy
The trained L2G model is applied to every GWAS credible set in the Open Targets Platform. All predictions below 0.05 are filtered out and feature contributions are added using SHAP analysis as described here. These results feature as Target-Disease evidence, as well as L2G annotation for every credible in the Platform.
The GWAS Associations evidence set is constructed based on GWAS credible sets linking to a protein-coding gene with an L2G score higher than 0.05.
Mountjoy, E., Schmidt, E.M., Carmona, M. et al. An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci. Nat Genet 53, 1527–1533 (2021)
The Open Targets Platform GraphQL — available at http://api.platform.opentargets.org — is our new API that allows for language-agnostic access to our data, along with other key benefits:
You can construct a query that returns only the fields that you need
You can build graphical queries that traverse a data graph through resolvable entities and this reduces the need for multiple queries
You can access the GraphQL API playground with built-in documentation and schema showing required and optional parameters
You can view the schema that shows the available fields for each object along with a description and data type attribute
You only have to use POST
requests with a simple query string and variables object
Our GraphQL API supports queries for a single target, disease/phenotype, drug, or target-disease association. For more systematic queries (e.g. for multiple targets), please use our data downloads or our Google BigQuery instance.
The base URL endpoint for our new GraphQL API is:
You can then access relevant data from the following endpoints:
/target: contains annotation information for targets including tractability assessments, mouse phenotype models, and baseline expression; also contains data on diseases and phenotypes association with the given target
/disease: contains annotation information for diseases and phenotypes including ontology, known drugs, and clinical signs and symptoms; also contains data on targets associated with the given disease or phenotype
/drug: contains annotation information for compounds and drugs including mechanisms of action, indications, and pharmacovigilance data
/variant: contains annotation information for variants including population allele frequencies, variant effect, transcript consequences, and credible sets associated with complex traits containing the variant.
/studies: contains annotation information for studies including trait or phenotype, publication, cohort information and list of credible sets associated with the study.
/credibleSet: contains annotation information for credible sets including the complete sets of variants in the credible set, gene assignment based on our L2G predictions and colocalisation metrics.
/search: contains index of all entities contained within the Platform
Below is an example GraphQL query for AR (ENSG00000169083) that will return Genetic Constraint and Tractability data.
Run this query in our GraphQL API playground
Using GraphQL's query strings and variables object constructs, you can also access the data using a programming language that supports HTTP POST
requests. While this is a valid approach, we discourage users from repeatedly querying the GraphQL API one entity at a time. Instead, our comprehensive datasets available for download provide a simpler and more performant strategy to achieve the same result.
Below is an example script using the same AR query above, but written for Python and R:
For more information on how to use the GraphQL API and example queries based on actual use cases and research questions, check out the Open Targets Community.
Navigate the Associations view in the Open Targets Platform, from which you can access the Evidence pages
Evidence pages aim to summarise all the available evidence for a given target-disease pair. Evidence displayed in the page includes indirect evidence, so the user can interrogate evidence annotated with descendants of the disease or phenotype of interest.
The structure of the page follows the same structure as the profile pages including descriptions, summary widgets, and detail widgets.
To navigate to an evidence page, from the associations page, use the kebab menu icon, next to the name of a target or disease / phenotype.
The Open Targets data pipeline is a complex process orchestrated in Apache Airflow, and it is divideded into data acquisition, transformation and data output.
The data pipeline is composed of multiple elements:
Data and evidence generation processes
Input stage
Transformation stage and ETL processes
Output stage
Gentropy-specific processes
Orchestration
curation — Open Targets curation repository
evidence_datasource_parsers — internal pipelines used to generate evidence
json_schema — evidence object schema used for evidence and association scoring
OnToma — Python module to map disease or phenotype terms to EFO
gentropy — Open Targets' genomics toolkit
See here for more info on the Gentropy pipelines.
orchestration — Open Targets data pipelines orchestrator
See detailed orchestration documentation here.
The Platform ETL (“extract, transform, and load”) and the Genetics ETL were separate processes before, but they are now merged into one single pipeline. This means that the data produced for both Genetics ETL and the Platform are released at the same time. Herein, we refer to this joint pipeline as the "unified pipeline".
The orchestration occurs on Google Airflow using Google Cloud as the cloud resource provider. The logic of the orchestration is based on the steps. The combination of steps forms directed acyclic graphs (DAGs).
The unified pipeline uses many static assets (link), like Open Targets related data and data needed to run Genetics ETL.
otter — Open Targets' Task ExecutoR i.e. scripts that process and prepare data for our ETL pipelines
platform-etl-backend: ETL pipelines to generate associations, evidence, and entity indices
platform-etl-openfda-faers: ETL pipeline to process Open FDA adverse events data
platform-etl-literature: ETL pipeline to generate similar entities and publications
platform-output-support: scripts for infrastructure tasks and generating a Platform release
If you have further questions, please get in touch with us on the Open Targets Community.
Overview of the technical infrastructure that supports the Open Targets Platform
The Open Targets Platform infrastructure stack is composed of three layers, data, backend and frontend.
Data layer
Opensearch — Contains the bulk of the data
Clickhouse — Contains data related to associations view
Backend layer
API — The main backend application, providing the GraphQL engine
OpenAI-API — The summarizing engine for literature section
Frontend layer
Web — A React SPA web application
Currently this infrastructure is hosted on Google Cloud, using a load-balanced, globally distributed and highly scalable deployment based on Terraform.
As a consortium committed to developing open-source, freely available tools that support systematic drug target identification and prioritisation, we actively encourage and accept open source contributions to our various repositories.
Summary of release highlights for the Open Targets Platform
19 March 2025
New features
Variant, study, and credible set information is now available in the Open Targets Platform. This unites the Open Targets Platform and Open Targets Genetics into a single interface for human genetic and target discovery information.
Interpret gene-disease evidence from both common and rare variation in one resource, and in multiple ancestries.
The Platform now has three additional entities:
Please note: the Platform only integrates variants associated with a disease, trait, or phenotype
Colocalisation is now based on credible set overlaps
Data updates
A substantial increase in the literature evidences due to improvements in resolving disambiguation of entities.
Updated gene burden data through FinnGen R12.
New NHS Genomic Medicine Service panels from GEL PanelApp and a new hearing loss panel to the Gene2Phenotype evidence set.
Rewritten Uniprot pipeline with new associations from Uniprot variants
Updated data from Probes&Drugs and DepMap.
New data from Reactome, ChEMBL, Europe PMC, COSMIC and EVA (through ClinVar).
Product features
An updated data downloads page which has a more detailed description of each file. (Temporary removal of schema which will be brought back in the subsequent release).
Various improvements to our web interface:
Users can search the UI using variants and study id.
Improved searching, filtering and sorting of entities on our associations pages. In particular, there are now separate sections for uploaded entity lists and pinned entities, and you can remove individual filters from the view.
The platform and the Target Prioritisation view has a more accessible colour scheme. Option to select desired columns in the UI.
A graphical Comparative Genomics view in Target Prioritisation.
Preview on hover: Ability to view details of an entity without navigating to it.
Product enhancements and bug fixes
Filtered out Phase IV clinical trial evidence that lacks regulatory approval for the specific indication from our target-disease association data.
The BE infrastructure has been upgraded to Scala 3
Improved and restructured documentation.
Previous releases (till 24.09):
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/24.09/output/etl/parquet/associationByOverallDirect/
25.03 release:
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/association_by_datasource_direct/
78,766 targets
28,513 diseases and phenotypes
18,081 drugs and compounds
28,168,992 evidence strings
10,162,821 target-disease associations
6,493,882 variants
18 September 2024
New features
Data updates
New safety liabilities associated with several targets which are routinely used by pharmaceutical companies have been added, from Brennan et al. (2024).
New Gene Burden evidence from the LoF burden analyses in FinnGen’s latest public release (R11).
New data from Reactome, Europe PMC, COSMIC and EVA (through ClinVar).
Product features
Improved visualisation for Gene essentiality data from Cancer DepMap.
Updated frontend table design and functionality leading to better searching/filtering and sorting functionality for most tables in the UI.
The classic associations view has been deprecated from the Platform..
Product enhancements and bug fixes
OpenAI model in the literature summarisation tool was updated to GPT-4o-mini.
Changes to variant
field in the cancer biomarker evidence.
Aggregated the granularity of the description of phenotypes inside cohortPhenotypes
to resolve duplication in ChEMBL evidence.
Bug fixes: pharmacogenetics schema, molecule dataset.
63,121 targets
28,327 diseases and phenotypes
18,041 drugs and compounds
17,853,184 evidence strings
8,155,988 target-disease associations
19 June 2024
New features
Uploading a custom list of targets or diseases to obtain a tailored associations view allowing users to view selected target-disease evidence for the specific entities.
Data updates
Integration of ChEMBL 34 which now contains data from the European Medicines Agency (EMA) increasing drug-indication coverage.
Updates to the clinical trials and tractability data.
New gene burden evidence for schizophrenia from the SCHEMA consortium and ancestry-specific evidence for prostate cancer.
A new Gene2Phenotype panel with musculo-skeletal implications.
New data from Reactome, COSMIC and EVA (through ClinVar), Probes and Drugs and GEL PanelApp increasing our coverage of data.
Product features
Ability to handle pharmacogenetic evidence involving drug combinations.
Product enhancements and bug fixes
Exclusion of splice QTLs from the assessment for the direction of effect and selecting the beta from the evidence with the lowest p-value instead of the largest effect size.
Improvements in the AotF GQL Playground Component.
Dropping isHumanApplicable
field from target safety.
Bug fixes to the ClinVar (somatic) widget loading state.
Resolving an error in phenotype mapping in GEL PanelApp, resolving incorrect tractability precedence and removing categorical burden tests from Genebass data based on community feedback.
63,226 targets
28,198 diseases and phenotypes
18,041 drugs and compounds
17,703,456 evidence strings
8,079,215 target-disease associations
20 March 2024
New features
Implementation of direction of effect assessment for eight different sources of target-disease association evidence.
Filtering bibliography data based on a publication date.
Data updates
Integration of the latest dataset from Project Score.
New data from Reactome, COSMIC and EVA (through ClinVar) increasing our coverage of data.
Product features
Inclusion of star alleles from PharmGKB and a new Direct Drug Target column to the pharmacogenetics widget.
Use of pharmacogenetics data to inform adverse drug response as an additional source of information on target safety.
New dedicated GraphQL API query playground for Associations-on-the-Fly and target prioritisation view.
Product enhancements and bug fixes
Redesigned context menu with a new navigation and pinning behaviour.
Updated protvista-uniprot viewer library to v2.11.1.
Bug fixes to download file schema and search issues.
63,226 targets
25,817 diseases and phenotypes
17,111 drugs and compounds
17,317,290 evidence strings
7,802,260 target-disease associations
30 November 2023
New features
A new widget in the Platform adding pharmacogenetics data from PharmGKB.
Data updates
New data available on Baseline RNA and protein expression data for targets via the API and FTP.
New data from Reactome and EVA (through ClinVar) increasing our coverage of data.
Product features
Users can easily export the entire associations table and the target prioritisation table in json or tsv format.
Updated search bar design with search suggestions.
Product enhancements and bug fixes
Transition to OpenSearch from Elasticsearch.
Bug fixes in the widgets in the Association On The Fly view and styling issues.
62,733 targets
25,246 diseases and phenotypes
17,095 drugs and compounds
16,710,896 evidence strings
7,994,180 target-disease associations
21 September 2023
New features
OpenAI Literature Summarisation tool - For data features that link to publications, users can ask for a natural language summary of the target-disease evidence presented in the publication using LangChain and OpenAI’s GPT3.5 Turbo model.
Data updates
Increase in Europe PMC literature evidence by 9.9% to 10,355,423
New data from ChEMBL, COSMIC and EVA (through ClinVar) increasing our coverage of data
Product features
Easy access to the schema of the files available for download in the Open Targets Platform
Product enhancements and bug fixes
Open Targets Platform user interface migration to Material UI v5
Refactoring of the sections in the frontend codebase - Components and sections moved into the packages/sections and packages/ui
62,733 targets
25,209 diseases and phenotypes
17,096 drugs and compounds
16,232,046 evidence strings
7,922,844 target-disease associations
26 June 2023
New features
Data updates
Updated data from ChEMBL, including adverse event drug warning data and more granular information on clinical phases
New data from IntoGEN, Europe PMC, and EVA (through ClinVar) increasing our coverage of data
Product features
Product enhancements and bug fixes
Fixes - homology widget, fixes to the data
More meaningful 404 error message
Fixed bugs in the API Playground
62,685 targets
24,713 diseases and phenotypes
13,210 drugs and compounds
15,117,741 evidence strings
7,835,247 target-disease associations
22 February 2023
In addition to regular updates from our data providers, we have a number of new features in this release:
New evidence for target-disease associations
Additional data for metabolic biomarkers added to our Gene Burden widget
QTL-based direction of effect included in evidence from Open Targets Genetics
Improved target annotation data
Integration of Target safety evidence from AOPWiki
New data from Probes and Drugs’ 04.2022 release
Literature updates
Preprints and patents now included in our bibliography
Development updates
Redesigned search
Provenance metadata
62,678 targets
24,713 diseases and phenotypes
12,854 drugs and compounds
10,446,771 evidence strings
6,656,559 target-disease associations
24 November 2022
In addition to continuous updates from our data providers, we have introduced the following new features:
Gene burden data for Parkinson’s disease
Updated classifications for clinical trial stop reasons
Variant functional consequences, available to browse in the Gene2Phenotype and Orphanet widgets
Other improvements and bug fixes
62,678 targets
22,274 diseases and phenotypes
12,854 drugs and compounds
14,611,717 evidence strings
6,960,486 target-disease associations
29 September 2022
New data, in particular:
Open Targets Genetics
Genomics England PanelApp
Gene burden
Probes and drugs
New data integrity file, in line with FAIR principles
61,888 targets
20,931 diseases and phenotypes
12,854 drugs and compounds
14,229,684 evidence strings
7,003,171 target-disease associations
24 June 2022
New data: five additional gene burden analyses from Genebass
New feature: new visualisation of subcellular locations of targets now available to users
New ontology term: “medical procedure”
61,524 targets
23,074 diseases and phenotypes
12,854 drugs and compounds
14,455,104 evidence strings
7,247,865 target-disease associations
28 April 2022
New datasource: gene burden analyses from Regeneron and the AstraZeneca
Integration of structural variants from ClinVar
Additional information from DailyMed drug label text-mining
NLP classification of why clinical trials stopped
New data: Gene2phenotype cardiac panel
61,524 targets
18,520 diseases and phenotypes
12,854 drugs and compounds
13,829,174 evidence strings
7,541,360 target-disease associations
28 February 2022
Gene2Phenotype terminology updated in line with the Gene Curation Coalition (GenCC)
Data updates from a range of providers including Open Targets Genetics and ChEMBL
61,524 targets
18,468 diseases and phenotypes
12,594 drugs and compounds
10,880,832 evidence strings
7,980,448 target-disease associations
29 November 2021
New Cancer Biomarkers evidence data from the Cancer Genome Interpreter
Updated genetic association evidence from Open Targets Genetics
Embedded GraphQL API playground for each data table and query
60,636 targets
18,706 diseases and phenotypes
12,594 drugs and compounds
10,481,189 evidence strings
7,787,231 target-disease associations
30 September 2021
Integration of Genetic Constraint data from gnomAD and new Chemical Probes data from Probes & Drugs database
Data updates from EFO, ChEMBL, and Mouse Genome Informatics
Other improvements and bug fixes
60,636 targets
18,663 diseases and phenotypes
12,594 drugs and compounds
11,071,233 evidence strings
7,927,820 target-disease associations
30 June 2021
Updated Open Targets Genetics Portal evidence, which included the integration of FinnGen biobank data (R5) and new GWAS Catalog studies
Integration of gene-disease data from Orphanet
Improvements to the user interface (e.g. datatype chips on evidence page)
Bug fixes (e.g. users can download Known Drugs table, association scores in datasets match values returned by API)
60,606 targets
18,507 diseases and phenotypes
13,185 drugs and compounds
13,267,236 evidence strings
9,216,710 target-disease associations
All data and results of queries must remain confidential and must not be shared publicly. Please note that data from OTAR projects is pre-publication, being actively worked on by projects teams and therefore subject to change through further analysis — our release notes contain details of any known issues with data sets.
All pre-publication data incorporated into the PPP will be publicly released by the project teams at the time of publication.
Terms of use for the Open Targets Platform - updated June 2021
These Terms of Use reflect Open Targets' objective to develop, implement and rapidly disseminate to the wider scientific community, new informatics tools, experimental methods, platforms and associated data related to target validation. They impose no additional constraints on the use of the contributed data than those provided by the data owner.
Open Targets expects attribution (e.g. in publications, services or products) for any of its online services, databases or software in accordance with good scientific practice. The expected attribution will be indicated on the appropriate web page.
Any feedback provided to Open Targets on the Open Targets Platform will be treated as non-confidential unless the individual or organisation providing the feedback states otherwise.
Open Targets is not liable to you or third parties claiming through you, for any loss or damage.
Personal data will only be released in exceptional circumstances when required by law or judicial or regulatory order.
We reserve the right to update these Terms of Use at any time. When alterations are inevitable, we will attempt to give reasonable notice of any changes by placing a notice on our website, but you may wish to check each time you use the website. The date of the most recent revision will appear on this page. If you do not agree to these changes, please do not continue to use our online services. We will also make available an archived copy of the previous Terms of Use for comparison.
Any questions or comments concerning these Terms of Use can be addressed to: The Operations Director, Open Targets, Wellcome Genome Campus, Hinxton CB10 1SD
Users of the Open Targets Platform agree not to attempt to use any Open Targets computers, files or networks apart from through the service interfaces provided.
Open Targets will make all reasonable effort to maintain continuity of the Open Targets Platform and provide adequate warning of any changes or discontinuities. However, Open Targets accepts no responsibility for the consequences of any temporary or permanent discontinuity in service.
Any attempt to use the Open Targets Platform to a level that prevents, or looks likely to prevent, Open Targets providing services to others, will result in the use being blocked.
Software that can be run from the Open Targets webpages may be used by any individual for any purpose unless specific exceptions are stated on the web page.
Open Targets does not accept responsibility for the consequences of any breach of the confidentiality of Open Targets site by third parties.
The online data services and databases of Open Targets are generated in part from data contributed by the community who remain the data owners.
Open Targets itself places no additional restrictions on the use or redistribution of the data available via its online services other than those provided by the original data owners.
Open Targets does not guarantee the accuracy of any provided data, generated database, software or online service nor the suitability of databases, software and online services for any purpose.
The original data may be subject to rights claimed by third parties, including but not limited to, patent, copyright, other intellectual property rights, biodiversity-related access and benefit-sharing rights. It is the responsibility of users of the Open Targets Platform to ensure that their exploitation of the data does not infringe any of the rights of such third parties.
Profile pages aim to describe the entity and provide relevant annotation that might become informative at different stages of the drug development process.
At the top of each profile page, we provide a description for the entity, a series of cross-references to other databases, as well as synonyms provided by some of the upstream data sources.
Next, a list of summary widgets display the availability of certain category; summary widgets in grey report the lack of information for the given category.
Further down the page, detail widgets provide the full extent of information available in the Platform about a particular summary widget. A description explains the nature of the displayed information, as well as the source of the data.
Open Targets Platform currently has the following entity pages:
As one of Open Targets flagship informatics products, the team that maintains the Open Targets Platform is committed to building open source tools and supporting open access research.
If you are interested in knowing more about our data sources and their licensing status, please see this list below:
Buniello, A. et al. (2025). . Nucleic Acids Research.
Ochoa, D. et al. (2023). Nucleic Acids Research.
Ochoa, D. et al. (2021). . Nucleic Acids Research.
Carvalho-Silva, D. et al. (2019). . Nucleic Acids Research.
Koscielny, G. et al. (2017). . Nucleic Acids Research.
Ghoussaini, M., et al. (2021) . Nucleic Acids Research.
Mountjoy, E., et al. (2021) . Nature Genetics.
— GraphQL API
— OpenAI API router
— Open Targets web applications
— Open Targets Infrastructure definition, scripts in charge of setting up the data layer with the relevant disk images and spinning up the rest of the infrastructure.
Please review and check out the to get started.
If you have further questions, please get in touch with us on the .
: functional context for 6.5M rare and common variants
: GWAS and molQTL studies
: 2.6 million credible sets derived from various sources
A new machine learning model which prioritises likely causal genes at each GWAS locus by using functional genomics features. The Platform also uses as part of the L2G predictions to illustrate the relative contribution of each feature.
A new Scalable and reproducible genetic analyses pipeline available as a Python package for post-GWAS analysis: .
- Open Targets' Task ExecutoR i.e. scripts that process and prepare data for our ETL pipelines.
The data downloads paths have been changed as now only parquet data format is available. Also there are minor changes to the name of the dataset (snake_case & singulars). Details .
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Users now have the ability to apply target-specific or disease/phenotype-specific filters to the target-disease association and target prioritisation pages. Read more details about the feature in our .
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Updating the gene burden results from AstraZeneca’s PheWAS portal .
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Integration of the 23Q4 version of DepMap ().
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
'Target Prioritisation' view: A new view for assessment of the target features considered when prioritising (or deprioritising) targets for drug discovery. Watch a detailed video .
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
‘Associations on the Fly’ - revamp of the current Open Targets Platform association page with new facets and additional built-in functionalities like view data directly in the associations table, control weights of contributing evidence, filter by datasource and data type (OR filters) and pin rows. Watch a detailed video .
Updated Molecular Interactions data source to version 12.0
Expanded the definition of a drug to include all probes as reported by as chemical probes are useful from a target's doability perspective
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Addition of a CRISPR Screens widget featuring data from
Introduction of a Cancer DepMap widget showcasing gene essentiality data from the
Missense variants in the OT Genetics, UniProt variants and ClinVar widgets now link to , a new tool to interpret the functional consequences of human missense variants
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out the for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Integration of new PROTAC tractability data from
Check out our for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
Check out our for more information on the new features and datasets introduced in this release.
Visit the for more data metrics for this release, including a per datasource breakdown of evidence strings.
For release notes for previous releases, check out the .
The (PPP) is provided exclusively to Open Targets consortium members.
For more information about the PPP, please read the
Please review our updated .
If you use our code and/or our data, please cite .
Open Targets Platform is marked with . This dedicates the data to the public domain, allowing downstream users to consume the data without restriction.
The Platform also conforms to the EBI long-term data preservation .
Please contact us on if you have questions about our licence or using our data and codebases.
The codebases that power the Platform - including our pipelines, GraphQL API, and React UI - are all open source and licensed under the .
You can find all of our code repositories on GitHub at .
< 0 = Protective
> 0 = Risk
< 1 = Protective
> 1 = Risk
CC-BY 4.0
Commercial use for Open Targets
CC-BY 4.0
CC BY 4.0
CC BY-SA 3.0
CC0 1.0
EMBL-EBI terms of use
Apache 2.0
CC-BY 4.0
CC-BY, CC-BY-NC or CC0 license or open access (depending on publication licenses)
CC-BY 4.0
CC-BY 4.0
CC0 1.0
Summary statistics can be requested here.
CC-BY 4.0
Apache 2.0
EMBL-EBI terms of use
Commercial use for Open Targets
CC0 1.0
CC-BY 4.0
EMBL-EBI terms of use (Summary statistics are under CC0 1.0)
CC-BY 4.0
CC0 1.0
CC BY-SA 3.0
EMBL-EBI terms of use
CC0 1.0
CC-BY 4.0
CC-BY 4.0
CC-BY 4.0
CC BY 4.0
CC-BY 4.0
CC-BY 4.0
Apache 2.0
Commercial use for Open Targets
CC-BY 4.0
CC-BY 4.0
EMBL-EBI terms of use
MIT terms of use
CC-BY 4.0
CC-BY 4.0
CC0 1.0
CC BY 3.0
Apache 2.0
CC-BY 4.0
Computational pipeline generating lists of publications based on similar entities
The Platform Bibliography tool aims to provide context on the scientific background available in the literature regarding the target, disease or phenotype or drug entities of interest. The Platform aims to provide not only what publications have referenced the entities of interest, but also what other entities frequently occur in the literature in conjunction with the entities of interest. Open Targets has developed in collaboration with Europe PMC a pipeline that tries to maximise the literature information extraction by using a combination of Natural Entity Recognition, ontological normalisation and data analysis.
On our entity annotation pages, users also have the option of filtering the bibliography data to a specific timeframe.
Europe PMC — https://europepmc.org — is an open science database that facilitates access to a comprehensive collection of life science publications, preprints, and patents. Europe PMC provides researchers with freely available data available through their website, APIs, and bulk downloads.
In conjunction with the Europe PMC corpus, all the information required to build the Platform entities is required in order to establish a dictionary of terms and synonyms associated with each entity.
A machine learning model trained using BioBERT and a combination of curated datasets is applied on the Europe PMC corpus formed by abstracts and full-text articles. The model tags every gene-protein (GP), disease (DS) or drug (DR) entity in every publication. The resulting set provides some metadata on the publications, the sections where the matches are found and the sentences, where more than one type of entity co-occur.
In order to ground the tagged text to the Platform entities, a dictionary-based approach is applied to the tagged entities by applying a series of standard Natural Language Processing tools (e.g. stemming, stop-words, etc.). As a result, a fraction of the tagged sentences are grounded to the Platform entities and annotated with their respective identifier.
The result of the normalisation is used in parallel to build the Platform Literature evidence covered in a different section.
In order to define a strategy to find similar entities based on literature, the resulting matched entities are later used to train a Word2Vec model; the hyper-parameter settings used to train the model are taken following the suggestion from the benchmark conducted by Benjamin et al. (2020).
By using this model, the user can query what are other entities similar to the one selected, based on all the corpus of literature. Similarly, several entities can be selected and the product of their vectors can propose additional entities similar to all the selected entities.
The Bibliography section uses this algorithm to navigate the universe of publications. As a user keeps selecting entities, the universe is narrowed to show the intersection of all the publications that mention all of the selected entities.
For data features that link to publications, the Platform now provides the option for users to ask for a natural language summary of the target-disease evidence presented in the publication (when the full-text article is available and free to re-use).
Using LangChain, we ask OpenAI’s GPT-4o mini model to summarise relevant portions of the text which we then ask it to summarise with the following prompt: “Can you provide a concise summary about the relationship between [target] and [disease] according to this study?“. The resulting text is presented to the user (see screenshots with example).
We hope this feature will help the user better understand the available bibliography evidence, and it may actually highlight cases in which the publication does not in fact provide evidence for the target-disease relationship in question.
You can find details on the OpenAI terms of use here.
Rosonovski S, Levchenko M, Bhatnagar R, et al. Europe PMC in 2023. Nucleic Acids Research. 2024 Jan;52(D1):D1668-D1676. DOI: 10.1093/nar/gkad1085. PMID: 37994696; PMCID: PMC10767826.
Benjamin P. Chamberlain, Emanuele Rossi, Dan Shiebler, Suvash Sedhain, and Michael M. Bronstein. 2020. Tuning Word2vec for Large Scale Recommendation Systems. In Fourteenth ACM Conference on Recommender Systems (RecSys '20). Association for Computing Machinery, New York, NY, USA, 732–737. DOI:https://doi.org/10.1145/3383313.3418486
Learn about the updated associations view
“Associations on the Fly” is a revamp the Open Targets Platform association page with new facets and additional built-in functionalities. This view replaced the classic associations page.
In the Associations on the Fly page, a target or disease/phenotype is fixed and the prioritised list of alternative entities is displayed. A more detailed explanation on associations is available in the Target-Disease associations section.
Rapid comparison of evidence for different associations
User control over the weighting of contributing evidence from each data source
Ability to include and filter by specific data sources (Note: this is an OR filter)
Searching and applying filters by various target, disease or phenotype categories
Ability to 'pin' a list of targets to create a customised list
Upload a list of interested entities and export results
The data source weights are presented as an advanced option in the new user interface. They have been designed to allow users to dynamically modify the relative importance of the different data sources from the defaults set by Open Targets. The view automatically recomputes the association scores based on the new user-defined weights, giving this view its name: Associations on the Fly.
The new feature can be adjusted to modify the preset Open Targets data sources scores and their effect on the global association score “on the fly”. This interaction ultimately enables a more tailored therapeutic hypothesis formulation.
Where evidence is available for a data source, clicking on the button will reveal the detail widget for that data source. Evidence displayed in the widget includes indirect evidence, so the user can interrogate evidence annotated with descendants of the disease or phenotype of interest.
The Platform comes with a re-designed functionality which allows filtering on the Associations on the Fly and Target Prioritisation pages:
On a disease or phenotype page: Users can search and apply target specific filters which filters the association and prioritisation pages by a particular target or a target category (categories details in the table below). The default is All Categories, however you can also select and view filter suggestions for a specific category from the drop down menu.
Names
Name of the target (approvedName
)
interleukin 13, tyrosine kinase 2
Symbol
Target symbol (approvedSymbol
)
IL13, TYK2
ChEMBL Target Class
Class of drug target from the ChEMBL database
Enzyme, Kinase, Surface antigen
GO:BP
Gene Ontology: Biological Process The larger processes, or ‘biological programs’ accomplished by multiple molecular activities
DNA repair, Intracellular signal transduction
GO:CC
Gene Ontology: Cellular Component A location, relative to cellular compartments and structures, occupied by a macromolecular machine
Cytoskeleton, Clathrin complex
GO:MF
Gene Ontology: Molecular Function Molecular-level activities performed by gene products
Oxidoreductase activity, Transporter activator activity
Reactome
Pathways from the Reactome database
Circadian clock, Interleukin-6 signaling
Subcellular Location
Subcellular location from UniProt and HPA
Cell membrane, Cytoplasm
Target ID (ENSG)
Ensembl gene IDs of the target, beginning with ENSG (id
)
ENSG00000169194, ENSG00000105397
Tractability Antibody
UniProt loc high conf, Human Protein Atlas loc
Tractability Other Modalities
Approved Drug
Tractability PROTAC
UniProt Ubiquitination, Half-life Data
Tractability Small Molecule
Structure with Ligand, High-Quality Pocket
All Categories
Search and apply all of the above filters
IL13, ENSG00000105397
On a target page: Users can search and apply disease or phenotype specific filters allowing them to filter the association page by a particular disease/phenotype or a disease/phenotype category [Disease, Therapeutic Area (eg Infectious disease, Endocrine system disease)].
Whenever you select a specific category, a few suggestions from the selected category are shown by default.
When multiple filters from different categories are selected, they are applied using an AND operator. When multiple filters within the same category are selected, they are applied with an OR operator.
Check out this tutorial video to know more about this feature:
Users can now upload a custom list of targets or diseases or phenotype of interest to obtain a tailored association/target prioritisation view.
The feature is accessible from the "upload" icon on the Associations on the Fly page. This gives users the option of uploading a file containing a custom list of targets or diseases or phenotype. The file should have one entity per row. There are multiple allowed file formats for the uploaded file (.txt)* , (.csv/.tsv/.xlsx)** , (.json). An example format has also been provided for each file format in the feature.
The Platform then suggests potential matches between the entities in uploaded list and the ones in the Platform; the matches are provided through their platform entity ids. The users also have the option to select specific results that they wish to be displayed on the final view. Clicking the 'Pin hits' tab prompts the "Associations on the Fly" page to build up a custom view with the entities from the uploaded list.
Watch our video describing the feature:
* For .txt file format, please create your input list using a text editor.
** For .csv/.tsv/.xlsx file formats, please ensure that the file has a header called id
.
We have designed and developed an export functionality for the Associations on the Fly/Target Prioritisation pages, allowing users to download:
Entire dataset view (default status)
Customised dataset view including custom controls changes, subset of data types (aggregations) and/or data from pinned targets only
TSV and JSON formats
with data on an accessible epitope for antibody based therapy
with data on compound in clinical trials with a modality other than small molecule or antibody
with data on using Proteolysis Targeting Chimeras (PROTACs)
with data on binding site suitable for small molecule binding