Target - disease evidence
Target - disease evidence
Every event or set of events pinpointing a target as a potential causal gene or protein for a disease, represents the unit of information, most often referred as evidence. Within Open Targets, a series of pipelines ensure information is retrieved from their sources and standardised in a way that can be immediately applied to answer drug development queries.
All evidence is mapped to the reference target entity identifier (Ensembl gene) and disease or phenotype identifier (experimental factor ontology, EFO), as well as other reference controlled vocabularies and ontologies when appropriate. Evidence is also reviewed to minimise the presence of duplicates within the same data source.
Data sources are also grouped into bigger categories abstracting the type of evidence they predominantly capture. These categories are usually referred to in the platform as data types as opposed to the individual resource data referred to as data sources.
In order to contextualise the relative importance of each piece of evidence, the Open Targets Platform provides a scoring framework for each data source. This score will take more relevance when understanding the association scoring in later sections.
Evidence data sources
Open Targets Genetics
Open Targets Genetics focuses on the identification of trait-causal genes from significant loci in genome-wide association studies (GWAS).
Whereas GWAS identifies significantly-associated alleles (lead variants), these variants might not necessarily be the causal (or the only causal) ones. Moreover, the causal genes are not necessarily the closest to the lead variant. Due to these reasons, identifying target-disease associations based on GWAS data is extremely challenging. Open Targets Genetics tackles this and other challenges by applying cutting-edge statistical genetics methodologies into large-scale human genetics data. Moreover, Open Targets Genetics uses a machine learning method to identify the most likely causal genes by integrating and summarising the effect of tag variants based on genetic and functional genomic data. This method is referred to as the Locus2Gene model.
A Genetics portal evidence in the Platform is defined as any GWAS-significant lead variant (p-value < 1e-8) identified in a study with a predicted causal gene for the given trait with a Locus2Gene (L2G) score greater than 0.05.
When available, L2G predictions incorporate a new column ‘QTL effect’ containing a Sequence Ontology term representing whether the observed allele is expected to cause an increased/decreased abundance of the gene product. In cases in which multiple variants with opposite effects are available only the strongest effect is considered.
Datatype: Genetic associations
Evidence scoring: Locus2Gene (L2G) score, filtered to use scores above 0.05
Direction of Effect assessment:
LoF and GoF from Variant Functional Consequence variants/Consequence from QTL
Source: Open Targets Genetics
Reference: Ghoussaini M, et al. 2021
ClinVar
ClinVar is a NIH public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. The ClinVar data source in the Open Targets Platform captures the subset of ClinVar that refers to germline variants (as opposed to somatic variants). Each evidence in the platform aims to capture an individual RCV record in ClinVar.
Information on variants is covered extensively for both single point and structural variants. When available, genomic coordinates are reported with RS numbers, or by following the CHROM_POS_REF_ALT and HGVS notations.
Datatype: Genetic associations
Evidence scoring: ClinVar evidence is scored in a 2-step process. In step 1, a score is assigned to every piece of evidence based on the clinical significance:
association not found
0
benign
0
not provided
0
likely benign
0
conflicting data from submitters
0.3
conflicting interpretations of pathogenicity
0.3
low penetrance
0.3
other
0.3
uncertain risk allele
0.3
uncertain significance
0.3
established risk allele
0.5
risk factor
0.5
affects
0.5
likely pathogenic
0.7
association
0.9
confers sensitivity
0.9
drug response
0.9
protective
0.9
pathogenic
0.9
In Step 2, the score is modulated based on the ClinVar review status:
no assertion provided
+0
no assertion criteria provided
+0
no assertion for the individual variant
+0
criteria provided, single submitter
+0.02
criteria provided, conflicting interpretations
+0.02
criteria provided, multiple submitters, no conflicts
+0.05
reviewed by expert panel
+0.07
practice guideline
+0.1
Direction of Effect assessment:
LoF variants
Source: ClinVar (via European Variation Archive)
References: Cezard T. et al, 2021; Shen A. et al, 2024; Landrum, M. et al, 2014; Landrum, M. et al, 2020
Gene Burden
Gene burden data comprises gene–phenotype relationships observed in gene-level association tests using rare variant collapsing analyses. The Platform integrates burden tests carried out by several sources:
REGENERON (Backman et al., 2021), a whole-exome sequencing analysis of individuals from the UK Biobank.
AstraZeneca PheWAS Portal (Wang et al., 2021), a whole-exome sequencing analysis of individuals from the UK Biobank.
Genebass (Karczewski et al., 2022): Gene-based Association Summary Statistics (Genebass), a whole-exome sequencing analysis of individuals from the UK Biobank.
The results of whole-exome and whole-genome sequencing analysis based on the SPARK cohort bring evidence of novel targets implicated in autism spectrum disorder (Zhou et al., 2022).
The SCHEMA consortium (Singh et al., 2022), a whole-exome sequencing analysis of individuals with schizophrenia.
The Epi25 collaborative (Epi25 Collaborative, 2019), a whole-exome sequencing analysis of individuals with epilepsy.
The Autism Sequencing Consortium (Satterstrom et al., 2020), a whole-exome sequencing analysis of individuals with autism spectrum disorder.
The results of an Open Targets project (Bomba et al., 2022), a whole-exome sequencing analysis of individuals from the INTERVAL cohort testing for associations between rare coding variants and blood metabolites.
The results of a pan-ancestry whole-exome sequencing analysis identify relevant genes associated with fat distribution (Akbari et al., 2022).
The results of whole-exome and whole-genome sequencing analysis on Parkinson disease and promoted by the AMP-PD initiative, and other collaborators (Makarious et al., 2022).
The results of gene-based analyses of rare variants and circulating metabolic biomarkers relevant to cardiovascular disease (Riveros-McKay et al., 2020).
The results of rare coding variant analyses from whole exome sequencing of Black South African men to identify genes significantly associated with prostate cancer (Soh et al., 2023)
The FinnGen (R11) gene-based burden test results from collapsing loss of function variants, based on genotyping data from the Finnish population. More information can be found in their documentation.
These associations are a result of collapsing rare variants in a gene into a single burden statistic and regress the phenotype on the burden statistic to test for the combined effects of all rare variants in that gene. The different collapsing methods inform about the filters used to select the set of qualifying variants, mostly based on their pathogenicity and frequency in the population.
Datatype: Genetic associations
Evidence scoring: Scaled p-value from 0.25 (p = 1e-7) to 1 (p < 1e-17).
Direction of Effect assessment:
Assumption of all variants LoF
Source: AstraZeneca PheWAS Portal, Genebass
References: Wang, Q. et al, 2021; Backman, J.D. et al, 2021; K.K., Karczewski et al., 2022; Zhou X. et al, 2022, Singh et al., 2022; Epi25 Collaborative, 2019; Satterstrom et al., 2020; Bomba et al., 2022; Akbari, P., 2022; Makarious et al., 2022; Riveros-McKay et al., 2020; Soh et al., 2023
Genomics England PanelApp
The Genomics England PanelApp is a knowledge base that combines crowdsourced expertise with curation to provide gene-disease relationships. Virtual gene panels related to human disorders are reviewed by experts within the clinical and scientific community to support the interpretation of genomes within the 100,000 Genomes Project. Within a panel, genes are rated based on the level of evidence supporting the association with the phenotypes identified by the panel. Genes are then classified according to a traffic light system with red/stop, amber/pause, and green/go classifications. To receive a green rating (diagnostic-grade) on a version 1+ panel, the gene requires "evidence from 3 or more unrelated families or from 2 - 3 unrelated families where there is strong additional functional data" and "genes that do not meet these criteria are rated as Amber (borderline) or Red (low level of evidence)."
The Open Targets Platform includes "green" and "amber" genes from version 1+ panels along with their phenotypes, providing the latter can be mapped to a disease or phenotype ontology. As we standardise our evidence to the EFO ontology, some of the phenotypes cannot be mapped and included in our platform - please visit the Genomics England PanelApp website for the full set.
Data type: Genetic associations
Evidence scoring: Based on Genomics England gene rating:
Amber
0.5
Green
1
Source: Genomics England PanelApp
References: Martin, A. et al, 2019
Gene2Phenotype
The data in Gene2Phenotype (G2P) is produced and curated from the literature by different sets of panels formed by consultant clinical geneticists. The G2P data is designed to facilitate the development, validation, curation, and distribution of large-scale, evidence-based datasets for use in diagnostic variant filtering. Each G2P entry associates an allelic requirement and a mutational consequence at a defined locus with a disease entity. A confidence level and evidence link are assigned to each entry. This confidence level follows the terminology described by GenCC for describing gene-disease validity.
G2P evidence in the Platform is the result of any target-disease curation by any of the expert panels.
Data type: Genetic associations
Evidence scoring:
Limited
0.01
Moderate
0.5
Strong
1
Both RD and IF
1
Definitive
1
Direction of Effect assessment:
LoF and GoF variants
Assumption of Risk
Source: Gene2Phenotype
References: Thormann, A. et al, 2019
UniProt literature
The Universal Protein Resource (UniProt) provides a large-compendium of sequence and functional information at the protein level. As part of their functional annotation effort, UniProt curators also annotate proteins with publications supporting their involvement on pathogenic processes.
All publications supporting a given target disease relationship are aggregated into one single Platform evidence.
Data type: Genetic associations
Evidence scoring:
Uniprot confidence
Evidence score
Medium
0.5
High
1
Source: UniProt
References: The UniProt Consortium, 2021
UniProt variants
The Universal Protein Resource (Uniprot) also curate variants supported by publications that are known to alter protein function on disease. Curated mutations are predominantly protein coding or in regulatory regions clearly associated with the causal protein.
All publications supporting a given variant in connection with a disease constitute individual evidence. All supporting publications are aggregated within the same evidence.
Data type: Genetic associations
Evidence scoring:
UniProt confidence
Evidence score
Medium
0.5
High
1
Source: UniProt
References: The UniProt Consortium, 2021
ClinGen
The Clinical Genome Resource (ClinGen) Gene-Disease Validity Curation aims to evaluate the strength of evidence supporting or refuting a claim that variation in a particular gene causes a particular disease. ClinGen provides a framework of guidelines to assess clinical validity in a semi-quantitative manner allowing curators to classify the validity of given gene-disease pair.
All gene-disease pairs mapped to EFO constitute individual evidence in the Platform.
Data type: Genetic associations
Evidence scoring:
ClinGen classification
Evidence score
No reported evidence
0.01
Refuted
0.01
Disputed
0.01
Limited
0.01
Moderate
0.5
Strong
1
Definitive
1
Source: ClinGen Gene-Disease Validity
References: Strande, N. et al., 2017
Orphanet
Orphanet is an international network that offers a range of resources to improve the understanding of rare disorders of genetic origin. These resources include an inventory of rare disease and gene associations, classification of the gene-disease relationship, information on the kind of mutation, and supporting publication references.
Data type: Genetic associations
Evidence scoring:
Not yet assessed
0.5
Assessed
1
Direction of Effect assessment:
LoF and GoF variants
Assumption of Risk
Source: Orphanet Genes Associated with Rare Diseases
References: Orphanet; Orphadata
ChEMBL
The EMBL-EBI ChEMBL is a manually curated database of bioactive molecules with drug-like properties, either approved for marketing by the U.S Food and Drug Administration (FDA), or clinical candidates. ChEMBL also captures information regarding the drug molecule indications, as well as their curated pharmacological target.
In the Platform, ChEMBL evidence represents any target-disease relationship that can be explained by an approved or clinical candidate drug, targeting the gene product and indicated for the disease. Independent studies are treated as individual evidence.
To provide additional context, we integrate a machine learning-based analysis of the reasons why a clinical trial has ended earlier than scheduled. This sorts the stop reasons into a set of 17 classes which include negative, neutral, and positive reasons. This information is available when hovering on the tooltip of the Source column.
The 17 classes are: Another Study, Business or Administrative, Negative, Study Design, Invalid Reason, Ethical Reason, Insufficient Data, Insufficient Enrolment, Study Staff Moved, Endpoint Met, Regulatory, Logistics or Resources, Safety and Side Effects, No Context, Success, Interim Analysis and Covid 19.
Data type: Drugs
Evidence scoring: ChEMBL evidence is scored in a 2-step process. In step 1, a score is assigned to every piece of evidence based on the clinical precedence:
Phase I (Early)
0.05
Phase I
0.1
Phase II
0.2
Phase III
0.7
Phase IV
1
In Step 2, for those clinical trials that have stopped early, the score is down-weighted based on the classification of the reason to stop. In this way, less importance is attributed to evidence of studies that have been stopped due to negative outcomes or safety concerns:
Negative
0.5
Safety or side effects
0.5
Direction of Effect assessment:
Assumption of Protective
Source: ChEMBL
References: Mendez, D. et al, 2019
Reactome
The Reactome database manually curates and identifies reaction pathways that are affected by a disease. Reactome annotation includes information regarding the causal target - disease link either being a protein coding mutation or an altered expression.
In the Platform, any mutation or altered expression event affecting a different reaction is captured in a different target - disease evidence.
Data type: Pathways & systems biology
Evidence scoring: All manually curated evidence in Reactome has a score of 1.
Source: Reactome
References: Jassal, B. et al, 2020
CRISPR screens
One of the most powerful approaches to uncover gene function is the experimental perturbation of genes followed by the observation of related phenotypes. The perturbation of gene function in human cells has been greatly facilitated by developments in CRISPR technology.
CRISPRbrain is a database for functional genomics screens in differentiated human brain cell types. We have prioritised genome-wide CRISPRi/a/KO screens (healthy vs KO) for integration in the Platform to generate target disease evidence.
We have linked cell types to diseases, meaning these diseases are often characterised with abnormal phenotypes in these cell types - hence the association. If knocking out a gene causes significant perturbation in the cell type, it might indicate a potential targeting strategy in the disease.
Data Type: Pathways & systems biology
Evidence Scoring: The Platform uses the linearized CRISPRbrain's assessment of statistical significance to assign a score, including hits from both the upper and lower end of the distribution
Source: CRISPRbrain
Reference: Tian, R et al, 2021
Project Score
Project Score is a Wellcome Sanger Institute resource that aims to identify dependencies in cancer cell lines to guide precision medicine. The project combines gene fitness effects derived from whole-genome CRISPR-Cas9 synthetic-lethality screenings with tractability data, genomic biomarkers and various target annotation enabling a systematic prioritisation of potential targets. The resulting inferences are then mapped from the cancer cell lines in which the experiment is performed to their corresponding tumors.
In the Platform, any Project Score prioritised target with priority score reaching 36.0 is included as independent evidence; however, pan-cancer dependecies are excluded from the integration.
Data type: Pathways & systems biology
Evidence scoring: Project Score priority score divided by 100.
Source: CRISPR (via Project Score)
References: Pacini et al, 2024
SLAPenrich
SLAPenrich (Sample-population Level Analysis of Pathway enrichments) is a novel statistical framework for the identification of significantly mutated pathways, at the sample population level, in large cohorts of cancer patients. SLAPenrich is based on a Poisson binomial model that takes into account the length of blocks of exons in genes within each pathway, and the background mutation rate of the analysed cohort of patients. SLAPenrich enrichment analysis is based on EMBL-EBI Reactome pathways and mutation data from The Cancer Genome Atlas (TCGA) cohort.
In the Platform, each pathway significantly enriched in tumor-occurring mutations constitute individual pieces of evidence.
Data type: Pathways & systems biology
Evidence scoring: Scaled enrichment p-value from 0.5 (p = 1e-4) to 1 (p<1e-14).
Source: SLAPenrich
References: Iorio, F. et al, 2018
Gene signatures
The Platform also provides information about key driver genes for specific diseases that have been curated from Systems Biology analysis. These publications present different disease gene signatures as potential key drivers or key regulators causing disease.
Data type: Pathways & systems biology
Evidence scoring: Scoring depends on whether the original data contains or not a score:
p-values and rank-based scores are normalised to the 0.5 - 1 range
If there is no score a fixed value of 0.5 is used
References: Peters, L. A. et al, 2017; Huan, T. et al, 2013; Zhang, B. et al, 2013; Mostafavi, S. et al, 2018
PROGENy
PROGENy (Pathway RespOnsive GENes) is a linear regression model that calculates pathway activity estimates based on consensus transcriptomic gene signatures obtained from perturbation experiments. PROGENy (Schubert et al) provides a framework to systematically compare pathway activities between normal and primary samples from The Cancer Genome Atlas (TCGA).
In the Platform, a PROGENy evidence is defined as any significantly regulated sample-level pathway activities inferred from matched normal vs. tumor samples.
Data type: Pathways & systems biology
Evidence scoring: Scaled p-value from 0.5 (p = 1e-4) to 1 (p<1e-14).
Source: PROGENy
References: Schubert, M. et al, 2018
Expression Atlas
The EMBL-EBI Expression Atlas provides a differential expression pipeline aiming to identify genes that are differentially expressed in disease vs control samples. Only contrasts from studies with enough replicates and minimum quality criteria are included in the processing.
In a given contrast, to consider a gene significantly regulated in a contrast, all the following rules are required:
Absolute log2 fold change > 1
Adjusted p-value <= 0.05
Maximum significant genes probes per contrast = 1000
In the Platform, each contrast from independent studies capturing differentially regulated genes constitutes independent evidence.
Data type: RNA expression
Evidence scoring: ExpressionAtlas scoring is the result of the product of:
Scaled p-value from 0 (p = 1) to 1 (p<1e-10)
Absolute log2 fold change divided by 10
Percentile rank divided by 100
Source: Expression Atlas
References: Papatheodorou, I. et al, 2020
#All evidence has a base score of 0.5. Whereas tier 2 genes score is always 0.5, tier 1 scores can be modulated as follows
Cancer Gene Census
Cancer Gene Census (CGC) is part of the Wellcome Sanger Institute Catalogue of Somatic Mutations in Cancer (COSMIC). CGC is an effort to catalogue genes which contain mutations that have been causally implicated in cancer. The exhaustive curation of the CGC covers individual studies as well as pan-cancer sequencing efforts, including The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) among others.
In the Platform, CGC evidence is aggregated at the target - disease level to provide a summary of all curated evidence supporting the involvement of a target with a particular cancer type.
Data type: Somatic mutations
Evidence scoring: Scoring is based on Cancer Gene Census tier system
-0.25
Only 1 mutated sample
+0.25
Gene mutated more frequently in particular disease compared to other diseases
+0.25
Mutations in gene occur more frequently than in other genes of similar length in the same disease
Source: Cancer Gene Census
References: Sondka, Z. et al, 2018
IntOGen
IntOGen provides a framework to identify potential cancer driver genes using large-scale mutational data from sequenced tumor samples. By harmonising tumor sequencing data from the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes (PCAWG) and other comprehensive efforts, IntOGen aims to provide a consensus assessment of cancer driver genes. Several state-of-the-art driver methodologies aiming to cover different approaches (e.g. dN/dS, Hotspots, etc.) are included to finally produce a consensus q-value for each driver gene in every tumor.
In the Platform, independent target - disease evidence are defined as any significant driver gene detected in any individual cohort. Information regarding the individual driver methods is also provided within each evidence.
Data type: Somatic mutations
Evidence scoring: Scaled combined q-values from 0.25 (q = 0.1) to 1 (q < 1e-10).
Source: intOGen
References: Martínez-Jiménez, F. et al, 2020
ClinVar (somatic)
ClinVar is a NIH public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. The ClinVar (somatic) data source in the Open Targets Platform captures the subset of ClinVar that refers to somatic variants (as opposed to germline variants).
Information on variants is covered extensively for both single point and structural variants. When available, genomic coordinates are reported with RS numbers, or by following the CHROM_POS_REF_ALT and HGVS notations.
Each evidence in the Platform aims to capture an individual RCV record in ClinVar.
Datatype: Somatic mutations
Evidence scoring: ClinVar evidence is scored in a 2-step process. In step 1, a score is assigned to every piece of evidence based on the clinical significance:
association not found
0
benign
0
not provided
0
likely benign
0
conflicting interpretations of pathogenicity
0.3
other
0.3
uncertain significance
0.3
risk factor
0.5
affects
0.5
likely pathogenic
0.7
association
0.9
drug response
0.9
protective
0.9
pathogenic
0.9
In Step 2, scored is modulated based on the ClinVar review status:
no assertion provided
+0
no assertion criteria provided
+0
no assertion for the individual variant
+0
criteria provided, single submitter
+0.02
criteria provided, conflicting interpretations
+0.02
criteria provided, multiple submitters, no conflicts
+0.05
reviewed by expert panel
+0.07
practice guideline
+0.1
Direction of Effect assessment:
LoF variants
Source: ClinVar (via European Variation Archive)
References: Cezard T. et al, 2021; Shen A. et al, 2024; Landrum, M. et al, 2014; Landrum, M. et al, 2020
Europe PMC
The EMBL-EBI Europe PubMed Central (Europe PMC) enables access to a worldwide collection of life science publications and preprints from trusted sources. The Europe PMC data source aims to identify target - disease co-occurrences in the literature and provide an assessment on the confidence of the relationship. This pipeline uses deep-learning based Named Entity Recognition (NER) to identify gene/proteins and diseases when mentioned in the text, to later normalise them to the target or disease/phenotype entities in the Platform. All co-occurrences of both types of entities in the same sentence are considered evidence.
In the Platform, a piece of Europe PMC evidence is the result of aggregating all co-occurrences of the same target and disease within the same publication.
Data type: Text mining
Evidence scoring: Score based on weighted document sections, sentence locations, and title for full text articles and abstracts as described in Kafkas et al., 2017. The aggregated scores of each gene/disease co-occurrence in the publication are further normalised between 0 and 1.
Source: Europe PMC
References: The Europe PMC Consortium, 2015; Kafkas et al., 2017
IMPC
The genotype-phenotype associations made available by the International Mouse Phenotypes Consortium (IMPC) are used to identify models of human disease based on phenotypic similarity scores.
The Wellcome Sanger Institute PhenoDigm is an algorithm aimed at capturing the similiarity between a knockout mouse and the clinical manifestations (phenotype) of a human disease. The premise is that if a gene knock-out causes an equivalent phenotype in mouse, the human counterpart is likely to be related with the cause of the disease.
It uses a semantic approach to map between clinical features observed in humans and mouse phenotype annotations. The phenotypic effects in mice are then mapped to phenotypes associated with human diseases. The matches are identified and a similarity score between a mouse model and a human disease is computed.
Data type: Animal models
Evidence scoring: The evidence score indicates the degree of concordance between the mouse and disease phenotypes, as described by Smedley et al 2013.
Direction of Effect assessment:
Assumption of all variants LoF
Assumption of Risk
Source: IMPC
References: Smedley, D. et al, 2013
Cancer Biomarkers
One of the aims of the Cancer Genome Interpreter is to identify how variations in the tumour genome may influence its response to anti-cancer therapies. The Cancer Biomarkers database features biomarkers of drug sensitivity, resistance, and toxicity for drugs targeting specific targets in cancer, curated by clinical and scientific experts in precision oncology, and classified by cancer type.
Data type: Pathways & systems biology
Evidence scoring: All manually curated evidence in Cancer Biomarkers has a score of 1.
Source: Cancer Genome Interpreter
References: Tamborero, D. et al, 2018
Last updated