Enhancer-to-Gene (ENCODE rE2G)

Enhancer-to-gene scores are directly ingested from the ENCODE consortium, which utilises a mixture of epigenetic datasets to connect genomic regions (defined as chromosome, start, end) to putative genes. These regions are likely regulatory elements and affect the transcriptional activity of their annotated genes.

The input datasets are based on publicly available resources from the ENCODE projectarrow-up-right, these include

  • histone modification ChIP-seq,

  • open chromatin DNase-seq

  • ATAC-seq

  • 3D chromatin conformation structure determined by Hi-C.

These inputs are normalised and used as features in a machine learning approach described in the original rE2G publicationarrow-up-right. The final output connects potential regulatory regions to their target genes, along with a score that ranges between 0 to 1 that indicates the confidence of a given assignment.

rE2G in the platform

Enhancer-to-gene scores from rE2G are ingested into the Open Targets ecosystem through the orchestration and can be

  • browsed through the variant page, where the regulatory regions overlapping a given variant of interest are displayed in the enhancer-to-gene widget.

  • browsed through the credible set page, which shows the overlapping rE2G scores for the lead variant of the credible set.

  • viewed in the form of e2gMean and e2gNeighbourhoodMean features in the L2G predictions.

Applied transformations

The dataset is downloaded in the latest version using the ENCODE API and transformed to the Intervalarrow-up-right gentropy format. The mapping between original biosamples provided by the source was based on manual curationarrow-up-right. We apply post-transformation quality controls and flagging system that includes:

  • Validation of gene identifiers from input against latest Ensembl version

  • Validation of biosample identifiers against the latest Biosample dataset

  • Score stringent filtering

We have implemented a stringent filter of 0.6 on the rE2G dataset to reduce computational costs and redundancy. The selection of the filter was based on an analysis performed on rE2G overlaps with eQTL credible sets.

Effect of rE2G-score filtering on gene prioritisation for eQTL credible sets. True positives are defined as the eQTL target gene; Sensitivity (orange) is TP recall, and FDR (blue) = 1 − precision among retained cs–gene pairs. Points are thresholds labelled by percentiles (Px) of the rE2G score distribution. Moderate filtering removes significant amounts of raw rE2G entries while retaining most TP assignments. FDR changes little until very aggressive cutoffs, indicating many non-TP—but potentially interesting—gene links persist.

Last updated

Was this helpful?