Enhancer-to-gene

Enhancer-to-gene scores are directly ingested from the E2G consortium, which utilises a mixture of epigenetic datasets to connect genomic regions (defined as chromosome, start, end) to putative genes. These regions are likely regulatory elements that affect the transcriptional activity of their connected genes.

The input datasets are based on publicly available resources from the ENCODE project, these include histone modification ChIP-seq, open chromatin DNase-seq and ATAC-seq, and 3D chromatin conformation structure determined by Hi-C. These inputs are normalised and used as features in a machine learning approach described in the original E2G publication. The final output connects potential regulatory regions to their target genes, along with a score that ranges between 0 to 1 that indicates the confidence of a given assignment.

Enhancer-to-gene scores from E2G are ingested into the Open Targets ecosystem and can currently be browsed through the variant page, where the regulatory regions overlapping a given variant of interest are displayed in the enhancer-to-gene widget. We have implemented a stringent filter of 0.6 on the E2G dataset to reduce computational costs and redundancy. The selection of the filter was based on an analysis performed on E2G overlaps with eQTL credible sets.

Effect of E2G-score filtering on gene prioritisation for eQTL credible sets. True positives are defined as the eQTL target gene; Sensitivity (orange) is TP recall, and FDR (blue) = 1 − precision among retained cs–gene pairs. Points are thresholds labelled by percentiles (Px) of the E2G score distribution. Moderate filtering removes significant amounts of raw E2G entries while retaining most TP assignments. FDR changes little until very aggressive cutoffs, indicating many non-TP—but potentially interesting—gene links persist.

The Enhancer-to-gene widget is also visible on the credible set page, which shows the overlapping E2G scores for the lead variant of the credible set. Future work will involve integrating the E2G dataset to construct additional features for the Locus-to-Gene (L2G) pipeline for gene prioritisation purposes.

Last updated

Was this helpful?