LogoLogo
OT PlatformOT GeneticsCommunityBlog
  • Open Targets Platform
  • Getting started
  • Target
    • Tractability
    • Safety
    • Chemical probes & TEPs
    • Baseline expression
    • Molecular interactions
    • Core Gene Essentiality
    • Pharmacogenetics
  • Disease or Phenotype
    • Clinical signs and symptoms
  • 🆕Variant
  • 🆕Study
  • Drug
    • Clinical Precedence
    • Pharmacovigilance
    • Pharmacogenetics
  • 🆕Credible Set
  • Target–disease evidence
  • Target–disease associations
  • 🆕GWAS & functional genomics
    • Data sources
    • Fine-mapping
    • Colocalisation
    • Locus-to-Gene (L2G)
    • Gentropy
  • Bibliography
  • Web interface
    • Associations on the Fly
    • Target Prioritisation
    • Evidence pages
    • Entity profile pages
  • Data and code access
    • Download datasets
    • Google BigQuery
    • GraphQL API
    • 🆕Platform infrastructure
    • 🆕Data pipeline
  • 🆕FAQs
  • Release notes
  • Citation
  • Licence
    • Terms of use
  • Partner Preview Platform
Powered by GitBook
On this page
  • Introduction
  • GitHub repositories
  • Data and evidence
  • Gentropy
  • Orchestration
  • Unified pipeline

Was this helpful?

Export as PDF
  1. Data and code access

Data pipeline

The Open Targets data pipeline is a complex process orchestrated in Apache Airflow, and it is divideded into data acquisition, transformation and data output.

PreviousPlatform infrastructureNextFAQs

Last updated 2 months ago

Was this helpful?

Introduction

The data pipeline is composed of multiple elements:

  1. Data and evidence generation processes

  2. Input stage

  3. Transformation stage and ETL processes

  4. Output stage

  5. Gentropy-specific processes

  6. Orchestration

GitHub repositories

Data and evidence

  • — Open Targets curation repository

  • — internal pipelines used to generate evidence

  • — evidence object schema used for evidence and association scoring

  • — Python module to map disease or phenotype terms to EFO

Gentropy

Orchestration

The orchestration occurs on Google Airflow using Google Cloud as the cloud resource provider. The logic of the orchestration is based on the steps. The combination of steps forms directed acyclic graphs (DAGs).

Unified pipeline

— Open Targets' genomics toolkit

See for more info on the Gentropy pipelines.

— Open Targets data pipelines orchestrator

See detailed orchestration documentation .

The Platform ETL (“extract, transform, and load”) and the were separate processes before, but they are now merged into one single pipeline. This means that the data produced for both Genetics ETL and the Platform are released at the same time. Herein, we refer to this joint pipeline as the "unified pipeline".

The unified pipeline uses many (link), like Open Targets related data and data needed to run Genetics ETL.

— Open Targets' Task ExecutoR i.e. scripts that process and prepare data for our ETL pipelines

: ETL pipelines to generate associations, evidence, and entity indices

: ETL pipeline to process Open FDA adverse events data

: ETL pipeline to generate similar entities and publications

: scripts for infrastructure tasks and generating a Platform release

If you have further questions, please get in touch with us on the .

🆕
curation
evidence_datasource_parsers
json_schema
OnToma
gentropy
here
orchestration
here
Genetics ETL
static assets
otter
platform-etl-backend
platform-etl-openfda-faers
platform-etl-literature
platform-output-support
Open Targets Community
Schematic overview of Open Targets pipelines