LogoLogo
OT PlatformOT GeneticsCommunityBlog
  • Open Targets Platform
  • Getting started
  • Target
    • Tractability
    • Safety
    • Chemical probes & TEPs
    • Baseline expression
    • Molecular interactions
    • Core Gene Essentiality
    • Pharmacogenetics
  • Disease or Phenotype
    • Clinical signs and symptoms
  • 🆕Variant
  • 🆕Study
  • Drug
    • Clinical Precedence
    • Pharmacovigilance
    • Pharmacogenetics
  • 🆕Credible Set
  • Target–disease evidence
  • Target–disease associations
  • 🆕GWAS & functional genomics
    • Data sources
    • Fine-mapping
    • Colocalisation
    • Locus-to-Gene (L2G)
    • Gentropy
  • Bibliography
  • Web interface
    • Associations on the Fly
    • Target Prioritisation
    • Evidence pages
    • Entity profile pages
  • Data and code access
    • Download datasets
    • Google BigQuery
    • GraphQL API
    • 🆕Platform infrastructure
    • 🆕Data pipeline
  • 🆕FAQs
  • Release notes
  • Citation
  • Licence
    • Terms of use
  • Partner Preview Platform
Powered by GitBook
On this page
  • Download
  • Accessing and querying datasets
  • File formats
  • Tutorials and how-to guides

Was this helpful?

Export as PDF
  1. Data and code access

Download datasets

PreviousData and code accessNextGoogle BigQuery

Last updated 2 months ago

Was this helpful?

To support more complex and systematic queries, we provide all datasets as data downloads.

A list of all datasets is available in the page.

All Platform datasets are available as a distributed collection of data. This implies that for each dataset, there will be a directory with a list of partitioned files. Currently, we produce our datasets in Parquet. This formats allow us to expose nested information in a machine-readable way.

Archive datasets, as well as input files and other secondary products, are also made available in the and .

Below, we describe how to download, access and query this information in a step-by-step guide.

Download

Below is a walkthrough on how to download the disease dataset from the 25.03 release in Parquet format using different approaches.

We recommend using lftp with a command line client, and when using tools like wget, curl, etc., use https:// rather than ftp://

Using rsync

rsync is a command line tool for efficiently transferring and synchronising files between a computer and an external hard drive.

rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/disease .

Using wget

wget is a command line tool that retrieves content from web servers and widely available in Unix systems.

wget --recursive --no-parent --no-host-directories --cut-dirs 8 \
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/disease

Using Google Cloud Platform (paywalled after 1TB)

Users with Google Cloud Platform account can download the datasets through the Google Cloud Console or using gsutil command-line tool.

gsutil -m cp -r gs://open-targets-data-releases/25.03/output/disease

Other ways to access data

Accessing and querying datasets

To read the information available in the partitioned datasets, there is no need to manipulate or concatenate files. Datasets can be read directly using the dataset path.

The next scripts provide a proof-of-concept example using the ClinVar evidence provided by the European Variation Archive. The next scripts show how to:

  • Read a dataset

  • Explore the schema of the dataset

  • Select a subset of information (columns)

  • Display the information

First of all the dataset needs to be downloaded as described in the previous section. For simplicity, only EVA evidence is downloaded, but all evidence can be downloaded at once using the same approach.

gsutil -m cp -r gs://open-targets-data-releases/25.03/output/evidence/sourceId=eva

The next query only displays 6 fields of the ClinVar evidence but there are other non-null values available. The schema is the best way to explore what's available and query the most relevant information. All Platform evidence share the same schema, so there will be a long list of fields that might not be informative for ClinVar but will be relevant if trying to query other data sources.

Dealing with nested information can sometimes be tedious. The Platform aims to minimise the nestiness of the data, however some level of structure is sometimes required. Spark provides a series of functions to deal with complex nested information. The scripts provide an example on how the clinicalSignificances array is flattened using the explode function.

Once loaded into Python or R, the user can decide to continue using Spark, write the output to a file or use alternative libraries to process the information (e.g. pandas, tidyverse, etc.).

from pyspark import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# path to ClinVar (EVA) evidence dataset 
# directory stored on your local machine
evidencePath = "local directory path - e.g. /User/downloads/sourceId=eva"

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)

# read evidence dataset
evd = spark.read.parquet(evidencePath)

# Browse the evidence schema
evd.printSchema()

# select fields of interest
evdSelect = (evd
 .select("targetId",
         "diseaseId",
         "variantRsId",
         "studyId",
         F.explode("clinicalSignificances").alias("cs"),
         "confidence")
 )
 evdSelect.show()

# +---------------+--------------+-----------+------------+--------------------+--------------------+
# |       targetId|     diseaseId|variantRsId|     studyId|                  cs|          confidence|
# +---------------+--------------+-----------+------------+--------------------+--------------------+
# |ENSG00000153201|Orphanet_88619|rs773278648|RCV001042548|uncertain signifi...|criteria provided...|
# |ENSG00000115718|  Orphanet_745|       null|RCV001134697|uncertain signifi...|criteria provided...|
# |ENSG00000107147|    HP_0001250|rs539139475|RCV000720408|       likely benign|criteria provided...|
# |ENSG00000175426|Orphanet_71528|rs142567487|RCV000292648|uncertain signifi...|criteria provided...|
# |ENSG00000169174|   EFO_0004911|rs563024336|RCV000375546|uncertain signifi...|criteria provided...|
# |ENSG00000140521|  Orphanet_298|rs376306906|RCV000763992|uncertain signifi...|criteria provided...|
# |ENSG00000134982|   EFO_0005842| rs74627407|RCV000073743|               other|no assertion crit...|
# |ENSG00000187498| MONDO_0008289|rs146288748|RCV001111533|uncertain signifi...|criteria provided...|
# |ENSG00000116688|Orphanet_64749|rs119103265|RCV000857104|uncertain signifi...|no assertion crit...|
# |ENSG00000133812|Orphanet_99956|rs562275980|RCV000367609|uncertain signifi...|criteria provided...|
# +---------------+--------------+-----------+------------+--------------------+--------------------+
# only showing top 10 rows

# Convert to a Pandas Dataframe
evdSelect.toPandas()
library(dplyr)
library(sparklyr)
library(sparklyr.nested)

## path to ClinVar (EVA) evidence dataset 
## directory stored on your local machine
evidencePath <- "local directory path - e.g. /User/downloads/sourceId=eva"

## establish connection
sc <- spark_connect(master = "local")

## read evidence dataset
evd <- spark_read_parquet(sc,
                          path = evidencePath)
## Browse the evidence schema
columns <- evd %>%
  sdf_schema() %>%
  lapply(function(x) do.call(tibble, x)) %>%
  bind_rows()

## select fields of interest
evdSelect <- evd %>%
  select(targetId,
         diseaseId,
         variantRsId,
         studyId,
         clinicalSignificances,
         confidence) %>%
  sdf_explode(clinicalSignificances)

##  # Source: spark<?> [?? x 6]
##    targetId   diseaseId   variantRsId studyId  clinicalSignific… confidence     
##    <chr>      <chr>       <chr>       <chr>    <chr>             <chr>          
##  1 ENSG00000… Orphanet_8… rs773278648 RCV0010… uncertain signif… criteria provi…
##  2 ENSG00000… Orphanet_7… NA          RCV0011… uncertain signif… criteria provi…
##  3 ENSG00000… HP_0001250  rs539139475 RCV0007… likely benign     criteria provi…
##  4 ENSG00000… Orphanet_7… rs142567487 RCV0002… uncertain signif… criteria provi…
##  5 ENSG00000… EFO_0004911 rs563024336 RCV0003… uncertain signif… criteria provi…
##  6 ENSG00000… Orphanet_2… rs376306906 RCV0007… uncertain signif… criteria provi…
##  7 ENSG00000… EFO_0005842 rs74627407  RCV0000… other             no assertion c…
##  8 ENSG00000… MONDO_0008… rs146288748 RCV0011… uncertain signif… criteria provi…
##  9 ENSG00000… Orphanet_6… rs119103265 RCV0008… uncertain signif… no assertion c…
## 10 ENSG00000… Orphanet_9… rs562275980 RCV0003… uncertain signif… criteria provi…
## # … with more rows

# Convert to a dplyr tibble
evdSelect %>%
  collect()

File formats

The Open Targets data generation pipeline produces outputs only in Parquet file format. The pipeline no longer produces outputs in JSON file format. This is due to Parquet file format having favourable features like:

  • built-in schema and data typing

  • size-efficiency when compressed

  • efficient reading

  • the wide availability of interfaces with most dataframe libraries

  • Previous releases (till 24.09):

    • https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/24.09/output/etl/parquet/associationByOverallDirect/

  • 25.03 release (and therafter):

    • https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/association_by_datasource_direct/

Tutorials and how-to guides

If you are using a non-Linux or non-Unix machine (e.g. Windows), you can access our FTP service using an FTP client like FileZilla or the Windows ftp command. For more information, including tips and workarounds, see the .

The next scripts make use of Apache Spark ( or ) to read and query the dataset using modern functional programming approaches. These packages need to be installed in their respective environments.

If you are new to Parquet and switching over from JSON, the change should be simple and your pipeline should be faster at reading the data. There are various examples of Parquet file readers from popular data frame libraries in , , , . Typically the reader is built on the Apache Arrow , which itself has APIs in many languages should you need them.

If you don’t wish to read data into dataframes and instead want to read the data as JSON (newline delimited), Open Targets has its for converting parquet to newline delimited JSON.

Post 25.03, the data downloads paths have changed as now only parquet file format is available. Also there are minor changes to the name of the dataset (snake_case & singular). More details can be found .

For more information on how to access and work with and example scripts based on actual use cases and research questions, check out the .

Platform Data Downloads
FTP server
Google Cloud Platform
Community Windows ftp thread
PySpark
Sparklyr
R
Spark
Polar
Pandas
library
in-house tool p2j (python)
here
our data downloads
Open Targets Community