Download datasets

To support more complex and systematic queries, we provide all datasets as data downloads.

A list of all datasets is available in the Platform Data Downloads page.

All Platform datasets are available as a distributed collection of data. This implies that for each dataset, there will be a directory with a list of partitioned files. Currently, we produce our datasets in Parquet. This formats allow us to expose nested information in a machine-readable way.

Archive datasets, as well as input files and other secondary products, are also made available in the FTP server and Google Cloud Platform.

Below, we describe how to download, access and query this information in a step-by-step guide.

Download

Below is a walkthrough on how to download the disease dataset from the 25.03 release in Parquet format using different approaches.

We recommend using lftp with a command line client, and when using tools like wget, curl, etc., use https:// rather than ftp://

Using rsync

rsync is a command line tool for efficiently transferring and synchronising files between a computer and an external hard drive.

rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/disease .

Using wget

wget is a command line tool that retrieves content from web servers and widely available in Unix systems.

wget --recursive --no-parent --no-host-directories --cut-dirs 8 \
https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/disease

Using Google Cloud Platform (paywalled after 1TB)

Users with Google Cloud Platform account can download the datasets through the Google Cloud Console or using gsutil command-line tool.

gsutil -m cp -r gs://open-targets-data-releases/25.03/output/disease

Other ways to access data

If you are using a non-Linux or non-Unix machine (e.g. Windows), you can access our FTP service using an FTP client like FileZilla or the Windows ftp command. For more information, including tips and workarounds, see the Community Windows ftp thread.

Accessing and querying datasets

To read the information available in the partitioned datasets, there is no need to manipulate or concatenate files. Datasets can be read directly using the dataset path.

The next scripts provide a proof-of-concept example using the ClinVar evidence provided by the European Variation Archive. The next scripts show how to:

  • Read a dataset

  • Explore the schema of the dataset

  • Select a subset of information (columns)

  • Display the information

First of all the dataset needs to be downloaded as described in the previous section. For simplicity, only EVA evidence is downloaded, but all evidence can be downloaded at once using the same approach.

gsutil -m cp -r gs://open-targets-data-releases/25.03/output/evidence/sourceId=eva

The next scripts make use of Apache Spark (PySpark or Sparklyr) to read and query the dataset using modern functional programming approaches. These packages need to be installed in their respective environments.

The next query only displays 6 fields of the ClinVar evidence but there are other non-null values available. The schema is the best way to explore what's available and query the most relevant information. All Platform evidence share the same schema, so there will be a long list of fields that might not be informative for ClinVar but will be relevant if trying to query other data sources.

Dealing with nested information can sometimes be tedious. The Platform aims to minimise the nestiness of the data, however some level of structure is sometimes required. Spark provides a series of functions to deal with complex nested information. The scripts provide an example on how the clinicalSignificances array is flattened using the explode function.

Once loaded into Python or R, the user can decide to continue using Spark, write the output to a file or use alternative libraries to process the information (e.g. pandas, tidyverse, etc.).

from pyspark import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# path to ClinVar (EVA) evidence dataset 
# directory stored on your local machine
evidencePath = "local directory path - e.g. /User/downloads/sourceId=eva"

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)

# read evidence dataset
evd = spark.read.parquet(evidencePath)

# Browse the evidence schema
evd.printSchema()

# select fields of interest
evdSelect = (evd
 .select("targetId",
         "diseaseId",
         "variantRsId",
         "studyId",
         F.explode("clinicalSignificances").alias("cs"),
         "confidence")
 )
 evdSelect.show()

# +---------------+--------------+-----------+------------+--------------------+--------------------+
# |       targetId|     diseaseId|variantRsId|     studyId|                  cs|          confidence|
# +---------------+--------------+-----------+------------+--------------------+--------------------+
# |ENSG00000153201|Orphanet_88619|rs773278648|RCV001042548|uncertain signifi...|criteria provided...|
# |ENSG00000115718|  Orphanet_745|       null|RCV001134697|uncertain signifi...|criteria provided...|
# |ENSG00000107147|    HP_0001250|rs539139475|RCV000720408|       likely benign|criteria provided...|
# |ENSG00000175426|Orphanet_71528|rs142567487|RCV000292648|uncertain signifi...|criteria provided...|
# |ENSG00000169174|   EFO_0004911|rs563024336|RCV000375546|uncertain signifi...|criteria provided...|
# |ENSG00000140521|  Orphanet_298|rs376306906|RCV000763992|uncertain signifi...|criteria provided...|
# |ENSG00000134982|   EFO_0005842| rs74627407|RCV000073743|               other|no assertion crit...|
# |ENSG00000187498| MONDO_0008289|rs146288748|RCV001111533|uncertain signifi...|criteria provided...|
# |ENSG00000116688|Orphanet_64749|rs119103265|RCV000857104|uncertain signifi...|no assertion crit...|
# |ENSG00000133812|Orphanet_99956|rs562275980|RCV000367609|uncertain signifi...|criteria provided...|
# +---------------+--------------+-----------+------------+--------------------+--------------------+
# only showing top 10 rows

# Convert to a Pandas Dataframe
evdSelect.toPandas()

File formats

The Open Targets data generation pipeline produces outputs only in Parquet file format. The pipeline no longer produces outputs in JSON file format. This is due to Parquet file format having favourable features like:

  • built-in schema and data typing

  • size-efficiency when compressed

  • efficient reading

  • the wide availability of interfaces with most dataframe libraries

If you are new to Parquet and switching over from JSON, the change should be simple and your pipeline should be faster at reading the data. There are various examples of Parquet file readers from popular data frame libraries in R, Spark, Polar, Pandas. Typically the reader is built on the Apache Arrow library, which itself has APIs in many languages should you need them.

If you don’t wish to read data into dataframes and instead want to read the data as JSON (newline delimited), Open Targets has its in-house tool p2j (python) for converting parquet to newline delimited JSON.

Post 25.03, the data downloads paths have changed as now only parquet file format is available. Also there are minor changes to the name of the dataset (snake_case & singular). More details can be found here.

  • Previous releases (till 24.09):

    • https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/24.09/output/etl/parquet/associationByOverallDirect/

  • 25.03 release (and therafter):

    • https://ftp.ebi.ac.uk/pub/databases/opentargets/platform/25.03/output/association_by_datasource_direct/

Tutorials and how-to guides

For more information on how to access and work with our data downloads and example scripts based on actual use cases and research questions, check out the Open Targets Community.

Last updated

Was this helpful?