Download datasets
Last updated
Was this helpful?
Last updated
Was this helpful?
To support more complex and systematic queries, we provide all datasets as data downloads.
A list of all datasets is available in the page.
All Platform datasets are available as a distributed collection of data. This implies that for each dataset, there will be a directory with a list of partitioned files. Currently, we produce our datasets in Parquet. This formats allow us to expose nested information in a machine-readable way.
Archive datasets, as well as input files and other secondary products, are also made available in the and .
Below, we describe how to download, access and query this information in a step-by-step guide.
Below is a walkthrough on how to download the disease
dataset from the 25.03
release in Parquet format using different approaches.
We recommend using lftp with a command line client, and when using tools like wget, curl, etc., use https:// rather than ftp://
rsync
is a command line tool for efficiently transferring and synchronising files between a computer and an external hard drive.
wget
is a command line tool that retrieves content from web servers and widely available in Unix systems.
Users with Google Cloud Platform account can download the datasets through the Google Cloud Console or using gsutil
command-line tool.
To read the information available in the partitioned datasets, there is no need to manipulate or concatenate files. Datasets can be read directly using the dataset path.
The next scripts provide a proof-of-concept example using the ClinVar evidence provided by the European Variation Archive. The next scripts show how to:
Read a dataset
Explore the schema of the dataset
Select a subset of information (columns)
Display the information
First of all the dataset needs to be downloaded as described in the previous section. For simplicity, only EVA evidence is downloaded, but all evidence can be downloaded at once using the same approach.
The next query only displays 6 fields of the ClinVar evidence but there are other non-null values available. The schema is the best way to explore what's available and query the most relevant information. All Platform evidence share the same schema, so there will be a long list of fields that might not be informative for ClinVar but will be relevant if trying to query other data sources.
Dealing with nested information can sometimes be tedious. The Platform aims to minimise the nestiness of the data, however some level of structure is sometimes required. Spark provides a series of functions to deal with complex nested information. The scripts provide an example on how the clinicalSignificances
array is flattened using the explode
function.
Once loaded into Python or R, the user can decide to continue using Spark, write the output to a file or use alternative libraries to process the information (e.g. pandas
, tidyverse
, etc.).
The Open Targets data generation pipeline produces outputs only in Parquet file format. The pipeline no longer produces outputs in JSON file format. This is due to Parquet file format having favourable features like:
built-in schema and data typing
size-efficiency when compressed
efficient reading
the wide availability of interfaces with most dataframe libraries
If you are using a non-Linux or non-Unix machine (e.g. Windows), you can access our FTP service using an FTP client like FileZilla or the Windows ftp
command. For more information, including tips and workarounds, see the .
The next scripts make use of Apache Spark ( or ) to read and query the dataset using modern functional programming approaches. These packages need to be installed in their respective environments.
If you are new to Parquet and switching over from JSON, the change should be simple and your pipeline should be faster at reading the data. There are various examples of Parquet file readers from popular data frame libraries in , , , . Typically the reader is built on the Apache Arrow , which itself has APIs in many languages should you need them.
If you don’t wish to read data into dataframes and instead want to read the data as JSON (newline delimited), Open Targets has its for converting parquet to newline delimited JSON.
Post 25.03, the data downloads paths have changed as now only parquet file format is available. Also there are minor changes to the name of the dataset (snake_case & singular). More details can be found .
For more information on how to access and work with and example scripts based on actual use cases and research questions, check out the .