All Platform datasets are available as a distributed collection of data. This implies that for each dataset, there will be a directory with a list of partitioned files. Currently, we produce our datasets in either Parquet or JSON format. Both of these formats allow us to expose nested information in a machine-readable way. Next we describe how to download, access and query this information in a step-by-step guide.
Archive datasets, as well as input files and other secondary products are also made available in the FTP server and Google Cloud Platform.
Download
Below is a walkthrough on how to download the diseases dataset from the 21.04 release in Parquet format using different approaches.
We recommend using lftp with a command line client, and when using tools like wget, curl, etc., use https:// rather than ftp://
Using rsync
rsync is a command line tool for efficiently transferring and synchronising files between a computer and an external hard drive.
If you are using a non-Linux or non-Unix machine (e.g. Windows), you can access our FTP service using an FTP client like FileZilla or the Windows ftp command. For more information, including tips and workarounds, see the Community Windows ftp thread.
Accessing and querying datasets
To read the information available in the partitioned datasets, there is no need to manipulate or concatenate files. Datasets can be read directly using the dataset path.
The next scripts provide a proof-of-concept example using the ClinVar evidence provided by the European Variation Archive. The next scripts show how to:
Read a dataset
Explore the schema of the dataset
Select a subset of information (columns)
Display the information
First of all the dataset needs to be downloaded as described in the previous section. For simplicity, only EVA evidence is downloaded, but all evidence can be downloaded at once using the same approach.
The next scripts make use of Apache Spark (PySpark or Sparklyr) to read and query the dataset using modern functional programming approaches. These packages need to be installed in their respective environments.
The next query only displays 6 fields of the ClinVar evidence but there are other non-null values available. The schema is the best way to explore what's available and query the most relevant information. All Platform evidence share the same schema, so there will be a long list of fields that might not be informative for ClinVar but will be relevant if trying to query other data sources.
Dealing with nested information can sometimes be tedious. The Platform aims to minimise the nestiness of the data, however some level of structure is sometimes required. Spark provides a series of functions to deal with complex nested information. The scripts provide an example on how the clinicalSignificances array is flattened using the explode function.
Once loaded into Python or R, the user can decide to continue using Spark, write the output to a file or use alternative libraries to process the information (e.g. pandas, tidyverse, etc.).
from pyspark import SparkConffrom pyspark.sql import SparkSessionimport pyspark.sql.functions as F# path to ClinVar (EVA) evidence dataset # directory stored on your local machineevidencePath ="local directory path - e.g. /User/downloads/sourceId=eva"# establish spark connectionspark = ( SparkSession.builder.master('local[*]').getOrCreate())# read evidence datasetevd = spark.read.parquet(evidencePath)# Browse the evidence schemaevd.printSchema()# select fields of interestevdSelect = (evd.select("targetId","diseaseId","variantRsId","studyId", F.explode("clinicalSignificances").alias("cs"),"confidence") ) evdSelect.show()# +---------------+--------------+-----------+------------+--------------------+--------------------+# | targetId| diseaseId|variantRsId| studyId| cs| confidence|# +---------------+--------------+-----------+------------+--------------------+--------------------+# |ENSG00000153201|Orphanet_88619|rs773278648|RCV001042548|uncertain signifi...|criteria provided...|# |ENSG00000115718| Orphanet_745| null|RCV001134697|uncertain signifi...|criteria provided...|# |ENSG00000107147| HP_0001250|rs539139475|RCV000720408| likely benign|criteria provided...|# |ENSG00000175426|Orphanet_71528|rs142567487|RCV000292648|uncertain signifi...|criteria provided...|# |ENSG00000169174| EFO_0004911|rs563024336|RCV000375546|uncertain signifi...|criteria provided...|# |ENSG00000140521| Orphanet_298|rs376306906|RCV000763992|uncertain signifi...|criteria provided...|# |ENSG00000134982| EFO_0005842| rs74627407|RCV000073743| other|no assertion crit...|# |ENSG00000187498| MONDO_0008289|rs146288748|RCV001111533|uncertain signifi...|criteria provided...|# |ENSG00000116688|Orphanet_64749|rs119103265|RCV000857104|uncertain signifi...|no assertion crit...|# |ENSG00000133812|Orphanet_99956|rs562275980|RCV000367609|uncertain signifi...|criteria provided...|# +---------------+--------------+-----------+------------+--------------------+--------------------+# only showing top 10 rows# Convert to a Pandas DataframeevdSelect.toPandas()
For more information on how to access and work with our data downloads and example scripts based on actual use cases and research questions, check out the Open Targets Community.