Quickstart Guide
We will use Genome in a Bottle's Chinese Trio for a step-by-step demonstration of the whole process. We'll go through downloading VCF files, annotating them, creating a database, starting cluster with Docker Compose and finally querying the database. All commands and parameters are suitable for running on a laptop with Linux.
0. Prerequisites
Your environment should meet the following requirements before starting:
- Scala 2.12: to run dnaerys-ctl tool. The simplest way to install the required scala version and to manage scala
environments is via Coursier. Once
csis installed, run:cs install scala:2.12.15. - Apache Spark 3.5: required for ETL process. Download and unpack Spark 3.5.* and you are all set.
- Docker/Docker Compose: to run the cluster.
- Optional: VEP for variant annotation and gRPCurl for interacting with the API via the command line.
1. VCF
You can use any VCF file from GiaB's latest release (GRCh38) for the following steps. We will use a VCF (merged.vcf.gz) with Chinese Trio (HG005,HG006,HG007) downloaded from UCSC archive.
2. Annotations
Enriching VCFs with annotations (optional, albeit highly recommended). You'd need a locally installed Variant Effect Predictor. Follow any preferred method for VEP installation from the Ensembl page.
In the example below we go with offline mode (optional). We also use CADD and AlphaMissense annotations, which require downloading of the corresponding annotation files (optional).
With everything in place, the annotation command looks similar to:
/path/to/vep \
-i /path/to/merged.vcf.gz \
--vcf \
--compress_output gzip \
-o /path/to/merged.vep.vcf.gz \
--species homo_sapiens \
--offline \
--cache \
--dir_cache /path/to/vep/cache \
--force_overwrite \
--assembly GRCh38 \
--gencode_primary \
--terms SO \
--variant_class \
--no_stats \
--hgvs \
--regulatory \
--canonical \
--biotype \
--sift b \
--polyphen b \
--af_gnomade \
--af_gnomadg \
--plugin AlphaMissense,file=/path/to/AlphaMissense_hg38.tsv.gz \
--plugin CADD,snv=/path/to/CADD/whole_genome_SNVs.tsv.gz,indels=/path/to/CADD/gnomad.genomes.r4.0.indel.tsv.gz \
--fasta /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa \
--fork 8 \
--buffer_size 10000
Notes:
- all --plugin and --fasta flags are optional and require downloading significantly large files
- adjust fork and buffer_size according to the input VCF file and the hardware spec; you can set fork = number of cores, and leave buffer_size as is if unsure
- the use of AlphaMissense, CADD, PolyPhen, and SIFT annotations is subject to their respective end-user licenses
- with given parameters, the annotation process takes about 120 minutes
3. Data Ingestion (ETL)
The ETL transforms raw or annotated VCFs into internal Dnaerys binary format.
With Spark 3.5:
$ spark-submit \
--master local[*] \
--conf spark.local.dir=/tmppath \
--conf spark.driver.maxResultSize=16g \
--driver-memory=18g \
--packages=io.projectglow:glow-spark3_2.12:2.0.0 \
--class org.dnaerys.etl \
/path/dnaerys-ctl-1.17.5.jar \
--path /path/to/merged.vep.vcf.gz \
--path2save /path/to/output/dnaerys_dataset \
--sinfo /path/to/samples.csv \
--notes "VEP annotated" \
--cohort ChineseTrio \
--rings 1 \
--grch38 \
--vep \
--canonical \
--gnomadg "gnomADg_AF" \
--gnomade "gnomADe_AF" \
--clinsig "CLIN_SIG" \
--cadd \
--alphamissense \
--polyphen \
--sift
Notes:
- samples.csv should list all sample names with their gender, in order they are listed in VCF, i.e.
sample_name,genderper line. For GIAB Chinese Trio we have:
HG005,male
HG006,male
HG007,female
- adjust
maxResultSize,driver-memoryandkryoserializer.buffer.maxaccording to the VCF file size and the local hardware spec if required
4. License Activation
Before the cluster can serve the data, we need to accept the Community License for the specific dataset path:
Show license terms:
$ scala -cp /path/dnaerys-ctl-1.17.5.jar org.dnaerys.license --show
Accept and bind to dataset:
$ scala -cp /path/dnaerys-ctl-1.17.5.jar org.dnaerys.license --accept --path /path/to/dnaerys_dataset
5. Local Cluster Deployment
Start the cluster with Docker Compose:
$ docker compose up
docker-compose.yml:
networks:
cluster-network:
services:
node0:
networks:
- cluster-network
image: dnaerys/dnaerys-cluster:latest
ports:
- '8001:8000'
- '8080:8080'
- '8081:8081'
volumes:
- /path/to/dnaerys_dataset:/dnaerys/dataset
shm_size: '1gb'
environment:
DEPLOYMENT_MODE: docker
REQUIRED_CONTACT_POINT_NR: 0
CLUSTER_HOST: node0
CLUSTER_SEED_HOST: node0
JAVA_OPTS: '--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED'
RING_ID: ring0
6. Querying the Database
Once the cluster is up & running on localhost, we can query the dataset with gRPCurl. Download dnaerys_1.17.4.proto.
Pathogenic variants in TP53:
grpcurl \
-plaintext \
-proto dnaerys_1.17.4.proto \
-d '{"chr":"17", "start":"7661779", "end":"7687546", "hom":"true", "het":"true", "ann": {"clinsgn":"PATHOGENIC"}, "assembly":"GRCh38"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInRegion
High-impact heterozygous variants in TP53 transcripts:
grpcurl \
-plaintext \
-proto dnaerys_1.17.4.proto \
-d '{"chr":"17", "start":"7661779", "end":"7687546", "het":"true", "ann": {"feature_type":["TRANSCRIPT"], "impact":["HIGH"]}, "assembly":"GRCh38"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInRegion
De Novo variants in a trio (AlphaMissense Likely Pathogenic):
grpcurl \
-plaintext \
-proto dnaerys_1.17.4.proto \
-d '{"parent1":"HG006", "parent2":"HG007", "proband":"HG005", "chr":"1", "start":"1", "end":"248956422", "ann": {"am_class":"AM_LIKELY_PATHOGENIC"}}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectDeNovo
7. Quality Control (QC)
Dnaerys includes built-in algorithms for cohort-wide integrity checks and kinship analysis.
Reported vs. observed sex mismatch check:
grpcurl \
-plaintext \
-proto dnaerys_1.17.4.proto \
-d '{"cohort_name":"ChineseTrio"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SexMismatchCheck
Kinship analysis (Twins and duplications):
grpcurl \
-plaintext \
-proto dnaerys_1.17.4.proto \
-d '{"cohort_name":"ChineseTrio", "degree":"TWINS_MONOZYGOTIC"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/Kinship