Quickstart Guide


We will use Genome in a Bottle's Chinese Trio for a step-by-step demonstration of the whole process. We'll go through downloading VCF files, annotating them, creating a database, starting cluster with Docker Compose and finally querying the database. All commands and parameters are suitable for running on a laptop with Linux.


0. Prerequisites

Your environment should meet the following requirements before starting:


1. VCF

You can use any VCF file from GiaB's latest release (GRCh38) for the following steps. We will use a VCF (merged.vcf.gz) with Chinese Trio (HG005,HG006,HG007) downloaded from UCSC archive.


2. Annotations

Enriching VCFs with annotations (optional, albeit highly recommended). You'd need a locally installed Variant Effect Predictor. Follow any preferred method for VEP installation from the Ensembl page.

In the example below we go with offline mode (optional). We also use CADD and AlphaMissense annotations, which require downloading of the corresponding annotation files (optional).

With everything in place, the annotation command looks similar to:

/path/to/vep \
    -i /path/to/merged.vcf.gz \
    --vcf \
    --compress_output gzip \
    -o /path/to/merged.vep.vcf.gz \
    --species homo_sapiens \
    --offline \
    --cache \
    --dir_cache /path/to/vep/cache \
    --force_overwrite \
    --assembly GRCh38 \
    --gencode_primary \
    --terms SO \
    --variant_class \
    --no_stats \
    --hgvs \
    --regulatory \
    --canonical \
    --biotype \
    --sift b \
    --polyphen b \
    --af_gnomade \
    --af_gnomadg \
    --plugin AlphaMissense,file=/path/to/AlphaMissense_hg38.tsv.gz \
    --plugin CADD,snv=/path/to/CADD/whole_genome_SNVs.tsv.gz,indels=/path/to/CADD/gnomad.genomes.r4.0.indel.tsv.gz \
    --fasta /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --fork 8 \
    --buffer_size 10000

Notes:


3. Data Ingestion (ETL)

The ETL transforms raw or annotated VCFs into internal Dnaerys binary format.

With Spark 3.5:

$ spark-submit \
    --master local[*] \
    --conf spark.local.dir=/tmppath \
    --conf spark.driver.maxResultSize=16g \
    --driver-memory=18g \
    --packages=io.projectglow:glow-spark3_2.12:2.0.0 \
    --class org.dnaerys.etl \
    /path/dnaerys-ctl-1.17.5.jar \
    --path /path/to/merged.vep.vcf.gz \
    --path2save /path/to/output/dnaerys_dataset \
    --sinfo  /path/to/samples.csv \
    --notes "VEP annotated" \
    --cohort ChineseTrio \
    --rings 1 \
    --grch38 \
    --vep \
    --canonical \
    --gnomadg "gnomADg_AF" \
    --gnomade "gnomADe_AF" \
    --clinsig "CLIN_SIG" \
    --cadd \
    --alphamissense \
    --polyphen \
    --sift

Notes:

HG005,male
HG006,male
HG007,female

4. License Activation

Before the cluster can serve the data, we need to accept the Community License for the specific dataset path:

Show license terms:

$ scala -cp /path/dnaerys-ctl-1.17.5.jar org.dnaerys.license --show

Accept and bind to dataset:

$ scala -cp /path/dnaerys-ctl-1.17.5.jar org.dnaerys.license --accept --path /path/to/dnaerys_dataset

5. Local Cluster Deployment

Start the cluster with Docker Compose:

$ docker compose up

docker-compose.yml:

networks:
  cluster-network:

services:

  node0:
    networks:
      - cluster-network
    image: dnaerys/dnaerys-cluster:latest
    ports:
      - '8001:8000'
      - '8080:8080'
      - '8081:8081'
    volumes:
      - /path/to/dnaerys_dataset:/dnaerys/dataset
    shm_size: '1gb'
    environment:
      DEPLOYMENT_MODE: docker
      REQUIRED_CONTACT_POINT_NR: 0
      CLUSTER_HOST: node0
      CLUSTER_SEED_HOST: node0
      JAVA_OPTS: '--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED'
      RING_ID: ring0

6. Querying the Database

Once the cluster is up & running on localhost, we can query the dataset with gRPCurl. Download dnaerys_1.17.4.proto.

Pathogenic variants in TP53:

grpcurl \
  -plaintext \
  -proto dnaerys_1.17.4.proto \
  -d '{"chr":"17", "start":"7661779", "end":"7687546", "hom":"true", "het":"true", "ann": {"clinsgn":"PATHOGENIC"}, "assembly":"GRCh38"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInRegion

High-impact heterozygous variants in TP53 transcripts:

grpcurl \
  -plaintext \
  -proto dnaerys_1.17.4.proto \
  -d '{"chr":"17", "start":"7661779", "end":"7687546", "het":"true", "ann": {"feature_type":["TRANSCRIPT"], "impact":["HIGH"]}, "assembly":"GRCh38"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInRegion

De Novo variants in a trio (AlphaMissense Likely Pathogenic):

grpcurl \
  -plaintext \
  -proto dnaerys_1.17.4.proto \
  -d '{"parent1":"HG006", "parent2":"HG007", "proband":"HG005", "chr":"1", "start":"1", "end":"248956422", "ann": {"am_class":"AM_LIKELY_PATHOGENIC"}}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectDeNovo

7. Quality Control (QC)

Dnaerys includes built-in algorithms for cohort-wide integrity checks and kinship analysis.

Reported vs. observed sex mismatch check:

grpcurl \
  -plaintext \
  -proto dnaerys_1.17.4.proto \
  -d '{"cohort_name":"ChineseTrio"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SexMismatchCheck

Kinship analysis (Twins and duplications):

grpcurl \
  -plaintext \
  -proto dnaerys_1.17.4.proto \
  -d '{"cohort_name":"ChineseTrio", "degree":"TWINS_MONOZYGOTIC"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/Kinship