Quick Start

We will use Genome in a Bottle's Chinese Trio for a step-by-step demonstration of the whole process. We'll go through downloading VCF files, annotating them, creating a database, starting cluster with Docker Compose and finally querying the database. All commands and parameters are suitable for running on a laptop with Linux.

0. Prerequisites

Your environment should meet the following requirements before starting:

dnaerys-ctl: Download the latest.
Java 17: to run dnaerys-ctl tool. Download.
Apache Spark 3.5: to run ETL. Download and unpack Spark 3.5.* and you are all set.
Docker/Docker Compose: to run the cluster.
VEP: for variant annotation (technically optional, but almost essential). Download.

1. VCF

Any VCF file from GiaB's latest release (GRCh38) will be a fit for the following steps. We'll use a VCF (merged.vcf.gz) with Chinese Trio (HG005,HG006,HG007) downloaded from UCSC archive.

2. Annotations

Enriching VCFs with annotations (optional, albeit highly recommended). You'd need a locally installed Variant Effect Predictor. Follow any preferred method for VEP installation from the Ensembl page.

In the example below we go with offline mode (optional). We demonstrate how to use CADD and AlphaMissense annotations, which require downloading of the corresponding annotation files (optional).

With everything in place, the annotation command looks similar to:

/path/to/vep \
    -i /path/to/merged.vcf.gz \
    --vcf \
    --compress_output gzip \
    -o /path/to/merged.vep.vcf.gz \
    --species homo_sapiens \
    --offline \
    --cache \
    --dir_cache /path/to/vep/cache \
    --force_overwrite \
    --assembly GRCh38 \
    --gencode_primary \
    --terms SO \
    --variant_class \
    --no_stats \
    --hgvs \
    --regulatory \
    --canonical \
    --biotype \
    --sift b \
    --polyphen b \
    --af_gnomade \
    --af_gnomadg \
    --plugin AlphaMissense,file=/path/to/AlphaMissense_hg38.tsv.gz \
    --plugin CADD,snv=/path/to/CADD/whole_genome_SNVs.tsv.gz,indels=/path/to/CADD/gnomad.genomes.r4.0.indel.tsv.gz \
    --fasta /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa \
    --fork 8 \
    --buffer_size 10000

Notes:

all --plugin and --fasta flags are optional and require downloading significantly large files
adjust fork and buffer_size according to the input VCF file and the hardware spec; you can set fork = number of cores, and leave buffer_size as is if unsure
the use of AlphaMissense, CADD, PolyPhen, and SIFT annotations is subject to their respective end-user licenses
with given parameters, the annotation process takes about 120 minutes

3. Data Ingestion (ETL)

The ETL transforms raw or annotated VCFs into internal Dnaerys binary format.

With Spark 3.5:

$ spark-submit \
    --master local[*] \
    --driver-memory=32g \
    --conf spark.local.dir=/tmppath \
    --conf spark.driver.maxResultSize=31g \
    --conf spark.cleaner.ttl.BROADCAST_VARS=-1 \
    --conf spark.kryoserializer.buffer.max=2047m \
    --conf spark.kryo.referenceTracking=false \
    --conf spark.network.sharedByteBufAllocators=false \
    --conf spark.executor.heartbeatInterval=180000 \
    --conf spark.network.timeout=240000 \
    --conf spark.network.timeoutInterval=240000 \
    --packages=io.projectglow:glow-spark3_2.12:2.0.0 \
    --class org.dnaerys.etl \
    /path/dnaerys-ctl-1.18.0.jar \
    --path /path/to/merged.vep.vcf.gz \
    --path2save /path/to/output/dnaerys_dataset \
    --sinfo /path/to/samples.csv \
    --notes "VEP annotated" \
    --cohort ChineseTrio \
    --rings 1 \
    --grch38 \
    --vep \
    --canonical \
    --gnomadg "gnomADg_AF" \
    --gnomade "gnomADe_AF" \
    --clinsig "CLIN_SIG" \
    --cadd \
    --alphamissense \
    --polyphen \
    --sift

Notes:

samples.csv should list all sample names with their gender, in order they are listed in VCF, sample_name,gender per line. For GIAB Chinese Trio:

HG005,male
HG006,male
HG007,female

adjust maxResultSize, driver-memory and kryoserializer.buffer.max according to the VCF file size and the local hardware spec if required
ETL process takes about 5 minutes (8 cores)

4. License Activation

Before the cluster can serve the data, we need to accept the Community License for the specific dataset path:

Show license terms:

$ java \
  --add-opens java.base/java.lang=ALL-UNNAMED \
  --add-opens java.base/java.lang.invoke=ALL-UNNAMED \
  --add-opens java.base/java.lang.reflect=ALL-UNNAMED \
  --add-opens java.base/java.io=ALL-UNNAMED \
  --add-opens java.base/java.nio=ALL-UNNAMED \
  --add-opens java.base/java.util=ALL-UNNAMED \
  --add-opens java.base/java.util.concurrent=ALL-UNNAMED \
  --add-opens java.base/sun.nio.ch=ALL-UNNAMED \
  -cp /path/dnaerys-ctl-1.18.0.jar \
  org.dnaerys.license \
  --show

Accept and bind to dataset:

$ java \
  --add-opens java.base/java.lang=ALL-UNNAMED \
  --add-opens java.base/java.lang.invoke=ALL-UNNAMED \
  --add-opens java.base/java.lang.reflect=ALL-UNNAMED \
  --add-opens java.base/java.io=ALL-UNNAMED \
  --add-opens java.base/java.nio=ALL-UNNAMED \
  --add-opens java.base/java.util=ALL-UNNAMED \
  --add-opens java.base/java.util.concurrent=ALL-UNNAMED \
  --add-opens java.base/sun.nio.ch=ALL-UNNAMED \
  -cp /path/dnaerys-ctl-1.18.0.jar \
  org.dnaerys.license \
  --accept \
  --path /path/to/dnaerys_dataset

Quick dataset info check:

$ java \
  --add-opens java.base/java.lang=ALL-UNNAMED \
  --add-opens java.base/java.lang.invoke=ALL-UNNAMED \
  --add-opens java.base/java.lang.reflect=ALL-UNNAMED \
  --add-opens java.base/java.io=ALL-UNNAMED \
  --add-opens java.base/java.nio=ALL-UNNAMED \
  --add-opens java.base/java.util=ALL-UNNAMED \
  --add-opens java.base/java.util.concurrent=ALL-UNNAMED \
  --add-opens java.base/sun.nio.ch=ALL-UNNAMED \
  -cp /path/dnaerys-ctl-1.18.0.jar \
  org.dnaerys.audit \
  --path /path/to/dnaerys_dataset \
  --info

5. Local Cluster Deployment

Start the cluster with Docker Compose:

$ docker compose up

docker-compose.yml:

networks:
  cluster-network:

services:

  node0:
    networks:
      - cluster-network
    image: dnaerys/dnaerys-cluster:latest
    ports:
      - '8001:8000'
    volumes:
      - /path/to/dnaerys_dataset:/dnaerys/dataset
    shm_size: '1gb'
    environment:
      DEPLOYMENT_MODE: docker
      REQUIRED_CONTACT_POINT_NR: 0
      CLUSTER_HOST: node0
      CLUSTER_SEED_HOST: node0
      JAVA_OPTS: '--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED'
      RING_ID: ring0

6. Querying the Database

Once the cluster is up & running on localhost, we can query the dataset from python via Python Client Library or from CLI with gRPCurl. You need dnaerys_1.20.0.proto file available locally for gRPCurl.

Three possible ways to answer the question - "Are there any noteworthy variants in BRCA1, BRCA2 or TP53 genes in my database ?"

Clinically reported - The most conservative and clinically meaningful definition. Only variants that have been explicitly reported in ClinVar as disease-associated:

python

from dnaerys import DnaerysClient, Region, AnnotationFilter, ClinSignificance

with DnaerysClient("127.0.0.1:8001", tls=False) as client:
    ann = AnnotationFilter(
        clin_significance=[
            ClinSignificance.PATHOGENIC,
            ClinSignificance.LIKELY_PATHOGENIC,
        ]
    )
    regions = [
        Region("chr17", 43044292, 43170245),  # BRCA1
        Region("chr13", 32315508, 32400268),  # BRCA2
        Region("chr17", 7661779, 7687546),    # TP53
    ]

    def gene_name(v):
        if v.chr.value == 13:
            return "BRCA2"
        if v.start >= 43044292:
            return "BRCA1"
        return "TP53"

    for v in client.select_variants(regions=regions, annotations=ann):
        print(f"{gene_name(v)} {v.ref}>{v.alt} at {v.start}, AF={v.af:.6f}")

gRPCurl

grpcurl \
  -plaintext \
  -proto dnaerys_1.20.0.proto \
  -d '{"chr":["17","13","17"], "start":["43044292","32315508","7661779"], "end":["43170245","32400268","7687546"],
       "hom":"true", "het":"true",
       "ann": {"clinsgn":["PATHOGENIC","LIKELY_PATHOGENIC"]},
       "assembly":"GRCh38"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInMultiRegions

High functional impact (protein-truncating) - variants that disrupt protein function regardless of ClinVar status:

python

from dnaerys import (
    DnaerysClient, Region, AnnotationFilter,
    Impact, Consequence, FeatureType,
)

with DnaerysClient("127.0.0.1:8001", tls=False) as client:
    ann = AnnotationFilter(
        feature_type=[FeatureType.TRANSCRIPT],
        impact=[Impact.HIGH],
        consequence=[
            Consequence.FRAMESHIFT_VARIANT,
            Consequence.STOP_GAINED,
            Consequence.STOP_LOST,
            Consequence.SPLICE_ACCEPTOR_VARIANT,
            Consequence.SPLICE_DONOR_VARIANT,
            Consequence.START_LOST,
        ],
    )
    regions = [
        Region("chr17", 43044292, 43170245),  # BRCA1
        Region("chr13", 32315508, 32400268),  # BRCA2
        Region("chr17", 7661779, 7687546),    # TP53
    ]

    def gene_name(v):
        if v.chr.value == 13:
            return "BRCA2"
        if v.start >= 43044292:
            return "BRCA1"
        return "TP53"

    for v in client.select_variants(regions=regions, annotations=ann):
        print(f"{gene_name(v)} {v.ref}>{v.alt} at {v.start}, AF={v.af:.6f}")

gRPCurl

grpcurl \
  -plaintext \
  -proto dnaerys_1.20.0.proto \
  -d '{"chr":["17","13","17"], "start":["43044292","32315508","7661779"], "end":["43170245","32400268","7687546"],
       "hom":"true", "het":"true",
       "ann": {"feature_type":["TRANSCRIPT"], "impact":["HIGH"],
               "consequence":["FRAMESHIFT_VARIANT","STOP_GAINED", "STOP_LOST",
                              "SPLICE_ACCEPTOR_VARIANT","SPLICE_DONOR_VARIANT",
                              "START_LOST"]},
       "assembly":"GRCh38"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInMultiRegions

Rare damaging missense variants - predicted as likely pathogenic by AlphaMissense and rare in the general population:

python

from dnaerys import (
    DnaerysClient, Region, AnnotationFilter, AlphaMissense,
)

with DnaerysClient("127.0.0.1:8001", tls=False) as client:
    ann = AnnotationFilter(
        am_class=[AlphaMissense.AM_LIKELY_PATHOGENIC],
        gnomad_exomes_af_lt=0.001,
    )
    regions = [
        Region("chr17", 43044292, 43170245),  # BRCA1
        Region("chr13", 32315508, 32400268),  # BRCA2
        Region("chr17", 7661779, 7687546),    # TP53
    ]

    def gene_name(v):
        if v.chr.value == 13:
            return "BRCA2"
        if v.start >= 43044292:
            return "BRCA1"
        return "TP53"

    for v in client.select_variants(regions=regions, annotations=ann):
        print(f"{gene_name(v)} {v.ref}>{v.alt} at {v.start}, "
              f"AM={v.am_score:.3f}, gnomADe={v.gnomad_exomes_af:.6f}")

gRPCurl

grpcurl \
  -plaintext \
  -proto dnaerys_1.20.0.proto \
  -d '{"chr":["17","13","17"], "start":["43044292","32315508","7661779"], "end":["43170245","32400268","7687546"],
       "hom":"true", "het":"true",
       "ann": {"am_class":["AM_LIKELY_PATHOGENIC"],
               "gnomad_exomes_af_lt":"0.001"},
       "assembly":"GRCh38"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInMultiRegions

And since we have a trio, we can also look at de novo, heterozygous dominant, and homozygous recessive variants. For de novo in chr1:

python

from dnaerys import DnaerysClient, Region

with DnaerysClient("127.0.0.1:8001", tls=False) as client:
    for v in client.select_de_novo(
        proband="HG005",
        parent1="HG006",
        parent2="HG007",
        region=Region("chr1", 1, 248956422),
    ):
        print(v)

gRPCurl

grpcurl \
  -plaintext \
  -proto dnaerys_1.20.0.proto \
  -d '{"parent1":"HG006", "parent2":"HG007", "proband":"HG005", "chr":"1", "start":"1", "end":"248956422", "assembly":"GRCh38"}' \
  127.0.0.1:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectDeNovo