Quick Start
We will use Genome in a Bottle's Chinese Trio for a step-by-step demonstration of the whole process. We'll go through downloading VCF files, annotating them, creating a database, starting cluster with Docker Compose and finally querying the database. All commands and parameters are suitable for running on a laptop with Linux.
0. Prerequisites
Your environment should meet the following requirements before starting:
- dnaerys-ctl: Download the latest.
- Java 17: to run dnaerys-ctl tool. Download.
- Apache Spark 3.5: to run ETL. Download and unpack Spark 3.5.* and you are all set.
- Docker/Docker Compose: to run the cluster.
- VEP: for variant annotation (technically optional, but almost essential). Download.
1. VCF
Any VCF file from GiaB's latest release (GRCh38) will be a fit for the following steps. We'll use a VCF (merged.vcf.gz) with Chinese Trio (HG005,HG006,HG007) downloaded from UCSC archive.
2. Annotations
Enriching VCFs with annotations (optional, albeit highly recommended). You'd need a locally installed Variant Effect Predictor. Follow any preferred method for VEP installation from the Ensembl page.
In the example below we go with offline mode (optional). We demonstrate how to use CADD and AlphaMissense annotations, which require downloading of the corresponding annotation files (optional).
With everything in place, the annotation command looks similar to:
/path/to/vep \
-i /path/to/merged.vcf.gz \
--vcf \
--compress_output gzip \
-o /path/to/merged.vep.vcf.gz \
--species homo_sapiens \
--offline \
--cache \
--dir_cache /path/to/vep/cache \
--force_overwrite \
--assembly GRCh38 \
--gencode_primary \
--terms SO \
--variant_class \
--no_stats \
--hgvs \
--regulatory \
--canonical \
--biotype \
--sift b \
--polyphen b \
--af_gnomade \
--af_gnomadg \
--plugin AlphaMissense,file=/path/to/AlphaMissense_hg38.tsv.gz \
--plugin CADD,snv=/path/to/CADD/whole_genome_SNVs.tsv.gz,indels=/path/to/CADD/gnomad.genomes.r4.0.indel.tsv.gz \
--fasta /path/to/GRCh38_full_analysis_set_plus_decoy_hla.fa \
--fork 8 \
--buffer_size 10000
Notes:
- all --plugin and --fasta flags are optional and require downloading significantly large files
- adjust fork and buffer_size according to the input VCF file and the hardware spec; you can set fork = number of cores, and leave buffer_size as is if unsure
- the use of AlphaMissense, CADD, PolyPhen, and SIFT annotations is subject to their respective end-user licenses
- with given parameters, the annotation process takes about 120 minutes
3. Data Ingestion (ETL)
The ETL transforms raw or annotated VCFs into internal Dnaerys binary format.
With Spark 3.5:
$ spark-submit \
--master local[*] \
--driver-memory=32g \
--conf spark.local.dir=/tmppath \
--conf spark.driver.maxResultSize=31g \
--conf spark.cleaner.ttl.BROADCAST_VARS=-1 \
--conf spark.kryoserializer.buffer.max=2047m \
--conf spark.kryo.referenceTracking=false \
--conf spark.network.sharedByteBufAllocators=false \
--conf spark.executor.heartbeatInterval=180000 \
--conf spark.network.timeout=240000 \
--conf spark.network.timeoutInterval=240000 \
--packages=io.projectglow:glow-spark3_2.12:2.0.0 \
--class org.dnaerys.etl \
/path/dnaerys-ctl-1.18.0.jar \
--path /path/to/merged.vep.vcf.gz \
--path2save /path/to/output/dnaerys_dataset \
--sinfo /path/to/samples.csv \
--notes "VEP annotated" \
--cohort ChineseTrio \
--rings 1 \
--grch38 \
--vep \
--canonical \
--gnomadg "gnomADg_AF" \
--gnomade "gnomADe_AF" \
--clinsig "CLIN_SIG" \
--cadd \
--alphamissense \
--polyphen \
--sift
Notes:
- samples.csv should list all sample names with their gender, in order they are listed in VCF,
sample_name,genderper line. For GIAB Chinese Trio:
HG005,male
HG006,male
HG007,female
- adjust
maxResultSize,driver-memoryandkryoserializer.buffer.maxaccording to the VCF file size and the local hardware spec if required - ETL process takes about 5 minutes (8 cores)
4. License Activation
Before the cluster can serve the data, we need to accept the Community License for the specific dataset path:
Show license terms:
$ java \
--add-opens java.base/java.lang=ALL-UNNAMED \
--add-opens java.base/java.lang.invoke=ALL-UNNAMED \
--add-opens java.base/java.lang.reflect=ALL-UNNAMED \
--add-opens java.base/java.io=ALL-UNNAMED \
--add-opens java.base/java.nio=ALL-UNNAMED \
--add-opens java.base/java.util=ALL-UNNAMED \
--add-opens java.base/java.util.concurrent=ALL-UNNAMED \
--add-opens java.base/sun.nio.ch=ALL-UNNAMED \
-cp /path/dnaerys-ctl-1.18.0.jar \
org.dnaerys.license \
--show
Accept and bind to dataset:
$ java \
--add-opens java.base/java.lang=ALL-UNNAMED \
--add-opens java.base/java.lang.invoke=ALL-UNNAMED \
--add-opens java.base/java.lang.reflect=ALL-UNNAMED \
--add-opens java.base/java.io=ALL-UNNAMED \
--add-opens java.base/java.nio=ALL-UNNAMED \
--add-opens java.base/java.util=ALL-UNNAMED \
--add-opens java.base/java.util.concurrent=ALL-UNNAMED \
--add-opens java.base/sun.nio.ch=ALL-UNNAMED \
-cp /path/dnaerys-ctl-1.18.0.jar \
org.dnaerys.license \
--accept \
--path /path/to/dnaerys_dataset
Quick dataset info check:
$ java \
--add-opens java.base/java.lang=ALL-UNNAMED \
--add-opens java.base/java.lang.invoke=ALL-UNNAMED \
--add-opens java.base/java.lang.reflect=ALL-UNNAMED \
--add-opens java.base/java.io=ALL-UNNAMED \
--add-opens java.base/java.nio=ALL-UNNAMED \
--add-opens java.base/java.util=ALL-UNNAMED \
--add-opens java.base/java.util.concurrent=ALL-UNNAMED \
--add-opens java.base/sun.nio.ch=ALL-UNNAMED \
-cp /path/dnaerys-ctl-1.18.0.jar \
org.dnaerys.audit \
--path /path/to/dnaerys_dataset \
--info
5. Local Cluster Deployment
Start the cluster with Docker Compose:
$ docker compose up
docker-compose.yml:
networks:
cluster-network:
services:
node0:
networks:
- cluster-network
image: dnaerys/dnaerys-cluster:latest
ports:
- '8001:8000'
volumes:
- /path/to/dnaerys_dataset:/dnaerys/dataset
shm_size: '1gb'
environment:
DEPLOYMENT_MODE: docker
REQUIRED_CONTACT_POINT_NR: 0
CLUSTER_HOST: node0
CLUSTER_SEED_HOST: node0
JAVA_OPTS: '--add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.time=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED'
RING_ID: ring0
6. Querying the Database
Once the cluster is up & running on localhost, we can query the dataset from python via Python Client Library or from CLI with gRPCurl. You need dnaerys_1.18.0.proto file available locally for gRPCurl.
Three possible ways to answer the question - "Are there any noteworthy variants in BRCA1, BRCA2 or TP53 genes in my database ?"
- Clinically reported - The most conservative and clinically meaningful definition. Only variants that have been explicitly reported in ClinVar as disease-associated:
python
from dnaerys import DnaerysClient, Region, AnnotationFilter, ClinSignificance
with DnaerysClient("127.0.0.1:8001", tls=False) as client:
ann = AnnotationFilter(
clin_significance=[
ClinSignificance.PATHOGENIC,
ClinSignificance.LIKELY_PATHOGENIC,
]
)
regions = [
Region("chr17", 43044292, 43170245), # BRCA1
Region("chr13", 32315508, 32400268), # BRCA2
Region("chr17", 7661779, 7687546), # TP53
]
def gene_name(v):
if v.chr.value == 13:
return "BRCA2"
if v.start >= 43044292:
return "BRCA1"
return "TP53"
for v in client.select_variants(regions=regions, annotations=ann):
print(f"{gene_name(v)} {v.ref}>{v.alt} at {v.start}, AF={v.af:.6f}")
gRPCurl
grpcurl \
-plaintext \
-proto dnaerys_1.18.0.proto \
-d '{"chr":["17","13","17"], "start":["43044292","32315508","7661779"], "end":["43170245","32400268","7687546"],
"hom":"true", "het":"true",
"ann": {"clinsgn":["PATHOGENIC","LIKELY_PATHOGENIC"]},
"assembly":"GRCh38"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInMultiRegions
- High functional impact (protein-truncating) - variants that disrupt protein function regardless of ClinVar status:
python
from dnaerys import (
DnaerysClient, Region, AnnotationFilter,
Impact, Consequence, FeatureType,
)
with DnaerysClient("127.0.0.1:8001", tls=False) as client:
ann = AnnotationFilter(
feature_type=[FeatureType.TRANSCRIPT],
impact=[Impact.HIGH],
consequence=[
Consequence.FRAMESHIFT_VARIANT,
Consequence.STOP_GAINED,
Consequence.STOP_LOST,
Consequence.SPLICE_ACCEPTOR_VARIANT,
Consequence.SPLICE_DONOR_VARIANT,
Consequence.START_LOST,
],
)
regions = [
Region("chr17", 43044292, 43170245), # BRCA1
Region("chr13", 32315508, 32400268), # BRCA2
Region("chr17", 7661779, 7687546), # TP53
]
def gene_name(v):
if v.chr.value == 13:
return "BRCA2"
if v.start >= 43044292:
return "BRCA1"
return "TP53"
for v in client.select_variants(regions=regions, annotations=ann):
print(f"{gene_name(v)} {v.ref}>{v.alt} at {v.start}, AF={v.af:.6f}")
gRPCurl
grpcurl \
-plaintext \
-proto dnaerys_1.18.0.proto \
-d '{"chr":["17","13","17"], "start":["43044292","32315508","7661779"], "end":["43170245","32400268","7687546"],
"hom":"true", "het":"true",
"ann": {"feature_type":["TRANSCRIPT"], "impact":["HIGH"],
"consequence":["FRAMESHIFT_VARIANT","STOP_GAINED", "STOP_LOST",
"SPLICE_ACCEPTOR_VARIANT","SPLICE_DONOR_VARIANT",
"START_LOST"]},
"assembly":"GRCh38"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInMultiRegions
- Rare damaging missense variants - predicted as likely pathogenic by AlphaMissense and rare in the general population:
python
from dnaerys import (
DnaerysClient, Region, AnnotationFilter, AlphaMissense,
)
with DnaerysClient("127.0.0.1:8001", tls=False) as client:
ann = AnnotationFilter(
am_class=[AlphaMissense.AM_LIKELY_PATHOGENIC],
gnomad_exomes_af_lt=0.001,
)
regions = [
Region("chr17", 43044292, 43170245), # BRCA1
Region("chr13", 32315508, 32400268), # BRCA2
Region("chr17", 7661779, 7687546), # TP53
]
def gene_name(v):
if v.chr.value == 13:
return "BRCA2"
if v.start >= 43044292:
return "BRCA1"
return "TP53"
for v in client.select_variants(regions=regions, annotations=ann):
print(f"{gene_name(v)} {v.ref}>{v.alt} at {v.start}, "
f"AM={v.am_score:.3f}, gnomADe={v.gnomad_exomes_af:.6f}")
gRPCurl
grpcurl \
-plaintext \
-proto dnaerys_1.18.0.proto \
-d '{"chr":["17","13","17"], "start":["43044292","32315508","7661779"], "end":["43170245","32400268","7687546"],
"hom":"true", "het":"true",
"ann": {"am_class":["AM_LIKELY_PATHOGENIC"],
"gnomad_exomes_af_lt":"0.001"},
"assembly":"GRCh38"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectVariantsInMultiRegions
And since we have a trio, we can also look at de novo, heterozygous dominant, and homozygous recessive variants. For de novo in chr1:
python
from dnaerys import DnaerysClient, Region
with DnaerysClient("127.0.0.1:8001", tls=False) as client:
for v in client.select_de_novo(
proband="HG005",
parent1="HG006",
parent2="HG007",
region=Region("chr1", 1, 248956422),
):
print(v)
gRPCurl
grpcurl \
-plaintext \
-proto dnaerys_1.18.0.proto \
-d '{"parent1":"HG006", "parent2":"HG007", "proband":"HG005", "chr":"1", "start":"1", "end":"248956422", "assembly":"GRCh38"}' \
127.0.0.1:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectDeNovo