Documentation

Command Reference

This reference covers all existing tools available in dnaerys-ctl and their usage from initial ingestion to long-term dataset maintenance and optimization.

Prerequisites

Before beginning, ensure the environment is configured with the following stack:

Scala 2.12.x: We recommend Coursier for managing Scala environments. Once it's installed, Scala can be installed with cs install scala:2.12.20.
alternatively, you can use Java 17 instead of scala: simply change scala to java
since release 1.18, dnaerys-ctl is compiled into Java 17 classes, so pass these options to all CLI commands (temporary artifacts, Spark 3.5 handles them automatically) --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED --add-opens java.base/java.lang.reflect=ALL-UNNAMED --add-opens java.base/java.io=ALL-UNNAMED --add-opens java.base/java.nio=ALL-UNNAMED --add-opens java.base/java.util=ALL-UNNAMED --add-opens java.base/java.util.concurrent=ALL-UNNAMED
Apache Spark 3.5.x: Required for ETL operations.
Glow: Dnaerys relies on Project Glow for VCF/BGEN parsing. It is automatically resolved during runtime and is the only reason we are still on Scala 2.12.

The ETL process transforms multi-sample VCF, BGEN, or Delta files into Dnaerys binary format. It runs as a Spark job and handles normalization, minimal representation, and functional and clinical annotations processing. To include annotations into Dnaerys datasets, the input files should already be annotated.

ETL options:

   // required
  --path  <arg>         source; path to multi-sample VCF/BGEN or Delta files to process
  --path2save  <arg>    final destination
  --sinfo  <arg>        path to csv file with sample names and genders
  --cohort  <arg>       cohort name, arbitrary string

  // # nodes
  --rings  <arg>        # of nodes in cluster

  // source format options, default is VCF/BGEN
  --delta               source is in Delta - https://delta.io

  // transformations
  --normalize           equals to Glow's 'normalize_variants'
  --reference  <arg>    path to reference genome (required for normalization only)
  --minrep              convert variants to minimal representation, equals to GATK' LeftAlignAndTrimVariants

  // annotations
  --vep                 extract VEP annotations
  --gnomad  <arg>       extract gnomAD AF annotations with provided prefix (default prefix in VEP is gnomAD_AF if --af_gnomad was used in VEP)
  --cadd                extract CADD_RAW/CADD_PHRED annotations
  --clinsig  <arg>      extract ClinVar Clinical significance annotations with provided prefix (default prefix in VEP is CLIN_SIG if --clin_sig_allele was used in VEP)
  --skipnotannotated    ignore unannotated variants when annotations are expected

  // filters
  --biallelic-only      load biallelic variants only
  --rto                 create runtime optimized cluster; default=false

  // dataset info
  --grch37              whether input files aligned to GRCh37; default=false
  --grch38              whether input files aligned to GRCh38; default=true
  --notes  <arg>        dataset notes

  // runtime options
  --sqlshuffle  <arg>   spark.sql.sqlshuffle.partitions; default=200
  --dump                dump variants to local tsv file for each partition
  --seq                 write rings on disk sequentially; default=false

  // help options & info
  -h, --help            Show help message
  -v, --version         Show version of this program

example:

$ spark-submit \
    --master local[*] \
    --conf spark.local.dir=/tmppath \
    --conf spark.driver.maxResultSize=XX \
    --driver-memory=XX \
    --packages=io.projectglow:glow-spark3_2.12:2.0.0 \
    --class org.dnaerys.etl \
    /path/dnaerys-ctl-1.17.2.jar \
    --path /path/to/input_dir/with/vcfs \
    --path2save /path/to/output/dnaerys_dataset \
    --sinfo  /path/to/samples.csv \
    --notes "VEP annotated dataset" \
    --cohort <some name> \
    --rings 8 \
    --grch38 \
    --vep \
    --gnomadg "gnomADg_AF" \
    --gnomade "gnomADe_AF" \
    --clinsig "CLIN_SIG" \
    --cadd \
    --alphamissense \
    --polyphen \
    --sift

Notes

Dnaerys uses native Glow's parser
Dnaerys supports Cloud Object Stores to the extent Spark supports it
--minrep similar to minRep in Hail. Limitations discussed here.
--normalize equal to bcftools norm and vt normalize and is performed by Glow. See details here and here.
--sinfo path to csv file with <sample_name,gender>. Single sample per line. Gender is male or female and samples names should be the same as in VCFs and in the same order. If gender is not known, see impute gender.
--delta expects data in Glow's schema saved in Delta. Dependency on delta should be mentioned when ETL job is submitted, e.g. --packages io.projectglow:glow-spark3_2.12:1.2.1,io.delta:delta-core_2.12:1.0.1

Dataset Info

Use the audit tool to inspect the health, versioning, history, and internal state of a dataset.

Audit options:

      --info           Show dataset details and serialization version
      --journal        View the immutable event log for the dataset
  -p, --path  <arg>    Path to the dataset directory
      --show-offsets   Display partition offsets
      --show-samples   List all sample names in the dataset

List all options:

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.audit --help

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.audit --path /path/to/dataset --info

License

Before a dataset can be served by the cluster, the license must be accepted and bound to the dataset path.

List all options:

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.license --help

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.license --accept --path /path/to/dataset

Merge clusters

MergeClusters merges two clusters with arbitrary datasets and number of nodes in each.

Cohorts from both datasets form a new dataset and cohorts composition is preserved. Variants from all cohorts are merged into a new dataset. Variants which exist only in a single dataset are treated as missing (aka 'no call') in another dataset.

Statistics (AC, AF, #het, #hom, etc) is recalculated taking into account zygosity/gender. Merging can be applied sequentially to resulted datasets, providing means to construct combined cohorts from smaller ones, keeping internal composition of original cohorts.

Algorithmically it takes a single path through all variants in each cluster and requires as much RAM as combination of two single partitions, one from each cluster (except a corner case when most of the variants in one of the dataset are located after variants from the last partition in another dataset).

cross references

In a common case, when a single cohort is partitioned during ETL, its parts can be processed separately and can be merged with the --same-cohort option. If variants in these partitions are also sorted and do not intersect in genomic coordinates, datasets can be combined via Combine clusters.
Merged cluster has the same number of nodes as the first cluster. Clusters can be rebalanced via Rebalance cluster if required.
When different datasets are merged, new multiallelic variants might emerge. They can be detected and handled after merge by Detect multiallelics.

List all options:

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.MergeClusters --help

$ scala \
    -J-Xmx128g \
    -cp /path/dnaerys-ctl.jar \
    org.dnaerys.utils.MergeClusters \
    --path1 /path/to/first_dataset \
    --path2 /path/to/second_dataset \
    --path2save /path/to/merged_dataset

Combine clusters

To combine different parts of the same cohort which went through ETL process separately.

Combine clusters copies partitions from both datasets to a final destination and updates annotation and metadata. Fast method as it does not read and merge variant data.

Hardware requirements are low even for the large datasets.

Prerequisite for this method is that partitions must represent different parts of the same cohort, i.e. variants in any partition should not intersect with variants in any other partition in global chromosomal order.

E.g. if we take a large initial cohort partitioned by genomic coordinates and process (ETL'ed) all partitions separately, we can combine resulted datasets in any order with each other, and then combine results of previous combine operations in any order with each other.

List all options:

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.CombineClusters --help

Rebalance cluster

After series of merges a cluster can become unbalanced. RebalanceCluster redistributes variants across the cluster nodes evenly.

List all options:

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.RebalanceCluster --help

Rebuild cluster

Running RebalanceCluster with desired number of nodes rebuilds the cluster.

Synthetic datasets

Synthesize generates synthetic dataset based on AF from given VCFs with synthetic GTs distributed according to HWE. Useful for performance / volume testing. Runs as a Spark job.

Synthetic dataset options:

  // required
  --path  <arg>          path to VCF/BGEN files with variants (samples data is ignored if exists)
  --path2save  <arg>     final destination
  --samples  <arg>       # of samples in synthetic cohort

  // # nodes
  --rings  <arg>         # of nodes in cluster

  // options
  --nocall-rate  <arg>   fraction of missing GTs, equals to (1 - Call Rate); default = 0.01
  --ann-rate  <arg>      fraction of each individual annotation *value* in dataset, default = 0.001
  --cohort  <arg>        cohort name; default = 'synthetic'
  --rto                  generate runtime optimized cluster; default=false

  // dataset info
  --notes  <arg>        dataset notes

  // runtime options
  --sqlshuffle  <arg>   spark.sql.sqlshuffle.partitions; default=200
  --seq                 write rings on disk sequentially; default=false

  // help options & info
  -h, --help            Show help message
  -v, --version         Show version of this program

$ spark-submit \
    --master local[*] \
    --packages=io.projectglow:glow-spark3_2.12:2.0.0 \
    --class org.dnaerys.utils.Synthesize \
    /path/dnaerys-ctl.jar \
    --path /path/to/vcfs \
    --path2save /path/to/output_dataset \
    --samples 'm' \
    --rings 'n' \
    --nocall-rate 0.005 \
    --rto \
    --cohort "cohort name" \
    --notes "dataset notes"

Runtime optimization

Experiments on real and synthetic data demonstrate that missing (no-call) GTs take the lion's share of the space in cohorts with 100K+ samples (while on much smaller cohorts, missing GTs are minor or comparable in size with heterozygous GTs).

RuntimeOptimize transfers dataset to runtime optimized mode by eliminating missing GTs. On large cohorts it can reduce requirements for RAM in cluster by more than a half. It keeps original dataset unmodified.

Missing GTs info is still required for off-line operations such as merging cohorts or removing samples, where it's essential to distinguish between missing and ref GTs, as well as for certain queries (which return virtual cohorts stats) and analytical queries (such as Kinship), so do not remove missing GTs info if you need any of those.

List all options:

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.RuntimeOptimize --help

Maintenance Utilities

Maintenance utilities support various maintenance tasks such as renaming cohorts, combining cohorts, renaming samples, removing samples, updating dataset notes and reference assembly, updating samples gender.

List all options:

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.Maintenance --help

Remove samples

Sometimes it's necessary to remove samples from existing dataset, whether due to QC reasons, relatedness, consent or other factors.

Removing samples takes a csv file with sample names to be removed and produces a new dataset without given samples. All variant related statistics is updated as if removed samples have never been in dataset. Algorithmically it takes a single path through all variants and requires about as much RAM as a single node.

Impute gender

Gender meta data can be updated if it was incorrect or missed during ETL. In principle, if dataset contains sufficient variants in X, the initial ETL can be performed with any gender data and later gender can be inferred by API call SexMismatchCheck.

--impute-females / --impute-males take a csv file with sample names to be updated. It triggers recalculation of variant statistics for variants in X, hence generates a new updated dataset.

Reference assembly update

--set-grch37 / --set-grch38 allow to update reference assembly meta information. This operation does not perform lift over of any variants, but it triggers recalculation of variant statistics (such as AF) for variants in PARs, hence generates a new updated dataset.

Combine cohorts

Sometimes datasets are processed in stages and get updated with additional samples. These updates can be processed (ETL'ed) separately. MergeClusters merges two clusters into a single one, preserving original cohort composition. --combine-cohort-with-next combines two cohorts inside a single cluster, so they logically become a single cohort.

Detect multiallelics

Multiallelic variants are automatically detected and handled during ETL when they are in the same rows in input files. In most cases that's sufficient and no other handling is required. Sometimes, multiallelic variants are split across multiple rows in input files. Other times, datasets are merged and new multiallelic variants emerge. --detect-multiallelic is for finding and handling these multiallelic variants.

Journal

Journaling system keeps track of major events in a life of a dataset. All commands that modify dataset's data, metadata or cluster layout are logged in journal and become an immutable part of dataset's life story. Journal's records can be examined via org.dnaerys.audit --journal

Importing from Hail

With the help of an intermediate step of saving data in Delta format, it is possible to import a dataset from Hail MatrixTable. Glow's Hail Interoperation documents the steps for converting Hail's dataset to Glow's schema and saving it in Delta. From there the dataset can be imported using the --delta flag in Dnaerys ETL process. Beware that VEP annotations will be lost if Glow's schema does not include the INFO_CSQ column.

Implementation notes

Ploidy: Dnaerys assumes diploid genotypes. Non-diploid entries in input files are ignored.
Sex Chromosomes: Heterozygous loci in males outside of PAR regions are treated as 1/2 alleles for statistical accuracy.
SVs: Structural Variants (SVs) are not currently supported.

Cloud Object Stores

Dnaerys natively supports S3, Azure Blob, and GCS through Spark's cloud integration.

Example AWS S3 Ingest:

export AWS_ACCESS_KEY="***"
export AWS_SECRET_KEY="***"

time ./spark-submit \
    --master local[*] \
    --packages=io.projectglow:glow-spark3_2.12:2.0.0,org.apache.spark:spark-hadoop-cloud_2.12:3.5.7 \
    --class org.dnaerys.etl \
    /path/dnaerys-ctl.jar \
    --path s3a://<bucket_name>/path/to/vcfs \
    --path2save /path/to/output_dataset \
    --sinfo /path/to/meta.csv \
    --cohort "cohort name" \
    --rings 4 \
    --grch38

Versioning and Compatibility

Versioning follows major.minor.maintenance pattern.
Major release reflects a major platform development or technology stack change.
Minor release reflects new functionality.
- Might introduce incompatible changes in data format and/or API. New functionality often means new fields in serialised data, which leads to incompatibility.
- Compatibility of data format is tracked by serialized data format version, which can be checked with --help and --info options on existing datasets and binaries. Increment in minor release doesn't always mean incompatible changes. Increment in data format version always does.
- Minor release might also include bug fixes which introduce incompatible changes.
Maintenance release reflects fixes and other updates. Always backward compatible in data formats and API.