Prerequisites
-
Scala 2.12.*
- NB. Coursier proivides a simple way to install scala
and manage scala environments and artifacts. Once
cs
is installed, run e.g.cs install scala:2.12.15
. Coursier can also install an appropriate JVM if none found.
- NB. Coursier proivides a simple way to install scala
and manage scala environments and artifacts. Once
-
Apache Spark 3.1.* & 3.2.* (ETL only)
- ETL depends on Glow, which is downloaded during runtime
- NB. Glow 1.2.* works with Spark 3.1.* & 3.2.* on Hadoop 2.7 & 3.2
Dataset's info
Dataset audit options
$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.audit --help
Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd
*** Community License ***
Serialized data format version: 18
Audit options:
--info show dataset details
--journal major events in a life of dataset
-p, --path <arg> path to dataset
--show-offsets show offsets
--show-samples show samples names
-h, --help Show help message
-v, --version Show version of this program
Importing / ETL
Importing data from multi-sample VCF/BGEN or Delta files.
ETL options:
// required
--path <arg> source; path to multi-sample VCF/BGEN or Delta files to process
--path2save <arg> final destination
--sinfo <arg> path to csv file with sample names and genders
--cohort <arg> cohort name, arbitrary string
// # nodes
--rings <arg> # of nodes in cluster
// source format options, default is VCF/BGEN
--delta source is in Delta - https://delta.io
// transformations
--normalize equals to Glow's 'normalize_variants'
--reference <arg> path to reference genome (required for normalization only)
--minrep convert variants to minimal representation, equals to GATK' LeftAlignAndTrimVariants
// annotations
--vep extract VEP annotations
--gnomad <arg> extract gnomAD AF annotations with provided prefix (default prefix in VEP is gnomAD_AF if --af_gnomad was used in VEP)
--cadd extract CADD_RAW/CADD_PHRED annotations
--clinsig <arg> extract ClinVar Clinical significance annotations with provided prefix (default prefix in VEP is CLIN_SIG if --clin_sig_allele was used in VEP)
--skipnotannotated ignore unannotated variants when annotations are expected
// filters
--biallelic-only load biallelic variants only
--rto create runtime optimized cluster; default=false
// dataset info
--grch37 whether input files aligned to GRCh37; default=false
--grch38 whether input files aligned to GRCh38; default=true
--notes <arg> dataset notes
// runtime options
--sqlshuffle <arg> spark.sql.sqlshuffle.partitions; default=200
--dump dump variants to local tsv file for each partition
--seq write rings on disk sequentially; default=false
// help options & info
-h, --help Show help message
-v, --version Show version of this program
e.g.
$ spark-submit \
--master local[*] \
--conf spark.local.dir=/tmp \
--conf spark.driver.maxResultSize=8g \
--driver-memory=8g \
--packages=io.projectglow:glow-spark3_2.12:1.2.1 \
--class org.dnaerys.etl \
/path/dnaerys-ctl.jar \
--path /path/to/vcfs \
--path2save /path/to/output_dataset \
--sinfo /path/to/meta.csv \
--dump \
--rings 8 \
--vep \
--clinsig clinvar_CLNSIG \
--gmomad gmomad_AF \
--skipnotannotated \
--cohort "cohort name" \
--notes "dataset notes"
Notes
-
Dnaerys uses native Glow's parser
-
Dnaerys supports Cloud Object Stores to the extent Spark supports it
-
--minrep
similar to minRep in Hail. Limitations discussed here. -
--normalize
equal to bcftools norm and vt normalize and is performed by Glow. See details here and here. -
--sinfo
path to csv file with<sample_name,gender>
. Single sample per line. Gender is male or female and samples names should be the same as in VCFs and in the same order. If gender is not known, see impute gender. -
--delta
expects data in Glow's schema saved in Delta. Dependency on delta should be mentioned when ETL job is submitted, e.g.--packages io.projectglow:glow-spark3_2.12:1.2.1,io.delta:delta-core_2.12:1.0.1
License
$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.license --help
Dnaerys v. 1.14.1 (c) 2022 Dnaerys Pty Ltd
*** Community License ***
Serialized data format version: 18
License options:
--accept accept the License
--path <arg> path to dataset
--show show the License
-h, --help Show help message
-v, --version Show version of this program
Merge clusters
MergeClusters merges two clusters with arbitrary datasets and number of nodes in each.
Cohorts from both datasets form a new dataset and cohorts composition is preserved. Variants from all cohorts are merged into a new dataset. Variants which exist only in a single dataset are treated as missed (aka 'no call') in another dataset.
Statistics (AC, AF, #het, #hom, etc) is recalculated taking into account zygosity/gender. Merging can be applied sequentially to resulted datasets, providing means to construct combined cohorts from smaller ones, keeping internal composition of original cohorts.
Algorithmically it takes a single path through all variants in each cluster and requires as much RAM as combination of two single partitions, one from each cluster (except a corner case when most of the variants in one of the dataset are located after variants from the last partition in another dataset).
cross references
-
In a common case, when a single cohort is partitioned during ETL, its parts can be processed separately and can be merged with the
--same-cohort
option. If variants in these partitions are also sorted and do not intersect in genomic coordinates, datasets can be combined via Combine clusters. -
Merged cluster has the same number of nodes as the first cluster. Clusters can be rebalanced via Rebalance cluster if required.
-
When different datasets are merged, new multiallelic variants might emerge. They can be detected and handled after merge by Detect multiallelics.
$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.MergeClusters --help
Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd
*** Community License ***
Serialized data format version: 18
Merge clusters:
--debug debug output; default=false
--dump dump to tsv files; default=false
--notes <arg> dataset notes
--path1 <arg> 1st dataset to merge
--path2 <arg> 2nd dataset to merge
--path2save <arg> final destination
--same-cohort same cohort merge
-h, --help Show help message
-v, --version Show version of this program
e.g.:
$ scala \
-J-Xmx128g \
-cp /path/dnaerys-ctl.jar \
org.dnaerys.utils.MergeClusters \
--path1 /path/to/first_dataset \
--path2 /path/to/second_dataset \
--path2save /path/to/merged_dataset
Combine clusters
To combine different parts of the same cohort which went through ETL process separately.
Combine clusters copies partitions from both datasets to a final destination and updates annotation and metadata. Fast method as it does not read and merge variant data.
Hardware requirements are low even for the large datasets.
Prerequisite for this method is that partitions must represent different parts of the same cohort, i.e. variants in any partition should not intersect with variants in any other partition in global chromosomal order.
E.g. if we take a large initial cohort partitioned by genomic coordinates and process (ETL'ed) all partitions separately, we can combine resulted datasets in any order with each other, and then combine results of previous combine operations in any order with each other.
$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.CombineClusters --help
Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd
*** Community License ***
Serialized data format version: 18
Combine clusters:
--notes <arg> dataset notes
--path1 <arg> 1st dataset to combine
--path2 <arg> 2nd dataset to combine
--path2save <arg> final destination
-h, --help Show help message
-v, --version Show version of this program
Rebalance cluster
After series of merges a cluster can become unbalanced. RebalanceCluster redistributes variants across the cluster nodes evenly.
$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.RebalanceCluster --help
Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd
*** Community License ***
Serialized data format version: 18
Rebalance cluster:
--debug debug output; default=false
--dump dump to tsv files; default=false
--path <arg> dataset to rebalance
--path2save <arg> final destination
--rings <arg> # of nodes in rebalanced cluster; default=current
-h, --help Show help message
-v, --version Show version of this program
Rebuild cluster
Running RebalanceCluster with desired number of nodes rebuilds the cluster.
Synthetic datasets
Synthesize generates synthetic dataset based on AF from given VCFs with synthetic GTs distributed according to HWE. Useful for performance / volume testing.
Synthetic dataset options:
// required
--path <arg> path to VCF/BGEN files with variants (samples data is ignored if exists)
--path2save <arg> final destination
--samples <arg> # of samples in synthetic cohort
// # nodes
--rings <arg> # of nodes in cluster
// options
--nocall-rate <arg> fraction of missed GTs, equals to (1 - Call Rate); default = 0.01
--ann-rate <arg> fraction of each individual annotation *value* in dataset, default = 0.001
--cohort <arg> cohort name; default = 'synthetic'
--rto generate runtime optimized cluster; default=false
// dataset info
--notes <arg> dataset notes
// runtime options
--sqlshuffle <arg> spark.sql.sqlshuffle.partitions; default=200
--seq write rings on disk sequentially; default=false
// help options & info
-h, --help Show help message
-v, --version Show version of this program
e.g.:
$ spark-submit \
--master local[*] \
--packages=io.projectglow:glow-spark3_2.12:1.2.1 \
--class org.dnaerys.utils.Synthesize \
/path/dnaerys-ctl.jar \
--path /path/to/vcfs \
--path2save /path/to/output_dataset \
--samples 'm' \
--rings 'n' \
--nocall-rate 0.005 \
--rto \
--cohort "cohort name" \
--notes "dataset notes"
Runtime optimization
Experiments on real and synthetic data demonstrate that missed (no-call) GTs take the lion's share of the space in cohorts with 100K+ samples (while on much smaller cohorts, missed GTs are minor or comparable in size with heterozygous GTs).
RuntimeOptimize transfers dataset to runtime optimized mode by eliminating missed GTs. On large cohorts it can reduce requirements for RAM in cluster by more than a half. It keeps original dataset unmodified.
Missed GTs info is still required for off-line operations such as merging cohorts or removing samples, where it's essential to distinguish between missed and ref GTs, as well as for certain queries (which return virtual cohorts stats) and analytical queries (such as Kinship), so do not remove missed GTs info if you need any of those.
$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.RuntimeOptimize --help
Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd
*** Community License ***
Serialized data format version: 18
Runtime Optimize cluster:
--path <arg> dataset to optimize
--path2save <arg> final destination
-h, --help Show help message
-v, --version Show version of this program
Misc maintenance
Maintenance utilities support various maintenance tasks such as renaming cohorts, combining cohorts, renaming samples, removing samples, updating dataset notes and reference assembly, updating samples gender.
$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.Maintenance --help
Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd
*** Community License ***
Serialized data format version: 18
Maintenance options:
--cohort <arg> cohort name
--combine-cohort-with-next combine --cohort with the next cohort in dataset
--detect-multiallelic detect multiallelic variants in dataset
--impute-females <arg> path to csv file with sample names to set as females
--impute-males <arg> path to csv file with sample names to set as males
-p, --path <arg> path to dataset
--path2save <arg> final destination for new dataset
--remove-samples <arg> path to csv file with sample names to remove from dataset
--rename-cohort-to <arg> set a new cohort name for --cohort
--rename-samples-to <arg> path to csv file with new sample names for --cohort
--set-grch37 sets the reference assembly to GRCh37
--set-grch38 sets the reference assembly to GRCh38
--update-notes <arg> new dataset notes
-h, --help Show help message
-v, --version Show version of this program
Remove samples
Sometimes it's necessary to remove samples from existing dataset, whether due to QC reasons, relatedness, consent or other factors.
Removing samples takes a csv file with sample names to be removed and produces a new dataset without given samples. All variant related statistics is updated as if removed samples have never been in dataset. Algorithmically it takes a single path through all variants and requires about as much RAM as a single node.
Impute gender
Gender meta data can be updated if it was incorrect or missed during ETL.
In principle, if dataset contains sufficient variants in X, the initial ETL can
be performed with any gender data and later gender can be inferred by API call SexMismatchCheck
.
--impute-females
/ --impute-males
take a csv file with sample names to be updated.
It triggers recalculation of variant statistics for variants in X, hence generates
a new updated dataset.
Reference assembly update
--set-grch37
/ --set-grch38
allow to update reference assembly meta information.
This operation does not perform lift over of any variants, but it triggers
recalculation of variant statistics (such as AF) for variants in PARs, hence
generates a new updated dataset.
Combine cohorts
Sometimes datasets are processed in stages and get updated with additional samples.
These updates can be processed (ETL'ed) separately.
MergeClusters merges two clusters into a single one,
preserving original cohort composition. --combine-cohort-with-next
combines
two cohorts inside a single cluster, so they logically become a single cohort.
Detect multiallelics
Multiallelic variants are automatically detected and handled during ETL when they are
in the same rows in input files. In most cases that's sufficient and no other handling
is required. Sometimes, multiallelic variants are split across multiple rows in input files.
Other times, datasets are merged and new multiallelic variants emerge.
--detect-multiallelic
is for finding and handling these multiallelic variants.
Journal
Journaling system keeps track of major events in a life of a dataset.
All commands that modify dataset's data, metadata or cluster layout are logged in journal
and become an immutable part of dataset's life story. Journal's records can be examined
via org.dnaerys.audit --journal
Importing from Hail
With the help of an intermediate step of saving data in Delta format, it is possible to import
a dataset from Hail MatrixTable. Glow's Hail Interoperation
documents the steps for converting Hail's dataset to Glow's schema and saving it in Delta.
From there the dataset can be imported using the --delta
flag in Dnaerys ETL process.
Beware that VEP annotations will be lost if Glow's schema does not include the INFO_CSQ
column.
Implementation notes
-
There is some degree of uncertainty how to count heterozygous loci in sex chromosomes in males, since the true ones exist only in PAR. To resolve this, it is counted as 1/2 of het alleles in data outside PARs.
-
Dnaerys expects all genotypes being diploid, as in normal GATK output, and calculates statistics based on this assumption. All non diploid GT in input files are ignored.
-
Genes usually have multiple transcripts (around 10+ on average), each of which is annotated separately. Annotations for all transcripts are preserved and assigned to a single allele.
-
SVs are currently not supported.
Cloud Object Stores
Dnaerys supports Cloud Object Stores in ETL to the extent Spark supports them. Please read about configuration on the dedicated Spark page - Integration with Cloud Infrastructures.
Dnaerys also supports Cloud Object Stores as datasets source in Kubernetes clusters.
Below is an example of reading VCFs from AWS S3 in ETL with Spark 3.2.1 and Glow 1.2.1
export AWS_ACCESS_KEY="***"
export AWS_SECRET_KEY="***"
time ./spark-submit \
--master local[*] \
--conf spark.local.dir=/tmp \
--conf spark.driver.maxResultSize=8g \
--driver-memory=8g \
--packages=io.projectglow:glow-spark3_2.12:1.2.1,org.apache.spark:spark-hadoop-cloud_2.12:3.2.1 \
--class org.dnaerys.etl \
/path/dnaerys-ctl.jar \
--path s3a://<bucket_name>/path/to/vcfs \
--path2save /path/to/output_dataset \
--sinfo /path/to/meta.csv \
--cohort "cohort name" \
--notes "dataset notes" \
--rings 4 \
--grch37
Versioning and Compatibility
-
Versioning follows major.minor.maintenance pattern.
-
Major release reflects a major platform development or technology stack change.
-
Minor release reflects new functionality.
-
Might introduce incompatible changes in data format and/or API. New functionality often means new fields in serialised data, which leads to incompatibility.
-
Compatibility of data format is tracked by serialized data format version, which can be checked with
--help
and--info
options on existing datasets and binaries. Increment in minor release doesn't always mean incompatible changes. Increment in data format version always does. -
Minor release might also include bug fixes which introduce incompatible changes.
-
-
Maintenance release reflects fixes and other updates. Always backward compatible in data formats and API.