Prerequisites


Dataset's info

Dataset audit options

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.audit --help

 Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd 
 *** Community License ***

 Serialized data format version: 18 
 Audit options:

      --info           show dataset details
      --journal        major events in a life of dataset
  -p, --path  <arg>    path to dataset
      --show-offsets   show offsets
      --show-samples   show samples names
  -h, --help           Show help message
  -v, --version        Show version of this program

Importing / ETL

Importing data from multi-sample VCF/BGEN or Delta files.

 ETL options:

  // required
  --path  <arg>         source; path to multi-sample VCF/BGEN or Delta files to process
  --path2save  <arg>    final destination
  --sinfo  <arg>        path to csv file with sample names and genders
  --cohort  <arg>       cohort name, arbitrary string

  // # nodes
  --rings  <arg>        # of nodes in cluster

  // source format options, default is VCF/BGEN
  --delta               source is in Delta - https://delta.io

  // transformations
  --normalize           equals to Glow's 'normalize_variants'
  --reference  <arg>    path to reference genome (required for normalization only)
  --minrep              convert variants to minimal representation, equals to GATK' LeftAlignAndTrimVariants

  // annotations
  --vep                 extract VEP annotations
  --gnomad  <arg>       extract gnomAD AF annotations with provided prefix (default prefix in VEP is gnomAD_AF if --af_gnomad was used in VEP)
  --cadd                extract CADD_RAW/CADD_PHRED annotations
  --clinsig  <arg>      extract ClinVar Clinical significance annotations with provided prefix (default prefix in VEP is CLIN_SIG if --clin_sig_allele was used in VEP)
  --skipnotannotated    ignore unannotated variants when annotations are expected

  // filters
  --biallelic-only      load biallelic variants only
  --rto                 create runtime optimized cluster; default=false

  // dataset info
  --grch37              whether input files aligned to GRCh37; default=false
  --grch38              whether input files aligned to GRCh38; default=true
  --notes  <arg>        dataset notes

  // runtime options
  --sqlshuffle  <arg>   spark.sql.sqlshuffle.partitions; default=200
  --dump                dump variants to local tsv file for each partition
  --seq                 write rings on disk sequentially; default=false

  // help options & info
  -h, --help            Show help message
  -v, --version         Show version of this program

e.g.

$ spark-submit \
    --master local[*] \
    --conf spark.local.dir=/tmp \
    --conf spark.driver.maxResultSize=8g \
    --driver-memory=8g \
    --packages=io.projectglow:glow-spark3_2.12:1.2.1 \
    --class org.dnaerys.etl \
    /path/dnaerys-ctl.jar \
    --path /path/to/vcfs \
    --path2save /path/to/output_dataset \
    --sinfo /path/to/meta.csv \
    --dump \
    --rings 8 \
    --vep \
    --clinsig clinvar_CLNSIG \
    --gmomad gmomad_AF \
    --skipnotannotated \
    --cohort "cohort name" \
    --notes "dataset notes"

Notes


License

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.license --help

 Dnaerys v. 1.14.1 (c) 2022 Dnaerys Pty Ltd 
 *** Community License ***

 Serialized data format version: 18 
 License options:

      --accept        accept the License
      --path  <arg>   path to dataset
      --show          show the License
  -h, --help          Show help message
  -v, --version       Show version of this program

Merge clusters

MergeClusters merges two clusters with arbitrary datasets and number of nodes in each.

Cohorts from both datasets form a new dataset and cohorts composition is preserved. Variants from all cohorts are merged into a new dataset. Variants which exist only in a single dataset are treated as missed (aka 'no call') in another dataset.

Statistics (AC, AF, #het, #hom, etc) is recalculated taking into account zygosity/gender. Merging can be applied sequentially to resulted datasets, providing means to construct combined cohorts from smaller ones, keeping internal composition of original cohorts.

Algorithmically it takes a single path through all variants in each cluster and requires as much RAM as combination of two single partitions, one from each cluster (except a corner case when most of the variants in one of the dataset are located after variants from the last partition in another dataset).

cross references

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.MergeClusters --help

 Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd 
 *** Community License ***

 Serialized data format version: 18 
 Merge clusters:

      --debug              debug output; default=false
      --dump               dump to tsv files; default=false
      --notes  <arg>       dataset notes
      --path1  <arg>       1st dataset to merge
      --path2  <arg>       2nd dataset to merge
      --path2save  <arg>   final destination
      --same-cohort        same cohort merge
  -h, --help               Show help message
  -v, --version            Show version of this program

e.g.:

$ scala \
    -J-Xmx128g \
    -cp /path/dnaerys-ctl.jar \
    org.dnaerys.utils.MergeClusters \
    --path1 /path/to/first_dataset \
    --path2 /path/to/second_dataset \
    --path2save /path/to/merged_dataset

Combine clusters

To combine different parts of the same cohort which went through ETL process separately.

Combine clusters copies partitions from both datasets to a final destination and updates annotation and metadata. Fast method as it does not read and merge variant data.

Hardware requirements are low even for the large datasets.

Prerequisite for this method is that partitions must represent different parts of the same cohort, i.e. variants in any partition should not intersect with variants in any other partition in global chromosomal order.

E.g. if we take a large initial cohort partitioned by genomic coordinates and process (ETL'ed) all partitions separately, we can combine resulted datasets in any order with each other, and then combine results of previous combine operations in any order with each other.

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.CombineClusters --help

 Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd 
 *** Community License ***

 Serialized data format version: 18 
 Combine clusters:

      --notes  <arg>       dataset notes
      --path1  <arg>       1st dataset to combine
      --path2  <arg>       2nd dataset to combine
      --path2save  <arg>   final destination
  -h, --help               Show help message
  -v, --version            Show version of this program

Rebalance cluster

After series of merges a cluster can become unbalanced. RebalanceCluster redistributes variants across the cluster nodes evenly.

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.RebalanceCluster --help

 Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd 
 *** Community License ***

 Serialized data format version: 18 
 Rebalance cluster:

      --debug              debug output; default=false
      --dump               dump to tsv files; default=false
      --path  <arg>        dataset to rebalance
      --path2save  <arg>   final destination
      --rings  <arg>       # of nodes in rebalanced cluster; default=current
  -h, --help               Show help message
  -v, --version            Show version of this program

Rebuild cluster

Running RebalanceCluster with desired number of nodes rebuilds the cluster.


Synthetic datasets

Synthesize generates synthetic dataset based on AF from given VCFs with synthetic GTs distributed according to HWE. Useful for performance / volume testing.

 Synthetic dataset options:

  // required
  --path  <arg>          path to VCF/BGEN files with variants (samples data is ignored if exists)
  --path2save  <arg>     final destination
  --samples  <arg>       # of samples in synthetic cohort

  // # nodes
  --rings  <arg>         # of nodes in cluster

  // options
  --nocall-rate  <arg>   fraction of missed GTs, equals to (1 - Call Rate); default = 0.01
  --ann-rate  <arg>      fraction of each individual annotation *value* in dataset, default = 0.001
  --cohort  <arg>        cohort name; default = 'synthetic'
  --rto                  generate runtime optimized cluster; default=false

  // dataset info
  --notes  <arg>        dataset notes

  // runtime options
  --sqlshuffle  <arg>   spark.sql.sqlshuffle.partitions; default=200
  --seq                 write rings on disk sequentially; default=false

  // help options & info
  -h, --help            Show help message
  -v, --version         Show version of this program

e.g.:

$ spark-submit \
    --master local[*] \
    --packages=io.projectglow:glow-spark3_2.12:1.2.1 \
    --class org.dnaerys.utils.Synthesize \
    /path/dnaerys-ctl.jar \
    --path /path/to/vcfs \
    --path2save /path/to/output_dataset \
    --samples 'm' \
    --rings 'n' \
    --nocall-rate 0.005 \
    --rto \
    --cohort "cohort name" \
    --notes "dataset notes"

Runtime optimization

Experiments on real and synthetic data demonstrate that missed (no-call) GTs take the lion's share of the space in cohorts with 100K+ samples (while on much smaller cohorts, missed GTs are minor or comparable in size with heterozygous GTs).

RuntimeOptimize transfers dataset to runtime optimized mode by eliminating missed GTs. On large cohorts it can reduce requirements for RAM in cluster by more than a half. It keeps original dataset unmodified.

Missed GTs info is still required for off-line operations such as merging cohorts or removing samples, where it's essential to distinguish between missed and ref GTs, as well as for certain queries (which return virtual cohorts stats) and analytical queries (such as Kinship), so do not remove missed GTs info if you need any of those.

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.RuntimeOptimize --help

 Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd 
 *** Community License ***

 Serialized data format version: 18 
 Runtime Optimize cluster:

      --path  <arg>        dataset to optimize
      --path2save  <arg>   final destination
  -h, --help               Show help message
  -v, --version            Show version of this program

Misc maintenance

Maintenance utilities support various maintenance tasks such as renaming cohorts, combining cohorts, renaming samples, removing samples, updating dataset notes and reference assembly, updating samples gender.

$ scala -cp /path/dnaerys-ctl.jar org.dnaerys.utils.Maintenance --help

 Dnaerys v. 1.14.2 (c) 2022 Dnaerys Pty Ltd 
 *** Community License ***

 Serialized data format version: 18 
 Maintenance options:

      --cohort  <arg>              cohort name
      --combine-cohort-with-next   combine --cohort with the next cohort in dataset
      --detect-multiallelic        detect multiallelic variants in dataset
      --impute-females  <arg>      path to csv file with sample names to set as females
      --impute-males  <arg>        path to csv file with sample names to set as males
  -p, --path  <arg>                path to dataset
      --path2save  <arg>           final destination for new dataset
      --remove-samples  <arg>      path to csv file with sample names to remove from dataset
      --rename-cohort-to  <arg>    set a new cohort name for --cohort
      --rename-samples-to  <arg>   path to csv file with new sample names for --cohort
      --set-grch37                 sets the reference assembly to GRCh37
      --set-grch38                 sets the reference assembly to GRCh38
      --update-notes  <arg>        new dataset notes
  -h, --help                       Show help message
  -v, --version                    Show version of this program

Remove samples

Sometimes it's necessary to remove samples from existing dataset, whether due to QC reasons, relatedness, consent or other factors.

Removing samples takes a csv file with sample names to be removed and produces a new dataset without given samples. All variant related statistics is updated as if removed samples have never been in dataset. Algorithmically it takes a single path through all variants and requires about as much RAM as a single node.


Impute gender

Gender meta data can be updated if it was incorrect or missed during ETL. In principle, if dataset contains sufficient variants in X, the initial ETL can be performed with any gender data and later gender can be inferred by API call SexMismatchCheck.

--impute-females / --impute-males take a csv file with sample names to be updated. It triggers recalculation of variant statistics for variants in X, hence generates a new updated dataset.


Reference assembly update

--set-grch37 / --set-grch38 allow to update reference assembly meta information. This operation does not perform lift over of any variants, but it triggers recalculation of variant statistics (such as AF) for variants in PARs, hence generates a new updated dataset.


Combine cohorts

Sometimes datasets are processed in stages and get updated with additional samples. These updates can be processed (ETL'ed) separately. MergeClusters merges two clusters into a single one, preserving original cohort composition. --combine-cohort-with-next combines two cohorts inside a single cluster, so they logically become a single cohort.


Detect multiallelics

Multiallelic variants are automatically detected and handled during ETL when they are in the same rows in input files. In most cases that's sufficient and no other handling is required. Sometimes, multiallelic variants are split across multiple rows in input files. Other times, datasets are merged and new multiallelic variants emerge. --detect-multiallelic is for finding and handling these multiallelic variants.


Journal

Journaling system keeps track of major events in a life of a dataset. All commands that modify dataset's data, metadata or cluster layout are logged in journal and become an immutable part of dataset's life story. Journal's records can be examined via org.dnaerys.audit --journal


Importing from Hail

With the help of an intermediate step of saving data in Delta format, it is possible to import a dataset from Hail MatrixTable. Glow's Hail Interoperation documents the steps for converting Hail's dataset to Glow's schema and saving it in Delta. From there the dataset can be imported using the --delta flag in Dnaerys ETL process. Beware that VEP annotations will be lost if Glow's schema does not include the INFO_CSQ column.


Implementation notes


Cloud Object Stores

Dnaerys supports Cloud Object Stores in ETL to the extent Spark supports them. Please read about configuration on the dedicated Spark page - Integration with Cloud Infrastructures.

Dnaerys also supports Cloud Object Stores as datasets source in Kubernetes clusters.

Below is an example of reading VCFs from AWS S3 in ETL with Spark 3.2.1 and Glow 1.2.1

export AWS_ACCESS_KEY="***"
export AWS_SECRET_KEY="***"

time ./spark-submit \
    --master local[*] \
    --conf spark.local.dir=/tmp \
    --conf spark.driver.maxResultSize=8g \
    --driver-memory=8g \
    --packages=io.projectglow:glow-spark3_2.12:1.2.1,org.apache.spark:spark-hadoop-cloud_2.12:3.2.1 \
    --class org.dnaerys.etl \
    /path/dnaerys-ctl.jar \
    --path s3a://<bucket_name>/path/to/vcfs \
    --path2save /path/to/output_dataset \
    --sinfo /path/to/meta.csv \
    --cohort "cohort name" \
    --notes "dataset notes" \
    --rings 4 \
    --grch37

Versioning and Compatibility