What
Dnaerys is a state of the art, distributed, horizontally scalable, in-memory low latency genome variant store - a specialized analytical database to store genetic variations and execute algorithms from the field of genetics.
- It provides millisecond range response time on datasets with hundreds of thousands of WGS samples across the cluster
Why
On a small scale, genomic data can be managed well in almost any modern DBMS.
On a large scale, with datasets spanning across hundreds of thousands WGS samples, DBMS' response time becomes a bottleneck for downstream applications.
Thesis
General-purpose DBMS are sub-optimal for implementing low latency scalable variant stores.
From the perspective of query plan optimization and code generation, specialized systems can avoid multiple stages in query execution which are required in general purpose systems
-
generating the optimal query execution plan for an arbitrary query is known to be NP-Complete problem, meaning there is no known exact solution for it and query optimizers need to employ approximate solutions and heuristics
-
on the other hand, purpose-built systems can implement optimal query execution plans directly without runtime query optimizers, runtime code generation or plan interpretation as often data model and queries are known a priori
From the perspective of storage model, in realm of Relational DBMS:
-
Row-oriented databases (N-ary storage model, NSM) are notoriously unfit for storing genomic data in memory. For relatively large datasets the amount of required RAM becomes prohibitively expensive.
-
Column-oriented databases (decomposition/hybrid storage models, DSM/PAX) are significantly more efficient in encoding & compression than NSM systems, nevertheless the required amount of main memory for large genomic datasets remains beyond the limit of practically feasible systems in most implementations.
From the perspective of primary data allocation:
-
Disk-based systems have huge performance cost when compared with in-memory designed systems.
-
Design reasons: managing internal buffer pools brings significant overheads. It is simple to see by placing all database files on RAM disks and eliminating disk I/O overheads. Main-memory designed systems outperform disk-based DBMS counterparts.
-
Access time reasons: even the fastest persistent local storage is slower than RAM for ~3 orders of magnitude.
-
-
-->
-
Cloud based Data Warehouses and Data Lake Houses solutions with data storage layers on object stores (Snowflake, Redshift, Databricks, Athena) are at disadvantage with in-memory systems wrt latencies and throughput.
- the CPU and latency overheads for accessing data over network are huge (including serialization/deserialization & TLS overheads)
- with object stores, it is (mostly) pulling data to the query, as opposed to pushing query to the data
-
Even shared-nothing distributed OLAP DBMS (e.g. Kudu) on fast persistent local storage (NVMe) are at disadvantage with main-memory based systems.
-
Common solutions
-
Common solutions for horizontally scalable variant stores are relying on general-purpose disk-based technologies, e.g.:
- HBase for Genomics England and Lifebit, BigQuery (Dremel) for Broad Institute of MIT and Harvard (Genomic Variant Store) and Google Life Sciences, Amazon Athena on S3 Data Lake for Amazon Omics, Elasticsearch for Seqr (Broad), Kudu for Garvan MRI, and so on
-
in other cases, variant store implementations are relying on specialised, purpose-built disk-based solutions, e.g.:
- Hail (Broad), GenomicsDB (Intel & Broad), TileDB-VCF (TileDB Inc & Helix), Genomic Variant Store (Broad; custom file format + BigQuery, so it's a mix), BC Platforms, et cetera
-
disk-oriented design places them at 3+ orders of magnitude disadvantage with in-memory purpose-built systems
What is it good for ?
- GWAS
- real-time interactive applications
- machine learning
- analytics
- precision medicine
Design Principles
-
distributed: it's a homogeneous distributed database management system. It has identical software running on all database nodes, all nodes have equal roles and any node can be used for querying.
-
in-memory: each node stores all its data in RAM
-
horizontally scalable: query latency stays within the same constrains while data volume & number of nodes are growing
-
shared-nothing architecture
-
OLAP
-
column-oriented
-
materialization query execution processing model
-
bottom-to-top query execution with push-based task assignment
-
PA/EC system in PACELC space
- Available in the presence of network partition
- sub-clusters continue to serve their respective clients, marking responses as potentially incomplete for the queries affected by partitioning
- Consistency over Latency without presence of network partition
- can operate as PC/EC system if required
- split-brain resolver is a configuration option
- Available in the presence of network partition
-
resilient
-
no single point of failure: every node in the cluster has an identical role and can service any request. Data is partitioned across all nodes.
-
graceful degradation: in an event of node(s) failure(s), cluster continues to serve requests with the data available on remaining nodes, with responses marked as potentially incomplete for the queries affected by node failures
-
φ accrual failure detector for detecting unreachable nodes. Flexible and configurable to adapt to a variety of environments (with wide range of n/w latencies between nodes and their deviations) and requirements (how fast recovery needs to happen).
-
push-pull gossip protocol for cluster state awareness
-
data corruption detection: all data pass through several layers of validation and integrity checks, including non cryptographic hashes
-
-
message-based cluster architecture
-
actor-based & MT-based parallelization
-
code specialization
- no code generation/compilation at runtime (no need as data model and queries are known a priori, hence queries implementation code is written in advance)
- low-level h/w specialization is carried out by JVM (SIMD, NUMA)
- high-level specialization is carried out by API implementation
-
cloud-scale oriented
- runs on Kubernetes clusters
- runs on infrastructure from laptops to enterprise and cloud
- designed with a focus on Observability and Ops
How to start
- it's simple to start playing with just with a Docker on your local laptop
- it's available with free Community license
Notable Partners
- National Computational Infrastructure
- Australian National University
- Garvan Institute of Medical Research
Origins
The story goes it takes its origin from a mythical creature who lived across Essos and Westeros with a mix of triple-stranded left-handed Dragon DNA and right-handed Valyrian DNA. All we know for sure, it is currently developed and supported by Dnaerys Pty Ltd, Australia