Benchmarking Notes
-
Take the notes with a grain of salt. Run your own tests. Make your own conclusions.
-
Every response carries information about execution time in elapsed_ms and elapsed_db_ms.
-
elapsed_ms comes in the last message in a response stream and reflects all time from receiving gRPC request to sending the last message in response. This includes all internal communication overheads, but does not include Protocol Buffers serialization/deserialization and communication overheads from/to client, which are not exposed to the database. Those overheads can be derived from end-to-end time known on the client side.
-
elapsed_db_ms reflects processing time in the database engine without internal communication overheads. Difference between elapsed_ms and elapsed_db_ms gives the cost of cluster's internal communications, including transferring results to response node. Each message in a response stream from replying nodes carries elapsed_db_ms.
-
-
End-to-end latency and throughput on a client side for a given query/cluster/dataset/testbed can be evaluated with JMH tests.
- 100+ integration tests, which cover entire gRPC API, provide more examples for client code that can be used for additional tests.
-
For throughput evaluation of massive amount of parallel requests consider to distribute requests evenly across nodes in the cluster. Load balancing is handled automatically in Kubernetes deployments.
- Any node can respond to any query, regardless of data distribution in the cluster, but every response is routed back to the client through the node of contact, creating additional load on this node.
Benchmarking Setup
-
Dnaerys 1.8.0 (JRE16),
broadcast mode
=false
(default) -
JMH version: 1.32
- all JMH settings = defaults
- running:
java -jar benchmarks.jar
java version "1.8.0_131"
- benchmarks sources
-
Hardware & OS
- 2890 WGS & gnomAD dataset tests performed on a single physical node with resources:
28 cores with HT (total 56 logical cores)
Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
RAM: 503GiB
Ubuntu 18.04.1 LTS
2890 WGS dataset
- 2 890 WGS samples
- ~86×10⁶ unique variants
- ~2890×86×10⁶ = 2.48×10¹¹ total number of all genotypes
Results for cluster with 4 (s/w) nodes running on a single physical node
Benchmark Mode Cnt Score Error Units // Returned results/op
// Variant selection
SelectVariants.hetP53orBRCA2 avgt 25 3.228 ± 0.169 ms/op // 200 heterozygous alleles
SelectVariants.homP53orBRCA2 avgt 25 3.497 ± 0.138 ms/op // 200 homozygous alleles
SelectVariants.homP53orBRCA2Unlim avgt 25 4.742 ± 0.041 ms/op // 456 homozygous alleles
SelectVariants.snpInRegion avgt 25 2.290 ± 0.092 ms/op // 1 allele
SelectVariants.mitoLiverPanel avgt 25 21.989 ± 0.388 ms/op // 6396 alleles
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples avgt 25 4.252 ± 0.055 ms/op // 61 heterozygous alleles
SelectVariants.homP53InSamples avgt 25 3.295 ± 0.071 ms/op // 18 homozygous alleles
// Sample selection: Het
SelectSamples.withHetP53Variants avgt 25 2.304 ± 0.062 ms/op // 100 heterozygous alleles
SelectSamples.withHetP53VariantsAsMulti avgt 25 2.400 ± 0.060 ms/op // 100 heterozygous alleles
SelectSamples.withHetP53orBRCA2Variants avgt 25 2.546 ± 0.129 ms/op // 200 heterozygous alleles
// Sample selection: Hom
SelectSamples.withHomP53Variants avgt 25 2.451 ± 0.059 ms/op // 100 homozygous alleles
SelectSamples.withHomP53VariantsAsMulti avgt 25 2.486 ± 0.053 ms/op // 100 homozygous alleles
SelectSamples.withHomP53orBRCA2Variants avgt 25 2.636 ± 0.146 ms/op // 200 homozygous alleles
// Sample selection: Hom + Het
SelectSamples.withP53Variants avgt 25 2.377 ± 0.058 ms/op // 100 alleles
SelectSamples.withP53VariantsAsMulti avgt 25 2.417 ± 0.049 ms/op // 100 alleles
SelectSamples.withP53orBRCA2Variants avgt 25 2.592 ± 0.142 ms/op // 200 alleles
// Variant selection
SelectVariants.hetP53orBRCA2 thrpt 25 0.311 ± 0.017 ops/ms // 200 heterozygous alleles
SelectVariants.homP53orBRCA2 thrpt 25 0.292 ± 0.017 ops/ms // 200 homozygous alleles
SelectVariants.homP53orBRCA2Unlim thrpt 25 0.206 ± 0.006 ops/ms // 456 homozygous alleles
SelectVariants.snpInRegion thrpt 25 0.441 ± 0.019 ops/ms // 1 allele
SelectVariants.mitoLiverPanel thrpt 25 0.046 ± 0.001 ops/ms // 6396 alleles
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples thrpt 25 0.234 ± 0.002 ops/ms // 61 heterozygous alleles
SelectVariants.homP53InSamples thrpt 25 0.304 ± 0.007 ops/ms // 18 homozygous alleles
// Sample selection: Het
SelectSamples.withHetP53Variants thrpt 25 0.433 ± 0.036 ops/ms // 100 heterozygous alleles
SelectSamples.withHetP53VariantsAsMulti thrpt 25 0.423 ± 0.031 ops/ms // 100 heterozygous alleles
SelectSamples.withHetP53orBRCA2Variants thrpt 25 0.393 ± 0.030 ops/ms // 200 heterozygous alleles
// Sample selection: Hom
SelectSamples.withHomP53Variants thrpt 25 0.416 ± 0.036 ops/ms // 100 homozygous alleles
SelectSamples.withHomP53VariantsAsMulti thrpt 25 0.409 ± 0.031 ops/ms // 100 homozygous alleles
SelectSamples.withHomP53orBRCA2Variants thrpt 25 0.379 ± 0.028 ops/ms // 200 homozygous alleles
// Sample selection: Hom + Het
SelectSamples.withP53Variants thrpt 25 0.431 ± 0.030 ops/ms // 100 alleles
SelectSamples.withP53VariantsAsMulti thrpt 25 0.410 ± 0.021 ops/ms // 100 alleles
SelectSamples.withP53orBRCA2Variants thrpt 25 0.384 ± 0.024 ops/ms // 200 alleles
Results for cluster with 16 (s/w) nodes running on a single physical node
Benchmark Mode Cnt Score Error Units // Returned results/op
// Variant selection
SelectVariants.hetP53orBRCA2 avgt 25 3.286 ± 0.179 ms/op // 200 heterozygous alleles
SelectVariants.homP53orBRCA2 avgt 25 3.360 ± 0.130 ms/op // 200 homozygous alleles
SelectVariants.homP53orBRCA2Unlim avgt 25 4.592 ± 0.041 ms/op // 456 homozygous alleles
SelectVariants.snpInRegion avgt 25 2.329 ± 0.082 ms/op // 1 allele
SelectVariants.mitoLiverPanel avgt 25 18.671 ± 0.083 ms/op // 6396 alleles
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples avgt 25 4.418 ± 0.038 ms/op // 61 heterozygous alleles
SelectVariants.homP53InSamples avgt 25 3.204 ± 0.059 ms/op // 18 homozygous alleles
// Sample selection: Het
SelectSamples.withHetP53Variants avgt 25 2.350 ± 0.060 ms/op // 100 heterozygous alleles
SelectSamples.withHetP53VariantsAsMulti avgt 25 2.398 ± 0.072 ms/op // 100 heterozygous alleles
SelectSamples.withHetP53orBRCA2Variants avgt 25 2.563 ± 0.133 ms/op // 200 heterozygous alleles
// Sample selection: Hom
SelectSamples.withHomP53Variants avgt 25 2.448 ± 0.060 ms/op // 100 homozygous alleles
SelectSamples.withHomP53VariantsAsMulti avgt 25 2.504 ± 0.061 ms/op // 100 homozygous alleles
SelectSamples.withHomP53orBRCA2Variants avgt 25 2.579 ± 0.132 ms/op // 200 homozygous alleles
// Sample selection: Hom + Het
SelectSamples.withP53Variants avgt 25 2.388 ± 0.050 ms/op // 100 alleles
SelectSamples.withP53VariantsAsMulti avgt 25 2.411 ± 0.074 ms/op // 100 alleles
SelectSamples.withP53orBRCA2Variants avgt 25 2.499 ± 0.183 ms/op // 200 alleles
// Variant selection
SelectVariants.hetP53orBRCA2 thrpt 25 0.304 ± 0.021 ops/ms // 200 heterozygous alleles
SelectVariants.homP53orBRCA2 thrpt 25 0.296 ± 0.021 ops/ms // 200 homozygous alleles
SelectVariants.homP53orBRCA2Unlim thrpt 25 0.216 ± 0.003 ops/ms // 456 homozygous alleles
SelectVariants.snpInRegion thrpt 25 0.430 ± 0.017 ops/ms // 1 allele
SelectVariants.mitoLiverPanel thrpt 25 0.054 ± 0.001 ops/ms // 6396 alleles
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples thrpt 25 0.225 ± 0.003 ops/ms // 61 heterozygous alleles
SelectVariants.homP53InSamples thrpt 25 0.312 ± 0.006 ops/ms // 18 homozygous alleles
// Sample selection: Het
SelectSamples.withHetP53Variants thrpt 25 0.423 ± 0.037 ops/ms // 100 heterozygous alleles
SelectSamples.withHetP53VariantsAsMulti thrpt 25 0.420 ± 0.029 ops/ms // 100 heterozygous alleles
SelectSamples.withHetP53orBRCA2Variants thrpt 25 0.390 ± 0.030 ops/ms // 200 heterozygous alleles
// Sample selection: Hom
SelectSamples.withHomP53Variants thrpt 25 0.418 ± 0.039 ops/ms // 100 homozygous alleles
SelectSamples.withHomP53VariantsAsMulti thrpt 25 0.414 ± 0.033 ops/ms // 100 homozygous alleles
SelectSamples.withHomP53orBRCA2Variants thrpt 25 0.390 ± 0.036 ops/ms // 200 homozygous alleles
// Sample selection: Hom + Het
SelectSamples.withP53Variants thrpt 25 0.433 ± 0.032 ops/ms // 100 alleles
SelectSamples.withP53VariantsAsMulti thrpt 25 0.420 ± 0.030 ops/ms // 100 alleles
SelectSamples.withP53orBRCA2Variants thrpt 25 0.404 ± 0.028 ops/ms // 200 alleles
Traffic Notes
-
limit
is a free parameter in queries -
Variant selection
hetP53orBRCA2
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)homP53orBRCA2
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)homP53orBRCA2Unlim
results unlimited = returns all 456 homozygous alleles in p53 or BRCA2mitoLiverPanel
results unlimited = returns 6 396 alleles in 11 genes
-
Variant selection in "virtual cohort"
hetP53InSamples
returns 61 heterozygous alleles which exist in 2 samples in p53homP53InSamples
returns 18 homozygous alleles which exist in 2 samples in p53
-
Sample selection: Het
withHetP53Variants
results are limited to 100/node (= 100 total, p53 is on a single node)withHetP53VariantsAsMulti
same as above, but viaSelectSamplesInMultiRegions
call, which returns results as a streamwithHetP53orBRCA2Variants
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)
-
Sample selection: Hom
withHomP53Variants
results are limited to 100/node (= 100 total, p53 is on a single node)withHomP53VariantsAsMulti
same as above, but viaSelectSamplesInMultiRegions
call, which returns results as a streamwithHomP53orBRCA2Variants
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)
-
Sample selection: Hom + Het
withP53Variants
results are limited to 100/node (= 100 total, p53 is on a single node)withP53VariantsAsMulti
same as above, but viaSelectSamplesInMultiRegions
call, which returns results as a streamwithP53orBRCA2Variants
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)
gnomAD v3.1 dataset
- gnomAD v3.1
- 76 156 WGS samples
- ~759×10⁶ unique variants
- 415,071,703,278 homozygous alt + heterozygous genotypes, modelled by HWE with 0.005 no-call rate
- 76156*759×10⁶ = ~57.802404×10¹² total number of genotypes (including hom ref + missing)
Results for cluster with 4 (s/w) nodes running on a single physical node
Benchmark Mode Cnt Score Error Units // Returned results/op
// Variant selection
SelectVariants.hetP53orBRCA2 avgt 25 3.143 ± 0.498 ms/op // 200 heterozygous alleles
SelectVariants.homP53orBRCA2 avgt 25 3.497 ± 0.238 ms/op // 200 homozygous alleles
SelectVariants.homP53orBRCA2Unlim avgt 25 13.378 ± 0.237 ms/op // 1215 homozygous alleles
SelectVariants.snpInRegion avgt 25 2.278 ± 0.275 ms/op // 1 allele
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples avgt 25 11.511 ± 0.163 ms/op // 51 heterozygous alleles
SelectVariants.homP53InSamples avgt 25 5.330 ± 0.082 ms/op // 14 homozygous alleles
// Sample selection: Het
SelectSamples.withHetP53Variants avgt 25 2.416 ± 0.132 ms/op // 100 heterozygous alleles
SelectSamples.withHetP53VariantsAsMulti avgt 25 2.463 ± 0.154 ms/op // 100 heterozygous alleles
SelectSamples.withHetP53orBRCA2Variants avgt 25 2.528 ± 0.170 ms/op // 200 heterozygous alleles
// Sample selection: Hom
SelectSamples.withHomP53Variants avgt 25 2.383 ± 0.075 ms/op // 100 homozygous alleles
SelectSamples.withHomP53VariantsAsMulti avgt 25 2.430 ± 0.054 ms/op // 100 homozygous alleles
SelectSamples.withHomP53orBRCA2Variants avgt 25 2.657 ± 0.151 ms/op // 200 homozygous alleles
// Sample selection: Hom + Het
SelectSamples.withP53Variants avgt 25 2.442 ± 0.078 ms/op // 100 alleles
SelectSamples.withP53VariantsAsMulti avgt 25 2.537 ± 0.085 ms/op // 100 alleles
SelectSamples.withP53orBRCA2Variants avgt 25 2.465 ± 0.095 ms/op // 200 alleles
// Variant selection
SelectVariants.hetP53orBRCA2 thrpt 25 0.329 ± 0.045 ops/ms // 200 heterozygous alleles
SelectVariants.homP53orBRCA2 thrpt 25 0.277 ± 0.017 ops/ms // 200 homozygous alleles
SelectVariants.homP53orBRCA2Unlim thrpt 25 0.074 ± 0.002 ops/ms // 1215 homozygous alleles
SelectVariants.snpInRegion thrpt 25 0.449 ± 0.054 ops/ms // 1 allele
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples thrpt 25 0.087 ± 0.001 ops/ms // 51 heterozygous alleles
SelectVariants.homP53InSamples thrpt 25 0.188 ± 0.002 ops/ms // 14 homozygous alleles
// Sample selection: Het
SelectSamples.withHetP53Variants thrpt 25 0.412 ± 0.040 ops/ms // 100 heterozygous alleles
SelectSamples.withHetP53VariantsAsMulti thrpt 25 0.411 ± 0.033 ops/ms // 100 heterozygous alleles
SelectSamples.withHetP53orBRCA2Variants thrpt 25 0.389 ± 0.031 ops/ms // 200 heterozygous alleles
// Sample selection: Hom
SelectSamples.withHomP53Variants thrpt 25 0.419 ± 0.035 ops/ms // 100 homozygous alleles
SelectSamples.withHomP53VariantsAsMulti thrpt 25 0.416 ± 0.030 ops/ms // 100 homozygous alleles
SelectSamples.withHomP53orBRCA2Variants thrpt 25 0.374 ± 0.025 ops/ms // 200 homozygous alleles
// Sample selection: Hom + Het
SelectSamples.withP53Variants thrpt 25 0.411 ± 0.028 ops/ms // 100 alleles
SelectSamples.withP53VariantsAsMulti thrpt 25 0.397 ± 0.021 ops/ms // 100 alleles
SelectSamples.withP53orBRCA2Variants thrpt 25 0.402 ± 0.023 ops/ms // 200 alleles
Traffic Notes
-
limit
is a free parameter in queries -
Variant selection
hetP53orBRCA2
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)homP53orBRCA2
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)homP53orBRCA2Unlim
results unlimited = returns all 1 215 homozygous alleles in p53 or BRCA2
-
Variant selection in "virtual cohort"
hetP53InSamples
returns 51 heterozygous alleles which exist in 2 samples in p53homP53InSamples
returns 14 homozygous alleles which exist in 2 samples in p53
-
Sample selection: Het
withHetP53Variants
results are limited to 100/node (= 100 total, p53 is on a single node)withHetP53VariantsAsMulti
same as above, but viaSelectSamplesInMultiRegions
call, which returns results as a streamwithHetP53orBRCA2Variants
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)
-
Sample selection: Hom
withHomP53Variants
results are limited to 100/node (= 100 total, p53 is on a single node)withHomP53VariantsAsMulti
same as above, but viaSelectSamplesInMultiRegions
call, which returns results as a streamwithHomP53orBRCA2Variants
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)
-
Sample selection: Hom + Het
withP53Variants
results are limited to 100/node (= 100 total, p53 is on a single node)withP53VariantsAsMulti
same as above, but viaSelectSamplesInMultiRegions
call, which returns results as a streamwithP53orBRCA2Variants
results are limited to 100/node (= 200 total, p53 & BRCA2 are on different nodes)
51 WGS 1KG dataset
Dataset with 51 WGS samples from 1000 Genomes Project. It gives a glimpse of what to expect on a commodity laptop.
- Dnaerys 1.11.1 (JRE17),
broadcast mode
=false
(default) - JMH version: 1.32, default settings
- Hardware: Lenovo ThinkPad T14s, i7-10510U CPU @ 1.80GHz, 4 cores/8HT
-
OS: Ubuntu 21.10
-
Dataset
- 51 WGS samples
- 19 725 105 unique variants
- ~51×19×10⁶ = ~1×10⁹ genotypes
Results for cluster with 4 (s/w) nodes running on a single physical node (laptop)
Benchmark Mode Cnt Score Error Units // Returned results/op
// Variant selection
SelectVariants.hetP53orBRCA2Unlim avgt 25 3.446 ± 0.111 ms/op // 613 heterozygous alleles
SelectVariants.homP53orBRCA2Unlim avgt 25 3.017 ± 0.096 ms/op // 263 homozygous alleles
SelectVariants.snpInRegion avgt 25 2.212 ± 0.067 ms/op // 1 allele
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples avgt 25 2.663 ± 0.082 ms/op // 54 heterozygous alleles
SelectVariants.homP53InSamples avgt 25 2.632 ± 0.077 ms/op // 38 homozygous alleles
// Sample selection
SelectSamples.withHetP53orBRCA2Variants avgt 25 2.872 ± 0.086 ms/op // 51 samples
SelectSamples.withHomP53orBRCA2Variants avgt 25 2.813 ± 0.078 ms/op // 51 samples
// Variant selection
SelectVariants.hetP53orBRCA2Unlim thrpt 25 0.287 ± 0.010 ops/ms // 613 heterozygous alleles
SelectVariants.homP53orBRCA2Unlim thrpt 25 0.334 ± 0.012 ops/ms // 263 homozygous alleles
SelectVariants.snpInRegion thrpt 25 0.453 ± 0.016 ops/ms // 1 allele
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples thrpt 25 0.376 ± 0.016 ops/ms // 54 heterozygous alleles
SelectVariants.homP53InSamples thrpt 25 0.380 ± 0.015 ops/ms // 38 homozygous alleles
// Sample selection
SelectSamples.withHetP53orBRCA2Variants thrpt 25 0.349 ± 0.014 ops/ms // 51 samples
SelectSamples.withHomP53orBRCA2Variants thrpt 25 0.351 ± 0.013 ops/ms // 51 samples
Traffic Notes
-
limit
is not set - all results returned -
Variant selection
hetP53orBRCA2Unlim
unlimited = returns all 613 heterozygous alleles in p53 or BRCA2homP53orBRCA2Unlim
unlimited = returns all 263 homozygous alleles in p53 or BRCA2
-
Variant selection in "virtual cohort"
hetP53InSamples
returns 54 heterozygous alleles which exist in 2 samples in p53homP53InSamples
returns 38 homozygous alleles which exist in 2 samples in p53
-
Sample selection: Het
withHetP53orBRCA2Variants
unlimited = returns all 51 samples
-
Sample selection: Hom
withHomP53orBRCA2Variants
unlimited = returns all 51 samples
Load scalability
Same tests, dataset and environment as for 51 WGS dataset, with more than a double load [*]. Two JMH benchmarks querying the same cluster node, at the same phase and time, one running outside cluster node container and another inside the container.
On benchmark outside the container, throughput decreased 10% to ~26% throughout the tests (which reflects roughly half of the whole throughput of the system under load, another half served benchmark inside container). Latency increased 15% to 29%. The smaller numbers are taken as a base for % evaluation.
Benchmark results outside container, cluster with 4 (s/w) nodes
Benchmark Mode Cnt Score Error Units
// Variant selection
SelectVariants.hetP53orBRCA2Unlim avgt 25 4.446 ± 0.090 ms/op
SelectVariants.homP53orBRCA2Unlim avgt 25 3.787 ± 0.146 ms/op
SelectVariants.snpInRegion avgt 25 2.563 ± 0.133 ms/op
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples avgt 25 3.247 ± 0.156 ms/op
SelectVariants.homP53InSamples avgt 25 3.146 ± 0.197 ms/op
// Sample selection
SelectSamples.withHetP53orBRCA2Variants avgt 25 3.690 ± 0.163 ms/op
SelectSamples.withHomP53orBRCA2Variants avgt 25 3.626 ± 0.170 ms/op
// Variant selection
SelectVariants.hetP53orBRCA2Unlim thrpt 25 0.228 ± 0.003 ops/ms
SelectVariants.homP53orBRCA2Unlim thrpt 25 0.278 ± 0.013 ops/ms
SelectVariants.snpInRegion thrpt 25 0.413 ± 0.029 ops/ms
// Variant selection in "virtual cohort"
SelectVariants.hetP53InSamples thrpt 25 0.332 ± 0.018 ops/ms
SelectVariants.homP53InSamples thrpt 25 0.339 ± 0.020 ops/ms
// Sample selection
SelectSamples.withHetP53orBRCA2Variants thrpt 25 0.273 ± 0.010 ops/ms
SelectSamples.withHomP53orBRCA2Variants thrpt 25 0.285 ± 0.017 ops/ms
[*] Workload on the cluster more than doubles as the benchmark from inside the container generates more load than the one from outside.
Alternative solutions
Apache Spark
We tested Spark queries on datasets fully allocated in memory, mainly for illustrative purposes. Using Spark as an in-memory OLAP database is not a proper use case for Spark. Also, placing the whole genomic dataset in data frame format in memory is not feasible in real world applications.
We ran simple benchmark tests counting the number of variants in gene panels on 51 WGS samples dataset on testbed. We used StorageLevel.MEMORY_ONLY and made sure the whole dataset fits in memory.
Across 100s tests for each panel,
- Mito 11 panel: 0.827 seconds avg, 75 msec/gene avg
- Adult 104 panel: 7.899 seconds avg, 76 msec/gene avg
-
Mito 374 panel: 28.697 seconds avg, 76 msec/gene avg
-
the memory footprint with 51 WGS samples exceeded 200Gb during the test runs. The minimal amount sufficient to place dataset in memory in data frame format was around 112Gb, we provided more to avoid full GCs during tens of sequential tests in a single run.
- Spark 3.1.2 / Delta 1.0.0 / Glow 1.0.1 / local mode
Unrolling queries
To find out the limits of what the Spark query optimisers can do, we unrolled Mito11 query. It's not a feasible approach for the real world applications, but it provided us with the results which must be close to the best possible performance. Across 100s of tests, the majority of best results were around 250 ms.
RAM disk
The above Spark tests were run with the whole dataset allocated in RAM in deserialized data frame format, which is an unrealistic scenario for real world applications. To test in more realistic settings we put the dataset in parquet format (430Mb) on RAM disk (type tmpfs) and ran tests without persisting in memory on the same single node. Across 32 tests:
- Mito 11 panel: 3.399 seconds avg, 309 msec/gene avg
- Adult 104 panel: 32.670 seconds avg, 314 msec/gene avg
- Mito 374 panel: 117.195 seconds avg, 313 msec/gene avg
- 4 times slower than in deserialized data frame format placed in memory
The same tests for 2890 samples dataset - 22Gb in parquet format on RAM disk:
- Mito 11 panel: 64.392 seconds avg, 5.853 seconds/gene avg
- Adult 104 panel: 609.738 seconds avg, 5.862 seconds/gene avg
Dnaerys
The same query for mitochondrial liver disease genes (Mito11 panel with 11 nuclear genes) in Dnaerys:
- < 20 milliseconds for the whole 2 890 samples dataset
- < 100 milliseconds for the whole gnomAD dataset (76 156 samples and 759×10⁶ variants)
- most overheads are serialization and communication overheads
- it's 3000 times difference with Spark on RAM disk
Google Genomics
A similar to gene panels query seems to take from tens of seconds to minutes in Google Cloud Life Sciences.