Footprints and Scaling
Scaling from local development environments to production clusters.
51 WGS dataset
- Dataset: 51 WGS samples from 1000 Genomes Project
- 19 725 105 unique variants
- ~51×19×10⁶ = ~1×10⁹ genotypes
- Memory, disk, and startup time footprints:
- On disk: 1.1 GiB
- RAM footprint: ~1.7 GiB
- Startup time on single node (laptop): t = 4.6 sec
- Footprints vary slightly based on the number of nodes
3202 WGS dataset
- Dataset: 1KGP 30x on GRCh38
- 3,202 WGS samples
- 138,044,723 unique variants
- 19,348,414 multiallelic variants
- 138 044 723×3202 = ~ 4.42×10¹¹ genotypes
- Annotations:
- VEP (impact, biotype, feature type, variant class, consequences), ClinVar, gnomADe + gnomADg 4.1, HGVSp, AlphaMissense Score & Class
- GENCODE Basic set
- annotation composition
- Memory and disk footprints:
- On disk: 42 GiB
- RAM footprint (k8s cluster with 8 pods): 63 GiB
- Footprints vary slightly based on the number of nodes.
76 156 WGS gnomAD
- Dataset: gnomAD v3.1
- 76,156 WGS samples
- ~759×10⁶ unique variants
- 415.071703278×10⁹ non-reference (hom+het) genotypes, modelled by HWE with 0.005 no-call rate
- ~57.802404×10¹² total number of genotypes (with homozygous reference and missed genotypes = 76156*759×10⁶)
- Memory, disk, and startup time footprints:
- On disk: 312 GiB
- RAM footprint: 450 GiB
- Startup time on single node (R630 testbed): t = 260 sec
- Startup time on N nodes = t / N
- Footprints vary slightly based on the number of nodes.
Resource Comparison Table
| Cohort Size | Variants | Total Genotypes | Disk (GiB) | RAM (GiB) |
|---|---|---|---|---|
| 51 WGS | 19.7M | 1×10⁹ | 1.1 | 1.7 |
| 3,202 WGS | 138M | 4.42×10¹¹ | 42 | 63 |
| 76,156 WGS | 759M | 57.802404×10¹² | 312 | 450 |
Testbed Hardware Reference
Unless specified otherwise, benchmarks are conducted on the R630 Testbed:
Host: Ubuntu 18.04.1 LTS
CPU: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
28 Cores w/ HT (56 Logical Cores)
RAM: 503 GiB