Analytics

Kinship

Kinship coefficient φ is a measure of biological relatedness. In Dnaerys it's calculated according to "Robust relationship inference in genome-wide association studies", Manichaikul, Ani et al.

**Relatedness for pairs of samples **

Relatedness is an important factor in population genetics and often we need to exclude related samples from datasets. Sometimes relatedness is an indication of a human error, e.g. sample duplications, which could indicate a sample mixup either at collection site or in a sequencing lab.

Dnaerys can calculate relatedness for all pairs in a dataset, or part of it.

Search for identical twins (equals to sample duplications) in all possible pairs in a cohort

./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"cohort_name":<cohort_name>, "degree":"TWINS_MONOZYGOTIC"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/Kinship

Search for all related pairs up to 3rd degree

./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"cohort_name":<cohort_name>, "degree":"THIRD_DEGREE"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/Kinship

Kinship search

Another useful application of relatedness is finding samples in a dataset which genetically relate to a sample of interest up to a certain kinship degree. It must be a common use case when we have large collections of samples in datasets and we want to find out if an external sample of interest exists in these datasets, or if any samples genetically related to sample of interest present in datasets. Sample of interest is provided as a vcf in request.

Polygenic Risk Scores

Polygenic risk scores are implemented by the same algorithm and perform the same calculations for basic scores as --score in plink.

Note

"More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoring-system vector with one coefficient per allele, --score computes the vector-matrix product a^TG, and then divides by the number of variants when reporting score-averages." -- plink

Results:

scores_sum contains summary of scores in sample, i.e. with effect alleles with het or hom genotypes only. This equals to sum mode in plink's --score with no-mean-imputation option.
From this, one can calculate weighted sums in each sample
- average over number of het/hom alleles (divide by hethom_cardinality)
- average over number of het/hom/ref alleles (divide by hethom_cardinality + ref_cardinality)
imputed_sum contains imputed part, i.e. summary of imputed scores for missing alleles in sample. I.e. it's a sum of ('SNP scores' * 'estimated gt weights') with 'estimated gt weights' = ('gnomAD AF' * 2) This equals to default mode in plink's '--score' when combined with 'scores_sum'
- this can be averaged over missing genotypes in sample in variants in this PRS (mis_cardinality)
For datasets in optimized mode (which do not keep info about ref/missing genotypes, see Runtime optimization), ref_cardinality, mis_cardinality and imputed_sum do not contain cardinality and sum.
Variants not presented in datasets do not contribute to scores. They are still counted in prsCardinality as reflection of PRS itself.

Step by step

Trio's genotypes

#chr_name  chr_position  effect_allele   HG002     HG003     HG004
2          178085137     A               1/1       1/1       1/1
2          178633945     C               0/0       0/1       0/0
3          66434643      C               0/0       0/1       0/0
5          180670251     G               0/1       0/1       0/1
5          180687212     T               0/1       0/1       0/0

prs tsv (or csv) file:

$ cat /path/prs.tsv
#chr_name  chr_position  effect_allele  reference_allele  effect_weight
2          178085137     A              G                 0.0708
2          178633945     C              A                 0.0778
3          66434643      C              T                 0.0512
5          180670251     G              C                 0.0594
5          180687212     T              C                 0.0873

load:

scala \
    -cp /path/dnaerys-ctl.jar \
    org.dnaerys.utils.PRS \
    --path <dataset_path> \
    --load \
    --prs  /path/prs.tsv \
    --name "Atrial fibrillation"  \
    --desc "PGS000XXX"

info:

scala \
    -cp /path/dnaerys-ctl.jar \
    org.dnaerys.utils.PRS \
    --path <dataset_path> \
    --info

querying:

grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"prs_name":"Atrial fibrillation", "samples":["HG002", "HG003", "HG004"]}' \
  localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/Prs

result:

{
  "prsName": "Atrial fibrillation",
  "sampleScores": [
    {
      "sample": "HG003",
      "scoresSum": 0.4173,
      "hethomCardinality": 5
    },
    {
      "sample": "HG002",
      "scoresSum": 0.28829998,
      "hethomCardinality": 3,
      "refCardinality": 2
    },
    {
      "sample": "HG004",
      "scoresSum": 0.201,
      "hethomCardinality": 2,
      "refCardinality": 3
    }
  ],
  "prsCardinality": 97,
  "elapsedMs": "8"
}

Sex mismatch

"Reported vs observed" sex mismatch check is based on F-statistics for deviation from expected number of homozygous alleles performed on biallelic SNVs on X chromosome outside PAR. The same method is used in Hail 0.2 and Plink, and parameters have similar meaning

float aaf_threshold = 3; // [0,1] consider only alleles with 'aaf_threshold < aaf < 1 - aaf_threshold'; default = 0
float female_threshold = 4; // samples called females if F < female_threshold; default = 0.7; ignored in FstatX calls
float male_threshold = 5;   // samples called males if F > male_threshold; default = 0.7; ignored in FstatX calls
bool include_par = 6; // PAR are excluded by default, this option includes them when true
bool seq = 7; // calculate sequentially in a single thread; by default calculations are performed in MT
              // and occupy all cores available on the nodes with X data

NB (i) Plink 1.07 tries to use "Nei's unbiased estimator" multiplier (2N/(2N-1)) for adjusting expected dosage of homozygous alleles when calculating inbreeding coefficients (for all cases, not just for sex check on X), but due to using integer division in implementation it never affected results beyond N = 1. This multiplier is not used in Dnaerys and Hail.

NB (ii) Keep it in mind that floating point arithmetic is not associative. Dnaerys computes stats in multiple threads in parallel, calculating different parts of the stats in different threads and small deviations in results might exist. I.e. F-stat results might not be identical for identical queries when calculated in MT, though deviations are negligible. This is not effect of a data race, but result of arithmetic being not associative. For sequential calculations use seq, which computes stats in a single thread.

Select by Inheritance

Selecting variants following Mendelian inheritance patterns within a specified trio (Parent1, Parent2, Proband).

./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"parent1":"HG003", "parent2":"HG004", "proband":"HG002", "skip":"1", "limit":"2"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectDeNovo

./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"affected_parent":"HG003", "unaffected_parent":"HG004", "affected_child":"HG002", "skip":"1", "limit":"2"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectHetDominant

./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"unaffected_parent1":"HG003", "unaffected_parent2":"HG004", "affected_child":"HG002", "skip":"1", "limit":"2"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectHomRecessive

Population Statistics (HWE and IBC)

Calculating population metrics on the fly for biallelic SNVs:

Inbreeding coefficient F_IT: Calculated for autosomal chromosomes and PAR regions.
Hardy-Weinberg Equilibrium (HWE): Performs a Chi-squared test for deviation. On X chromosome outside PAR calculated according to "Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome" J Graffelman, BS Weir
Accessing Data: These metrics are returned as part of the AllelesWithStatsResponse. Use the TopNHWE endpoint to identify the variants with the most significant p-values across the entire dataset.