Kinship

Kinship coefficient φ is a measure of biological relatedness. In Dnaerys it's calculated according to "Robust relationship inference in genome-wide association studies", Manichaikul, Ani et al.

** Calculating relatedness for pairs of samples**

Relatedness is an important factor in population genetics and often we need to exclude related samples from datasets. Sometimes relatedness is an indication of a human error, e.g. sample duplications, which could indicate a sample mixup either at collection site or in a sequencing lab.

Dnaerys can calculate relatedness for all pairs in a dataset, or part of it.

./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"cohort_name":<cohort_name>, "degree":"TWINS_MONOZYGOTIC"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/Kinship
./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"cohort_name":<cohort_name>, "degree":"THIRD_DEGREE"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/Kinship

Another useful application of relatedness is finding samples in a dataset which genetically relate to a sample of interest up to a certain kinship degree. It must be a common use case when we have large collections of samples in datasets and we want to find out if an external sample of interest exists in these datasets, or if any samples genetically related to sample of interest present in datasets. Sample of interest is provided as a vcf in request.


Polygenic risk scores

Polygenic risk scores are implemented by the same algorithm and perform the same calculations for basic scores as --score in plink.

Note

"More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoring-system vector with one coefficient per allele, --score computes the vector-matrix product aTG, and then divides by the number of variants when reporting score-averages." -- plink

Results:


Step by step

Trio's genotypes

#chr_name  chr_position  effect_allele   HG002     HG003     HG004
2          178085137     A               1/1       1/1       1/1
2          178633945     C               0/0       0/1       0/0
3          66434643      C               0/0       0/1       0/0
5          180670251     G               0/1       0/1       0/1
5          180687212     T               0/1       0/1       0/0

prs tsv (or csv) file:

$ cat /path/prs.tsv
#chr_name  chr_position  effect_allele  reference_allele  effect_weight
2          178085137     A              G                 0.0708
2          178633945     C              A                 0.0778
3          66434643      C              T                 0.0512
5          180670251     G              C                 0.0594
5          180687212     T              C                 0.0873

load:

scala \
    -cp /path/dnaerys-ctl.jar \
    org.dnaerys.utils.PRS \
    --path <dataset_path> \
    --load \
    --prs  /path/prs.tsv \
    --name "Atrial fibrillation"  \
    --desc "PGS000XXX"

info:

scala \
    -cp /path/dnaerys-ctl.jar \
    org.dnaerys.utils.PRS \
    --path <dataset_path> \
    --info

querying:

grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"prs_name":"Atrial fibrillation", "samples":["HG002", "HG003", "HG004"]}' \
  localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/Prs

result:

{
  "prsName": "Atrial fibrillation",
  "sampleScores": [
    {
      "sample": "HG003",
      "scoresSum": 0.4173,
      "hethomCardinality": 5
    },
    {
      "sample": "HG002",
      "scoresSum": 0.28829998,
      "hethomCardinality": 3,
      "refCardinality": 2
    },
    {
      "sample": "HG004",
      "scoresSum": 0.201,
      "hethomCardinality": 2,
      "refCardinality": 3
    }
  ],
  "prsCardinality": 97,
  "elapsedMs": "8"
}

Sex mismatch

"Reported vs observed" sex mismatch check is based on F-statistics for deviation from expected number of homozygous alleles performed on biallelic SNVs on X chromosome outside PAR. The same method is used in Hail 0.2 and Plink, and parameters have similar meaning

float aaf_threshold = 3; // [0,1] consider only alleles with 'aaf_threshold < aaf < 1 - aaf_threshold'; default = 0
float female_threshold = 4; // samples called females if F < female_threshold; default = 0.7; ignored in FstatX calls
float male_threshold = 5;   // samples called males if F > male_threshold; default = 0.7; ignored in FstatX calls
bool include_par = 6; // PAR are excluded by default, this option includes them when true
bool seq = 7; // calculate sequentially in a single thread; by default calculations are performed in MT
              // and occupy all cores available on the nodes with X data

NB (i) Plink 1.07 tries to use "Nei's unbiased estimator" multiplier (2N/(2N-1)) for adjusting expected dosage of homozygous alleles when calculating inbreeding coefficients (for all cases, not just for sex check on X), but due to using integer division in implementation it never affected results beyond N = 1. This multiplier is not used in Dnaerys and Hail.

NB (ii) Keep it in mind that floating point arithmetic is not associative. Dnaerys computes stats in multiple threads in parallel, calculating different parts of the stats in different threads and small deviations in results might exist. I.e. F-stat results might not be identical for identical queries when calculated in MT, though deviations are negligible. This is not effect of a data race, but result of arithmetic being not associative. For sequential calculations use seq, which computes stats in a single thread.


Select by inheritance

Selecting variants by inheritance models

./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"parent1":"HG003", "parent2":"HG004", "proband":"HG002", "skip":"1", "limit":"2"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectDeNovo
./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"affected_parent":"HG003", "unaffected_parent":"HG004", "affected_child":"HG002", "skip":"1", "limit":"2"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectHetDominant
./grpcurl \
  -plaintext \
  -proto dnaerys.proto \
  -d '{"unaffected_parent1":"HG003", "unaffected_parent2":"HG004", "affected_child":"HG002", "skip":"1", "limit":"2"}' \
  localhost:8001 \
  org.dnaerys.cluster.grpc.DnaerysService/SelectHomRecessive

Inbreeding coefficient

Inbreeding coefficient FIT for biallelic SNVs in autosomal chromosomes and PAR.

Results returned as part of AllelesWithStatsResponse.


HWE

Chi-squared test for deviation from HWE for biallelic SNVs on autosomal and X chromosomes. On X chromosome outside PAR calculated according to "Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome" J Graffelman, BS Weir

Results returned as part of AllelesWithStatsResponse for Select*WithStats requests. TopNHWE selects top N variants with the most significant p-value across the whole dataset.