Kinship
Kinship coefficient φ is a measure of biological relatedness. In Dnaerys it's calculated according to "Robust relationship inference in genome-wide association studies", Manichaikul, Ani et al.
** Calculating relatedness for pairs of samples**
Relatedness is an important factor in population genetics and often we need to exclude related samples from datasets. Sometimes relatedness is an indication of a human error, e.g. sample duplications, which could indicate a sample mixup either at collection site or in a sequencing lab.
Dnaerys can calculate relatedness for all pairs in a dataset, or part of it.
- Search for identical twins (equals to sample duplications) in all possible pairs in a cohort
./grpcurl \
-plaintext \
-proto dnaerys.proto \
-d '{"cohort_name":<cohort_name>, "degree":"TWINS_MONOZYGOTIC"}' \
localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/Kinship
- Search for all related pairs up to 3rd degree
./grpcurl \
-plaintext \
-proto dnaerys.proto \
-d '{"cohort_name":<cohort_name>, "degree":"THIRD_DEGREE"}' \
localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/Kinship
Kinship search
Another useful application of relatedness is finding samples in a dataset which genetically relate to a sample of interest up to a certain kinship degree. It must be a common use case when we have large collections of samples in datasets and we want to find out if an external sample of interest exists in these datasets, or if any samples genetically related to sample of interest present in datasets. Sample of interest is provided as a vcf in request.
Polygenic risk scores
Polygenic risk scores are implemented by the same algorithm and perform the same calculations for
basic scores as --score
in plink.
Note
"More precisely, if G is the full genotype/dosage matrix (rows = alleles, columns = samples) and a is a scoring-system vector with one coefficient per allele, --score computes the vector-matrix product aTG, and then divides by the number of variants when reporting score-averages." -- plink
Results:
-
scores_sum
contains summary of scores in sample, i.e. with effect alleles with het or hom genotypes only. This equals tosum
mode in plink's--score
withno-mean-imputation
option. -
From this, one can calculate weighted sums in each sample
- average over number of het/hom alleles (divide by
hethom_cardinality
) - average over number of het/hom/ref alleles (divide by
hethom_cardinality
+ref_cardinality
)
- average over number of het/hom alleles (divide by
-
imputed_sum
contains imputed part, i.e. summary of imputed scores for missing alleles in sample. I.e. it's a sum of ('SNP scores' * 'estimated gt weights') with 'estimated gt weights' = ('gnomAD AF' * 2) This equals to default mode in plink's '--score' when combined with 'scores_sum'- this can be averaged over missing genotypes in sample in variants in this PRS (
mis_cardinality
)
- this can be averaged over missing genotypes in sample in variants in this PRS (
-
For datasets in optimized mode (which do not keep info about ref/missing genotypes, see Runtime optimization),
ref_cardinality
,mis_cardinality
andimputed_sum
do not contain cardinality and sum. -
Variants not presented in datasets do not contribute to scores. They are still counted in
prsCardinality
as reflection of PRS itself.
Step by step
Trio's genotypes
#chr_name chr_position effect_allele HG002 HG003 HG004
2 178085137 A 1/1 1/1 1/1
2 178633945 C 0/0 0/1 0/0
3 66434643 C 0/0 0/1 0/0
5 180670251 G 0/1 0/1 0/1
5 180687212 T 0/1 0/1 0/0
prs tsv (or csv) file:
$ cat /path/prs.tsv
#chr_name chr_position effect_allele reference_allele effect_weight
2 178085137 A G 0.0708
2 178633945 C A 0.0778
3 66434643 C T 0.0512
5 180670251 G C 0.0594
5 180687212 T C 0.0873
load:
scala \
-cp /path/dnaerys-ctl.jar \
org.dnaerys.utils.PRS \
--path <dataset_path> \
--load \
--prs /path/prs.tsv \
--name "Atrial fibrillation" \
--desc "PGS000XXX"
info:
scala \
-cp /path/dnaerys-ctl.jar \
org.dnaerys.utils.PRS \
--path <dataset_path> \
--info
querying:
grpcurl \
-plaintext \
-proto dnaerys.proto \
-d '{"prs_name":"Atrial fibrillation", "samples":["HG002", "HG003", "HG004"]}' \
localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/Prs
result:
{
"prsName": "Atrial fibrillation",
"sampleScores": [
{
"sample": "HG003",
"scoresSum": 0.4173,
"hethomCardinality": 5
},
{
"sample": "HG002",
"scoresSum": 0.28829998,
"hethomCardinality": 3,
"refCardinality": 2
},
{
"sample": "HG004",
"scoresSum": 0.201,
"hethomCardinality": 2,
"refCardinality": 3
}
],
"prsCardinality": 97,
"elapsedMs": "8"
}
Sex mismatch
"Reported vs observed" sex mismatch check is based on F-statistics for deviation from expected number of homozygous alleles performed on biallelic SNVs on X chromosome outside PAR. The same method is used in Hail 0.2 and Plink, and parameters have similar meaning
float aaf_threshold = 3; // [0,1] consider only alleles with 'aaf_threshold < aaf < 1 - aaf_threshold'; default = 0
float female_threshold = 4; // samples called females if F < female_threshold; default = 0.7; ignored in FstatX calls
float male_threshold = 5; // samples called males if F > male_threshold; default = 0.7; ignored in FstatX calls
bool include_par = 6; // PAR are excluded by default, this option includes them when true
bool seq = 7; // calculate sequentially in a single thread; by default calculations are performed in MT
// and occupy all cores available on the nodes with X data
NB (i) Plink 1.07 tries to use "Nei's unbiased estimator" multiplier (2N/(2N-1)) for adjusting expected dosage of homozygous alleles when calculating inbreeding coefficients (for all cases, not just for sex check on X), but due to using integer division in implementation it never affected results beyond N = 1. This multiplier is not used in Dnaerys and Hail.
NB (ii)
Keep it in mind that floating point arithmetic
is not associative.
Dnaerys computes stats in multiple threads in parallel, calculating different parts
of the stats in different threads and small deviations in results might exist.
I.e. F-stat results might not be identical for identical queries when calculated in MT,
though deviations are negligible. This is not effect of a
data race,
but result of arithmetic being not associative.
For sequential calculations use seq
, which computes stats in a single thread.
Select by inheritance
Selecting variants by inheritance models
./grpcurl \
-plaintext \
-proto dnaerys.proto \
-d '{"parent1":"HG003", "parent2":"HG004", "proband":"HG002", "skip":"1", "limit":"2"}' \
localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectDeNovo
./grpcurl \
-plaintext \
-proto dnaerys.proto \
-d '{"affected_parent":"HG003", "unaffected_parent":"HG004", "affected_child":"HG002", "skip":"1", "limit":"2"}' \
localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectHetDominant
./grpcurl \
-plaintext \
-proto dnaerys.proto \
-d '{"unaffected_parent1":"HG003", "unaffected_parent2":"HG004", "affected_child":"HG002", "skip":"1", "limit":"2"}' \
localhost:8001 \
org.dnaerys.cluster.grpc.DnaerysService/SelectHomRecessive
Inbreeding coefficient
Inbreeding coefficient FIT for biallelic SNVs in autosomal chromosomes and PAR.
Results returned as part of AllelesWithStatsResponse
.
HWE
Chi-squared test for deviation from HWE for biallelic SNVs on autosomal and X chromosomes. On X chromosome outside PAR calculated according to "Testing for Hardy-Weinberg equilibrium at biallelic genetic markers on the X chromosome" J Graffelman, BS Weir
Results returned as part of AllelesWithStatsResponse
for Select*WithStats
requests.
TopNHWE
selects top N variants with the most significant p-value across the whole dataset.