@@ -15,11 +15,11 @@ kernelspec:
15
15
:::
16
16
17
17
18
- (sec_tutorial )=
18
+ (sec_usage )=
19
19
20
- # Tutorial
20
+ # Usage
21
21
22
- (sec_tutorial_toy_example )=
22
+ (sec_usage_toy_example )=
23
23
24
24
## Toy example
25
25
@@ -61,7 +61,7 @@ for sample in range(ds['call_genotype'].shape[1]):
61
61
We wish to infer a genealogy that could have given rise to this data set. To run _ tsinfer_
62
62
we wrap the .vcz file in a ` tsinfer.VariantData ` object. This requires an
63
63
* ancestral allele* to be specified for each site; there are
64
- many methods for calculating there : details are outside the scope of this manual, but we
64
+ many methods for calculating these : details are outside the scope of this manual, but we
65
65
have started a [ discussion topic] ( https://github.com/tskit-dev/tsinfer/discussions/523 )
66
66
on this issue to provide some recommendations.
67
67
@@ -83,19 +83,29 @@ ancestral_alleles[-1] = "."
83
83
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_alleles)
84
84
```
85
85
86
- The ` .VariantData ` object is a lightweight wrapper for the data from the 3 diploid samples
87
- in the .vcz file. We'll use the object to infer a tree sequence from the variant data.
88
- Howeve, note that some sites are not used for genealogical inference. This includes non-variable
89
- (fixed) sites, singleton sites, and sites where the ancestral allele is unknown: in this example,
90
- these are seen at site IDs 4, 5 and 7 respectively. In addition,
91
- multiallelic sites, with more than 2 alleles, are not used for inference (but see
92
- [ here] ( https://github.com/tskit-dev/tsinfer/issues/670 ) for a workaround).
86
+ The ` VariantData ` object is a lightweight wrapper around the .vcz file.
87
+ We'll use it to infer a tree sequence on the basis of the sites that vary between the
88
+ different samples. However, note that certain sites are not used by _ tsinfer_ for inferring
89
+ the genealogy (although they are still encoded in the final tree sequence), These are:
93
90
94
- Additionally, during the inference step, extra sites can be flagged as not for use in
95
- inferring the genealogy, for example if they are deemed unreliable (this is done
96
- via the ` exclude_positions ` parameter). Note, however, that even if a site is not used
97
- for genealogical inference, its genetic variation can still be encoded in the final
98
- tree sequence.
91
+ * Non-variable (fixed) sites, e.g. site 4 above
92
+ * Singleton sites, where only one genome has the derived allele e.g. site 5 above
93
+ * Sites where the ancestral allele is unknown, e.g. demonstrated by site 7 above
94
+ * Multialleleic sites, with more than 2 alleles (but see
95
+ [ here] ( https://github.com/tskit-dev/tsinfer/issues/670 ) for a workaround)
96
+
97
+ Additionally, during the inference step, additional sites can be flagged as not for use in
98
+ inference, for example if they are deemed unreliable (this is done
99
+ via the ` exclude_positions ` parameter).
100
+
101
+ ### Masks
102
+
103
+ Sites which are not used for inference will still be included in the final tree sequence, with
104
+ mutations at those sites being placed onto branches by parsimony. However, it is also possible
105
+ to completely exclude sites and samples from the final tree sequence, by specifing a ` site_mask `
106
+ and/or a ` sample_mask ` when creating the ` VariantData ` object. Such sites or samples will be
107
+ completely omitted from both inference and the final tree sequence. This can be useful, for
108
+ example, to reduce the amount of computation required for an inference.
99
109
100
110
### Topology inference
101
111
@@ -186,7 +196,7 @@ algorithm is only intended to infer the genetic relationships between the sample
186
196
(i.e. the * topology* of the tree sequence).
187
197
188
198
189
- (sec_tutorial_simulation_example )=
199
+ (sec_usage_simulation_example )=
190
200
191
201
## Simulation example
192
202
@@ -416,15 +426,15 @@ Other than the sample node IDs, it is meaningless to compare node numbers in the
416
426
source and inferred tree sequences.
417
427
:::
418
428
419
- (sec_tutorial_data_example )=
429
+ (sec_usage_data_example )=
420
430
421
431
## Data example
422
432
423
433
Inputting real data for inference is similar in principle to the examples above.
424
434
All that is required is a .vcz file, which can be created using
425
435
[ vcf2zarr] ( https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html ) as above
426
436
427
- (sec_tutorial_read_vcf )=
437
+ (sec_usage_read_vcf )=
428
438
429
439
### Reading a VCF
430
440
@@ -440,7 +450,9 @@ vcf_location = "_static/P_dom_chr24_phased.vcf.gz"
440
450
!python -m bio2zarr vcf2zarr convert --force {vcf_location} sparrows.vcz
441
451
```
442
452
443
- This creates the ` sparrows.vcz ` datastore, which we open using ` tsinfer.VariantData ` :
453
+ This creates the ` sparrows.vcz ` datastore, which we open using ` tsinfer.VariantData ` .
454
+ The original VCF had ancestral alleles specified in the ` AA ` INFO field, so we can
455
+ simply provide the string ` "variant_AA" ` as the ancestral_allele parameter.
444
456
445
457
``` {code-cell} ipython3
446
458
# Do the inference: this VCF has ancestral alleles in the AA field
@@ -552,7 +564,7 @@ discrete groups on the tree, but be part of a larger mixing population. Note, ho
552
564
that this is only one of thousands of trees, and may not be typical of the genome as a
553
565
whole. Additionally, most data sets will have far more samples than this example, so
554
566
trees visualized in this way are likely to be huge and difficult to understand. As in
555
- the {ref}` simulation example <sec_tutorial_simulation_example > ` above, one possibility
567
+ the {ref}` simulation example <sec_usage_simulation_example > ` above, one possibility
556
568
is to {meth}` ~tskit.TreeSequence.simplify ` the tree sequence to a limited number of
557
569
samples, but it is likely that most studies will
558
570
instead rely on various statistical summaries of the trees. Storing genetic data as a
0 commit comments