Skip to content

Commit e84f7c0

Browse files
hyanwongmergify[bot]
authored andcommitted
Change name from "Tutorial" to "Usage"
This allows us to reserve the word "tutorial" for more specific inference tutorials, for example, on the tutorials site. It's also more accurate: people are more likely to just straight to e.g. the VCF usage section rather than work their way through the whole page.
1 parent 1738a7f commit e84f7c0

File tree

3 files changed

+36
-24
lines changed

3 files changed

+36
-24
lines changed

docs/_toc.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,9 @@ parts:
88
- caption: Installation
99
chapters:
1010
- file: installation
11-
- caption: Tutorial
11+
- caption: Usage
1212
chapters:
13-
- file: tutorial
13+
- file: usage
1414
- caption: Inference
1515
chapters:
1616
- file: inference

docs/inference.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -141,7 +141,7 @@ still not as efficently as it is possible to analyse an equivalent tree sequence
141141
Rather than require the user to understand the internal structure of this
142142
file format, we provide a simple {ref}`Python API <sec_api_file_formats>`
143143
to allow the user to efficiently construct it from their own data.
144-
An example of how to use this API is given in the {ref}`sec_tutorial`.
144+
An example of how to use this API is given in the {ref}`sec_usage` documentation.
145145

146146
We do not provide an automatic means of importing data from VCF (or any
147147
other format) intentionally, as we believe that this would be extremely difficult to do.

docs/tutorial.md renamed to docs/usage.md

Lines changed: 33 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@ kernelspec:
1515
:::
1616

1717

18-
(sec_tutorial)=
18+
(sec_usage)=
1919

20-
# Tutorial
20+
# Usage
2121

22-
(sec_tutorial_toy_example)=
22+
(sec_usage_toy_example)=
2323

2424
## Toy example
2525

@@ -61,7 +61,7 @@ for sample in range(ds['call_genotype'].shape[1]):
6161
We wish to infer a genealogy that could have given rise to this data set. To run _tsinfer_
6262
we wrap the .vcz file in a `tsinfer.VariantData` object. This requires an
6363
*ancestral allele* to be specified for each site; there are
64-
many methods for calculating there: details are outside the scope of this manual, but we
64+
many methods for calculating these: details are outside the scope of this manual, but we
6565
have started a [discussion topic](https://github.com/tskit-dev/tsinfer/discussions/523)
6666
on this issue to provide some recommendations.
6767

@@ -83,19 +83,29 @@ ancestral_alleles[-1] = "."
8383
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_alleles)
8484
```
8585

86-
The `.VariantData` object is a lightweight wrapper for the data from the 3 diploid samples
87-
in the .vcz file. We'll use the object to infer a tree sequence from the variant data.
88-
Howeve, note that some sites are not used for genealogical inference. This includes non-variable
89-
(fixed) sites, singleton sites, and sites where the ancestral allele is unknown: in this example,
90-
these are seen at site IDs 4, 5 and 7 respectively. In addition,
91-
multiallelic sites, with more than 2 alleles, are not used for inference (but see
92-
[here](https://github.com/tskit-dev/tsinfer/issues/670) for a workaround).
86+
The `VariantData` object is a lightweight wrapper around the .vcz file.
87+
We'll use it to infer a tree sequence on the basis of the sites that vary between the
88+
different samples. However, note that certain sites are not used by _tsinfer_ for inferring
89+
the genealogy (although they are still encoded in the final tree sequence), These are:
9390

94-
Additionally, during the inference step, extra sites can be flagged as not for use in
95-
inferring the genealogy, for example if they are deemed unreliable (this is done
96-
via the `exclude_positions` parameter). Note, however, that even if a site is not used
97-
for genealogical inference, its genetic variation can still be encoded in the final
98-
tree sequence.
91+
* Non-variable (fixed) sites, e.g. site 4 above
92+
* Singleton sites, where only one genome has the derived allele e.g. site 5 above
93+
* Sites where the ancestral allele is unknown, e.g. demonstrated by site 7 above
94+
* Multialleleic sites, with more than 2 alleles (but see
95+
[here](https://github.com/tskit-dev/tsinfer/issues/670) for a workaround)
96+
97+
Additionally, during the inference step, additional sites can be flagged as not for use in
98+
inference, for example if they are deemed unreliable (this is done
99+
via the `exclude_positions` parameter).
100+
101+
### Masks
102+
103+
Sites which are not used for inference will still be included in the final tree sequence, with
104+
mutations at those sites being placed onto branches by parsimony. However, it is also possible
105+
to completely exclude sites and samples from the final tree sequence, by specifing a `site_mask`
106+
and/or a `sample_mask` when creating the `VariantData` object. Such sites or samples will be
107+
completely omitted from both inference and the final tree sequence. This can be useful, for
108+
example, to reduce the amount of computation required for an inference.
99109

100110
### Topology inference
101111

@@ -186,7 +196,7 @@ algorithm is only intended to infer the genetic relationships between the sample
186196
(i.e. the *topology* of the tree sequence).
187197

188198

189-
(sec_tutorial_simulation_example)=
199+
(sec_usage_simulation_example)=
190200

191201
## Simulation example
192202

@@ -416,15 +426,15 @@ Other than the sample node IDs, it is meaningless to compare node numbers in the
416426
source and inferred tree sequences.
417427
:::
418428

419-
(sec_tutorial_data_example)=
429+
(sec_usage_data_example)=
420430

421431
## Data example
422432

423433
Inputting real data for inference is similar in principle to the examples above.
424434
All that is required is a .vcz file, which can be created using
425435
[vcf2zarr](https://sgkit-dev.github.io/bio2zarr/vcf2zarr/overview.html) as above
426436

427-
(sec_tutorial_read_vcf)=
437+
(sec_usage_read_vcf)=
428438

429439
### Reading a VCF
430440

@@ -440,7 +450,9 @@ vcf_location = "_static/P_dom_chr24_phased.vcf.gz"
440450
!python -m bio2zarr vcf2zarr convert --force {vcf_location} sparrows.vcz
441451
```
442452

443-
This creates the `sparrows.vcz` datastore, which we open using `tsinfer.VariantData`:
453+
This creates the `sparrows.vcz` datastore, which we open using `tsinfer.VariantData`.
454+
The original VCF had ancestral alleles specified in the `AA` INFO field, so we can
455+
simply provide the string `"variant_AA"` as the ancestral_allele parameter.
444456

445457
```{code-cell} ipython3
446458
# Do the inference: this VCF has ancestral alleles in the AA field
@@ -552,7 +564,7 @@ discrete groups on the tree, but be part of a larger mixing population. Note, ho
552564
that this is only one of thousands of trees, and may not be typical of the genome as a
553565
whole. Additionally, most data sets will have far more samples than this example, so
554566
trees visualized in this way are likely to be huge and difficult to understand. As in
555-
the {ref}`simulation example <sec_tutorial_simulation_example>` above, one possibility
567+
the {ref}`simulation example <sec_usage_simulation_example>` above, one possibility
556568
is to {meth}`~tskit.TreeSequence.simplify` the tree sequence to a limited number of
557569
samples, but it is likely that most studies will
558570
instead rely on various statistical summaries of the trees. Storing genetic data as a

0 commit comments

Comments
 (0)