You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/usage.md
+65-50Lines changed: 65 additions & 50 deletions
Original file line number
Diff line number
Diff line change
@@ -32,34 +32,43 @@ document. However, for the moment we'll just use a pre-generated dataset:
32
32
33
33
```{code-cell} ipython3
34
34
import zarr
35
-
ds = zarr.open("_static/example_data.vcz")
35
+
vcf_zarr = zarr.open("_static/example_data.vcz")
36
36
```
37
37
38
38
This is what the genotypes stored in that datafile look like:
39
39
40
40
```{code-cell}
41
41
:"tags": ["remove-input"]
42
42
import numpy as np
43
-
assert all(len(np.unique(a)) == len(a) for a in ds['variant_allele'])
44
-
assert any([np.sum(g) == 1 for g in ds['call_genotype']]) # at least one singleton
45
-
assert any([np.sum(g) == 0 for g in ds['call_genotype']]) # at least one non-variable
46
-
47
-
alleles = ds['variant_allele'][:].astype(str)
48
-
sites = np.arange(ds['call_genotype'].shape[0])
49
-
print(" " * 22, "Site:", " ".join(str(x) for x in range(8)), "\n")
50
-
for sample in range(ds['call_genotype'].shape[1]):
51
-
for genome in range(ds['call_genotype'].shape[2]):
52
-
genotypes = ds['call_genotype'][:,sample, genome]
53
-
print(
54
-
f"Diploid sample {sample} (genome {genome}):",
55
-
" ".join(alleles[sites, genotypes])
56
-
)
43
+
G = vcf_zarr['call_genotype'][:] # read full genotype matrix into memory
44
+
positions = vcf_zarr['variant_position'][:]
45
+
alleles = vcf_zarr['variant_allele'][:]
46
+
47
+
assert any([np.sum(g) == 1 for g in G]) # at least one singleton
48
+
assert any([np.sum(g) == 0 for g in G]) # at least one non-variable
49
+
assert all(len(np.unique(a)) == len(a) for a in vcf_zarr['variant_allele'][:])
50
+
51
+
num_sites, num_samples, ploidy = G.shape
52
+
print("Diploid sample id:", " ".join(f" {i} " for i in range(num_samples)))
53
+
print("Genome for sample: ", " ".join(f" {p} {' ' * p} " for _ in range(num_samples) for p in range(ploidy)))
54
+
print("-" * 54)
55
+
for site_id, pos in enumerate(positions):
56
+
print(f" position {pos}:", end=" ")
57
+
for sample_id in range(num_samples):
58
+
genotypes = G[site_id, sample_id, :]
59
+
site_alleles = alleles[site_id].astype(str)
60
+
print(" ".join(f"{a:<4}" for a in site_alleles[genotypes.flatten()]), end=" ")
61
+
print()
57
62
```
58
63
64
+
:::{note}
65
+
The last site, at position 95, is an indel (insertion or deletion). Indels can be used as long as the indel does not overlap with other variants, only 2 alleles exist, and the ancestral state is known.
66
+
:::
67
+
59
68
### VariantData and ancestral alleles
60
69
61
70
We wish to infer a genealogy that could have given rise to this data set. To run _tsinfer_
62
-
we wrap the .vcz file in a `tsinfer.VariantData` object. This requires an
71
+
we wrap the `.vcz` file in a {class}`tsinfer.VariantData` object. This requires an
63
72
*ancestral state* to be specified for each site; there are
64
73
many methods for calculating these: details are outside the scope of this manual, but we
65
74
have started a [discussion topic](https://github.com/tskit-dev/tsinfer/discussions/523)
@@ -71,8 +80,10 @@ in the `variant_AA` field of the .vcz file. It's also possible to provide a nump
71
80
of ancestral alleles, of the same length as the number of selected variants. If you have a string
72
81
of the ancestral states (e.g. from FASTA) the {meth}`add_ancestral_state_array`
73
82
method can be used to convert and save this to the VCF Zarr dataset (under the name
74
-
`ancestral_state`). Note that the positions passed to the method should be
75
-
zero-based, if you have one-based positions you should prepend an "X" to the string.
83
+
`ancestral_state`). Note that this method assumes that the string uses zero-based
84
+
indexing, so that the first letter corresponds to a site at position 0. If,
85
+
as is typical, the first letter in the string denotes the ancestral state of
86
+
the site at position 1 in the .vcz file, you should prepend an "X" to the string.
76
87
Alleles that are not in the list of alleles for their respective site are
77
88
treated as unknown and not used for inference (with a warning given).
0 commit comments