Skip to content

Commit 4098b56

Browse files
hyanwongmergify[bot]
authored andcommitted
Add an indel to the simple example, and change positions
1 parent 34fbe79 commit 4098b56

File tree

7 files changed

+69
-61
lines changed

7 files changed

+69
-61
lines changed
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
>chr1
2-
GGCTCAGA
2+
nnnnnnnnnnnnnnnnGnnnnnnnnnnnnnnnnnnnnnnnnnnnGnnnnnCnnnnTnnnnnnnnnnnnnnnCnnnAnnnnnnnnnGnnnnnnnnnAnnnn
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
chr1 8 6 8 8
1+
chr1 100 6 100 100

docs/_static/example_data.vcz/variant_allele/.zarray

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@
1010
"id": "blosc",
1111
"shuffle": 1
1212
},
13-
"dtype": "|S1",
14-
"fill_value": null,
13+
"dtype": "|S4",
14+
"fill_value": "",
1515
"filters": null,
1616
"order": "C",
1717
"shape": [

docs/_static/example_data.vcz/variant_allele/.zattrs

Lines changed: 0 additions & 7 deletions
This file was deleted.
48 Bytes
Binary file not shown.
0 Bytes
Binary file not shown.

docs/usage.md

Lines changed: 65 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -32,34 +32,43 @@ document. However, for the moment we'll just use a pre-generated dataset:
3232

3333
```{code-cell} ipython3
3434
import zarr
35-
ds = zarr.open("_static/example_data.vcz")
35+
vcf_zarr = zarr.open("_static/example_data.vcz")
3636
```
3737

3838
This is what the genotypes stored in that datafile look like:
3939

4040
```{code-cell}
4141
:"tags": ["remove-input"]
4242
import numpy as np
43-
assert all(len(np.unique(a)) == len(a) for a in ds['variant_allele'])
44-
assert any([np.sum(g) == 1 for g in ds['call_genotype']]) # at least one singleton
45-
assert any([np.sum(g) == 0 for g in ds['call_genotype']]) # at least one non-variable
46-
47-
alleles = ds['variant_allele'][:].astype(str)
48-
sites = np.arange(ds['call_genotype'].shape[0])
49-
print(" " * 22, "Site:", " ".join(str(x) for x in range(8)), "\n")
50-
for sample in range(ds['call_genotype'].shape[1]):
51-
for genome in range(ds['call_genotype'].shape[2]):
52-
genotypes = ds['call_genotype'][:,sample, genome]
53-
print(
54-
f"Diploid sample {sample} (genome {genome}):",
55-
" ".join(alleles[sites, genotypes])
56-
)
43+
G = vcf_zarr['call_genotype'][:] # read full genotype matrix into memory
44+
positions = vcf_zarr['variant_position'][:]
45+
alleles = vcf_zarr['variant_allele'][:]
46+
47+
assert any([np.sum(g) == 1 for g in G]) # at least one singleton
48+
assert any([np.sum(g) == 0 for g in G]) # at least one non-variable
49+
assert all(len(np.unique(a)) == len(a) for a in vcf_zarr['variant_allele'][:])
50+
51+
num_sites, num_samples, ploidy = G.shape
52+
print("Diploid sample id:", " ".join(f" {i} " for i in range(num_samples)))
53+
print("Genome for sample: ", " ".join(f" {p} {' ' * p} " for _ in range(num_samples) for p in range(ploidy)))
54+
print("-" * 54)
55+
for site_id, pos in enumerate(positions):
56+
print(f" position {pos}:", end=" ")
57+
for sample_id in range(num_samples):
58+
genotypes = G[site_id, sample_id, :]
59+
site_alleles = alleles[site_id].astype(str)
60+
print(" ".join(f"{a:<4}" for a in site_alleles[genotypes.flatten()]), end=" ")
61+
print()
5762
```
5863

64+
:::{note}
65+
The last site, at position 95, is an indel (insertion or deletion). Indels can be used as long as the indel does not overlap with other variants, only 2 alleles exist, and the ancestral state is known.
66+
:::
67+
5968
### VariantData and ancestral alleles
6069

6170
We wish to infer a genealogy that could have given rise to this data set. To run _tsinfer_
62-
we wrap the .vcz file in a `tsinfer.VariantData` object. This requires an
71+
we wrap the `.vcz` file in a {class}`tsinfer.VariantData` object. This requires an
6372
*ancestral state* to be specified for each site; there are
6473
many methods for calculating these: details are outside the scope of this manual, but we
6574
have started a [discussion topic](https://github.com/tskit-dev/tsinfer/discussions/523)
@@ -71,8 +80,10 @@ in the `variant_AA` field of the .vcz file. It's also possible to provide a nump
7180
of ancestral alleles, of the same length as the number of selected variants. If you have a string
7281
of the ancestral states (e.g. from FASTA) the {meth}`add_ancestral_state_array`
7382
method can be used to convert and save this to the VCF Zarr dataset (under the name
74-
`ancestral_state`). Note that the positions passed to the method should be
75-
zero-based, if you have one-based positions you should prepend an "X" to the string.
83+
`ancestral_state`). Note that this method assumes that the string uses zero-based
84+
indexing, so that the first letter corresponds to a site at position 0. If,
85+
as is typical, the first letter in the string denotes the ancestral state of
86+
the site at position 1 in the .vcz file, you should prepend an "X" to the string.
7687
Alleles that are not in the list of alleles for their respective site are
7788
treated as unknown and not used for inference (with a warning given).
7889

@@ -84,18 +95,16 @@ import pyfaidx
8495
vcf_zarr = zarr.open("_static/example_data.vcz")
8596
8697
reader = pyfaidx.Fasta("_static/example_ancestral_state.fa")
87-
ancestral_string = str(reader["chr1"])
98+
ancestral_str = str(reader["chr1"])
8899
# Our positions are zero-based, if they were one-based we would
89100
# prepend an X here.
90-
# e.g. ancestral_string = "X" + ancestral_string
101+
# e.g. ancestral_str = "X" + ancestral_str
91102
92-
# Set the last site to unknown, for demonstration purposes
93-
ancestral_string = ancestral_string[:-1] + "N"
103+
# Set the ancestral state at the last known position to "N", for demonstration
104+
last_pos = vcf_zarr['variant_position'][-1]
105+
ancestral_str = ancestral_str[:last_pos] + "N" + ancestral_str[(last_pos + 1):]
94106
95-
tsinfer.add_ancestral_state_array(
96-
vcf_zarr,
97-
ancestral_string,
98-
)
107+
tsinfer.add_ancestral_state_array(vcf_zarr, ancestral_str)
99108
vdata = tsinfer.VariantData("_static/example_data.vcz", ancestral_state="ancestral_state")
100109
```
101110

@@ -104,9 +113,11 @@ We'll use it to infer a tree sequence on the basis of the sites that vary betwee
104113
different samples. However, note that certain sites are not used by _tsinfer_ for inferring
105114
the genealogy (although they are still encoded in the final tree sequence), These are:
106115

107-
* Non-variable (fixed) sites, e.g. site 4 above
108-
* Singleton sites, where only one genome has the derived allele e.g. site 5 above
109-
* Sites where the ancestral allele is unknown, e.g. demonstrated by site 7 above
116+
* Non-variable (fixed) sites, e.g. the site at position 71 above
117+
* Singleton sites, where only one genome has the derived allele
118+
e.g. the site at position 75 above
119+
* Sites where the ancestral allele is unknown, e.g. demonstrated above when setting the
120+
ancestral state of the site at position 95 to `N`.
110121
* Multialleleic sites, with more than 2 alleles (but see
111122
[here](https://github.com/tskit-dev/tsinfer/issues/670) for a workaround)
112123

@@ -136,10 +147,10 @@ import numpy as np
136147
137148
# mask out any sites not associated with the contig named "chr1"
138149
# (for demonstration: all sites in this .vcz file are from "chr1" anyway)
139-
chr1_index = np.where(ds.contig_id[:] == "chr1")[0]
140-
site_mask = ds.variant_contig[:] != chr1_index
141-
# also mask out any sites with a position >= 6
142-
site_mask[ds.variant_position[:] >= 6] = True
150+
chr1_index = np.where(vcf_zarr.contig_id[:] == "chr1")[0]
151+
site_mask = vcf_zarr.variant_contig[:] != chr1_index
152+
# also mask out any sites with a position >= 80
153+
site_mask[vcf_zarr.variant_position[:] >= 80] = True
143154
144155
smaller_vdata = tsinfer.VariantData(
145156
"_static/example_data.vcz",
@@ -165,21 +176,25 @@ print(f"Inferred a genetic genealogy for {inferred_ts.num_samples} (haploid) gen
165176

166177
And that's it: we now have a fully functional {class}`tskit.TreeSequence`
167178
object that we can interrogate in the usual ways. For example, we can look
168-
at the {meth}`aligned haplotypes<tskit.TreeSequence.alignments>` in the tree sequence:
179+
at the {meth}`variants<tskit.TreeSequence.variants>` in the tree sequence:
169180

170181
```{code-cell} ipython3
171-
print("Sample\tInferred sequence")
172-
for sample_id, seq in zip(
173-
inferred_ts.samples(),
174-
inferred_ts.alignments(missing_data_character="."),
175-
):
176-
print(sample_id, seq, sep="\t")
182+
print("TS sample", "\t".join(str(s) for s in inferred_ts.samples()), sep="\t")
183+
print("TS individual", "\t".join(str(inferred_ts.node(s).individual) for s in inferred_ts.samples()), sep="\t")
184+
print("-" * 60)
185+
print("Pos (anc state)")
186+
for v in inferred_ts.variants():
187+
print(
188+
f"{int(v.site.position)} ({v.site.ancestral_state})",
189+
"\t".join(f"{a:<4}" for a in np.array(v.alleles)[v.genotypes]),
190+
sep="\t")
177191
```
178192

179-
You will notice that the sample sequences generated by this tree sequence are identical
180-
to the input haplotype data: apart from the imputation of
193+
Comparing the genotypes of each individual in the inferred tree sequence
194+
against the equivalent diploid samples in the .vcz file, you can see that they
195+
are identical. Apart from the imputation of
181196
{ref}`missing data<sec_inference_data_requirements>`, _tsinfer_ is guaranteed to
182-
losslessly encode any given input data, regardless of the inferred topology. You can
197+
losslessly encode any genetic variation data, regardless of the inferred topology. You can
183198
check this programatically if you want:
184199

185200
```{code-cell} ipython3
@@ -193,8 +208,8 @@ for v_orig, v_inferred in zip(vdata.variants(), inferred_ts.variants()):
193208
print("** Genotypes in original dataset and inferred tree sequence are the same **")
194209
```
195210

196-
We can examine the inferred genetic genealogy,
197-
in the form of {ref}`local trees<tutorials:sec_what_is_local_trees>`. _Tsinfer_ has also
211+
We can examine the inferred genetic genealogy, in the form of
212+
{ref}`local trees<tutorials:sec_what_is_local_trees>`. _Tsinfer_ has
198213
placed mutations on the genealogy to explain the observed genetic variation:
199214

200215
```{code-cell} ipython3
@@ -210,8 +225,8 @@ mut_labels = {
210225
node_labels = {u: u for u in inferred_ts.samples()}
211226
212227
inferred_ts.draw_svg(
213-
size=(600, 200),
214-
canvas_size=(620, 200),
228+
size=(600, 300),
229+
canvas_size=(640, 300),
215230
node_labels=node_labels,
216231
mutation_labels=mut_labels,
217232
y_axis=True
@@ -528,7 +543,7 @@ import numpy as np
528543
import tskit
529544
import zarr
530545
531-
ds = zarr.load("sparrows.vcz")
546+
vcf_zarr = zarr.load("sparrows.vcz")
532547
533548
populations = ("Norway", "France")
534549
# save the population data in json format
@@ -541,9 +556,9 @@ metadata = [
541556
zarr.save("sparrows.vcz/populations_metadata", metadata)
542557
543558
# Now assign each diploid sample to a population
544-
num_individuals = ds["sample_id"].shape[0]
559+
num_individuals = vcf_zarr["sample_id"].shape[0]
545560
individuals_population = np.full(num_individuals, tskit.NULL, dtype=np.int32)
546-
for i, name in enumerate(ds["sample_id"]):
561+
for i, name in enumerate(vcf_zarr["sample_id"]):
547562
if name.startswith("FR"):
548563
individuals_population[i] = populations.index("France")
549564
else:

0 commit comments

Comments
 (0)