Skip to content

Commit ab665ad

Browse files
zoyahavtfx-copybara
authored andcommitted
Adds common_transformations.md with some how-to's for common use cases.
PiperOrigin-RevId: 489941067
1 parent 83585a4 commit ab665ad

File tree

1 file changed

+62
-0
lines changed

1 file changed

+62
-0
lines changed

docs/common_transformations.md

Lines changed: 62 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,62 @@
1+
# Common Transformations
2+
3+
[TOC]
4+
5+
In this document we describe how to do common transformations with tf.transform.
6+
7+
We assume you have already constructed the beam pipeline along the lines of the
8+
examples, and only describe what needs to be added to `preprocessing_fn` and
9+
possibly model.
10+
11+
## Using String/Categorical data
12+
13+
The following `preprocessing_fn` will compute a vocabulary over the values of
14+
feature `x` with tokens in descending frequency order, convert feature `x`
15+
values to their index in the vocabulary, and finally perform a one-hot encoding
16+
for the output.
17+
18+
This is common for example in use cases where the label feature is a categorical
19+
string.
20+
The resulting one-hot encoding is ready for training.
21+
22+
Note: this example produces `x_out` as a potentially large dense tensor. This is
23+
fine as long as the transformed data doesn't get materialized, and this is the
24+
format expected in training. Otherwise, a more efficient representation would be
25+
a `tf.SparseTensor`, in which case only a single index and value (1) is used to
26+
represent each instance.
27+
28+
```python
29+
def preprocessing_fn(inputs):
30+
integerized = tft.compute_and_apply_vocabulary(
31+
inputs['x'],
32+
num_oov_buckets=1,
33+
vocab_filename='x_vocab')
34+
one_hot_encoded = tf.one_hot(
35+
integerized,
36+
depth=tf.cast(tft.experimental.get_vocabulary_size_by_name('x_vocab') + 1,
37+
tf.int32),
38+
on_value=1.0,
39+
off_value=0.0)
40+
return {
41+
'x_out': one_hot_encoded,
42+
}
43+
```
44+
45+
## Mean imputation for missing data
46+
47+
In this example, feature `x` is an optional feature, represented as a
48+
`tf.SparseTensor` in the `preprocessing_fn`. In order to convert it to a dense
49+
tensor, we compute its mean, and set the mean to be the default value when it
50+
is missing from an instance.
51+
52+
The resulting dense tensor will have the shape `[None, 1]`, `None` represents
53+
the batch dimension, and for the second dimension it will be the number of
54+
values that `x` can have per instance. In this case it's 1.
55+
56+
```python
57+
def preprocessing_fn(inputs):
58+
return {
59+
'x_out': tft.sparse_tensor_to_dense_with_shape(
60+
inputs['x'], default_value=tft.mean(x), shape=[None, 1])
61+
}
62+
```

0 commit comments

Comments
 (0)