|
| 1 | +# Common Transformations |
| 2 | + |
| 3 | +[TOC] |
| 4 | + |
| 5 | +In this document we describe how to do common transformations with tf.transform. |
| 6 | + |
| 7 | +We assume you have already constructed the beam pipeline along the lines of the |
| 8 | +examples, and only describe what needs to be added to `preprocessing_fn` and |
| 9 | +possibly model. |
| 10 | + |
| 11 | +## Using String/Categorical data |
| 12 | + |
| 13 | +The following `preprocessing_fn` will compute a vocabulary over the values of |
| 14 | +feature `x` with tokens in descending frequency order, convert feature `x` |
| 15 | +values to their index in the vocabulary, and finally perform a one-hot encoding |
| 16 | +for the output. |
| 17 | + |
| 18 | +This is common for example in use cases where the label feature is a categorical |
| 19 | +string. |
| 20 | +The resulting one-hot encoding is ready for training. |
| 21 | + |
| 22 | +Note: this example produces `x_out` as a potentially large dense tensor. This is |
| 23 | +fine as long as the transformed data doesn't get materialized, and this is the |
| 24 | +format expected in training. Otherwise, a more efficient representation would be |
| 25 | +a `tf.SparseTensor`, in which case only a single index and value (1) is used to |
| 26 | +represent each instance. |
| 27 | + |
| 28 | +```python |
| 29 | +def preprocessing_fn(inputs): |
| 30 | + integerized = tft.compute_and_apply_vocabulary( |
| 31 | + inputs['x'], |
| 32 | + num_oov_buckets=1, |
| 33 | + vocab_filename='x_vocab') |
| 34 | + one_hot_encoded = tf.one_hot( |
| 35 | + integerized, |
| 36 | + depth=tf.cast(tft.experimental.get_vocabulary_size_by_name('x_vocab') + 1, |
| 37 | + tf.int32), |
| 38 | + on_value=1.0, |
| 39 | + off_value=0.0) |
| 40 | + return { |
| 41 | + 'x_out': one_hot_encoded, |
| 42 | + } |
| 43 | +``` |
| 44 | + |
| 45 | +## Mean imputation for missing data |
| 46 | + |
| 47 | +In this example, feature `x` is an optional feature, represented as a |
| 48 | +`tf.SparseTensor` in the `preprocessing_fn`. In order to convert it to a dense |
| 49 | +tensor, we compute its mean, and set the mean to be the default value when it |
| 50 | +is missing from an instance. |
| 51 | + |
| 52 | +The resulting dense tensor will have the shape `[None, 1]`, `None` represents |
| 53 | +the batch dimension, and for the second dimension it will be the number of |
| 54 | +values that `x` can have per instance. In this case it's 1. |
| 55 | + |
| 56 | +```python |
| 57 | +def preprocessing_fn(inputs): |
| 58 | + return { |
| 59 | + 'x_out': tft.sparse_tensor_to_dense_with_shape( |
| 60 | + inputs['x'], default_value=tft.mean(x), shape=[None, 1]) |
| 61 | + } |
| 62 | +``` |
0 commit comments