Skip to content

Commit 0b6ee61

Browse files
authored
Add ORC reader tutorial (#1465)
* Add ORC reader tutorial * clean up notebook * address comments * address comments * address comments * address comment: remove outputs and add desc for dataset * fix lint * fix lint: Prefer second person instead of first person. * address comments * fix typo
1 parent 77da6bc commit 0b6ee61

File tree

2 files changed

+335
-1
lines changed

2 files changed

+335
-1
lines changed

docs/tutorials/_toc.yaml

+2-1
Original file line numberDiff line numberDiff line change
@@ -36,4 +36,5 @@ toc:
3636
path: /io/tutorials/elasticsearch
3737
- title: "Avro"
3838
path: /io/tutorials/avro
39-
39+
- title: "ORC"
40+
path: /io/tutorials/orc

docs/tutorials/orc.ipynb

+333
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,333 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {
6+
"id": "Tce3stUlHN0L"
7+
},
8+
"source": [
9+
"##### Copyright 2021 The TensorFlow Authors."
10+
]
11+
},
12+
{
13+
"cell_type": "code",
14+
"execution_count": 1,
15+
"metadata": {
16+
"cellView": "form",
17+
"id": "tuOe1ymfHZPu"
18+
},
19+
"outputs": [],
20+
"source": [
21+
"#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
22+
"# you may not use this file except in compliance with the License.\n",
23+
"# You may obtain a copy of the License at\n",
24+
"#\n",
25+
"# https://www.apache.org/licenses/LICENSE-2.0\n",
26+
"#\n",
27+
"# Unless required by applicable law or agreed to in writing, software\n",
28+
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
29+
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
30+
"# See the License for the specific language governing permissions and\n",
31+
"# limitations under the License."
32+
]
33+
},
34+
{
35+
"cell_type": "markdown",
36+
"metadata": {
37+
"id": "qFdPvlXBOdUN"
38+
},
39+
"source": [
40+
"# Apache ORC Reader"
41+
]
42+
},
43+
{
44+
"cell_type": "markdown",
45+
"metadata": {
46+
"id": "MfBg1C5NB3X0"
47+
},
48+
"source": [
49+
"<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
50+
" <td>\n",
51+
" <a target=\"_blank\" href=\"https://www.tensorflow.org/io/tutorials/orc\"><img src=\"https://www.tensorflow.org/images/tf_logo_32px.png\" />View on TensorFlow.org</a>\n",
52+
" </td>\n",
53+
" <td>\n",
54+
" <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/io/blob/master/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/colab_logo_32px.png\" />Run in Google Colab</a>\n",
55+
" </td>\n",
56+
" <td>\n",
57+
" <a target=\"_blank\" href=\"https://github.com/tensorflow/io/blob/master/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/GitHub-Mark-32px.png\" />View on GitHub</a>\n",
58+
" </td>\n",
59+
" <td>\n",
60+
" <a href=\"https://storage.googleapis.com/tensorflow_docs/io/docs/tutorials/orc.ipynb\"><img src=\"https://www.tensorflow.org/images/download_logo_32px.png\" />Download notebook</a>\n",
61+
" </td>\n",
62+
"</table>"
63+
]
64+
},
65+
{
66+
"cell_type": "markdown",
67+
"metadata": {
68+
"id": "xHxb-dlhMIzW"
69+
},
70+
"source": [
71+
"## Overview\n",
72+
"\n",
73+
"Apache ORC is a popular columnar storage format. tensorflow-io package provides a default implementation of reading [Apache ORC](https://orc.apache.org/) files."
74+
]
75+
},
76+
{
77+
"cell_type": "markdown",
78+
"metadata": {
79+
"id": "MUXex9ctTuDB"
80+
},
81+
"source": [
82+
"## Setup"
83+
]
84+
},
85+
{
86+
"cell_type": "markdown",
87+
"metadata": {
88+
"id": "1Eh-iCRVBm0p"
89+
},
90+
"source": [
91+
"Install required packages, and restart runtime\n"
92+
]
93+
},
94+
{
95+
"cell_type": "code",
96+
"execution_count": 2,
97+
"metadata": {
98+
"id": "g7cxbf1-skn6"
99+
},
100+
"outputs": [],
101+
"source": [
102+
"!pip install tensorflow-io"
103+
]
104+
},
105+
{
106+
"cell_type": "code",
107+
"execution_count": 3,
108+
"metadata": {
109+
"id": "IqR2PQG4ZaZ0"
110+
},
111+
"outputs": [],
112+
"source": [
113+
"import tensorflow as tf\n",
114+
"import tensorflow_io as tfio"
115+
]
116+
},
117+
{
118+
"cell_type": "markdown",
119+
"metadata": {
120+
"id": "EyHfC3nEzseN"
121+
},
122+
"source": [
123+
"### Download a sample dataset file in ORC"
124+
]
125+
},
126+
{
127+
"cell_type": "markdown",
128+
"metadata": {
129+
"id": "ZjEeF6Fva8UO"
130+
},
131+
"source": [
132+
"The dataset you will use here is the [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) from UCI. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. It has 4 attributes: (1) sepal length, (2) sepal width, (3) petal length, (4) petal width, and the last column contains the class label."
133+
]
134+
},
135+
{
136+
"cell_type": "code",
137+
"execution_count": 4,
138+
"metadata": {
139+
"id": "zaiXjZiXzrHs"
140+
},
141+
"outputs": [],
142+
"source": [
143+
"!curl -OL https://github.com/tensorflow/io/raw/master/tests/test_orc/iris.orc\n",
144+
"!ls -l iris.orc"
145+
]
146+
},
147+
{
148+
"cell_type": "markdown",
149+
"metadata": {
150+
"id": "7DG9JTJ0-bzg"
151+
},
152+
"source": [
153+
"## Create a dataset from the file"
154+
]
155+
},
156+
{
157+
"cell_type": "code",
158+
"execution_count": 35,
159+
"metadata": {
160+
"id": "ppFAjXAYsj-z"
161+
},
162+
"outputs": [],
163+
"source": [
164+
"dataset = tfio.IODataset.from_orc(\"iris.orc\", capacity=15).batch(1)"
165+
]
166+
},
167+
{
168+
"cell_type": "markdown",
169+
"metadata": {
170+
"id": "4xPr3f4LVdeN"
171+
},
172+
"source": [
173+
"Examine the dataset:"
174+
]
175+
},
176+
{
177+
"cell_type": "code",
178+
"execution_count": 42,
179+
"metadata": {
180+
"id": "9B1QUKG70Lzs"
181+
},
182+
"outputs": [],
183+
"source": [
184+
"for item in dataset.take(1):\n",
185+
" print(item)\n"
186+
]
187+
},
188+
{
189+
"cell_type": "markdown",
190+
"metadata": {
191+
"id": "03qncHJPVNK3"
192+
},
193+
"source": [
194+
"Let's walk through an end-to-end example of tf.keras model training with ORC dataset based on iris dataset."
195+
]
196+
},
197+
{
198+
"cell_type": "markdown",
199+
"metadata": {
200+
"id": "tDkpKRMVcPfb"
201+
},
202+
"source": [
203+
"### Data preprocessing"
204+
]
205+
},
206+
{
207+
"cell_type": "markdown",
208+
"metadata": {
209+
"id": "nDgkfWFRVjKz"
210+
},
211+
"source": [
212+
"Configure which columns are features, and which column is label:"
213+
]
214+
},
215+
{
216+
"cell_type": "code",
217+
"execution_count": 47,
218+
"metadata": {
219+
"id": "R1OYAybz07dr"
220+
},
221+
"outputs": [],
222+
"source": [
223+
"feature_cols = [\"sepal_length\", \"sepal_width\", \"petal_length\", \"petal_width\"]\n",
224+
"label_cols = [\"species\"]\n",
225+
"\n",
226+
"# select feature columns\n",
227+
"feature_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=feature_cols)\n",
228+
"# select label columns\n",
229+
"label_dataset = tfio.IODataset.from_orc(\"iris.orc\", columns=label_cols)"
230+
]
231+
},
232+
{
233+
"cell_type": "markdown",
234+
"metadata": {
235+
"id": "GSYMP48vVvV0"
236+
},
237+
"source": [
238+
"A util function to map species to float numbers for model training:"
239+
]
240+
},
241+
{
242+
"cell_type": "code",
243+
"execution_count": 48,
244+
"metadata": {
245+
"id": "TQvuE7OgVs1q"
246+
},
247+
"outputs": [],
248+
"source": [
249+
"vocab_init = tf.lookup.KeyValueTensorInitializer(\n",
250+
" keys=tf.constant([\"virginica\", \"versicolor\", \"setosa\"]),\n",
251+
" values=tf.constant([0, 1, 2], dtype=tf.int64))\n",
252+
"vocab_table = tf.lookup.StaticVocabularyTable(\n",
253+
" vocab_init,\n",
254+
" num_oov_buckets=4)"
255+
]
256+
},
257+
{
258+
"cell_type": "code",
259+
"execution_count": 49,
260+
"metadata": {
261+
"id": "lpf0w41iWAZ4"
262+
},
263+
"outputs": [],
264+
"source": [
265+
"label_dataset = label_dataset.map(vocab_table.lookup)\n",
266+
"dataset = tf.data.Dataset.zip((feature_dataset, label_dataset))\n",
267+
"dataset = dataset.batch(1)\n",
268+
"\n",
269+
"def pack_features_vector(features, labels):\n",
270+
" \"\"\"Pack the features into a single array.\"\"\"\n",
271+
" features = tf.stack(list(features), axis=1)\n",
272+
" return features, labels\n",
273+
"\n",
274+
"dataset = dataset.map(pack_features_vector)"
275+
]
276+
},
277+
{
278+
"cell_type": "markdown",
279+
"metadata": {
280+
"id": "R1Tyf3AodC2Y"
281+
},
282+
"source": [
283+
"## Build, compile and train the model"
284+
]
285+
},
286+
{
287+
"cell_type": "markdown",
288+
"metadata": {
289+
"id": "oVB9Q0B-WDn4"
290+
},
291+
"source": [
292+
"Finally, you are ready to build the model and train it! You will build a 3 layer keras model to predict the class of the iris plant from the dataset you just processed."
293+
]
294+
},
295+
{
296+
"cell_type": "code",
297+
"execution_count": 50,
298+
"metadata": {
299+
"id": "tToy0FoOWG-9"
300+
},
301+
"outputs": [],
302+
"source": [
303+
"model = tf.keras.Sequential(\n",
304+
" [\n",
305+
" tf.keras.layers.Dense(\n",
306+
" 10, activation=tf.nn.relu, input_shape=(4,)\n",
307+
" ),\n",
308+
" tf.keras.layers.Dense(10, activation=tf.nn.relu),\n",
309+
" tf.keras.layers.Dense(3),\n",
310+
" ]\n",
311+
")\n",
312+
"\n",
313+
"model.compile(optimizer=\"adam\", loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=[\"accuracy\"])\n",
314+
"model.fit(dataset, epochs=5)"
315+
]
316+
}
317+
],
318+
"metadata": {
319+
"colab": {
320+
"collapsed_sections": [
321+
"Tce3stUlHN0L"
322+
],
323+
"name": "orc.ipynb",
324+
"toc_visible": true
325+
},
326+
"kernelspec": {
327+
"display_name": "Python 3",
328+
"name": "python3"
329+
}
330+
},
331+
"nbformat": 4,
332+
"nbformat_minor": 0
333+
}

0 commit comments

Comments
 (0)