Skip to content

Commit 5eec4c7

Browse files
committed
Adding pipelines samples
1 parent aa0c55e commit 5eec4c7

8 files changed

+1043
-0
lines changed

pipelines/README.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# OCI Data Science ML Pipelines
2+
3+
This folder contains samples for using OCI Data Science ML Pipelines
4+
5+
Machine learning pipelines are a crucial component of the modern data science workflow. They help automate the process of building, training, and deploying machine learning models, allowing data scientists to focus on more important tasks such as data exploration and model evaluation.
6+
7+
At a high level, a machine learning pipeline consists of several steps, each of which performs a specific task, working together to complete a workflow. For example, the first step might be data preprocessing, where raw data is cleaned and transformed into a format that can be fed into a machine learning algorithm. The next step might be model training, where the algorithm is trained on the processed data to learn the patterns and relationships within it. Steps can be executed in sequence or in parallel, speeding up the time to complete the workflow.
8+
One of the key advantages of using machine learning pipelines is the ability to easily repeat and reproduce the entire workflow. This is important for ensuring the reliability and reproducibility of the results, and for making it easier to experiment with different algorithms and parameters finding the best model for a given problem.
9+
10+
Using pipelines, you can:
11+
12+
- Create ML pipeline by defining the workflow of the steps
13+
- Write reusable code for each pipeline step or use existing ML Jobs as steps.
14+
- Execute the pipeline, set parameters for each run.
15+
- Monitor the execution of the pipeline and review logs outputted from the steps
16+
17+
## Available Samples
18+
19+
### Simple pipeline with data sharing between steps
20+
[simple pipeline](/workspaces/oci-data-science-ai-samples/pipelines/samples/simple)
21+
This is a very simple sample with 3 consecutive steps, each passes data to the next step for additional processing.
22+
23+
### Employee attrition sample
24+
[employee attrition](/workspaces/oci-data-science-ai-samples/pipelines/samples/employee-attrition)
25+
This is a full featured pipeline, with data processing, parallel training of models, evaluating the models and deploying the best one into a real time Model Deployment.
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
import pandas as pd
2+
import os, io
3+
import ads
4+
from ads import set_auth
5+
from ads.common.auth import default_signer
6+
7+
DATAFILE_FILENAME_PREFIX = "pipeline_data_"
8+
DATAFILE_ENV_NAME = "DATA_LOCATION"
9+
DATAFILE_FILENAME_EXT = ".csv"
10+
PIPELINE_RUN_OCID_ENV_NAME = "PIPELINE_RUN_OCID"
11+
12+
class MLPipelineDataHelper:
13+
"""
14+
Helper functions for passing data between pipeline steps
15+
The functions use a temporary file on OCI object storage to set/get data between steps in the pipeline.
16+
The functions expect the presence of the environment variable DATA_LOCATION with the value of the OCI object storage location to be used. Here is an example of how this could looks like (don't forget the slash / at the end!):
17+
os.environ["DATA_LOCATION"] = "oci://{bucket_name}@{namespace}/"
18+
19+
The functions use the PIPELINE_RUN_OCID environment variable in the temporary file name to make it unique to the pipeline.
20+
21+
Dependencies:
22+
ocifs: pip install ocifs
23+
"""
24+
25+
def set_pipeline_param(param_name, param_value):
26+
"""
27+
Set a parameter. param_name is the key, and param_value is the value.
28+
for simple small data, like strings, numbers, and even small sets/dataframes/dictionaries, you can use the value as is (pass by value).
29+
For larger data srtuctures, write the data to a file and use the param_value as a reference for the file.
30+
"""
31+
32+
datafile_loc = os.environ[DATAFILE_ENV_NAME]
33+
ads.set_auth(auth="resource_principal")
34+
if (datafile_loc is not None):
35+
datafile_fullpath = datafile_loc + DATAFILE_FILENAME_PREFIX + os.environ[PIPELINE_RUN_OCID_ENV_NAME] + DATAFILE_FILENAME_EXT
36+
try:
37+
ref_data_dfrm = pd.read_csv(datafile_fullpath, header=None, storage_options=default_signer())
38+
ref_data_dict = dict(ref_data_dfrm.to_dict('split')['data'])
39+
except FileNotFoundError:
40+
print("pipeline data file not found. Creating " + datafile_fullpath)
41+
ref_data_dict = dict()
42+
43+
ref_data_dict[param_name] = param_value
44+
output_df = pd.DataFrame.from_dict(ref_data_dict, orient='index')
45+
output_df.to_csv(datafile_fullpath, header=False, storage_options=default_signer())
46+
print("Added " + param_name + " = " + ref_data_dict[param_name])
47+
return
48+
49+
print("Error: DATA_LOCATION environment variable is not defined")
50+
return
51+
52+
def get_pipeline_param(param_name):
53+
"""
54+
Retrieve a previously set parameter by its name.
55+
"""
56+
57+
datafile_loc = os.environ[DATAFILE_ENV_NAME]
58+
ads.set_auth(auth="resource_principal")
59+
if (datafile_loc is not None):
60+
datafile_fullpath = datafile_loc + DATAFILE_FILENAME_PREFIX + os.environ[PIPELINE_RUN_OCID_ENV_NAME] + DATAFILE_FILENAME_EXT
61+
try:
62+
ref_data_dfrm = pd.read_csv(datafile_fullpath, header=None, storage_options=default_signer())
63+
ref_data_dict = dict(ref_data_dfrm.to_dict('split')['data'])
64+
return ref_data_dict[param_name]
65+
except FileNotFoundError:
66+
print("pipeline data file not found")
67+
return None
68+
69+
print("Error: DATA_LOCATION environment variable is not defined")
70+
return None
71+
72+
def cleanup_pipeline_params():
73+
"""
74+
Delete the temporary file from the object storage. Call this function before the end of your pipeline.
75+
"""
76+
77+
import ocifs
78+
fs = ocifs.OCIFileSystem()
79+
try:
80+
datafile_loc = os.environ[DATAFILE_ENV_NAME]
81+
if (datafile_loc is not None):
82+
datafile_fullpath = datafile_loc + DATAFILE_FILENAME_PREFIX + os.environ[PIPELINE_RUN_OCID_ENV_NAME] + DATAFILE_FILENAME_EXT
83+
fs.rm(datafile_fullpath)
84+
print("Cleanup completed")
85+
except:
86+
print("Nothing to cleanup")

0 commit comments

Comments
 (0)