Skip to content

Commit 0774732

Browse files
committed
Documentation on polaris-ai added inside polaris-ai folder. Documents link in main README.md
1 parent 27b7c0d commit 0774732

File tree

6 files changed

+200
-0
lines changed

6 files changed

+200
-0
lines changed

docs/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -12,4 +12,9 @@ This is the main documentation of the project.
1212
* [Generic Elasticity Strategies](./features/generic-elasticity-strategies.md)
1313
* [Generic SLOs](./features/generic-slos.md)
1414
* [Predictions](./features/predictions.md)
15+
* [Polaris-ai](./polaris-ai/polaris-ai-main.md)
16+
* [Predictive monitoring](./polaris-ai/polaris-ai-predictive-monitoring.md)
17+
* [LSTM](./polaris-ai/polaris-ai-lstm.md)
18+
* [Transformer](./polaris-ai/polaris-ai-transformer.md)
19+
* [Profiling](./polaris-ai/polaris-ai-profiling.md)
1520
* [Documentation of interfaces, classes, etc.](./typedoc)

docs/polaris-ai/polaris-ai-lstm.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
In this folder the files to run and train the LSTM model can be found.
2+
To re-train the LSTM, it is possible to run the script `gcd_single-job_multivariate_prediction.py [epochs neurons batch_size] --exp-name [exp]`.
3+
To reproduce the exact same model, the code is:`gcd_single-job_multivariate_prediction.py 400 50 72 --exp_name exp_01`.
4+

docs/polaris-ai/polaris-ai-main.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# polaris-ai
2+
3+
This repository contains the tools developed in the framework of the polaris-slo-cloud project that belong to AI technologies.
4+
5+
The main purpose of this repository is to develop the set of AI-enabled tools to ease and automate the management of SLO-aware clouds. These tools aim at allowing a better and more business oriented management of deployments by providing control over high-level SLOs or creating workload profiles based on metadata.
6+
7+
The final aim is recommending or automating the resource profiling, as well as, predicting and performing autoscaling actions on the deployment to ensure its optimal use without violating any SLO.
8+
9+
In this regard, the architecture for the AI technologies of the polaris-slo-cloud project is presented below:
10+
![polaris-ai architecture](https://raw.githubusercontent.com/vikcas/figures/main/Polaris-ai_architecture_scheme.png)
11+
12+
In the previous scheme, the white and grey boxes represent the input data. In blue, there are the AI technologies that will be researched or used. The purple circles represent the tools that this project will develop. These aim to perform the actions that are represented in green, to finally obtain a complete cloud management system.
13+
14+
Currently, this repository contains the tools of three key steps: tools to treat [cloud workload data](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/data_extraction); tools to develop [metadata-based profiling](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/profiling) and tools to create and test deep learning models for predicting high-level SLO such as Efficiency, this can be found at the folder [high-level monitoring](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring).
15+
16+
The [`data_extraction` folder](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/data_extraction) contains the scripts to pre-process the input data. So far, we refer to the [Google cluster data (2011)](https://research.google/tools/datasets/cluster-workload-traces/) as our first primary source.
17+
18+
The core sections of this repository are the [metadata-based profiling](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/profiling) and the predictive [high-level monitoring](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring). In order to use the predictive high-level monitoring it is required to have specific workload data, and this is only available once the workload is running. Therefore, to solve this bootstrapping issue and offer personalized and adaptive management to new workloads, we have develop a metadata-based profiling, which based static and a priori data of the workload is able to determine which would be its requirements. The following figure shows the explained paradigm:
19+
20+
![polaris-ai overview](https://raw.githubusercontent.com/vikcas/figures/main/Polaris%20AI%20overview.png)
21+
22+
[Metadata-based profiling](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/profiling) folder includes means to generate workload profiles based on metadata, specifically on the metadata present in [Google cluster data (2011)](https://research.google/tools/datasets/cluster-workload-traces/). It also contains means to generate workload profiles based on their use of low-level cloud resources, this allows to generate a ground-truth to evaluate the metadata-based profiling.
23+
24+
The [`high-level monitoring` folder](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring) includes the means for generating and testing models. Specifically, we have so far the code to develop [LSTM](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/lstm_approach) and [transformer](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/transformer_approach) neural networks. These are ready to predict in a multi-step fashion high-level SLO, such as Efficiency, defined as the ratio between used and requested resources.
25+
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
# High-level predictive monitoring
2+
3+
This repository contains the code to generate LSTM and transformer models for high-level monitoring prediction or to execute an already trained one.
4+
The data folder contains pre-filtered and pre-processed data from [Google Cluster Data - 2011-2](https://github.com/google/cluster-data/blob/master/ClusterData2011_2.md).
5+
6+
The folder [models](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/models/lstm_batch72_neurons50_epochs400_do0) contains the pre-trained LSTMs. To test it, you can run the test: `python gcd_test_model.py 6318371744`; the second argument represents the ID of the job to consider. The Jupyter notebook [`test_gcd-model_predictions.ipynb`](https://github.com/polaris-slo-cloud/polaris-ai/blob/main/predictive_monitoring/high-level_monitoring/lstm_approach/test_gcd-model_predictions.ipynb) offers the possibility to explore different ways to test new data.
7+
8+
The folder [`lstm_approach`](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/lstm_approach) contains the code to run the LSTM model. To re-train the LSTM, it is possible to run the script `gcd_single-job_multivariate_prediction.py [epochs neurons batch_size] --exp-name [exp]`. To reproduce the exact same model, the code is:`gcd_single-job_multivariate_prediction.py 400 50 72 --exp_name exp_01`.
9+
10+
The folder [`transformer_approach`](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/transformer_approach) contains all the required components related with the transformer model, as well as a detailed readme file. There, one can find the model used for the resource prediction inside the [`model` folder](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/transformer_approach/models). Also, a simplified python code called [`example.py`](https://github.com/polaris-slo-cloud/polaris-ai/blob/main/predictive_monitoring/high-level_monitoring/transformer_approach/example.py) is provided in order to test and learn how to use the model.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Workload profiling
2+
3+
This code performs steps to generate precise and representative profile groups for workload, based on what we call static metadata.
4+
5+
Static metadata is all the information regarding the workload that doesn't change at runtime. Ideally, when users deploy some applications in the platform, they give information regarding the type of workload they intend to submit, details about the operating system, and the applications' priorities.
6+
This data is essential to underline behavioral and design patterns for the users and their workload. However, this information is relevant in the measure that it characterizes specific workload execution schemes.
7+
8+
## Approach overview
9+
Our approach follows a series of steps, given the premises mentioned beforehand, and starting from the assumption that the system doesn't have, initially, any profile label. Thus:
10+
1. The first step is to perform unsupervised learning techniques to find similarities in workload execution. Here, we look at a few key features, namely:
11+
- CPU
12+
- Memory
13+
- Disk
14+
- Level of parallelization
15+
- Runtime length
16+
2. Once the algorithm has extracted relevant groups, we can derive information regarding the static metadata, linking the workload to their static features and deducing patterns.
17+
3. Finally, we create models of each of these profiles to let the system perform an automatic workload assignment to each group.
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
Data series forecasting using a transformer.
2+
3+
Follow the example.py in the folder to see a very simple usage of the model
4+
5+
## Prepare the data for the model
6+
```Python
7+
# Set paths to data
8+
data_path = "../data/task-usage_job-ID-3418339_total.csv"
9+
results_path = "..."
10+
results_file = "...csv"
11+
# Prepare dataset
12+
df, scaler = prepare_data(data_path)
13+
```
14+
15+
## Set up a device
16+
```Python
17+
# Considering using cuda if available.
18+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
19+
```
20+
21+
## Load and prepare the model
22+
23+
```Python
24+
# Load the model configuration
25+
with open("models/multistep/config.json") as jfile:
26+
config = json.load(jfile)
27+
# Initialize the model
28+
model = init_transformer(config, device)
29+
# Load the model
30+
model_state, _ = torch.load("models/model_data", map_location=device)
31+
model.load_state_dict(model_state)
32+
# Set the model for evaluation mode
33+
model.eval()
34+
```
35+
36+
## Select the loss function.
37+
```Python
38+
# This model has been trained with MSE, but others can be considered.
39+
loss_f = torch.nn.MSELoss()
40+
```
41+
42+
## Convert dataset to pytorch dataloader functions
43+
```Python
44+
# Notice that the first argument is "test". Using "train" or "validation" will provide access to other parts of the data. However, they will also feed the model with other data structures not as a sliding window as in test.
45+
test_dataset = LoadGoogleDataset("test", seq_len=config["seq_len"], prediction_step=config["prediction_step"],
46+
data_frame=df)
47+
test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)
48+
```
49+
50+
## Run test/forecast loop
51+
```Python
52+
loss_progress = list()
53+
if config["prediction_step"] > 1:
54+
outputs = dict()
55+
targets = dict()
56+
for ii in range(config["prediction_step"]):
57+
outputs[str(ii)] = list()
58+
targets[str(ii)] = list()
59+
else:
60+
outputs = list()
61+
targets = list()
62+
for x_enc, x_dec, target in test_loader:
63+
with torch.no_grad():
64+
# Send data to device and prepare dimensions
65+
x_enc, x_dec, target = x_enc.to(device), x_dec.to(device), target.to(device)
66+
x_dec = x_dec.unsqueeze(-1)
67+
# Forecast
68+
out = model.forward(x_enc.float(), x_dec.float(), training=False)
69+
# Compute loss
70+
loss = loss_f(out.double(), target.double())
71+
# Store results and target values
72+
if config["prediction_step"] > 1:
73+
for ii in range(config["prediction_step"]):
74+
outputs[str(ii)].append(out.squeeze().cpu().detach().tolist()[ii])
75+
targets[str(ii)].append(target.squeeze().cpu().detach().tolist()[ii])
76+
else:
77+
outputs.append(out.squeeze().cpu().detach().tolist())
78+
targets.append(target.squeeze().cpu().detach().tolist())
79+
# Keep loss values in a list
80+
loss_progress.append(loss.cpu().detach().tolist())
81+
```
82+
83+
## Re-scale data
84+
```Python
85+
# re-scale outputs
86+
l_df = len(df["Efficiency"])
87+
df_computed = df
88+
89+
values = dict()
90+
91+
if config["prediction_step"] > 1:
92+
eff_out = dict()
93+
tgt_out = dict()
94+
for ii in range(config["prediction_step"]):
95+
real_eff = np.zeros(len(df["Efficiency"]))
96+
real_eff[l_df - len(outputs[str(ii)]):] = outputs[str(ii)]
97+
df_computed["Efficiency"] = real_eff
98+
df_unscaled = scaler.inverse_transform(df_computed)
99+
eff_out[str(ii)] = df_unscaled[l_df - len(outputs[str(ii)]):, -1]
100+
101+
real_eff = np.zeros(len(df["Efficiency"]))
102+
real_eff[l_df - len(outputs[str(ii)]):] = targets[str(ii)]
103+
df_computed["Efficiency"] = real_eff
104+
df_unscaled = scaler.inverse_transform(df_computed)
105+
tgt_out[str(ii)] = df_unscaled[l_df - len(outputs[str(ii)]):, -1]
106+
107+
values["eff_" + str(ii)] = eff_out[str(ii)].tolist()
108+
values["tgt_" + str(ii)] = tgt_out[str(ii)].tolist()
109+
110+
else:
111+
real_eff = np.zeros(len(df["Efficiency"]))
112+
real_eff[l_df - len(outputs):] = outputs
113+
df_computed["Efficiency"] = real_eff
114+
df_unscaled = scaler.inverse_transform(df_computed)
115+
eff_out = df_unscaled[l_df - len(outputs):, -1]
116+
117+
real_eff = np.zeros(len(df["Efficiency"]))
118+
real_eff[l_df - len(outputs):] = targets
119+
df_computed["Efficiency"] = real_eff
120+
df_unscaled = scaler.inverse_transform(df_computed)
121+
tgt_out = df_unscaled[l_df - len(outputs):, -1]
122+
123+
values["eff"] = eff_out.tolist()
124+
values["tgt"] = tgt_out.tolist()
125+
```
126+
127+
## Save data
128+
```Python
129+
with open(results_path + results_file, 'w') as f:
130+
dict_writer = writer(f)
131+
dict_writer.writerow(values.keys())
132+
dict_writer.writerows(zip(*values.values()))
133+
```
134+
135+
Transfomer model adapted from:
136+
137+
Wu, N., Green, B., Ben, X., & O’Banion, S. (2020). Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. ArXiv. http://arxiv.org/abs/2001.08317
138+
139+
![Transformer model](https://raw.githubusercontent.com/vikcas/figures/main/transformer_model.png?token=ACPGCP7MYUVZO66AIOAFLZDAX6FYU)

0 commit comments

Comments
 (0)