Documentation on polaris-ai added inside polaris-ai folder. Documents link in main README.md

vikcas · vikcas · commit 0774732fd424 · 2022-01-24T17:29:26.000+01:00
diff --git a/docs/README.md b/docs/README.md
@@ -12,4 +12,9 @@ This is the main documentation of the project.
     * [Generic Elasticity Strategies](./features/generic-elasticity-strategies.md)
     * [Generic SLOs](./features/generic-slos.md)
     * [Predictions](./features/predictions.md)
+* [Polaris-ai](./polaris-ai/polaris-ai-main.md)
+    * [Predictive monitoring](./polaris-ai/polaris-ai-predictive-monitoring.md)
+    * [LSTM](./polaris-ai/polaris-ai-lstm.md)
+    * [Transformer](./polaris-ai/polaris-ai-transformer.md)
+    * [Profiling](./polaris-ai/polaris-ai-profiling.md)
 * [Documentation of interfaces, classes, etc.](./typedoc)
diff --git a/docs/polaris-ai/polaris-ai-lstm.md b/docs/polaris-ai/polaris-ai-lstm.md
@@ -0,0 +1,4 @@
+In this folder the files to run and train the LSTM model can be found.  
+To re-train the LSTM, it is possible to run the script `gcd_single-job_multivariate_prediction.py [epochs neurons batch_size] --exp-name [exp]`.   
+To reproduce the exact same model, the code is:`gcd_single-job_multivariate_prediction.py 400 50 72 --exp_name exp_01`.
+
diff --git a/docs/polaris-ai/polaris-ai-main.md b/docs/polaris-ai/polaris-ai-main.md
@@ -0,0 +1,25 @@
+# polaris-ai
+
+This repository contains the tools developed in the framework of the polaris-slo-cloud project that belong to AI technologies.  
+
+The main purpose of this repository is to develop the set of AI-enabled tools to ease and automate the management of SLO-aware clouds. These tools aim at allowing a better and more business oriented management of deployments by providing control over high-level SLOs or creating workload profiles based on metadata.  
+
+The final aim is recommending or automating the resource profiling, as well as, predicting and performing autoscaling actions on the deployment to ensure its optimal use without violating any SLO.  
+
+In this regard, the architecture for the AI technologies of the polaris-slo-cloud project is presented below:
+![polaris-ai architecture](https://raw.githubusercontent.com/vikcas/figures/main/Polaris-ai_architecture_scheme.png)
+
+In the previous scheme, the white and grey boxes represent the input data. In blue, there are the AI technologies that will be researched or used. The purple circles represent the tools that this project will develop. These aim to perform the actions that are represented in green, to finally obtain a complete cloud management system.
+
+Currently, this repository contains the tools of three key steps: tools to treat [cloud workload data](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/data_extraction); tools to develop [metadata-based profiling](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/profiling) and tools to create and test deep learning models for predicting high-level SLO such as Efficiency, this can be found at the folder [high-level monitoring](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring).
+
+The [`data_extraction` folder](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/data_extraction) contains the scripts to pre-process the input data. So far, we refer to the [Google cluster data (2011)](https://research.google/tools/datasets/cluster-workload-traces/) as our first primary source.
+
+The core sections of this repository are the [metadata-based profiling](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/profiling) and the predictive [high-level monitoring](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring). In order to use the predictive high-level monitoring it is required to have specific workload data, and this is only available once the workload is running. Therefore, to solve this bootstrapping issue and offer personalized and adaptive management to new workloads, we have develop a metadata-based profiling, which based static and a priori data of the workload is able to determine which would be its requirements. The following figure shows the explained paradigm:
+
+![polaris-ai overview](https://raw.githubusercontent.com/vikcas/figures/main/Polaris%20AI%20overview.png)
+
+[Metadata-based profiling](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/profiling) folder includes means to generate workload profiles based on metadata, specifically on the metadata present in [Google cluster data (2011)](https://research.google/tools/datasets/cluster-workload-traces/). It also contains means to generate workload profiles based on their use of low-level cloud resources, this allows to generate a ground-truth to evaluate the metadata-based profiling.
+
+The [`high-level monitoring` folder](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring) includes the means for generating and testing models. Specifically, we have so far the code to develop [LSTM](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/lstm_approach) and [transformer](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/transformer_approach) neural networks. These are ready to predict in a multi-step fashion high-level SLO, such as Efficiency, defined as the ratio between used and requested resources. 
+
diff --git a/docs/polaris-ai/polaris-ai-predictive-monitoring.md b/docs/polaris-ai/polaris-ai-predictive-monitoring.md
@@ -0,0 +1,10 @@
+# High-level predictive monitoring
+
+This repository contains the code to generate LSTM and transformer models for high-level monitoring prediction or to execute an already trained one.
+The data folder contains pre-filtered and pre-processed data from [Google Cluster Data - 2011-2](https://github.com/google/cluster-data/blob/master/ClusterData2011_2.md).
+
+The folder [models](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/models/lstm_batch72_neurons50_epochs400_do0) contains the pre-trained LSTMs. To test it, you can run the test: `python gcd_test_model.py 6318371744`; the second argument represents the ID of the job to consider. The Jupyter notebook [`test_gcd-model_predictions.ipynb`](https://github.com/polaris-slo-cloud/polaris-ai/blob/main/predictive_monitoring/high-level_monitoring/lstm_approach/test_gcd-model_predictions.ipynb) offers the possibility to explore different ways to test new data.
+
+The folder [`lstm_approach`](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/lstm_approach) contains the code to run the LSTM model. To re-train the LSTM, it is possible to run the script `gcd_single-job_multivariate_prediction.py [epochs neurons batch_size] --exp-name [exp]`. To reproduce the exact same model, the code is:`gcd_single-job_multivariate_prediction.py 400 50 72 --exp_name exp_01`.
+
+The folder [`transformer_approach`](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/transformer_approach) contains all the required components related with the transformer model, as well as a detailed readme file. There, one can find the model used for the resource prediction inside the [`model` folder](https://github.com/polaris-slo-cloud/polaris-ai/tree/main/predictive_monitoring/high-level_monitoring/transformer_approach/models). Also, a simplified python code called [`example.py`](https://github.com/polaris-slo-cloud/polaris-ai/blob/main/predictive_monitoring/high-level_monitoring/transformer_approach/example.py) is provided in order to test and learn how to use the model.
diff --git a/docs/polaris-ai/polaris-ai-profiling.md b/docs/polaris-ai/polaris-ai-profiling.md
@@ -0,0 +1,17 @@
+# Workload profiling
+
+This code performs steps to generate precise and representative profile groups for workload, based on what we call static metadata.
+
+Static metadata is all the information regarding the workload that doesn't change at runtime. Ideally, when users deploy some applications in the platform, they give information regarding the type of workload they intend to submit, details about the operating system, and the applications' priorities.
+This data is essential to underline behavioral and design patterns for the users and their workload. However, this information is relevant in the measure that it characterizes specific workload execution schemes.
+
+## Approach overview
+Our approach follows a series of steps, given the premises mentioned beforehand, and starting from the assumption that the system doesn't have, initially, any profile label. Thus:
+1. The first step is to perform unsupervised learning techniques to find similarities in workload execution. Here, we look at a few key features, namely:
+    - CPU
+    - Memory
+    - Disk
+    - Level of parallelization
+    - Runtime length
+2. Once the algorithm has extracted relevant groups, we can derive information regarding the static metadata, linking the workload to their static features and deducing patterns.
+3. Finally, we create models of each of these profiles to let the system perform an automatic workload assignment to each group.
diff --git a/docs/polaris-ai/polaris-ai-transformer.md b/docs/polaris-ai/polaris-ai-transformer.md
@@ -0,0 +1,139 @@
+Data series forecasting using a transformer.
+
+Follow the example.py in the folder to see a very simple usage of the model
+
+## Prepare the data for the model
+```Python
+# Set paths to data
+data_path = "../data/task-usage_job-ID-3418339_total.csv"
+results_path = "..."
+results_file = "...csv"
+# Prepare dataset
+df, scaler = prepare_data(data_path)
+```
+
+## Set up a device
+```Python
+# Considering using cuda if available.
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+```
+
+## Load and prepare the model
+
+```Python
+# Load the model configuration
+with open("models/multistep/config.json") as jfile:
+    config = json.load(jfile)
+# Initialize the model
+model = init_transformer(config, device)
+# Load the model
+model_state, _ = torch.load("models/model_data", map_location=device)
+model.load_state_dict(model_state)
+# Set the model for evaluation mode
+model.eval()
+```
+
+## Select the loss function.
+```Python
+# This model has been trained with MSE, but others can be considered.
+loss_f = torch.nn.MSELoss()
+```
+
+## Convert dataset to pytorch dataloader functions
+```Python
+# Notice that the first argument is "test". Using "train" or "validation" will provide access to other parts of the data. However, they will also feed the model with other data structures not as a sliding window as in test.
+test_dataset = LoadGoogleDataset("test", seq_len=config["seq_len"], prediction_step=config["prediction_step"],
+                                 data_frame=df)
+test_loader = DataLoader(test_dataset, batch_size=1, shuffle=False)
+```
+
+## Run test/forecast loop
+```Python
+loss_progress = list()
+if config["prediction_step"] > 1:
+    outputs = dict()
+    targets = dict()
+    for ii in range(config["prediction_step"]):
+        outputs[str(ii)] = list()
+        targets[str(ii)] = list()
+else:
+    outputs = list()
+    targets = list()
+for x_enc, x_dec, target in test_loader:
+    with torch.no_grad():
+        # Send data to device and prepare dimensions
+        x_enc, x_dec, target = x_enc.to(device), x_dec.to(device), target.to(device)
+        x_dec = x_dec.unsqueeze(-1)
+        # Forecast
+        out = model.forward(x_enc.float(), x_dec.float(), training=False)
+        # Compute loss
+        loss = loss_f(out.double(), target.double())
+        # Store results and target values
+        if config["prediction_step"] > 1:
+            for ii in range(config["prediction_step"]):
+                outputs[str(ii)].append(out.squeeze().cpu().detach().tolist()[ii])
+                targets[str(ii)].append(target.squeeze().cpu().detach().tolist()[ii])
+        else:
+            outputs.append(out.squeeze().cpu().detach().tolist())
+            targets.append(target.squeeze().cpu().detach().tolist())
+        # Keep loss values in a list
+        loss_progress.append(loss.cpu().detach().tolist())
+```
+        
+## Re-scale data
+```Python
+# re-scale outputs
+l_df = len(df["Efficiency"])
+df_computed = df
+
+values = dict()
+
+if config["prediction_step"] > 1:
+    eff_out = dict()
+    tgt_out = dict()
+    for ii in range(config["prediction_step"]):
+        real_eff = np.zeros(len(df["Efficiency"]))
+        real_eff[l_df - len(outputs[str(ii)]):] = outputs[str(ii)]
+        df_computed["Efficiency"] = real_eff
+        df_unscaled = scaler.inverse_transform(df_computed)
+        eff_out[str(ii)] = df_unscaled[l_df - len(outputs[str(ii)]):, -1]
+
+        real_eff = np.zeros(len(df["Efficiency"]))
+        real_eff[l_df - len(outputs[str(ii)]):] = targets[str(ii)]
+        df_computed["Efficiency"] = real_eff
+        df_unscaled = scaler.inverse_transform(df_computed)
+        tgt_out[str(ii)] = df_unscaled[l_df - len(outputs[str(ii)]):, -1]
+
+        values["eff_" + str(ii)] = eff_out[str(ii)].tolist()
+        values["tgt_" + str(ii)] = tgt_out[str(ii)].tolist()
+
+else:
+    real_eff = np.zeros(len(df["Efficiency"]))
+    real_eff[l_df - len(outputs):] = outputs
+    df_computed["Efficiency"] = real_eff
+    df_unscaled = scaler.inverse_transform(df_computed)
+    eff_out = df_unscaled[l_df - len(outputs):, -1]
+
+    real_eff = np.zeros(len(df["Efficiency"]))
+    real_eff[l_df - len(outputs):] = targets
+    df_computed["Efficiency"] = real_eff
+    df_unscaled = scaler.inverse_transform(df_computed)
+    tgt_out = df_unscaled[l_df - len(outputs):, -1]
+
+    values["eff"] = eff_out.tolist()
+    values["tgt"] = tgt_out.tolist()
+```
+
+## Save data
+```Python
+with open(results_path + results_file, 'w') as f:
+    dict_writer = writer(f)
+    dict_writer.writerow(values.keys())
+    dict_writer.writerows(zip(*values.values()))
+```
+
+Transfomer model adapted from: 
+
+Wu, N., Green, B., Ben, X., & O’Banion, S. (2020). Deep Transformer Models for Time Series Forecasting: The Influenza Prevalence Case. ArXiv. http://arxiv.org/abs/2001.08317
+
+![Transformer model](https://raw.githubusercontent.com/vikcas/figures/main/transformer_model.png?token=ACPGCP7MYUVZO66AIOAFLZDAX6FYU)