Skip to content

Commit cc36c0c

Browse files
authored
Merge pull request #27 from DistributedScience/auto_monitor
create auto-monitor
2 parents 102fd32 + 0ae8e87 commit cc36c0c

File tree

5 files changed

+273
-43
lines changed

5 files changed

+273
-43
lines changed

config.py

+3
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,9 @@
3333
# LOG GROUP INFORMATION:
3434
LOG_GROUP_NAME = APP_NAME
3535

36+
# MONITORING
37+
AUTO_MONITOR = 'True'
38+
3639
# CLOUDWATCH DASHBOARD CREATION
3740
CREATE_DASHBOARD = 'True' # Create a dashboard in Cloudwatch for run
3841
CLEAN_DASHBOARD = 'True' # Automatically remove dashboard at end of run with Monitor

documentation/DS-documentation/step_1_configuration.md

+5
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,11 @@ This queue will be automatically made if it doesn't exist already.
6464

6565
***
6666

67+
### MONITORING
68+
* **AUTO_MONITOR:** Whether or not to have Auto-Monitor automatically monitor your jobs.
69+
70+
***
71+
6772
### CLOUDWATCH DASHBOARD CREATION
6873

6974
* **CREATE_DASHBOARD:** Should DS create a Cloudwatch Dashboard that plots run metrics?

documentation/DS-documentation/step_4_monitor.md

+39-43
Original file line numberDiff line numberDiff line change
@@ -2,72 +2,68 @@
22

33
Your workflow is now submitted.
44
Distributed-Something will keep an eye on a few things for you at this point without you having to do anything else.
5-
65
* Each instance is labeled with your APP_NAME, so that you can easily find your instances if you want to look at the instance metrics on the Running Instances section of the [EC2 web interface](https://console.aws.amazon.com/ec2/v2/home) to monitor performance.
7-
86
* You can also look at the whole-cluster CPU and memory usage statistics related to your APP_NAME in the [ECS web interface](https://console.aws.amazon.com/ecs/home).
9-
107
* Each instance will have an alarm placed on it so that if CPU usage dips below 1% for 15 consecutive minutes (almost always the result of a crashed machine), the instance will be automatically terminated and a new one will take its place.
11-
128
* Each individual job processed will create a log of the CellProfiler output, and each Docker container will create a log showing CPU, memory, and disk usage.
139

14-
If you choose to run the monitor script, Distributed-Something can be even more helpful.
15-
The monitor can be run by entering `python run.py monitor files/APP_NAMESpotFleetRequestId.json`.
10+
If you choose to run the Monitor script, Distributed-Something can be even more helpful.
1611

17-
(**Note:** You should run the monitor inside [Screen](https://www.gnu.org/software/screen/), [tmux](https://tmux.github.io/), or another comparable service to keep a network disconnection from killing your monitor; this is particularly critical the longer your run takes.)
12+
## Running Monitor
1813

19-
***
14+
### Manually running Monitor
15+
Monitor can be run by entering `python run.py monitor files/APP_NAMESpotFleetRequestId.json`.
16+
While the optimal time to initiate Monitor is as soon as you have triggered a run as it downscales infrastructure as necessary, you can run Monitor at any point in time and it will clean up whatever infrastructure remains.
2017

21-
## Monitor file
18+
**Note:** You should run the monitor inside [Screen](https://www.gnu.org/software/screen/), [tmux](https://tmux.github.io/), or another comparable service to keep a network disconnection from killing your monitor; this is particularly critical the longer your run takes.
2219

23-
The JSON monitor file containing all the information Distributed-Something needs will have been automatically created when you sent the instructions to start your cluster in the [previous step](step_3_start_cluster).
24-
The file itself is quite simple and contains the following information:
25-
26-
```
27-
{"MONITOR_FLEET_ID" : "sfr-9999ef99-99fc-9d9d-9999-9999999e99ab",
28-
"MONITOR_APP_NAME" : "2021_12_13_Project_Analysis",
29-
"MONITOR_ECS_CLUSTER" : "default",
30-
"MONITOR_QUEUE_NAME" : "2021_12_13_Project_AnalysisQueue",
31-
"MONITOR_BUCKET_NAME" : "bucket-name",
32-
"MONITOR_LOG_GROUP_NAME" : "2021_12_13_Project_Analysis",
33-
"MONITOR_START_TIME" : "1649187798951"}
34-
```
35-
36-
For any DS run where you have run [`startCluster`](step_3_start_cluster) more than once, the most recent values will overwrite the older values in the monitor file.
37-
Therefore, if you have started multiple spot fleets (which you might do in different subnets if you are having trouble getting enough capacity in your spot fleet, for example), monitor will only clean up the latest request unless you manually edit the `MONITOR_FLEET_ID` to match the spot fleet you have kept.
38-
39-
***
20+
### Using Auto-Monitor
21+
Instead of manually triggering Monitor, you can have a version of Monitor automatically initiate after you [start your cluster](step_3_start_cluster.md) by setting `AUTO_MONITOR = 'True'` in your [config file](step_1_configuration.md).
22+
Auto-Monitor is an AWS Lambda function that is triggered by alarms placed on the SQS queue.
23+
Read more about the [SQS Queue](SQS_QUEUE_information.md) to better understand the alarm metrics.
4024

4125
## Monitor functions
4226

4327
### While your analysis is running
28+
* Scales down the spot fleet request to match the number of remaining jobs WITHOUT force terminating them.
29+
This happens every 10 minutes with manual Monitor or when the are no Visible Messages in your queue for Auto-Monitor.
30+
* Deletes the alarms for any instances that have been terminated in the last 24 hours (because of spot prices rising above your maximum bid, machine crashes, etc).
31+
This happens every hour with manual Monitor or when the are no Visible Messages in your queue for Auto-Monitor.
4432

45-
* Checks your queue once per minute to see how many jobs are currently processing and how many remain to be processed.
46-
47-
* Once per hour, it deletes the alarms for any instances that have been terminated in the last 24 hours (because of spot prices rising above your maximum bid, machine crashes, etc).
48-
49-
### When the number of jobs in your queue goes to 0
50-
33+
### When your queue is totally empty (there are no Visible or Not Visible messages)
5134
* Downscales the ECS service associated with your APP_NAME.
52-
5335
* Deletes all the alarms associated with your spot fleet (both the currently running and the previously terminated instances).
54-
5536
* Shuts down your spot fleet to keep you from incurring charges after your analysis is over.
56-
5737
* Gets rid of the queue, service, and task definition created for this analysis.
58-
5938
* Exports all the logs from your analysis onto your S3 bucket.
39+
* Deletes your Cloudwatch Dashboard if you created it and set `CLEAN_DASHBOARD = 'True'` in your [config file](step_1_configuration.md).
6040

61-
* Deletes your Cloudwatch Dashboard if you created it and set CLEAN_DASHBOARD to True.
41+
## Cheapest mode
42+
43+
If you are manually triggering Monitor, you can run the monitor in an optional "cheapest" mode, which will downscale the number of requested machines (but not RUNNING machines) to one machine 15 minutes after the monitor is engaged.
44+
You can engage cheapest mode by adding `True` as a final configurable parameter when starting the monitor, aka `python run.py monitor files/APP_NAMESpotFleetRequestId.json True`
45+
46+
Cheapest mode is cheapest because it will remove all but 1 machine as soon as that machine crashes and/or runs out of jobs to do; this can save you money, particularly in multi-CPU Dockers running long jobs.
47+
This mode is optional because running this way involves some inherent risks.
48+
If machines stall out due to processing errors, they will not be replaced, meaning your job will take overall longer.
49+
Additionally, if there is limited capacity for your requested configuration when you first start (e.g. you want 200 machines but AWS says it can currently only allocate you 50), more machines will not be added if and when they become available in cheapest mode as they would in normal mode.
6250

6351
***
6452

65-
## Cheapest mode
53+
## Monitor file
6654

67-
You can run the monitor in an optional "cheapest" mode, which will downscale the number of requested machines (but not RUNNING machines) to one 15 minutes after the monitor is engaged.
68-
You can engage cheapest mode by adding `True` as a final configurable parameter when starting the monitor, aka `python run.py monitor files/APP_NAMESpotFleetRequestId.json True`
55+
The JSON monitor file containing all the information Distributed-Something needs will have been automatically created when you sent the instructions to start your cluster in the [previous step](step_3_start_cluster).
56+
The file itself is quite simple and contains the following information:
6957

70-
Cheapest mode is cheapest because it will remove all but 1 machine as soon as that machine crashes and/or runs out of jobs to do; this can save you money, particularly in multi-CPU Dockers running long jobs.
58+
```
59+
{"MONITOR_FLEET_ID" : "sfr-9999ef99-99fc-9d9d-9999-9999999e99ab",
60+
"MONITOR_APP_NAME" : "2021_12_13_Project_Analysis",
61+
"MONITOR_ECS_CLUSTER" : "default",
62+
"MONITOR_QUEUE_NAME" : "2021_12_13_Project_AnalysisQueue",
63+
"MONITOR_BUCKET_NAME" : "bucket-name",
64+
"MONITOR_LOG_GROUP_NAME" : "2021_12_13_Project_Analysis",
65+
"MONITOR_START_TIME" : "1649187798951"}
66+
```
7167

72-
This mode is optional because running this way involves some inherent risks- if machines stall out due to processing errors, they will not be replaced, meaning your job will take overall longer.
73-
Additionally, if there is limited capacity for your requested configuration when you first start (e.g. you want 200 machines but AWS says it can currently only allocate you 50), more machines will not be added if and when they become available in cheapest mode as they would in normal mode.
68+
For any DS run where you have run [`startCluster`](step_3_start_cluster) more than once, the most recent values will overwrite the older values in the monitor file.
69+
Therefore, if you have started multiple spot fleets (which you might do in different subnets if you are having trouble getting enough capacity in your spot fleet, for example), Monitor will only clean up the latest request unless you manually edit the `MONITOR_FLEET_ID` to match the spot fleet you have kept.

lambda_function.py

+187
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
import boto3
2+
import datetime
3+
import botocore
4+
import json
5+
6+
s3 = boto3.client("s3")
7+
ecs = boto3.client("ecs")
8+
ec2 = boto3.client("ec2")
9+
cloudwatch = boto3.client("cloudwatch")
10+
sqs = boto3.client("sqs")
11+
12+
bucket = "BUCKET_NAME"
13+
14+
15+
def killdeadAlarms(fleetId, monitorapp, project):
16+
checkdates = [
17+
datetime.datetime.now().strftime("%Y-%m-%d"),
18+
(datetime.datetime.now() - datetime.timedelta(days=1)).strftime("%Y-%m-%d"),
19+
]
20+
todel = []
21+
for eachdate in checkdates:
22+
datedead = ec2.describe_spot_fleet_request_history(
23+
SpotFleetRequestId=fleetId, StartTime=eachdate
24+
)
25+
for eachevent in datedead["HistoryRecords"]:
26+
if eachevent["EventType"] == "instanceChange":
27+
if eachevent["EventInformation"]["EventSubType"] == "terminated":
28+
todel.append(eachevent["EventInformation"]["InstanceId"])
29+
todel = [f"{project}_{x}" for x in todel]
30+
cloudwatch.delete_alarms(AlarmNames=todel)
31+
print("Old alarms deleted")
32+
33+
34+
def seeIfLogExportIsDone(logExportId):
35+
while True:
36+
result = cloudwatch.describe_export_tasks(taskId=logExportId)
37+
if result["exportTasks"][0]["status"]["code"] != "PENDING":
38+
if result["exportTasks"][0]["status"]["code"] != "RUNNING":
39+
print(result["exportTasks"][0]["status"]["code"])
40+
break
41+
time.sleep(30)
42+
43+
44+
def downscaleSpotFleet(queue, spotFleetID):
45+
response = sqs.get_queue_url(QueueName=queue)
46+
queueUrl = response["QueueUrl"]
47+
response = sqs.get_queue_attributes(
48+
QueueUrl=queueUrl,
49+
AttributeNames=[
50+
"ApproximateNumberOfMessages",
51+
"ApproximateNumberOfMessagesNotVisible",
52+
],
53+
)
54+
visible = int(response["Attributes"]["ApproximateNumberOfMessages"])
55+
nonvisible = int(response["Attributes"]["ApproximateNumberOfMessagesNotVisible"])
56+
status = ec2.describe_spot_fleet_instances(SpotFleetRequestId=spotFleetID)
57+
if nonvisible < len(status["ActiveInstances"]):
58+
result = ec2.modify_spot_fleet_request(
59+
ExcessCapacityTerminationPolicy="noTermination",
60+
TargetCapacity=str(nonvisible),
61+
SpotFleetRequestId=spotFleetID,
62+
)
63+
64+
65+
def lambda_handler(event, lambda_context):
66+
# Triggered any time SQS queue ApproximateNumberOfMessagesVisible = 0
67+
# OR ApproximateNumberOfMessagesNotVisible = 0
68+
messagestring = event["Records"][0]["Sns"]["Message"]
69+
messagedict = json.loads(messagestring)
70+
queueId = messagedict["Trigger"]["Dimensions"][0]["value"]
71+
project = queueId.rsplit("_", 1)[0]
72+
73+
# Download monitor file
74+
monitor_file_name = f"{queueId.split('Queue')[0]}SpotFleetRequestId.json"
75+
monitor_local_name = f"/tmp/{monitor_file_name}"
76+
monitor_on_bucket_name = f"monitors/{monitor_file_name}"
77+
78+
with open(monitor_local_name, "wb") as f:
79+
try:
80+
s3.download_fileobj(bucket, monitor_on_bucket_name, f)
81+
except botocore.exceptions.ClientError as error:
82+
print("Error retrieving monitor file.")
83+
return
84+
with open(monitor_local_name, "r") as input:
85+
monitorInfo = json.load(input)
86+
87+
monitorcluster = monitorInfo["MONITOR_ECS_CLUSTER"]
88+
monitorapp = monitorInfo["MONITOR_APP_NAME"]
89+
fleetId = monitorInfo["MONITOR_FLEET_ID"]
90+
loggroupId = monitorInfo["MONITOR_LOG_GROUP_NAME"]
91+
starttime = monitorInfo["MONITOR_START_TIME"]
92+
CLEAN_DASHBOARD = monitorInfo["CLEAN_DASHBOARD"]
93+
print(f"Monitor triggered for {monitorcluster} {monitorapp} {fleetId} {loggroupId}")
94+
95+
# If no visible messages, downscale machines
96+
if "ApproximateNumberOfMessagesVisible" in event["Records"][0]["Sns"]["Message"]:
97+
print("No visible messages. Tidying as we go.")
98+
killdeadAlarms(fleetId, monitorapp, project)
99+
downscaleSpotFleet(queueId, fleetId)
100+
101+
# If no messages in progress, cleanup
102+
if "ApproximateNumberOfMessagesNotVisible" in event["Records"][0]["Sns"]["Message"]:
103+
print("No messages in progress. Cleaning up.")
104+
ecs.update_service(
105+
cluster=monitorcluster,
106+
service=f"{monitorapp}Service",
107+
desiredCount=0,
108+
)
109+
print("Service has been downscaled")
110+
111+
# Delete the alarms from active machines and machines that have died.
112+
active_dictionary = ec2.describe_spot_fleet_instances(
113+
SpotFleetRequestId=fleetId
114+
)
115+
active_instances = []
116+
for instance in active_dictionary["ActiveInstances"]:
117+
active_instances.append(instance["InstanceId"])
118+
cloudwatch.delete_alarms(AlarmNames=active_instances)
119+
killdeadAlarms(fleetId, monitorapp, project)
120+
121+
# Read spot fleet id and terminate all EC2 instances
122+
ec2.cancel_spot_fleet_requests(
123+
SpotFleetRequestIds=[fleetId], TerminateInstances=True
124+
)
125+
print("Fleet shut down.")
126+
127+
# Remove SQS queue, ECS Task Definition, ECS Service
128+
ECS_TASK_NAME = monitorapp + "Task"
129+
ECS_SERVICE_NAME = monitorapp + "Service"
130+
131+
print("Deleting existing queue.")
132+
queueoutput = sqs.list_queues(QueueNamePrefix=queueId)
133+
try:
134+
if len(queueoutput["QueueUrls"]) == 1:
135+
queueUrl = queueoutput["QueueUrls"][0]
136+
else: # In case we have "AnalysisQueue" and "AnalysisQueue1" and only want to delete the first of those
137+
for eachUrl in queueoutput["QueueUrls"]:
138+
if eachUrl.split("/")[-1] == queueName:
139+
queueUrl = eachUrl
140+
sqs.delete_queue(QueueUrl=queueUrl)
141+
except KeyError:
142+
print("Can't find queue to delete.")
143+
144+
print("Deleting service")
145+
try:
146+
ecs.delete_service(cluster=monitorcluster, service=ECS_SERVICE_NAME)
147+
except:
148+
print("Couldn't delete service.")
149+
150+
print("De-registering task")
151+
taskArns = ecs.list_task_definitions()
152+
for eachtask in taskArns["taskDefinitionArns"]:
153+
fulltaskname = eachtask.split("/")[-1]
154+
ecs.deregister_task_definition(taskDefinition=fulltaskname)
155+
156+
print("Removing cluster if it's not the default and not otherwise in use")
157+
if monitorcluster != "default":
158+
result = ecs.describe_clusters(clusters=[monitorcluster])
159+
if (
160+
sum(
161+
[
162+
result["clusters"][0]["pendingTasksCount"],
163+
result["clusters"][0]["runningTasksCount"],
164+
result["clusters"][0]["activeServicesCount"],
165+
]
166+
)
167+
== 0
168+
):
169+
ecs.delete_cluster(cluster=monitorcluster)
170+
171+
# Remove alarms that triggered monitor
172+
print("Removing alarms that triggered Monitor")
173+
cloudwatch.delete_alarms(
174+
AlarmNames=[
175+
f"ApproximateNumberOfMessagesVisibleisZero_{monitorapp}",
176+
f"ApproximateNumberOfMessagesNotVisibleisZero_{monitorapp}",
177+
]
178+
)
179+
180+
# Remove Cloudwatch dashboard if created and cleanup desired
181+
if CLEAN_DASHBOARD.lower() == "true":
182+
dashboard_list = cloudwatch.list_dashboards()
183+
for entry in dashboard_list["DashboardEntries"]:
184+
if monitorapp in entry["DashboardName"]:
185+
cloudwatch.delete_dashboards(
186+
DashboardNames=[entry["DashboardName"]]
187+
)

run.py

+39
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111

1212
CREATE_DASHBOARD = 'False'
1313
CLEAN_DASHBOARD = 'False'
14+
AUTO_MONITOR = 'False'
1415

1516
from config import *
1617

@@ -591,8 +592,16 @@ def startCluster():
591592
createMonitor.write('"MONITOR_BUCKET_NAME" : "'+AWS_BUCKET+'",\n')
592593
createMonitor.write('"MONITOR_LOG_GROUP_NAME" : "'+LOG_GROUP_NAME+'",\n')
593594
createMonitor.write('"MONITOR_START_TIME" : "'+ starttime+'"}\n')
595+
createMonitor.write('"CLEAN_DASHBOARD" : "'+ CLEAN_DASHBOARD+'"}\n')
594596
createMonitor.close()
595597

598+
# Upload monitor file to S3 so it can be read by Auto-Monitor lambda function
599+
if AUTO_MONITOR.lower()=='true':
600+
s3 = boto3.client("s3")
601+
json_on_bucket_name = f'monitors/{APP_NAME}SpotFleetRequestId.json' # match path set in lambda function
602+
with open(monitor_file_name, "rb") as a:
603+
s3.put_object(Body=a, Bucket=AWS_BUCKET, Key=json_on_bucket_name)
604+
596605
# Step 4: Create a log group for this app and date if one does not already exist
597606
logclient=boto3.client('logs')
598607
loggroupinfo=logclient.describe_log_groups(logGroupNamePrefix=LOG_GROUP_NAME)
@@ -642,6 +651,36 @@ def startCluster():
642651
print ("Creating CloudWatch dashboard for run metrics")
643652
create_dashboard(requestInfo)
644653

654+
if AUTO_MONITOR.lower()=='true':
655+
# Create alarms that will trigger Monitor based on SQS queue metrics
656+
cloudwatch = boto3.client("cloudwatch")
657+
metricnames = [
658+
"ApproximateNumberOfMessagesNotVisible",
659+
"ApproximateNumberOfMessagesVisible",
660+
]
661+
sns = boto3.client("sns")
662+
MonitorARN = sns.create_topic(Name="Monitor")['TopicArn'] # returns ARN since topic already exists
663+
for metric in metricnames:
664+
response = cloudwatch.put_metric_alarm(
665+
AlarmName=f'{metric}isZero_{APP_NAME}',
666+
ActionsEnabled=True,
667+
OKActions=[],
668+
AlarmActions=[MonitorARN],
669+
InsufficientDataActions=[],
670+
MetricName=metric,
671+
Namespace="AWS/SQS",
672+
Statistic="Average",
673+
Dimensions=[
674+
{"Name": "QueueName", "Value": f'{APP_NAME}Queue'}
675+
],
676+
Period=300,
677+
EvaluationPeriods=1,
678+
DatapointsToAlarm=1,
679+
Threshold=0,
680+
ComparisonOperator="LessThanOrEqualToThreshold",
681+
TreatMissingData="missing",
682+
)
683+
645684
#################################
646685
# SERVICE 4: MONITOR JOB
647686
#################################

0 commit comments

Comments
 (0)