-
Notifications
You must be signed in to change notification settings - Fork 13
switch to ported slurm docker cluster #1297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified FilesNo covered modified files...
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great stuff :) I only left a couple of smaller comments.
run: uv python install ${{ matrix.python-version }} | ||
- name: Install dependencies (without docker) | ||
run: uv sync --all-extras --frozen | ||
- name: "Run Kubernetes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- name: "Run Kubernetes" | |
- name: "Run Dask" |
for name in "slurmctld" "c1" "c2"; do | ||
docker exec -w /cluster_tools "$name" bash -c "uv sync --frozen" | ||
done | ||
- name: "Run Tests (test_all, test_slurm)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention that slurm_change_config
is not executed?
|
||
- name: Install dependencies (without docker) | ||
if: ${{ matrix.executors == 'kubernetes' || matrix.executors == 'dask' }} | ||
- name: "Run Tests (test_all, test_slurm)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only runs slurm_change_config
right?
- name: Check typing | ||
if: ${{ matrix.executors == 'multiprocessing' && matrix.python-version == '3.11' }} | ||
if: ${{ matrix.python-version == '3.11' }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we adapt this so that these checks are run against the newest python version?
needs: changes | ||
if: ${{ needs.changes.outputs.cluster_tools == 'true' }} | ||
runs-on: ubuntu-latest | ||
timeout-minutes: 30 | ||
strategy: | ||
max-parallel: 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is never set now. is this on purpose?
echo "MaxArraySize=2" >> ./dockered-slurm/slurm.conf | ||
sed "s/JobAcctGatherFrequency=30/JobAcctGatherFrequency=1/g" ./dockered-slurm/slurm.conf > ./dockered-slurm/slurm.conf.bak | ||
cp ./dockered-slurm/slurm.conf.bak ./dockered-slurm/slurm.conf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: I find the name *.bak
weird, because it's not really a backup but instead a temporary file. in-place modification with sed doesn't work for some reason? I'd name it *.tmp
then. also I'd mv
instead of cp
because there is no need for the tmp file afterwards.
for _ in range(10): | ||
stdout, _, exit_code = call( | ||
f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}" | ||
) | ||
|
||
if exit_code != 0: | ||
break | ||
|
||
if len(stdout.splitlines()) <= 1: | ||
time.sleep(0.2) | ||
continue | ||
|
||
# Parse stdout into a key-value object | ||
memory_limit_investigation = self._investigate_memory_consumption(stdout) | ||
if memory_limit_investigation: | ||
return memory_limit_investigation | ||
|
||
# Parse stdout into a key-value object | ||
properties = parse_key_value_pairs(stdout, "\n", ":") | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this doesn't do retries if sacct fails? would it do any harm to do so?
- I find the trailing break in this for-loop unintuitive to read. my suggestion would be this:
max_retries = 10
for _ in range(max_retries):
stdout, _, exit_code = call(
f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}"
)
if exit_code != 0 or len(stdout.splitlines()) <= 1:
# Sleep and retry in the next loop
time.sleep(0.2)
else:
# Parse stdout into a key-value object
memory_limit_investigation = self._investigate_memory_consumption(stdout)
if memory_limit_investigation:
return memory_limit_investigation
break
return None | ||
|
||
if percentage < 100: | ||
stdout_lines = stdout.splitlines() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a comment that mentions the expected format for the stdout input? this makes it easier to follow the parsing logic.
for line in stdout_lines[1:]: | ||
params = line.split("|") | ||
try: | ||
max_rss = max(max_rss, int(params[2][:-1])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[:-1]
strips away the unit which is always expected to be one character long? maybe a regex for numbers would make this more robust? 🤔
except Exception: | ||
pass | ||
try: | ||
req_mem = max(max_rss, int(params[3][:-1])) | ||
except Exception: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in which cases are these exceptions expected? i.e., what kind of stdouts lines make this necessary? maybe we can guard against these a bit more specific than a generic try-catch.
Description:
TODO
scalableminds/slurm-docker-cluster
(Requires: switch to ported slurm docker cluster #1297 to be merged)Issues: