Skip to content

switch to ported slurm docker cluster #1297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: master
Choose a base branch
from

Conversation

robert-oleynik
Copy link

@robert-oleynik robert-oleynik commented Apr 30, 2025

Description:

TODO

Issues:

  • fixes CI build failures for Docker Slurm Cluster image

@robert-oleynik robert-oleynik self-assigned this Apr 30, 2025
Copy link

github-actions bot commented Apr 30, 2025

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
8635 7212 84% 80% 🟢

New Files

No new covered files...

Modified Files

No covered modified files...

updated for commit: 5b36743 by action🐍

@robert-oleynik robert-oleynik marked this pull request as draft April 30, 2025 11:37
@robert-oleynik robert-oleynik marked this pull request as ready for review May 22, 2025 12:46
Copy link
Member

@philippotto philippotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great stuff :) I only left a couple of smaller comments.

run: uv python install ${{ matrix.python-version }}
- name: Install dependencies (without docker)
run: uv sync --all-extras --frozen
- name: "Run Kubernetes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: "Run Kubernetes"
- name: "Run Dask"

for name in "slurmctld" "c1" "c2"; do
docker exec -w /cluster_tools "$name" bash -c "uv sync --frozen"
done
- name: "Run Tests (test_all, test_slurm)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mention that slurm_change_config is not executed?


- name: Install dependencies (without docker)
if: ${{ matrix.executors == 'kubernetes' || matrix.executors == 'dask' }}
- name: "Run Tests (test_all, test_slurm)"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only runs slurm_change_config right?

- name: Check typing
if: ${{ matrix.executors == 'multiprocessing' && matrix.python-version == '3.11' }}
if: ${{ matrix.python-version == '3.11' }}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we adapt this so that these checks are run against the newest python version?

needs: changes
if: ${{ needs.changes.outputs.cluster_tools == 'true' }}
runs-on: ubuntu-latest
timeout-minutes: 30
strategy:
max-parallel: 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is never set now. is this on purpose?

Comment on lines +77 to +79
echo "MaxArraySize=2" >> ./dockered-slurm/slurm.conf
sed "s/JobAcctGatherFrequency=30/JobAcctGatherFrequency=1/g" ./dockered-slurm/slurm.conf > ./dockered-slurm/slurm.conf.bak
cp ./dockered-slurm/slurm.conf.bak ./dockered-slurm/slurm.conf
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I find the name *.bak weird, because it's not really a backup but instead a temporary file. in-place modification with sed doesn't work for some reason? I'd name it *.tmp then. also I'd mv instead of cp because there is no need for the tmp file afterwards.

Comment on lines +439 to +456
for _ in range(10):
stdout, _, exit_code = call(
f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}"
)

if exit_code != 0:
break

if len(stdout.splitlines()) <= 1:
time.sleep(0.2)
continue

# Parse stdout into a key-value object
memory_limit_investigation = self._investigate_memory_consumption(stdout)
if memory_limit_investigation:
return memory_limit_investigation

# Parse stdout into a key-value object
properties = parse_key_value_pairs(stdout, "\n", ":")
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this doesn't do retries if sacct fails? would it do any harm to do so?
  2. I find the trailing break in this for-loop unintuitive to read. my suggestion would be this:
        max_retries = 10
        for _ in range(max_retries):
            stdout, _, exit_code = call(
                f"sacct -P --format=JobID,State,MaxRSS,ReqMem --unit K -j {job_id_with_index}"
            )

            if exit_code != 0 or len(stdout.splitlines()) <= 1:
                # Sleep and retry in the next loop
                time.sleep(0.2)
            else:
                # Parse stdout into a key-value object
                memory_limit_investigation = self._investigate_memory_consumption(stdout)
                if memory_limit_investigation:
                    return memory_limit_investigation
                break

return None

if percentage < 100:
stdout_lines = stdout.splitlines()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment that mentions the expected format for the stdout input? this makes it easier to follow the parsing logic.

for line in stdout_lines[1:]:
params = line.split("|")
try:
max_rss = max(max_rss, int(params[2][:-1]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[:-1] strips away the unit which is always expected to be one character long? maybe a regex for numbers would make this more robust? 🤔

Comment on lines +491 to +496
except Exception:
pass
try:
req_mem = max(max_rss, int(params[3][:-1]))
except Exception:
pass
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in which cases are these exceptions expected? i.e., what kind of stdouts lines make this necessary? maybe we can guard against these a bit more specific than a generic try-catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants