Skip to content

Allow PCUI to support multiple Pcluster versions #418

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 16, 2025
Merged

Conversation

hgreebe
Copy link
Contributor

@hgreebe hgreebe commented May 2, 2025

Description

Allow PCUI to support managing clusters of different PCUI versions

Changes

  • BREAKING CHANGE With this change we are removing the support for ParallelCluster versions <= 3.5.0. This is an approved product decision. The technical reason behind this change is that supporting those older versions would have required an increase in complexity that we considered not worth it.
  • PCUI template accepts a comma separated list of versions to support
  • An API stack for each pcluster version is launched
  • API handler manages the mapping of pcluster version to api invoke url
  • Each API request contains a version parameter in the url
  • Updating the stack to remove or add support for different pcluster versions is supported
  • During cluster create the first page is now a version page that includes a dropdown menu to select a version
  • The official Images page includes a dropdown menu to select which version of images to show

How Has This Been Tested?

  • Manually Tested the following:
    • Create PCUI stack with the version parameter filled in with multiple versions
    • Validated that you could see info and modify clusters of the multiple supported versions
    • Validated creating clusters of multiple versions
    • Validated updating clusters of multiple versions
    • Validated view the official images of different versions
    • Validated that buttons such as Shell DCV and Stop Fleet work for only the supported versions
    • Validated that clusters of unsupported versions are not editable
    • VAlidated updating the stack by removing and adding supported versions
  • Modified unit tests to account for support of multiple versions

In order to increase the likelihood of your contribution being accepted, please make sure you have read both the Contributing Guidelines and the Project Guidelines

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@hgreebe hgreebe changed the title Develop Allow PCUI to support multiple Pcluster versions May 2, 2025
@hgreebe hgreebe marked this pull request as ready for review May 2, 2025 15:54
@gmarciani
Copy link
Collaborator

since this PR is changing the frontend, could you please attach screenshot(s) that represent the most relervant changes?
for example, I'm thinking about the new wizard page for the version

@@ -32,13 +32,14 @@
USER_POOL_ID = os.getenv("USER_POOL_ID")
AUTH_PATH = os.getenv("AUTH_PATH")
API_BASE_URL = os.getenv("API_BASE_URL")
API_VERSION = os.getenv("API_VERSION", "3.1.0")
API_VERSION = sorted(os.getenv("API_VERSION", "3.1.0").split(","), key=lambda x: [-int(n) for n in x.split('.')])
DEFAULT_API_VERSION = API_VERSION[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are setting the latest PC version as the default one.
The user should be able to control the default version, but with this solution is not.

Example: a user wants to try a new version of PC, but keeping the default version to the one that they consider the most stable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because for the home page where all of the clusters are displayed, it uses the list clusters API of the default version.
Currently, if the supported API version is larger than an unsupport version of one of the clusters, then you will be able to see some of its information, there will just be a warning saying that you can not edit the cluster. But is the supported API version it less than the unsupported version, you won't be able to see any cluster information and will getting an error message saying that the version is not supported.

The reason for the sorting/default version being the largest version is so that for example if the supported versions for 3.13.0 and 3.11.0, you will be able to see the information for a 3.12.0 cluster no matter the order you put in the supported versions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation. Please put a comment explaining why the default version must be the most recent one.

@@ -32,13 +32,14 @@
USER_POOL_ID = os.getenv("USER_POOL_ID")
AUTH_PATH = os.getenv("AUTH_PATH")
API_BASE_URL = os.getenv("API_BASE_URL")
API_VERSION = os.getenv("API_VERSION", "3.1.0")
API_VERSION = sorted(os.getenv("API_VERSION", "3.1.0").split(","), key=lambda x: [-int(n) for n in x.split('.')])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the implications of this sorting?
Is it used only to determine the default version in the line below?

Also, I suggest to trim blank spaces from the string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explained the sorting in the comment above

API_USER_ROLE = os.getenv("API_USER_ROLE")
OIDC_PROVIDER = os.getenv("OIDC_PROVIDER")
CLIENT_ID = os.getenv("CLIENT_ID")
CLIENT_SECRET = os.getenv("CLIENT_SECRET")
SECRET_ID = os.getenv("SECRET_ID")
SITE_URL = os.getenv("SITE_URL", API_BASE_URL)
SITE_URL = os.getenv("SITE_URL")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should't this be the API URL of the default version?

else:
info_resp = sigv4_request("GET", API_BASE_URL, url)
info_resp = sigv4_request("GET", get_base_url(request.args.get("version")), url)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: avoid duplication. read the target version ones from the request

@@ -735,14 +744,19 @@ def _get_params(_request):
params.pop("path")
return params

def get_base_url(v):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: v -> version

@hgreebe
Copy link
Contributor Author

hgreebe commented May 2, 2025

Screenshot 2025-05-02 at 13 43 30 Screenshot 2025-05-02 at 13 43 47

@@ -62,6 +63,14 @@
if not JWKS_URL:
JWKS_URL = os.getenv("JWKS_URL",
f"https://cognito-idp.{REGION}.amazonaws.com/{USER_POOL_ID}/" ".well-known/jwks.json")
API_BASE_URL_MAPPING = {}

if API_BASE_URL:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May you please wrap this logic into a dedicated function and unit test it?

"version": {
"label": "Cluster Version",
"title": "Version",
"placeholder": "Select your cluster version",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessayr to repeat "Select your" in a placeholder. Could simply be "Cluster version"

@@ -484,7 +493,7 @@ def get_dcv_session():


def get_custom_image_config():
image_info = sigv4_request("GET", API_BASE_URL, f"/v3/images/custom/{request.args.get('image_id')}").json()
image_info = sigv4_request("GET", get_base_url(request.args.get("version")), f"/v3/images/custom/{request.args.get('image_id')}").json()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good practice to specify the request param name version in a constant.

const ARG_VERSION="version"

request.args.get(ARG_VERSION)

We did not respect this best practice elsewhere (eg: region), but we can start applying on new code.

@@ -233,9 +242,9 @@ def ec2_action():
def get_cluster_config_text(cluster_name, region=None):
url = f"/v3/clusters/{cluster_name}"
if region:
info_resp = sigv4_request("GET", API_BASE_URL, url, params={"region": region})
info_resp = sigv4_request("GET", get_base_url(request.args.get("version")), url, params={"region": region})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the request is missing the version patameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the get_base_url function checks if the version has been passed in

@@ -32,13 +32,14 @@
USER_POOL_ID = os.getenv("USER_POOL_ID")
AUTH_PATH = os.getenv("AUTH_PATH")
API_BASE_URL = os.getenv("API_BASE_URL")
API_VERSION = os.getenv("API_VERSION", "3.1.0")
API_VERSION = sorted(os.getenv("API_VERSION", "3.1.0").split(","), key=lambda x: [-int(n) for n in x.split('.')])
DEFAULT_API_VERSION = API_VERSION[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks for the explanation. Please put a comment explaining why the default version must be the most recent one.

@@ -34,7 +35,7 @@ Parameters:
Version:
Description: Version of AWS ParallelCluster to deploy.
Type: String
AllowedPattern: "^([0-9]+)\\.([0-9]+)\\.([0-9]+)$"
AllowedPattern: "^([0-9]+)\\.([0-9]+)\\.([0-9]+)(,([0-9]+)\\.([0-9]+)\\.([0-9]+))*$"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a limit to the maximum number of versions that a user can specify?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what if a user specifies the same version twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the user specifies the sam number twice than it will just launch the api stack for that version once, because the api handler makes the list of versions a set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not maximum

@@ -161,17 +162,6 @@ Conditions:
- !Not [!Equals [!Ref SNSRole, ""]]
UseNewCognito:
!Not [ Condition: UseExistingCognito]
UseNonDockerizedPCAPI:
!Not [ Condition: UseDockerizedPCAPI]
UseDockerizedPCAPI: !And
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a user specifies version <=3.5.0?
This is a breaking change unless you manage <= 3.5.0 in another way (does not seems so).
Can we avoid this change? Why are we forced to introduce it?

Fn::ForEach::ParallelClusterApi:
- ApiVersion
- !Split [",", !Ref Version]
- ParallelClusterApi&{ApiVersion}:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does CFN implicitly transform the dotted version with something else?
I'm asking because I assiume you cannot use dots within a resource name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When you use the & symbol instead of the # symbol when referencing the variable, it automatically removes the dots

- ParallelClusterApi&{ApiVersion}:
Type: AWS::CloudFormation::Stack
Properties:
Parameters:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove ImageBuilderSubnetId and ImageBuilderVpcId?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because if we only support PC version >3.5.0, than all will use the NonDockerizedPCAPI, which has those two variables set to !Ref AWS::NoValue

reason = "Failed {}: {}".format(event["RequestType"], e)

Timeout: 300
MemorySize: 128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a very low mem size. What's the maximum mem usage observed across 10 executions? Whatever it is, I suggest to double it. The impact on costs would be very low because this function is executed only on stack create/update/delete, but the potential impact in case of failures is bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mem usage is was 86. I will increase it.

Action:
- cloudformation:ListStacks
- cloudformation:DescribeStacks
Resource: '*'
Copy link
Collaborator

@gmarciani gmarciani May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we scope these permissions down? I think we can set the resource to the current stack name, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cannot scope this down using stack name, what about scoping this down using tags as per https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-requesttag?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why leaving * as resource? Isn't it possible to specify the stack arn with a wildcard only on the stackname?

Resource: !Sub
- arn:${AWS::Partition}:execute-api:${AWS::Region}:${AWS::AccountId}:${PCApiGateway}/*/*
- { PCApiGateway: !Select [2, !Split ['/', !Select [0, !Split ['.', !GetAtt [ ParallelClusterApi, Outputs.ParallelClusterApiInvokeUrl ]]]]] }
Resource: !Sub "arn:${AWS::Partition}:execute-api:${AWS::Region}:${AWS::AccountId}:*/*/*"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a security vulnerabikity. We must scope this policy down.
If injecting APIG ARNs does not work b/c of CFN limitations, then we can do it setting conditions on tags https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-resourcetag

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a working solution to restrict the permissions using tags.

  1. Add tag aws-parallelcluster-ui:stack-id equal to the PCUI stack id to the PCAPI nested stacks.
  2. Add condition
    Example:
  ParallelClusterApi:
    Type: AWS::CloudFormation::Stack
    Properties:
      ...
      Tags:
        - Key: 'aws-parallelcluster-ui:stack-id'
          Value: !Ref AWS::StackId


  ParallelClusterApiGatewayInvoke:
    Type: AWS::IAM::ManagedPolicy
    Properties:
       ....
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Action:
              - execute-api:Invoke
            Effect: Allow
            Resource: !Sub "arn:${AWS::Partition}:execute-api:${AWS::Region}:${AWS::AccountId}:*/*/*"
            Condition:
              StringEquals:
                "aws:ResourceTag/aws-parallelcluster-ui:stack-id": !Ref 'AWS::StackId'
        

result = f"{result}{version}={output['OutputValue']},"

parsed_url = urlparse(output['OutputValue']).hostname.split('.')[0]
api_gateway_arns.append(f"arn:{get_partition(os.environ['AWS_REGION'])}:execute-api:{os.environ['AWS_REGION']}:{account_id}:{parsed_url}/*/*")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To address path traversal vulnerability, we recently made a change that must be reflected here (#422)

return (
<Select
expandToViewport={true}
selectedOption={{label: selectedVersion, value: selectedVersion}}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding version selection for images as well. For clusters we added tests to cover this change. What about images?

gateway = urlparse(output['OutputValue']).hostname.split('.')[0]
stage = output['OutputValue'].split('/')[3]
api_gateway_arns.append(f"arn:{get_partition(os.environ['AWS_REGION'])}:execute-api:{os.environ['AWS_REGION']}:{account_id}:{gateway}/{stage}/*")
print(f"API arn: {parsed_url}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is parsed_url? What about adding the ARN determined on line 325 to the next print and get rid of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, that was leftover from testing.

The arns are listed in the log as the output so I will remove the print.

for output in stack_response['Stacks'][0]['Outputs']:
if output['OutputKey'] == 'ParallelClusterApiInvokeUrl':
# Construct the result string
result = f"{result}{version}={output['OutputValue']},"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this concantenation you will end up with a trailing comma at the end of the result. Have you verified this does not have any side effect? Wha about getting rid of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has not side effect because but I will remove the trailing comma

for output in stack_response['Stacks'][0]['Outputs']:
if output['OutputKey'] == 'ParallelClusterApiInvokeUrl':
# Construct the result string
result = f"{result}{version}={output['OutputValue']},"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codestyle: output['OutoputValue'] is repeated 4 times. what about improving readability assigning it to a more self explanatory variable api_url instead?

reason = "Failed {}: {}".format(event["RequestType"], e)

Timeout: 300
MemorySize: 256
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don;'t remember if we already discussed this topic. However, what is the maximum amount of memory consumed that you observed. Is 256 way enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mem usage is was 86. so i think 256 should be more than enough.

Action:
- cloudformation:ListStacks
- cloudformation:DescribeStacks
Resource: '*'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cannot scope this down using stack name, what about scoping this down using tags as per https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_condition-keys.html#condition-keys-requesttag?

Copy link
Collaborator

@gmarciani gmarciani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Pr contains an approved breaking change: it breaks the support for PC <= 3.5.0. Let's highlight this in the PR description.

- !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stack/*"
Condition:
StringEquals:
"aws:ResourceTag/parallelcluster:api-id": !Ref ApiGatewayRestApi
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this tag set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 217 or parallelcluster-ui.yaml

TemplateURL: !Sub https://${AWS::Region}-aws-parallelcluster.s3.${AWS::Region}.amazonaws.com/parallelcluster/${ApiVersion}/api/parallelcluster-api.yaml
TimeoutInMinutes: 30
Tags:
- Key: 'parallelcluster:api-id'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tag name is misleading as it seems to refer to ParallelCluster API id, wheres the id we are injecting here is the ID of ParallelCluster UI API. Can you please rename the tag to parallelcluster-ui:api-id?

@hgreebe hgreebe merged commit eec445d into aws:main Jun 16, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants