Skip to content

Device Plugin and Node Labeller support for gpu partitions #117

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

sriram-30
Copy link
Contributor

@sriram-30 sriram-30 commented Apr 2, 2025

The PR adds support for GPU Partitions in Device Plugin and Nodelabeller.
It also adds a new Topology Aware Scheduler for allocation of gpu resources from k8s

@sriram-30 sriram-30 marked this pull request as ready for review April 2, 2025 10:56
@sriram-30 sriram-30 changed the title Device Plugin and Node Labeller changes for gpu partitions Device Plugin and Node Labeller support for gpu partitions Apr 2, 2025
@sriram-30 sriram-30 force-pushed the DevicePluginChanges branch 3 times, most recently from 5cfb040 to f00cedb Compare April 8, 2025 05:26
@@ -235,17 +218,8 @@ func (p *AMDGPUPlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin
partitionType := device["computePartition"].(string) + "_" + device["memoryPartition"].(string)
resourceTypeDevs[partitionType] = append(resourceTypeDevs[partitionType], dev)

numas, err := hw.GetNUMANodes(id)
numas := []int64{int64(device["numaNode"].(int))}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not enhance the hwloc library to support GPU partition? The idea of a software library is to reduce duplicated code with common functionality thereby improving software quality. I am sure getting numanode for partitioned GPU is needed outside of our narrow k8s device plugin context?

Copy link
Contributor Author

@sriram-30 sriram-30 Apr 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, enhancing an external open source C library is not within the scope of this development activity.
Also, the method of fetching numaNode in this PR is very much in line with how existing device plugin code reads other gpu information. Similar to other properties are already being read in the files under /sys/module/amdgpu/drivers/pci:amdgpu/ in the device plugin code, we are simply making use of the numanode file available and use the property of the original GPU for its partitions.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying you should be doing it yourself, but don't we have other team working on enhancing hwloc? Have you reach out?

return &AMDGPUPlugin{
Heartbeat: l.Heartbeat,
Resource: resourceLastName,
options := []AMDGPUPluginOption{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this extra abstraction?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this abstraction to decouple the Allocator dependency during creation of AMDGPUPlugin object. We can associate different allocation policy in future, if we want to. Also will make it easier to add more UTs in future. Please let me know your thoughts.

func setupTopoFolders() {
topofolders := []string{"../../../testdata/topo-mi210-xgmi-pcie.tgz", "../../../testdata/topo-mi300-cpx.tgz"}
for _, dir := range topofolders {
cmd := exec.Command("tar", "-xvf", dir, "-C", "../../../testdata")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right way to do this. How big is the test data uncompressed? how much of it do you really need?

Sampling the content of the tgz, you seems to have dump the sysfs incorrectly (every file is null padded to the size of the page (4k).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y2kenny-amd We have removed files that we don't need for our logic and compressed them again. these 2 topo folders contain around 4200 files. Memory wise, they occupy around 34 MB. These folders contain properties files from which we read connectivity information between every pair of GPUs. One of the topology directory is with 64 CPX partitions and the file count increased due to that.
Please let us know your thoughts on this. We can make changes accordingly. We can either keep the full folders or keep the compressed tars.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of files is not an issue, you can leave the unused ones in them. The issue is that all the files are null padded to 4kB in size. Please find a way to dump the sysfs without the zero padding or strip the zero paddings.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@y2kenny-amd we stripped the zero padding and pushed the tars again. PTAL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the size of the unpadded files, I think it's ok to just have the folder like other test cases and not use the tar ball work around (even though the mi300 is sizeable.)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure @y2kenny-amd . we pushed the folders instead of tar ball. PTAL

@sriram-30 sriram-30 force-pushed the DevicePluginChanges branch 18 times, most recently from 4ff4e6b to 22a76d5 Compare April 9, 2025 10:46
@sriram-30 sriram-30 force-pushed the DevicePluginChanges branch 3 times, most recently from 9322a6e to a56c360 Compare April 12, 2025 02:29
@sriram-30 sriram-30 force-pushed the DevicePluginChanges branch from a56c360 to 02928fa Compare April 16, 2025 12:51
}
glog.Infof("Device counts: %v", deviceCountMap)
return devices, deviceCountMap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the way the deviceCountMap is used, this can be split out into a separate function that takes the output of GetAMDGPUs as input (since computePartition and memoryPartition are returned.) This avoid the drastic interface changes.

This commit adds support for extra fields in the amdgpu map which will
help store info about partitions and their original gpus:

Terms to understand before viewing the changes in this commit:

Homogeneous Node:
   If all GPUs in a node are following the same compute and memory
   partition style, the node is considered homogeneous

Heterogeneous Node:
   If the GPUs on a node have different different compute and memory
   partition styles, the node is considered heterogeneous (Put simply
   ,if node is not homogeneous)

Explanation of Changes made in this commit:

-> For GPU partition directories found under
   /sys/devices/platform/amd_xcp*,
   we keep track of their original GPU's ID in the existing "device"
   map so that we know under which plugin to push each device later
   (devID field is added to the map so that the real gpu's id can be
   associated with all the partitions)

Calculation of real GPU from Partition GPU when partition is supported
on the node:

card 1,9,17,25,… are always real GPU IDs
card 2,3,4,5,6,7,8 are partitioned GPUs whose real GPU ID is 1
card 10,11,12,13,14,15,16 are partitioned GPUs whose real GPU ID is 9
and so on….

Implementation Info:

Device Plugin uses these files to determine partition type of the GPU's
      /sys/module/amdgpu/drivers/pci:amdgpu/<>/current_memory_partition
     /sys/module/amdgpu/drivers/pci:amdgpu/<>/current_compute_partition

-> The partition type for each GPU is stored.For the partitioned GPU's,
   device plugin associates the same partition types as the original
   GPU.To do that, the devID field we stored for the original GPU is
   used to fetch the compute and memory partition from the original gpu
   and store it as the partition type for the partitioned GPU's

   Devices Map struct has additional fields to identify partition for
   each gpu

   Example of how the map looks:
      Devices map: map[0000:cc:00.0:map[card:0 computePartition: "spx"
     devID:0 memoryPartition: "nps1", nodeId:2 numaNode:1 renderD:128]]

-> We maintain a map of how many gpu's are under each partitionType
   Example of how the map looks:
      Device counts: map[cpx_nps1:8 spx_nps1:4]

-> Added some helper functions to determine if node is homogeneous, and
   whether the GPUs on the node support partitioning. These will be
   invoked by the device plugin and node labeller code:

   The helper functions check for the presence of
   `available_compute_partition` and `available_memory_partition` files
   at the path of the first GPU of the node.
@sriram-30 sriram-30 force-pushed the DevicePluginChanges branch from 02928fa to 78a5485 Compare April 17, 2025 04:38
This commit contains changes made on the plugin side to invoke
previously defined helper functions to start the appropriate number of
plugins

Terms to understand before viewing the changes in this commit:

Homogeneous Node:
   If all GPUs in a node are following the same compute and memory
   partition style, the node is considered homogeneous

Heterogeneous Node:
   If the GPUs on a node have different different compute and memory
   partition styles, the node is considered heterogeneous (Put simply,
   if node is not homogeneous)

From the count of gpu's for each partitionType, device plugin
determines if node is homogeneous or heterogeneous.

-> If node is homogeneous, the ListAndWatch function is the same as it
   was before. One plugin is started for resource last name "gpu"
-> If node is heterogeneous, ListAndWatch sends the devices to their
   respective resource type depending on its partitionType. Multiple
   plugins are started for the different resource types present and are
   identified using resourceLastName (cpx_nps1, spx_nps1, etc.)

Example of how the map looks:
   Device counts: map[cpx_nps1:8 spx_nps1:4]

NOTE:

import "strings" was missed in this previous PR:
ROCm#116
This is causing "undefined: strings" compilation error.
Added that import here as it was interfering with being able to compile
my changes
@sriram-30 sriram-30 force-pushed the DevicePluginChanges branch 5 times, most recently from 970fceb to fff133d Compare April 17, 2025 05:31
This commit contains changes on the node labeller side to invoke the
previously defined helper functions to label the node according to
whether the node is homogeneous or heterogeneous and the partition
types present on the node

Terms to understand before viewing the changes in this commit:

Homogeneous Node:
   If all GPUs in a node are following the same compute and memory
   partition style, the node is considered homogeneous

Heterogeneous Node:
   If the GPUs on a node have different different compute and memory
   partition styles, the node is considered heterogeneous (Put simply,
   if node is not homogeneous)

-> Node labeller labels the node with `amd.com/compute-memory-partition`
   and sets it to whichever partitionType the node has if the node is
   homogeneous.
   If it is heterogeneous, no label will be added by node labeller
   since the resource reporting by device plugin already denotes the
   types of partition present on the node.

-> If node is homogeneous, Node labeller labels the node with
   `amd.com/compute-memory-partition` and sets it to whichever
   `partitionType` is present on all GPU's of the node currently.

   It uses the `devicesCountMap` to determine which `partitionType` to
   use for the label

-> If it is heterogeneous, no label will be added by node labeller since
   the resource names reported by device plugin already denotes the
   types of partition present on the node.

-> if node has GPU's of model MI-300 or above, node is labelled with
   amd.com/compute-partition-supported ="true"
   amd.com/memory-partition-supported="true"

-> If not, these labels will be set to "false"

   To achieve this, the presence of `available_compute_partition` and
   `available_memory_partition` files are checked at the path of the
   first GPU of the node.
-> Allocator package is introduced to calculate the best possible
   subset of GPUs to allocate when resources are requested by device
   plugin

-> Device Plugin takes care of passing on GPU’s nodeId as part of the
   device struct which is used to fetch link types between that GPU and
   all other GPUs.

-> Device Plugin also passes on GPU’s numaNode which is used in the
   scoring logic

-> Kubelet informs Allocator how many resources are being requested and
   what are the available devices of that resource type

-> Allocator generates all unique combinations of subsets of size n

-> For every unique pair(x,y) in each subset, increment
   score += score(x,y)

-> Pick the subset with the least score as the preferred allocation

Logic for scoring each pair of devices

-> If x.devId == y.devId are equal, score += 10, else, score += 20

 Explanation: If devId's are same, they are partitions of the same gpu
 which should be considered better than being part of different GPU’s

-> If link_type(x,y) == "xgmi", score += 10
    If link_type(x,y) == "pcie", score += 40
    If link_type(x,y) is anything else, score += 50

Explanation: XGMI connection is faster than PCIE

Allocator fetches link_type for all links for a node `x` from the value
present in:

/sys/class/kfd/kfd/topology/nodes/x/io_links/<int>/properties
/sys/class/kfd/kfd/topology/nodes/x/p2p_links/<int>/properties

-> If x.numaNode == y.numaNode, score += 10, else, score +=20

Explanation: GPUs being connected to the same numa node is better than
             being connected to different numa nodes

The value for the scores are picked in a way that the below order of
precedence is followed for the preferred allocation, hence some values
are higher than others as those properties of a pair have a bigger
effect on how "good" the pair of gpu's is.

1.	XGMI + same gpu + same numa node
2.	XGMI + different gpu + same numa node
3.	XGMI + different gpu + different numa node
4.	PCIE + same gpu + same numa node
5.	PCIE + different gpu + same numa node
6.	PCIE + different gpu + different numa node
7.	Any different link type other than XGMI and PCIE

To avoid calculating a large number of subsets and their score,
partitioned GPUs use their original GPU's properties and we avoid
making extra subsets between partitions of same gpu since all
partitions of a GPU will have same score with all partitions of another
GPU.
The resource_naming_strategy is a new flag which can be passed to the
device plugin daemonset. The supported values for the flag are
"single" and "mixed"

Terms to understand before viewing the changes in this commit:

Homogeneous Node:
   If all GPUs in a node are following the same compute and memory
   partition style, the node is considered homogeneous

Heterogeneous Node:
   If the GPUs on a node have different different compute and memory
   partition styles, the node is considered heterogeneous (Put simply,
   if node is not homogeneous)

Behaviour of Resource Naming Strategy in different node types:

Homogeneous Node:

-> If node is homogeneous and resource naming strategy is "single",
   one plugin is started using the DevicePluginManager with the last
   name as “gpu”.

   If node is homogeneous and resource naming strategy is "mixed",
   one plugin is started using the DevicePluginManager with the last
   name as the partition style present on the node.

-> The ListAndWatch function remains almost the same as it was before.
   It reports resources under a single resource name(the name will
   either be "gpu" or the partition style present on the node(cpx_nps1)
   depending on strategy)

Heterogeneous:

-> If node is heterogeneous and resource naming strategy is "mixed", we
   invoke the DevicePluginManager to start multiple plugins for
   different partitionTypes under the names “spx-nps1, “cpx-nps1”, etc.
   We use the devicesCount map to start plugins for the partitionTypes
   that are present in the map

-> ListAndWatch sends the devices to the plugin for their respective
   resource type depending on its partitionType. Each device has
   computePartition and memoryPartition fields in its object as shown
   before, which is used to identify which plugin to report the
   resource under. (amd.com/spx-nps1,amd.com/cpx-nps1, etc..)

Note:
-> If node is heterogeneous, "single" strategy is not supported as
   multiple resource types getting reported under a single resource
   name wouldn't be mathematically accurate as to how many true gpus of
   each type there are

-> For nodes where partitioning is not supported(MI200), irrespective
   of strategy, the resources will get reported under "amd.com/gpu"

-> If the flag is not set by user, default value is "single". This is
   to maintain backwards compatibility with older resource name before
   strategy was introduced (amd.com/gpu)
@sriram-30 sriram-30 force-pushed the DevicePluginChanges branch from fff133d to 955c8a8 Compare April 17, 2025 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants