Device Plugin and Node Labeller support for gpu partitions #117

sriram-30 · 2025-04-02T08:09:42Z

The PR adds support for GPU Partitions in Device Plugin and Nodelabeller.
It also adds a new Topology Aware Scheduler for allocation of gpu resources from k8s

y2kenny-amd · 2025-04-08T14:09:13Z

internal/pkg/plugin/plugin.go

@@ -235,17 +218,8 @@ func (p *AMDGPUPlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin
 				partitionType := device["computePartition"].(string) + "_" + device["memoryPartition"].(string)
 				resourceTypeDevs[partitionType] = append(resourceTypeDevs[partitionType], dev)

-				numas, err := hw.GetNUMANodes(id)
+				numas := []int64{int64(device["numaNode"].(int))}


why not enhance the hwloc library to support GPU partition? The idea of a software library is to reduce duplicated code with common functionality thereby improving software quality. I am sure getting numanode for partitioned GPU is needed outside of our narrow k8s device plugin context?

Sorry, enhancing an external open source C library is not within the scope of this development activity.
Also, the method of fetching numaNode in this PR is very much in line with how existing device plugin code reads other gpu information. Similar to other properties are already being read in the files under /sys/module/amdgpu/drivers/pci:amdgpu/ in the device plugin code, we are simply making use of the numanode file available and use the property of the original GPU for its partitions.

I am not saying you should be doing it yourself, but don't we have other team working on enhancing hwloc? Have you reach out?

y2kenny-amd · 2025-04-08T14:25:44Z

internal/pkg/plugin/plugin.go

-	return &AMDGPUPlugin{
-		Heartbeat: l.Heartbeat,
-		Resource:  resourceLastName,
+	options := []AMDGPUPluginOption{


what is the purpose of this extra abstraction?

Added this abstraction to decouple the Allocator dependency during creation of AMDGPUPlugin object. We can associate different allocation policy in future, if we want to. Also will make it easier to add more UTs in future. Please let me know your thoughts.

cmd/k8s-device-plugin/main.go

internal/pkg/allocator/besteffort_policy.go

y2kenny-amd · 2025-04-08T14:48:03Z

internal/pkg/allocator/besteffort_policy_test.go

+func setupTopoFolders() {
+	topofolders := []string{"../../../testdata/topo-mi210-xgmi-pcie.tgz", "../../../testdata/topo-mi300-cpx.tgz"}
+	for _, dir := range topofolders {
+		cmd := exec.Command("tar", "-xvf", dir, "-C", "../../../testdata")


This is not the right way to do this. How big is the test data uncompressed? how much of it do you really need?

Sampling the content of the tgz, you seems to have dump the sysfs incorrectly (every file is null padded to the size of the page (4k).

@y2kenny-amd We have removed files that we don't need for our logic and compressed them again. these 2 topo folders contain around 4200 files. Memory wise, they occupy around 34 MB. These folders contain properties files from which we read connectivity information between every pair of GPUs. One of the topology directory is with 64 CPX partitions and the file count increased due to that.
Please let us know your thoughts on this. We can make changes accordingly. We can either keep the full folders or keep the compressed tars.

The number of files is not an issue, you can leave the unused ones in them. The issue is that all the files are null padded to 4kB in size. Please find a way to dump the sysfs without the zero padding or strip the zero paddings.

@y2kenny-amd we stripped the zero padding and pushed the tars again. PTAL

with the size of the unpadded files, I think it's ok to just have the folder like other test cases and not use the tar ball work around (even though the mi300 is sizeable.)

sure @y2kenny-amd . we pushed the folders instead of tar ball. PTAL

cmd/k8s-node-labeller/main.go

y2kenny-amd · 2025-04-16T17:07:19Z

internal/pkg/amdgpu/amdgpu.go

 	}
+	glog.Infof("Device counts: %v", deviceCountMap)
+	return devices, deviceCountMap


Looking at the way the deviceCountMap is used, this can be split out into a separate function that takes the output of GetAMDGPUs as input (since computePartition and memoryPartition are returned.) This avoid the drastic interface changes.

This commit adds support for extra fields in the amdgpu map which will help store info about partitions and their original gpus: Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply ,if node is not homogeneous) Explanation of Changes made in this commit: -> For GPU partition directories found under /sys/devices/platform/amd_xcp*, we keep track of their original GPU's ID in the existing "device" map so that we know under which plugin to push each device later (devID field is added to the map so that the real gpu's id can be associated with all the partitions) Calculation of real GPU from Partition GPU when partition is supported on the node: card 1,9,17,25,… are always real GPU IDs card 2,3,4,5,6,7,8 are partitioned GPUs whose real GPU ID is 1 card 10,11,12,13,14,15,16 are partitioned GPUs whose real GPU ID is 9 and so on…. Implementation Info: Device Plugin uses these files to determine partition type of the GPU's /sys/module/amdgpu/drivers/pci:amdgpu/<>/current_memory_partition /sys/module/amdgpu/drivers/pci:amdgpu/<>/current_compute_partition -> The partition type for each GPU is stored.For the partitioned GPU's, device plugin associates the same partition types as the original GPU.To do that, the devID field we stored for the original GPU is used to fetch the compute and memory partition from the original gpu and store it as the partition type for the partitioned GPU's Devices Map struct has additional fields to identify partition for each gpu Example of how the map looks: Devices map: map[0000:cc:00.0:map[card:0 computePartition: "spx" devID:0 memoryPartition: "nps1", nodeId:2 numaNode:1 renderD:128]] -> We maintain a map of how many gpu's are under each partitionType Example of how the map looks: Device counts: map[cpx_nps1:8 spx_nps1:4] -> Added some helper functions to determine if node is homogeneous, and whether the GPUs on the node support partitioning. These will be invoked by the device plugin and node labeller code: The helper functions check for the presence of `available_compute_partition` and `available_memory_partition` files at the path of the first GPU of the node.

This commit contains changes made on the plugin side to invoke previously defined helper functions to start the appropriate number of plugins Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply, if node is not homogeneous) From the count of gpu's for each partitionType, device plugin determines if node is homogeneous or heterogeneous. -> If node is homogeneous, the ListAndWatch function is the same as it was before. One plugin is started for resource last name "gpu" -> If node is heterogeneous, ListAndWatch sends the devices to their respective resource type depending on its partitionType. Multiple plugins are started for the different resource types present and are identified using resourceLastName (cpx_nps1, spx_nps1, etc.) Example of how the map looks: Device counts: map[cpx_nps1:8 spx_nps1:4] NOTE: import "strings" was missed in this previous PR: ROCm#116 This is causing "undefined: strings" compilation error. Added that import here as it was interfering with being able to compile my changes

This commit contains changes on the node labeller side to invoke the previously defined helper functions to label the node according to whether the node is homogeneous or heterogeneous and the partition types present on the node Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply, if node is not homogeneous) -> Node labeller labels the node with `amd.com/compute-memory-partition` and sets it to whichever partitionType the node has if the node is homogeneous. If it is heterogeneous, no label will be added by node labeller since the resource reporting by device plugin already denotes the types of partition present on the node. -> If node is homogeneous, Node labeller labels the node with `amd.com/compute-memory-partition` and sets it to whichever `partitionType` is present on all GPU's of the node currently. It uses the `devicesCountMap` to determine which `partitionType` to use for the label -> If it is heterogeneous, no label will be added by node labeller since the resource names reported by device plugin already denotes the types of partition present on the node. -> if node has GPU's of model MI-300 or above, node is labelled with amd.com/compute-partition-supported ="true" amd.com/memory-partition-supported="true" -> If not, these labels will be set to "false" To achieve this, the presence of `available_compute_partition` and `available_memory_partition` files are checked at the path of the first GPU of the node.

-> Allocator package is introduced to calculate the best possible subset of GPUs to allocate when resources are requested by device plugin -> Device Plugin takes care of passing on GPU’s nodeId as part of the device struct which is used to fetch link types between that GPU and all other GPUs. -> Device Plugin also passes on GPU’s numaNode which is used in the scoring logic -> Kubelet informs Allocator how many resources are being requested and what are the available devices of that resource type -> Allocator generates all unique combinations of subsets of size n -> For every unique pair(x,y) in each subset, increment score += score(x,y) -> Pick the subset with the least score as the preferred allocation Logic for scoring each pair of devices -> If x.devId == y.devId are equal, score += 10, else, score += 20 Explanation: If devId's are same, they are partitions of the same gpu which should be considered better than being part of different GPU’s -> If link_type(x,y) == "xgmi", score += 10 If link_type(x,y) == "pcie", score += 40 If link_type(x,y) is anything else, score += 50 Explanation: XGMI connection is faster than PCIE Allocator fetches link_type for all links for a node `x` from the value present in: /sys/class/kfd/kfd/topology/nodes/x/io_links/<int>/properties /sys/class/kfd/kfd/topology/nodes/x/p2p_links/<int>/properties -> If x.numaNode == y.numaNode, score += 10, else, score +=20 Explanation: GPUs being connected to the same numa node is better than being connected to different numa nodes The value for the scores are picked in a way that the below order of precedence is followed for the preferred allocation, hence some values are higher than others as those properties of a pair have a bigger effect on how "good" the pair of gpu's is. 1. XGMI + same gpu + same numa node 2. XGMI + different gpu + same numa node 3. XGMI + different gpu + different numa node 4. PCIE + same gpu + same numa node 5. PCIE + different gpu + same numa node 6. PCIE + different gpu + different numa node 7. Any different link type other than XGMI and PCIE To avoid calculating a large number of subsets and their score, partitioned GPUs use their original GPU's properties and we avoid making extra subsets between partitions of same gpu since all partitions of a GPU will have same score with all partitions of another GPU.

The resource_naming_strategy is a new flag which can be passed to the device plugin daemonset. The supported values for the flag are "single" and "mixed" Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply, if node is not homogeneous) Behaviour of Resource Naming Strategy in different node types: Homogeneous Node: -> If node is homogeneous and resource naming strategy is "single", one plugin is started using the DevicePluginManager with the last name as “gpu”. If node is homogeneous and resource naming strategy is "mixed", one plugin is started using the DevicePluginManager with the last name as the partition style present on the node. -> The ListAndWatch function remains almost the same as it was before. It reports resources under a single resource name(the name will either be "gpu" or the partition style present on the node(cpx_nps1) depending on strategy) Heterogeneous: -> If node is heterogeneous and resource naming strategy is "mixed", we invoke the DevicePluginManager to start multiple plugins for different partitionTypes under the names “spx-nps1, “cpx-nps1”, etc. We use the devicesCount map to start plugins for the partitionTypes that are present in the map -> ListAndWatch sends the devices to the plugin for their respective resource type depending on its partitionType. Each device has computePartition and memoryPartition fields in its object as shown before, which is used to identify which plugin to report the resource under. (amd.com/spx-nps1,amd.com/cpx-nps1, etc..) Note: -> If node is heterogeneous, "single" strategy is not supported as multiple resource types getting reported under a single resource name wouldn't be mathematically accurate as to how many true gpus of each type there are -> For nodes where partitioning is not supported(MI200), irrespective of strategy, the resources will get reported under "amd.com/gpu" -> If the flag is not set by user, default value is "single". This is to maintain backwards compatibility with older resource name before strategy was introduced (amd.com/gpu)

sriram-30 marked this pull request as ready for review April 2, 2025 10:56

sriram-30 changed the title ~~Device Plugin and Node Labeller changes for gpu partitions~~ Device Plugin and Node Labeller support for gpu partitions Apr 2, 2025

sajmera-pensando requested a review from y2kenny-amd April 4, 2025 00:29

sriram-30 force-pushed the DevicePluginChanges branch 3 times, most recently from 5cfb040 to f00cedb Compare April 8, 2025 05:26

y2kenny-amd requested changes Apr 8, 2025

View reviewed changes

sriram-30 force-pushed the DevicePluginChanges branch 18 times, most recently from 4ff4e6b to 22a76d5 Compare April 9, 2025 10:46

sriram-30 mentioned this pull request Apr 9, 2025

Device Plugin and Allocator Documentation #119

Closed

sriram-30 force-pushed the DevicePluginChanges branch 3 times, most recently from 9322a6e to a56c360 Compare April 12, 2025 02:29

y2kenny-amd requested changes Apr 16, 2025

View reviewed changes

cmd/k8s-node-labeller/main.go Outdated Show resolved Hide resolved

cmd/k8s-node-labeller/main.go Outdated Show resolved Hide resolved

sriram-30 force-pushed the DevicePluginChanges branch from a56c360 to 02928fa Compare April 16, 2025 12:51

y2kenny-amd reviewed Apr 16, 2025

View reviewed changes

sriram-30 force-pushed the DevicePluginChanges branch from 02928fa to 78a5485 Compare April 17, 2025 04:38

sriram-30 force-pushed the DevicePluginChanges branch 5 times, most recently from 970fceb to fff133d Compare April 17, 2025 05:31

sriram-30 added 4 commits April 17, 2025 05:35

Numa Node for Partitions and Devices struct helper for Allocator

8ad18ca

sriram-30 force-pushed the DevicePluginChanges branch from fff133d to 955c8a8 Compare April 17, 2025 05:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Device Plugin and Node Labeller support for gpu partitions #117

Device Plugin and Node Labeller support for gpu partitions #117

sriram-30 commented Apr 2, 2025 •

edited

Loading

y2kenny-amd Apr 8, 2025

sriram-30 Apr 9, 2025 •

edited

Loading

y2kenny-amd Apr 9, 2025

y2kenny-amd Apr 8, 2025

biluriuday Apr 9, 2025

y2kenny-amd Apr 8, 2025

biluriuday Apr 9, 2025

y2kenny-amd Apr 9, 2025

biluriuday Apr 11, 2025

y2kenny-amd Apr 11, 2025

biluriuday Apr 12, 2025

y2kenny-amd Apr 16, 2025

Device Plugin and Node Labeller support for gpu partitions #117

Are you sure you want to change the base?

Device Plugin and Node Labeller support for gpu partitions #117

Conversation

sriram-30 commented Apr 2, 2025 • edited Loading

Choose a reason for hiding this comment

sriram-30 Apr 9, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sriram-30 commented Apr 2, 2025 •

edited

Loading

sriram-30 Apr 9, 2025 •

edited

Loading