-
Notifications
You must be signed in to change notification settings - Fork 60
Device Plugin and Node Labeller support for gpu partitions #117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
5cfb040
to
f00cedb
Compare
@@ -235,17 +218,8 @@ func (p *AMDGPUPlugin) ListAndWatch(e *pluginapi.Empty, s pluginapi.DevicePlugin | |||
partitionType := device["computePartition"].(string) + "_" + device["memoryPartition"].(string) | |||
resourceTypeDevs[partitionType] = append(resourceTypeDevs[partitionType], dev) | |||
|
|||
numas, err := hw.GetNUMANodes(id) | |||
numas := []int64{int64(device["numaNode"].(int))} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not enhance the hwloc library to support GPU partition? The idea of a software library is to reduce duplicated code with common functionality thereby improving software quality. I am sure getting numanode for partitioned GPU is needed outside of our narrow k8s device plugin context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, enhancing an external open source C library is not within the scope of this development activity.
Also, the method of fetching numaNode in this PR is very much in line with how existing device plugin code reads other gpu information. Similar to other properties are already being read in the files under /sys/module/amdgpu/drivers/pci:amdgpu/
in the device plugin code, we are simply making use of the numanode
file available and use the property of the original GPU for its partitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying you should be doing it yourself, but don't we have other team working on enhancing hwloc? Have you reach out?
return &AMDGPUPlugin{ | ||
Heartbeat: l.Heartbeat, | ||
Resource: resourceLastName, | ||
options := []AMDGPUPluginOption{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the purpose of this extra abstraction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this abstraction to decouple the Allocator dependency during creation of AMDGPUPlugin object. We can associate different allocation policy in future, if we want to. Also will make it easier to add more UTs in future. Please let me know your thoughts.
func setupTopoFolders() { | ||
topofolders := []string{"../../../testdata/topo-mi210-xgmi-pcie.tgz", "../../../testdata/topo-mi300-cpx.tgz"} | ||
for _, dir := range topofolders { | ||
cmd := exec.Command("tar", "-xvf", dir, "-C", "../../../testdata") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right way to do this. How big is the test data uncompressed? how much of it do you really need?
Sampling the content of the tgz, you seems to have dump the sysfs incorrectly (every file is null padded to the size of the page (4k).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@y2kenny-amd We have removed files that we don't need for our logic and compressed them again. these 2 topo folders contain around 4200 files. Memory wise, they occupy around 34 MB. These folders contain properties files from which we read connectivity information between every pair of GPUs. One of the topology directory is with 64 CPX partitions and the file count increased due to that.
Please let us know your thoughts on this. We can make changes accordingly. We can either keep the full folders or keep the compressed tars.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of files is not an issue, you can leave the unused ones in them. The issue is that all the files are null padded to 4kB in size. Please find a way to dump the sysfs without the zero padding or strip the zero paddings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@y2kenny-amd we stripped the zero padding and pushed the tars again. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with the size of the unpadded files, I think it's ok to just have the folder like other test cases and not use the tar ball work around (even though the mi300 is sizeable.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure @y2kenny-amd . we pushed the folders instead of tar ball. PTAL
4ff4e6b
to
22a76d5
Compare
9322a6e
to
a56c360
Compare
a56c360
to
02928fa
Compare
internal/pkg/amdgpu/amdgpu.go
Outdated
} | ||
glog.Infof("Device counts: %v", deviceCountMap) | ||
return devices, deviceCountMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the way the deviceCountMap is used, this can be split out into a separate function that takes the output of GetAMDGPUs as input (since computePartition and memoryPartition are returned.) This avoid the drastic interface changes.
This commit adds support for extra fields in the amdgpu map which will help store info about partitions and their original gpus: Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply ,if node is not homogeneous) Explanation of Changes made in this commit: -> For GPU partition directories found under /sys/devices/platform/amd_xcp*, we keep track of their original GPU's ID in the existing "device" map so that we know under which plugin to push each device later (devID field is added to the map so that the real gpu's id can be associated with all the partitions) Calculation of real GPU from Partition GPU when partition is supported on the node: card 1,9,17,25,… are always real GPU IDs card 2,3,4,5,6,7,8 are partitioned GPUs whose real GPU ID is 1 card 10,11,12,13,14,15,16 are partitioned GPUs whose real GPU ID is 9 and so on…. Implementation Info: Device Plugin uses these files to determine partition type of the GPU's /sys/module/amdgpu/drivers/pci:amdgpu/<>/current_memory_partition /sys/module/amdgpu/drivers/pci:amdgpu/<>/current_compute_partition -> The partition type for each GPU is stored.For the partitioned GPU's, device plugin associates the same partition types as the original GPU.To do that, the devID field we stored for the original GPU is used to fetch the compute and memory partition from the original gpu and store it as the partition type for the partitioned GPU's Devices Map struct has additional fields to identify partition for each gpu Example of how the map looks: Devices map: map[0000:cc:00.0:map[card:0 computePartition: "spx" devID:0 memoryPartition: "nps1", nodeId:2 numaNode:1 renderD:128]] -> We maintain a map of how many gpu's are under each partitionType Example of how the map looks: Device counts: map[cpx_nps1:8 spx_nps1:4] -> Added some helper functions to determine if node is homogeneous, and whether the GPUs on the node support partitioning. These will be invoked by the device plugin and node labeller code: The helper functions check for the presence of `available_compute_partition` and `available_memory_partition` files at the path of the first GPU of the node.
02928fa
to
78a5485
Compare
This commit contains changes made on the plugin side to invoke previously defined helper functions to start the appropriate number of plugins Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply, if node is not homogeneous) From the count of gpu's for each partitionType, device plugin determines if node is homogeneous or heterogeneous. -> If node is homogeneous, the ListAndWatch function is the same as it was before. One plugin is started for resource last name "gpu" -> If node is heterogeneous, ListAndWatch sends the devices to their respective resource type depending on its partitionType. Multiple plugins are started for the different resource types present and are identified using resourceLastName (cpx_nps1, spx_nps1, etc.) Example of how the map looks: Device counts: map[cpx_nps1:8 spx_nps1:4] NOTE: import "strings" was missed in this previous PR: ROCm#116 This is causing "undefined: strings" compilation error. Added that import here as it was interfering with being able to compile my changes
970fceb
to
fff133d
Compare
This commit contains changes on the node labeller side to invoke the previously defined helper functions to label the node according to whether the node is homogeneous or heterogeneous and the partition types present on the node Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply, if node is not homogeneous) -> Node labeller labels the node with `amd.com/compute-memory-partition` and sets it to whichever partitionType the node has if the node is homogeneous. If it is heterogeneous, no label will be added by node labeller since the resource reporting by device plugin already denotes the types of partition present on the node. -> If node is homogeneous, Node labeller labels the node with `amd.com/compute-memory-partition` and sets it to whichever `partitionType` is present on all GPU's of the node currently. It uses the `devicesCountMap` to determine which `partitionType` to use for the label -> If it is heterogeneous, no label will be added by node labeller since the resource names reported by device plugin already denotes the types of partition present on the node. -> if node has GPU's of model MI-300 or above, node is labelled with amd.com/compute-partition-supported ="true" amd.com/memory-partition-supported="true" -> If not, these labels will be set to "false" To achieve this, the presence of `available_compute_partition` and `available_memory_partition` files are checked at the path of the first GPU of the node.
-> Allocator package is introduced to calculate the best possible subset of GPUs to allocate when resources are requested by device plugin -> Device Plugin takes care of passing on GPU’s nodeId as part of the device struct which is used to fetch link types between that GPU and all other GPUs. -> Device Plugin also passes on GPU’s numaNode which is used in the scoring logic -> Kubelet informs Allocator how many resources are being requested and what are the available devices of that resource type -> Allocator generates all unique combinations of subsets of size n -> For every unique pair(x,y) in each subset, increment score += score(x,y) -> Pick the subset with the least score as the preferred allocation Logic for scoring each pair of devices -> If x.devId == y.devId are equal, score += 10, else, score += 20 Explanation: If devId's are same, they are partitions of the same gpu which should be considered better than being part of different GPU’s -> If link_type(x,y) == "xgmi", score += 10 If link_type(x,y) == "pcie", score += 40 If link_type(x,y) is anything else, score += 50 Explanation: XGMI connection is faster than PCIE Allocator fetches link_type for all links for a node `x` from the value present in: /sys/class/kfd/kfd/topology/nodes/x/io_links/<int>/properties /sys/class/kfd/kfd/topology/nodes/x/p2p_links/<int>/properties -> If x.numaNode == y.numaNode, score += 10, else, score +=20 Explanation: GPUs being connected to the same numa node is better than being connected to different numa nodes The value for the scores are picked in a way that the below order of precedence is followed for the preferred allocation, hence some values are higher than others as those properties of a pair have a bigger effect on how "good" the pair of gpu's is. 1. XGMI + same gpu + same numa node 2. XGMI + different gpu + same numa node 3. XGMI + different gpu + different numa node 4. PCIE + same gpu + same numa node 5. PCIE + different gpu + same numa node 6. PCIE + different gpu + different numa node 7. Any different link type other than XGMI and PCIE To avoid calculating a large number of subsets and their score, partitioned GPUs use their original GPU's properties and we avoid making extra subsets between partitions of same gpu since all partitions of a GPU will have same score with all partitions of another GPU.
The resource_naming_strategy is a new flag which can be passed to the device plugin daemonset. The supported values for the flag are "single" and "mixed" Terms to understand before viewing the changes in this commit: Homogeneous Node: If all GPUs in a node are following the same compute and memory partition style, the node is considered homogeneous Heterogeneous Node: If the GPUs on a node have different different compute and memory partition styles, the node is considered heterogeneous (Put simply, if node is not homogeneous) Behaviour of Resource Naming Strategy in different node types: Homogeneous Node: -> If node is homogeneous and resource naming strategy is "single", one plugin is started using the DevicePluginManager with the last name as “gpu”. If node is homogeneous and resource naming strategy is "mixed", one plugin is started using the DevicePluginManager with the last name as the partition style present on the node. -> The ListAndWatch function remains almost the same as it was before. It reports resources under a single resource name(the name will either be "gpu" or the partition style present on the node(cpx_nps1) depending on strategy) Heterogeneous: -> If node is heterogeneous and resource naming strategy is "mixed", we invoke the DevicePluginManager to start multiple plugins for different partitionTypes under the names “spx-nps1, “cpx-nps1”, etc. We use the devicesCount map to start plugins for the partitionTypes that are present in the map -> ListAndWatch sends the devices to the plugin for their respective resource type depending on its partitionType. Each device has computePartition and memoryPartition fields in its object as shown before, which is used to identify which plugin to report the resource under. (amd.com/spx-nps1,amd.com/cpx-nps1, etc..) Note: -> If node is heterogeneous, "single" strategy is not supported as multiple resource types getting reported under a single resource name wouldn't be mathematically accurate as to how many true gpus of each type there are -> For nodes where partitioning is not supported(MI200), irrespective of strategy, the resources will get reported under "amd.com/gpu" -> If the flag is not set by user, default value is "single". This is to maintain backwards compatibility with older resource name before strategy was introduced (amd.com/gpu)
fff133d
to
955c8a8
Compare
The PR adds support for GPU Partitions in Device Plugin and Nodelabeller.
It also adds a new Topology Aware Scheduler for allocation of gpu resources from k8s