Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Issue]: Product-name information missing for MI300 GPUs in AKS #112

Open
lalitmsft opened this issue Mar 12, 2025 · 1 comment
Open

[Issue]: Product-name information missing for MI300 GPUs in AKS #112

lalitmsft opened this issue Mar 12, 2025 · 1 comment

Comments

@lalitmsft
Copy link

lalitmsft commented Mar 12, 2025

Problem Description

We are using AMD MI300X GPUs in AKS. We use the node-labeller Daemonset to discover the product-name however on this amdgpu-labeller.

Args passed to the container.

    - -vram
    - -cu-count
    - -simd-count
    - -device-id
    - -family
    - -product-name
    - -driver-version
    - -driver-src-version
    - -firmware

SKU of the node Standard_ND96isr_MI300X_v5
Label added by AMD on the node.

    amd.com/gpu.cu-count: "304"
    amd.com/gpu.device-id: 74b5
    amd.com/gpu.driver-src-version: 0A82C3819E55873715679EA
    amd.com/gpu.driver-version: 6.8.5
    amd.com/gpu.family: AI
    amd.com/gpu.simd-count: "1216"
    amd.com/gpu.vram: 191G
    beta.amd.com/gpu.cu-count: "304"
    beta.amd.com/gpu.cu-count.304: "8"
    beta.amd.com/gpu.device-id: 74b5
    beta.amd.com/gpu.device-id.74b5: "8"
    beta.amd.com/gpu.family: AI
    beta.amd.com/gpu.family.AI: "8"
    beta.amd.com/gpu.firmware.CE.feat.0: "8"
    beta.amd.com/gpu.firmware.CE.fw.0: "8"
    beta.amd.com/gpu.firmware.MC.feat.0: "8"
    beta.amd.com/gpu.firmware.MC.fw.0: "8"
    beta.amd.com/gpu.firmware.ME.feat.0: "8"
    beta.amd.com/gpu.firmware.ME.fw.0: "8"
    beta.amd.com/gpu.firmware.MEC.feat.49: "8"
    beta.amd.com/gpu.firmware.MEC.fw.160: "8"
    beta.amd.com/gpu.firmware.PFP.feat.0: "8"
    beta.amd.com/gpu.firmware.PFP.fw.0: "8"
    beta.amd.com/gpu.firmware.RLC.feat.1: "8"
    beta.amd.com/gpu.firmware.RLC.fw.64: "8"
    beta.amd.com/gpu.firmware.SDMA0.feat.44: "8"
    beta.amd.com/gpu.firmware.SDMA0.fw.21: "8"
    beta.amd.com/gpu.firmware.SMC.feat.0: "8"
    beta.amd.com/gpu.firmware.SMC.fw.5600514: "8"
    beta.amd.com/gpu.firmware.UVD.feat.0: "8"
    beta.amd.com/gpu.firmware.UVD.fw.0: "8"
    beta.amd.com/gpu.firmware.VCE.feat.0: "8"
    beta.amd.com/gpu.firmware.VCE.fw.0: "8"
    beta.amd.com/gpu.simd-count: "1216"
    beta.amd.com/gpu.simd-count.1216: "8"
    beta.amd.com/gpu.vram: 191G
    beta.amd.com/gpu.vram.191G: "8"

This is from a different SKU in azure Standard_nd96asr_mi200_v4

    amd.com/gpu.cu-count: "110"
    amd.com/gpu.device-id: 740c
    amd.com/gpu.driver-src-version: 0A82C3819E55873715679EA
    amd.com/gpu.driver-version: 6.8.5
    amd.com/gpu.family: AI
    amd.com/gpu.product-name: AMD_INSTINCT_MI250X_MCM_OAM_AC_MBA_MSFT
    amd.com/gpu.simd-count: "440"
    amd.com/gpu.vram: 64G
    beta.amd.com/gpu.cu-count: "110"
    beta.amd.com/gpu.cu-count.110: "16"
    beta.amd.com/gpu.device-id: 740c
    beta.amd.com/gpu.device-id.740c: "16"
    beta.amd.com/gpu.family: AI
    beta.amd.com/gpu.family.AI: "16"
    beta.amd.com/gpu.firmware.CE.feat.0: "16"
    beta.amd.com/gpu.firmware.CE.fw.0: "16"
    beta.amd.com/gpu.firmware.MC.feat.0: "16"
    beta.amd.com/gpu.firmware.MC.fw.0: "16"
    beta.amd.com/gpu.firmware.ME.feat.0: "16"
    beta.amd.com/gpu.firmware.ME.fw.0: "16"
    beta.amd.com/gpu.firmware.MEC.feat.46: "16"
    beta.amd.com/gpu.firmware.MEC.fw.83: "16"
    beta.amd.com/gpu.firmware.PFP.feat.0: "16"
    beta.amd.com/gpu.firmware.PFP.fw.0: "16"
    beta.amd.com/gpu.firmware.RLC.feat.1: "16"
    beta.amd.com/gpu.firmware.RLC.fw.17: "16"
    beta.amd.com/gpu.firmware.SDMA0.feat.44: "16"
    beta.amd.com/gpu.firmware.SDMA0.fw.8: "16"
    beta.amd.com/gpu.firmware.SMC.feat.0: "16"
    beta.amd.com/gpu.firmware.SMC.fw.4471808: "16"
    beta.amd.com/gpu.firmware.UVD.feat.0: "16"
    beta.amd.com/gpu.firmware.UVD.fw.0: "16"
    beta.amd.com/gpu.firmware.VCE.feat.0: "16"
    beta.amd.com/gpu.firmware.VCE.fw.0: "16"
    beta.amd.com/gpu.product-name: AMD_INSTINCT_MI250X_MCM_OAM_AC_MBA_MSFT
    beta.amd.com/gpu.product-name.AMD_INSTINCT_MI250X_MCM_OAM_AC_MBA_MSFT: "16"
    beta.amd.com/gpu.simd-count: "440"
    beta.amd.com/gpu.simd-count.440: "16"
    beta.amd.com/gpu.vram: 64G
    beta.amd.com/gpu.vram.64G: "16"

When I use rocm-smi --showproductname it works on both GPUs (MI200 and MI300x).
Output on MI300.

GPU[0]          : Card Series:          Aqua Vanjaram [Instinct MI300X VF]
GPU[0]          : Card Model:           0x74b5
GPU[0]          : Card Vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             M3000100
GPU[0]          : Subsystem ID:         0x74a1
GPU[0]          : Device Rev:           0x00
GPU[0]          : Node ID:              2
GPU[0]          : GUID:                 65402
GPU[0]          : GFX Version:          gfx942

On MI200

GPU[0]          : Card Series:          AMD INSTINCT MI250X (MCM) OAM AC MBA MSFT
GPU[0]          : Card Model:           0x740c
GPU[0]          : Card Vendor:          Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0]          : Card SKU:             D65205
GPU[0]          : Subsystem ID:         0x0b0c
GPU[0]          : Device Rev:           0x01
GPU[0]          : Node ID:              4
GPU[0]          : GUID:                 10595
GPU[0]          : GFX Version:          gfx9010

The expectation is that product-name should be available on all GPUs, both MI200 and MI300

Operating System

NAME="Ubuntu" VERSION="22.04.5 LTS (Jammy Jellyfish)"

CPU

model name : Intel(R) Xeon(R) Platinum 8480C

GPU

Name: Intel(R) Xeon(R) Platinum 8480C Marketing Name: Intel(R) Xeon(R) Platinum 8480C Name: Intel(R) Xeon(R) Platinum 8480C Marketing Name: Intel(R) Xeon(R) Platinum 8480C Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack- Name: gfx942 Marketing Name: AMD Instinct MI300X VF Name: amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-

ROCm Version

6.2.60204-1

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

@lalitmsft
Copy link
Author

lalitmsft commented Mar 12, 2025

Further looking at code.

       "product-name": func(gpus map[string]map[string]int) map[string]string {
		counts := map[string]int{}
		replacer := strings.NewReplacer(" ", "_", "(", "", ")", "")

		for _, v := range gpus {
			prodnamePath := fmt.Sprintf("/sys/class/drm/card%d/device/product_name", v["card"])
			b, err := ioutil.ReadFile(prodnamePath)
			if err != nil {
				log.Error(err, prodnamePath)
				continue
			}
			prodName := replacer.Replace(strings.TrimSpace(string(b)))
			if prodName == "" {
				continue
			}
			counts[prodName]++
		}

		return createLabels("product-name", counts)
	},

I see it is looking for product_name file on specific path /sys/class/drm/card0/device.
On MI300 I can see the file product_name file.
Files on path ls /sys/class/drm/card0/device

aer_dev_correctable       current_memory_partition  gpu_metrics     mem_busy_percent         numa_node                          pp_dpm_mclk        remove        run_cleaner_shader          xgmi_hive_info
aer_dev_fatal             d3cold_allowed            hwmon           mem_info_gtt_total       pcie_replay_count                  pp_dpm_sclk        rescan        subsystem                   xgmi_num_hops
aer_dev_nonfatal          device                    i2c-0           mem_info_gtt_used        pm_metrics                         pp_dpm_socclk      reset         subsystem_device            xgmi_num_links
ari_enabled               dma_mask_bits             i2c-1           mem_info_preempt_used    pm_policy                          pp_dpm_vclk        reset_method  subsystem_vendor            xgmi_physical_id
board_info                driver                    ip_discovery    mem_info_vis_vram_total  power                              pp_features        resource      thermal_throttling_logging  xgmi_port_num
broken_parity_status      driver_override           irq             mem_info_vis_vram_used   power_dpm_force_performance_level  pp_force_state     resource0     uevent
class                     drm                       link            mem_info_vram_total      power_dpm_state                    pp_num_states      resource0_wc  unique_id
config                    enable                    local_cpulist   mem_info_vram_used       power_state                        pp_od_clk_voltage  resource2     vbios_version
consistent_dma_mask_bits  enforce_isolation         local_cpus      modalias                 pp_cur_state                       pp_table           resource2_wc  vendor
current_link_speed        fw_version                max_link_speed  msi_bus                  pp_dpm_dclk                        ras                resource5     xgmi_device_id
current_link_width        gpu_busy_percent          max_link_width  msi_irqs                 pp_dpm_fclk                        reg_state          revision      xgmi_error

However, it is present on MI200.

aer_dev_correctable       current_link_speed  enable             link                mem_info_preempt_used    numa_node                          pp_dpm_fclk        product_name    resource0_wc        subsystem_vendor            xgmi_num_hops
aer_dev_fatal             current_link_width  enforce_isolation  local_cpulist       mem_info_vis_vram_total  pcie_bw                            pp_dpm_mclk        product_number  resource2           thermal_throttling_logging  xgmi_num_links
aer_dev_nonfatal          d3cold_allowed      fru_id             local_cpus          mem_info_vis_vram_used   pcie_replay_count                  pp_dpm_sclk        ras             resource2_wc        uevent                      xgmi_physical_id
ari_enabled               device              fw_version         manufacturer        mem_info_vram_total      pm_policy                          pp_dpm_socclk      remove          resource5           unique_id
board_info                df_cntr_avail       gpu_busy_percent   max_link_speed      mem_info_vram_used       power                              pp_features        rescan          revision            vbios_version
broken_parity_status      dma_mask_bits       gpu_metrics        max_link_width      mem_info_vram_vendor     power_dpm_force_performance_level  pp_force_state     reset           run_cleaner_shader  vendor
class                     driver              hwmon              mem_busy_percent    modalias                 power_dpm_state                    pp_num_states      reset_method    serial_number       xgmi_device_id
config                    driver_override     i2c-0              mem_info_gtt_total  msi_bus                  power_state                        pp_od_clk_voltage  resource        subsystem           xgmi_error
consistent_dma_mask_bits  drm                 irq                mem_info_gtt_used   msi_irqs                 pp_cur_state                       pp_table           resource0       subsystem_device    xgmi_hive_info

Is there some other path in MI300 where the file product_name is present?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant