[prev in list] [next in list] [prev in thread] [next in thread] 

List:       freeipmi-devel
Subject:    [bug #64792] Bad IPMI DCMI response from Huawei and Xfusion BMCs
From:       Ole Holm Nielsen <INVALID.NOREPLY () gnu ! org>
Date:       2023-10-19 8:15:02
Message-ID: 20231019-081500.sv348981.37160 () savannah ! gnu ! org
[Download RAW message or body]

URL:
  <https://savannah.gnu.org/bugs/?64792>

                 Summary: Bad IPMI DCMI response from Huawei and Xfusion BMCs
                   Group: GNU FreeIPMI
               Submitter: oleholmnielsen
               Submitted: Thu 19 Oct 2023 08:15:00 AM UTC
                Category: None
                Severity: 3 - Normal
                Priority: 5 - Normal
              Item Group: None
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any
        Operating System: None


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Thu 19 Oct 2023 08:15:00 AM UTC By: Ole Holm Nielsen <oleholmnielsen>
We have successfully integrated the development FreeIPMI version 1.7.0 in our
Linux cluster with the Slurm resource manager.  My test is described in
https://bugs.schedmd.com/show_bug.cgi?id=17639#c55 and I have documented the
FreeIPMI setup in my Slurm Wiki page
https://wiki.fysik.dtu.dk/Niflheim_system/Slurm_configuration/#freeipmi-issues

Now we would like to deploy Slurm including the FreeIPMI power monitoring, but
we have discovered a snag:

We have 196 older Huawei XH620 V3 nodes (Intel Broadwell) whose BMC doesn't
seem to support the IPMI DCMI extensions.  A colleague at another university
has the same problem with brand new Xfusion FusionOne HPC 1288H V6 servers
(Intel IceLake, essentially rebranded Huawei servers) even though the server's
BMC is documented to support DCMI 1.5!

On the Huawei and Xfusion nodes we get this error message:

$ ipmi-dcmi --get-system-power-statistics
ipmi_cmd_dcmi_get_power_reading: command invalid or unsupported

Due to this error, Slurm logs (spams) every minute in slurmd.log "error:
_get_dcmi_power_reading: get DCMI power reading failed"

I've tried to find out how to query the Huawei BMC with IPMI DCMI but I only
get error messages:

$ ipmi-dcmi --get-dcmi-capability-info
ipmi_cmd_dcmi_get_dcmi_capability_info_supported_dcmi_capabilities: bad
completion code

I also tried each of the WORKAROUNDS listed in the ipmi-dcmi manual page, but
in every case they return the same error.

The debug option gives some details:

$ ipmi-dcmi --get-dcmi-capability-info --debug
=====================================================
Group Extension - Get DCMI Capability Info Request
=====================================================
[               1h] = cmd[ 8b]
[              DCh] = group_extension_identification[ 8b]
[               1h] = parameter_selector[ 8b]
=====================================================
Group Extension - Get DCMI Capability Info Response
=====================================================
[               1h] = cmd[ 8b]
[              D6h] = comp_code[ 8b]
ipmi_cmd_dcmi_get_dcmi_capability_info_supported_dcmi_capabilities: bad
completion code

The non-DCMI commands seem to be working correctly.  For example, I can read
the system power:

$ ipmi-sensors -t Power_Unit
ID  | Name         | Type       | Reading    | Units | Event
22  | Power        | Power Unit | 296.00     | W     | 'OK'
(lines deleted)

Question: Would a WORKAROUND be feasible to implement for Huawei and Xfusion
servers?  If so, how can we help by providing debugging information?

Or is there some other way for getting the DCMI extensions to work?

Thanks a lot,
Ole







    _______________________________________________________
File Attachments:


-------------------------------------------------------
Date: Thu 19 Oct 2023 08:15:00 AM UTC  Name: bmc-info.log  Size: 2KiB   By:
oleholmnielsen
Output from bmc-info
<http://savannah.gnu.org/bugs/download.php?file_id=55257>

    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?64792>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic