Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This document provides basic troubleshooting information for Bare Metal Machine (BMM) resources which are reporting a Degraded status in the BMM detailed status message.
Prerequisites
- Access to the Azure portal or Azure CLI
- Permissions to view and manage Bare Metal Machine resources
- For diagnostic commands: SSH access via BareMetalMachineKeySet (see Manage emergency access to a Bare Metal Machine)
Symptoms
Bare Metal Machines (BMM) which are in Degraded state exhibit the following symptoms.
- The Detailed status message includes one or more Degraded messages as shown in the following table.
- The BMM is automatically cordoned once the resource is continuously degraded for more than 15 minutes (for Compute nodes only).
- The BMM will then remain cordoned for 2 hours after the underlying conditions resolve, after which it will be automatically uncordoned.
- Control and Management nodes can be reported as Degraded, but aren't automatically cordoned.
| Detailed status message | Details and mitigation |
|---|---|
Degraded: NIC failed |
Degraded: NIC failed |
Degraded: port down |
Degraded: port down |
Degraded: LACP status is down |
Degraded: LACP status is down |
Degraded: port flapping |
Degraded: port flapping |
Degraded status messages and associated automatic cordoning behavior are present in Azure Operator Nexus version 2502.1 and higher.
Troubleshooting
To check for any Bare Metal Machines (BMMs) which are currently degraded, run az networkcloud baremetalmachine list -g <ResourceGroup_Name> -o table.
This command shows the current status of all BMMs in the specified resource group. Any active Degraded conditions are visible in the detailed status message.
To see the current Cordoning status, include a --query parameter which specifies the cordonStatus, as seen in the following example.
This command can help to identify any compute nodes which are still automatically cordoned due to recently resolved Degraded conditions.
az networkcloud baremetalmachine list \
-g <ResourceGroup_Name> \
--output table \
--query "[].{name:name,powerState:powerState,provisioningState:provisioningState,readyState:readyState,cordonStatus:cordonStatus,detailedStatus:detailedStatus,detailedStatusMessage:detailedStatusMessage}"
Example Azure CLI output
This example shows a deployment with two currently degraded BMMs (compute01 and compute04), and two cordoned BMMs (compute02 and compute04).
Not all degraded BMMs are cordoned (yet), and not all of the healthy BMMs are uncordoned (yet) - due to the fixed delay before automatic cordoning and uncordoning takes effect.
Name PowerState ProvisioningState ReadyState CordonStatus DetailedStatus DetailedStatusMessage
-------------- ------------ ------------------- ------------ -------------- ---------------- -----------------------------------------------------------------------------------------------------------------
rack1management1 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
rack1compute01 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine. Degraded: LACP status is down
rack1compute02 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine.
rack1compute03 On Succeeded True Uncordoned Provisioned The OS is provisioned to the machine.
rack1compute04 On Succeeded True Cordoned Provisioned The OS is provisioned to the machine. Degraded: port flapping Degraded: port down
Additional information about recent degraded conditions and automatic cordoning is available in the following fields on the bmm kubernetes resource.
degradedStartTimeanddegradedEndTimeshow the start and end time of the most recent degraded stateconditionsshows the status of any individual conditions which are contributing to a degraded statecordonStatusindicates whether the node is currently cordoned or uncordonedannotationsshows which conditions triggered the current cordon, if automatically cordoned.platform.afo-nc.microsoft.com/lacp-down-cordonplatform.afo-nc.microsoft.com/port-down-cordonplatform.afo-nc.microsoft.com/port-flap-cordon
If the user manually cordoned the BMM, the following annotation is also present.
platform.afo-nc.microsoft.com/customer-cordon
The Activity Logs for the BMM resource in the Azure portal can also provide more information about any recent user initiated cordon requests.
The
annotationsmetadata on thebmmkubernetes resource shows which condition triggered the cordon.The
conditionsstatus on thebmmkubernetes object shows the current status and timestamp for any triggering conditions.
To view these bmm kubernetes resource fields, use an Azure CLI run-read-command command as shown in the following example.
az networkcloud baremetalmachine run-read-command \
-g <ResourceGroup_Name> \
-n rack2management2 \
--limit-time-seconds 60 \
--commands "[{command:'kubectl get',arguments:[-n,nc-system,bmm,rack2compute08,-o,json]}]" \
--output-directory .
- Replace
<ResourceGroup_Name>with the name of the resource group containing the BMM resources. - Replace
rack2management2with the name of a BMM resource for a healthy Kubernetes control plane node, from which to execute thekubectl getcommand. - Replace
rack2compute08with the name of the degraded or cordoned BMM to inspect.
For more information about the run-read-command feature and available diagnostic commands, see Troubleshoot Bare-Metal Machines by Using the run-read Command.
Example run-read-command output (kubectl get bmm):
This example shows an automatically cordoned BMM with two active Degraded conditions.
{
"metadata": {
"annotations": {
"platform.afo-nc.microsoft.com/port-down-cordon": "true",
"platform.afo-nc.microsoft.com/port-flap-cordon": "true"
}
},
"status": {
"conditions": [
{
"lastTransitionTime": "2025-03-04T02:47:59Z",
"status": "True",
"type": "BmmInExpectedLACPState"
},
{
"lastTransitionTime": "2025-03-04T03:27:00Z",
"message": "Physical link(s) down: 4b_p1",
"reason": "PortDown",
"status": "False",
"type": "BmmNetworkLinksUp"
},
{
"lastTransitionTime": "2025-03-04T03:49:00Z",
"message": "Port flapping in the last 15 mins: 4b_p1 (2 times)",
"reason": "PortFlappingDetected",
"status": "False",
"type": "BmmNetworkLinksStable"
}
],
"cordonStatus": "Cordoned",
"degradedStartTime": "2025-03-04T03:27:00Z",
"detailedStatus": "Provisioned",
"detailedStatusMessage": "The OS is provisioned to the machine. Degraded: port flapping Degraded: port down"
}
}
Automatic Cordoning
If an uncordoned Compute BMM remains in a Degraded state for more than 15 minutes, the node is automatically cordoned.
- An automatically cordoned node will remain cordoned for 2 hours after the underlying conditions are resolved, after which it will be automatically uncordoned.
- To uncordon a BMM manually, use the
az networkcloud baremetalmachine uncordoncommand or execute the Uncordon action from the Azure portal. - Manually uncordoning a BMM which still has an active degraded condition isn't allowed. In this case, the Uncordon request is rejected with an error message similar to the following.
action rejected: baremetalmachine 'rack1compute01' currently degraded since 2025-02-26 05:26:09 +0000 UTC
Note: only BMMs used for Compute are automatically cordoned. Control and Management nodes aren't automatically cordoned.
For more information about investigating the root cause of an automatic cordon, see Troubleshooting.
For more information about manually cordoning and uncordoning BMMs, see Make a Bare Metal Machine unschedulable (cordon) and Make a Bare Metal Machine schedulable (uncordon).
Degraded: NIC Failed
This message indicates that one of the expected Mellanox Network Interface Cards (NICs) on the underlying compute host is failed or missing. This message typically indicates a hardware failure on the NIC, or that the card isn't correctly seated in the host.
To troubleshoot this issue:
- to identify the nonoperational NIC, check the Ethernet link status indicators on the underlying compute host
- check that the NIC is correctly installed and seated
- sign into the Baseboard Management Controller (BMC) to check the hardware status of the NIC
- review detailed hardware logs by generating a Dell TSR (Technical Support Report) as described in the Dell Knowledge Base article Export a SupportAssist Collection Using an iDRAC
- review the most recent time of failure reported by the Bare Metal Machine
conditions, as described in the Troubleshooting section - power cycle the host by executing a Restart action on the Bare Metal Machine resource, and see if the condition clears.
Example conditions output for NIC failed
"conditions": [
{
"lastTransitionTime": "2025-05-21T16:49:29Z",
"message": "Expected 2 devices in oam-bond, found 1: 98_pf0vf0_vf",
"reason": "OamDevicesUnhealthy",
"status": "False",
"type": "BmmNicsHealthy"
},
],
Degraded: port down
This message in the BMM Detailed status message field indicates that the physical link is down on one or more of the Mellanox interfaces on the underlying compute host. This scenario can indicate a cabling, switch port configuration, or hardware failure.
To troubleshoot this issue:
- review the
conditionsstatus of the kubernetesbmmobject, as described in the Troubleshooting section - this information should identify the affected port and approximate time of the issue
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
Example conditions output for port down
"conditions": [
{
"lastTransitionTime": "2025-03-04T03:27:00Z",
"message": "Physical link(s) down: 4b_p1",
"reason": "PortDown",
"status": "False",
"type": "BmmNetworkLinksUp"
},
],
Degraded: LACP status is down
This message in the BMM Detailed status message field indicates a Link Aggregation Control Protocol (LACP) failure on the underlying compute host, when the physical links are physically up. This scenario can indicate a cabling or Top Of Rack (TOR) switch configuration issue.
To troubleshoot this issue:
- review the
conditionsstatus of the kubernetesbmmobject, as described in the Troubleshooting section - this information should identify the affected port and approximate time of the issue
- check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
- check whether any other BMMs are also reporting port or LACP issues, which might help to identify any potential mis-cabling or wider issue with the TOR switch or network configuration
- check for any recent deployment or infrastructure changes which coincide with the time of failure
- for more information about diagnosing and fixing LACP issues, see Troubleshoot LACP Bonding.
Warning
In version 2502.1, there's a known issue where LACP status is down can be incorrectly reported in addition to a port is not functioning as expected message during a port down scenario.
This issue can happen when a BMM is restarted or reimaged while the physical port is down.
In this case, the LACP warning can be safely ignored if the physical port is also down. This issue is fixed in version 2503.1.
Example conditions output for unexpected LACP state
"conditions": [
{
"lastTransitionTime": "2025-01-31T12:24:27Z",
"message": "Error: LACP status for interface 4b_p0 is down, LACP status for interface 4b_p1 is down",
"reason": "LACP status is down",
"severity": "Error",
"status": "False",
"type": "BmmInExpectedLACPState"
},
],
Degraded: port flapping
This message in the BMM Detailed status message field indicates that one or more of the Mellanox ethernet ports is experiencing port flapping. Port flapping is defined as two or more changes in the physical link state within the previous 15 minutes. This behavior can indicate a cabling, switch or hardware issue, or possible network configuration issues.
To troubleshoot this issue:
- identify the affected port and approximate time of the issue by reviewing the BMM
conditions, as described in the Troubleshooting section - check the
degradedStartTimetimestamp on thebmmobject (if different) for more context about the overall timeline - check the Ethernet cabling and Top Of Rack (TOR) switch for the specified port
- check for any other BMMs which are also reporting port flapping or link failures, for information about the scope of the issue or any common cause
- check for any recent deployment or infrastructure changes which coincide with the time of failure.
Example conditions output for port flapping
"conditions": [
{
"lastTransitionTime": "2025-03-04T03:49:00Z",
"message": "Port flapping in the last 15 mins: 4b_p1 (2 times)",
"reason": "PortFlappingDetected",
"status": "False",
"type": "BmmNetworkLinksStable"
},
],
Related content
- Troubleshoot Warning status messages
- BMM lifecycle management
- Best practices for Bare Metal Machine operations
- If you still have questions, contact Azure support
- For more information about support plans, see Azure support plans