VSX Health Check: Warning: System diagnostics fail...

Tommy_Forrest · ‎2023-03-02

Hi everyone, wanted to write up my experience with this system diagnostic message. There's not a lot out there on Google or internal SK's on this issue.

This error started presenting itself recently on a Maestro/VSX cluster comprised of 4 - 28600 SGMs running R81 JHT68.

When we ran asg diag verify, we noticed that test 21 - BMAC VMAC verify failed.

The test recommends running mac_verifier -x to get more data. When we did this, we found 3 interfaces that were complaining.

They looked something like this:

Device:kern, Interface: wrpj320, Member XX on SGM: 1_01 - has zero BMAC address
Device:kern, Interface: wrpj320, Member XX on SGM: 1_02 - has zero BMAC address
Device: vs0, Interface: wrpj320, Member XX on SGM: 1_01 - has zero BMAC address
Device: vs0, Interface: wrpj320, Member XX on SGM: 1_02 - has zero BMAC address

This would go on and on as it cycled through the Member count (and when we had all 4 SGM's in the cluster it would repeat for them as well).

Aside from this issue, we did not observe any issues with the cluster operation.

At first, (working with Diamond TAC) we though it was just an angry SGM. Since this cluster isn't production, we started yanking SGMs from the SG and adding them back.

But that did not resolve the issue.

Long story short, we discovered the interfaces in the active db config on the SGMs. We believe they were orphaned. Earlier in this cluster's life we had built VSes and vswitches and would later decide to remove them to unify the SGID's with how VSes and vswitches were assigned on the production cluster.

Working with TAC we used db get/set commands on the SGM's to remove the errant interfaces. I'm not going to post the exact commands we used because, well, you shouldn't be using said commands (especially in VSX) without a lot of adult supervision.

To start, we decided to delete 1 of the 3 errant entries (and would later follow up on the other two entries). The proper order of things:

Delete the entry on the SMO, save your db changes.

Delete the entry on the next SGM, save your changes and reboot.

Once the SGM comes back, verify your changes took effect and you didn't break anything.

Rinse and repeat for any other SGMs in the cluster.

Finally, reboot the SMO.

In the end "mac_verifier -x" showed us the changes as we were making them. After we'd completed all the changes to the 2nd SGM, we no longer saw 1_02 showing up in the verifier. Once the SMO rebooted the command returned no data. asg diag verify also got happy.

I believe at some point an SK article is going to be written up on this subject. When it gets published, I'll update this posting.

Lari_Luoma · ‎2023-03-03

Thanks Tommy for explaining this. Bottom line is that you should always open an SR with TAC and not modify database manually on your own. 🙂

genisis__ · ‎2023-03-04

totally agree - and in fact this sounds similar to VSX issue I had in the past with orphaned interfaces, needless to say, the steering wheel was firmly handed over to R&D.

the_rock · ‎2023-03-04

Great post, thanks for sharing 🙌

VSX Health Check: Warning: System diagnostics failed on the following tests: BMAC VMAC verify