Solved: Policy Install Failed - Problem With The Commit Fu...

Daniel_Taney · ‎2019-01-03

I have a feeling this one is going to require a call to TAC, but does anyone have any experience troubleshooting this one? I've got one VS in a VSLS VSX Cluster containing 3 Virtual Switches and 5 Virtual Systems that I get this Policy install error on. All other VS's install policy just fine. The VSX Cluster is R77.30 with R80.10 SMS.

The strange thing is that it happens once the Policy Install progress hits 99%. It was my understanding (based on this very comprehensive and helpful writeup) that the Policy Install procedure was all but completed once the progress bar hit 99%?

When I look at vsx stat -v, It appears that VSX thinks the policy installed. The "Installed at" time matches with when the Policy Install fails.

I am able to verify the Policy in Smart Console without any errors but I'm not sure where to begin troubleshooting this since it appears that the Gateway thinks the policy installed successfully but the management server doesn't.

Thanks!

R80 CCSA / CCSE

Daniel_Taney · ‎2019-01-24

Just to close the loop here in case anyone else should encounter this problem, the final solution was to perform a SIC reset on the individual VS as outlined in sk34098.

Kaspars Zibarts‌ suggestion of reset_gw would have also worked since that procedure performs a full SIC reset as part of the vsx_util reconfigure process.

In the end, it came down to TAC feeling very certain the individual SIC reset would resolve it and my ability to try the SIC reset during the normal course of troubleshooting vs. waiting for a maintenance window to do the reset_gw.

Thanks to all who contributed their suggestions and help here! If nothing else, I learned a handful of other troubleshooting steps and commands through this thread that I otherwise wouldn't have!

R80 CCSA / CCSE

View solution in original post

Kaspars_Zibarts · ‎2019-01-03

How many cluster members do you have and did you verify that policy installed on all members? I would start with cpd.elg logs

Daniel_Taney · ‎2019-01-04

There are 2 members in the cluster. I didn't think to check both clusters. I habitually was checking just the one that the VS is active on. However, it does appear that the policy is not installing on the other cluster member. The install dates are different across the two.

I'll start digging through the cpd.elg logs on this Gateway and see if anything interesting and relevant pops up. Thanks for the suggestion!

R80 CCSA / CCSE

Daniel_Taney · ‎2019-01-04

The plot thickens... it looks like there is a positive confirmation the policy installed on the Cluster Member that shows the current policy install date:

However, it seems to just enter/exit "addon end_handler" without showing any confirmation that the policy install succeeded (or failed) on the Gateway that doesn't show the current policy install date.

I was hoping for something a little more definitive in the logs pointing to a reason. But there is definitely a difference between the two cluster members.

I wonder if the cpd process just needs restarting? Maybe it's time for a cpstop/start in a maintenance window?

R80 CCSA / CCSE

Kaspars_Zibarts · ‎2019-01-04

There are couple of SKs about cpd debug that would show you more messages in the log. For example

How to debug CPD daemon

Additionally check fwm.elg. But yes - if you have it as an option - reboot the gateway that exibits the problem and check which jumbo hotfix you are on and if there's a newer version that might have fixes for cpd or policy installation / VSX

Daniel_Taney · ‎2019-01-04

Just to be clear, the fwm.elg log only exists on the SMS side, correct?

Thanks for the SK on debugging CPD. I got about 140,000+ lines of output when I ran it while pushing policy. I'm thinking this may be the point where it is more beneficial to engage TAC because without any guidance of what I'm looking for, it seems like searching for a needle in a haystack.

If nothing else, I can arm TAC with a lot of information when opening the SR to hopefully move things along quickly!

R80 CCSA / CCSE

Kaspars_Zibarts · ‎2019-01-04

BTW, how does vsx stat -v output looks like on that gateway? Is SIC established to the failing VS?

Daniel_Taney · ‎2019-01-04

Everything looks OK to my eyes:

R80 CCSA / CCSE

Vladimir · ‎2019-01-03

Please check the status of free RAM on the policy installation target.

Daniel_Taney · ‎2019-01-04

Thanks for the suggestion! I think the memory looks pretty good:

R80 CCSA / CCSE

JozkoMrkvicka · ‎2019-01-03

check free HDD space on mgmt and/or VSX.

Kind regards,
Jozko Mrkvicka

Daniel_Taney · ‎2019-01-04

This seemed like a great place to start, but disk space looks pretty good. Both VSX clusters are using the same amount of disk space. The SMS should have plenty, too!

R80 CCSA / CCSE

_Val_ · ‎2019-01-06

Please open an SR with TAC if you did not do that already

Daniel_Taney · ‎2019-01-07

Yes, I opened an SR on Friday. TAC supplied a policy debug script. I expect we’ll make some good progress today once I get rolling working through that process.

Thanks,

Dan

R80 CCSA / CCSE

Daniel_Taney · ‎2019-01-22

This one is still making the rounds with TAC. We were provided a vs_reconfigure BASH script to run against the VS to rebuild it. While the script seemed to run successfully, and the GW was able to pull policy from the SMS, we are still unable to push policy to it.

Now, we do get a SIC error despite the SIC status still showing as Trust in the output of vsx stat -v.

Strangely, I am able to modify the route table and push the VS config successfully through SmartConsole. I have a feeling we will be resetting SIC on the individual VS, but it seems strange that everything seems to work and communicate up to a certain point.

Very strange... I'll do my best to update with whatever the resolution ends up being as this seems to be a unique one!

R80 CCSA / CCSE

Kaspars_Zibarts · ‎2019-01-22

I would probably do full rebuild of the box. First reset_gw on firewall (this way you will keep all basic non-vsx config) and then vsx_util reconfigure on mgmt. Something seems very "stuck" there if TAC was not able to resolve it so far. Not too sure how many times have you done it, but it's not as complicated and dangerous as it sounds. I would avoid resetting sic on individual Vs - never had full success with it. Something always didn't work correctly at the end.

Daniel_Taney · ‎2019-01-22

I wasn't aware reset_gw was an option. Would you just run this from vs0 to basically blow away all the VSX config from the Gateway?

By "keep all basic non-vsx config", I'm taking that to mean the underlying GAIA config and OS remains. So, this isn't a rebuild in the sense of a total reinstall of GAIA + HFA's to the GW? I'm familiar with the "vsx_util reconfigure" process and am pretty comfortable with that.

I was thinking along these lines, but I didn't realize there was a way to remove just the VSX configs! It would save a lot of headache of having to reimage the appliance and put everything back in place. I can mention it to the folks at TAC working with me. Is there an SK explaining this anywhere?

Thanks for the input, this could be very helpful!

-Dan

R80 CCSA / CCSE

Kaspars_Zibarts · ‎2019-01-22

Yap! Has saved me on number of occasions and very popular command in my lab where I rebuild them constantly to test stuff

Just save your show configuration output as pain text just in case of course

Daniel_Taney · ‎2019-01-22

Thanks again for this suggestion. I think this is the way to go. Now, I just need to get this squeezed into a maintenance window!

R80 CCSA / CCSE

Kaspars_Zibarts · ‎2019-01-22

If you have two VMs available, I would always suggested lab testing just to make sure. And don't forget the snapshot

Daniel_Taney · ‎2019-01-23

Precisely how I planned on spending my day today!

R80 CCSA / CCSE

Daniel_Taney · ‎2019-01-24

Just to close the loop here in case anyone else should encounter this problem, the final solution was to perform a SIC reset on the individual VS as outlined in sk34098.

Kaspars Zibarts‌ suggestion of reset_gw would have also worked since that procedure performs a full SIC reset as part of the vsx_util reconfigure process.

In the end, it came down to TAC feeling very certain the individual SIC reset would resolve it and my ability to try the SIC reset during the normal course of troubleshooting vs. waiting for a maintenance window to do the reset_gw.

Thanks to all who contributed their suggestions and help here! If nothing else, I learned a handful of other troubleshooting steps and commands through this thread that I otherwise wouldn't have!

R80 CCSA / CCSE

Are you a member of CheckMates?

Policy Install Failed - Problem With The Commit Function