Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Bob_Zimmerman
Authority
Authority

Finding differences between cluster members

I recently ran into a situation where one member of one of my clusters was aware of 15 VIPs and the other member was only aware of 13. It was annoying to find and to fix, so I created a small script to help me confirm whether my environment had any other instances of the problem. I thought it might be useful for others.

The script uses the management API to get a list of all CMAs on an MDS. If it's not on an MDS, it creates a fake CMA to represent the SmartCenter. It then iterates through the CMAs and looks for normal (non-VSX) clusters. Once it has the list of clusters, it dumps each of them to get a list of the members. It then uses cprid_util to connect to each member and run a command or set of commands, and it saves the output to files in /tmp on the MDS or SmartCenter.

When it is done running the commands on a cluster, it prints the name of that cluster and whether it found any differences.

NOTE: This script has no support for VSX right now, and no support for clusters with more than two members.

 

#!/usr/bin/env bash
######################################################################
### For non-VSX clusters
######################################################################
portNumber=443
unset cmaList
. /etc/profile.d/CP.sh
cmaList=$(mgmt_cli --port "${portNumber}" -f json -r true show domains limit 500 details-level full \
| jq -c '.objects[]|{name:.name,server:.servers[]|{host:."multi-domain-server",ipAddress:."ipv4-address"}}' \
| grep $(hostname) \
| jq -c '[.name,.server.ipAddress]')
if [ ${#cmaList} -eq 0 ];then cmaList=("[\"$(hostname)\",\"\"]");fi
for cmaRow in $cmaList; do
cmaName=$(echo "${cmaRow}" | jq '.[0]' | sed 's#"##g')
cmaAddress=$(echo "${cmaRow}" | jq '.[1]' | sed 's#"##g')
mdsenv "${cmaAddress}" 2>/dev/null
firewallList=$(mgmt_cli --port "${portNumber}" -f json -d "${cmaAddress}" -r true show gateways-and-servers limit 500 details-level full \
| jq -c '.objects[]|.' \
| grep CpmiGatewayCluster \
| jq -c '.uid' \
| xargs -L 1 mgmt_cli --port "${portNumber}" -f json -d "${cmaAddress}" -r true show object details-level full uid \
| jq -c '.object|{clusterName:.name,member:."cluster-members"[]} | {clusterName:.clusterName,memberName:.member.name,address:.member."ip-address"}')
clusterList=$(echo "$firewallList" | jq -c ".clusterName" | sort | uniq | sed 's#"##g')
for clusterName in $clusterList; do
for firewallLine in $(echo "$firewallList" | grep "$clusterName"); do
memberName="$(echo "${firewallLine}" | jq '.memberName' | sed 's#"##g')"
firewall="$(echo "${firewallLine}" | jq '.address' | sed 's#"##g')"
cprid_util -verbose -server "${firewall}" rexec -rcmd sh -c '
############################################################
cphaprob -a if | sort
############################################################
' > /tmp/"$clusterName"-"$memberName".output
done
echo "========================================"
echo -n "$clusterName:"
diffOut=$(diff /tmp/"$clusterName"-*)
if [ $? -eq 0 ]; then
echo " NO DIFFERENCES"
else
echo ""
echo "$diffOut"
fi
/bin/rm /tmp/"$clusterName"-*
done
done;echo "========================================"

 

The command you want to run on each cluster member goes between the hash rows, where I have "cphaprob -a if | sort". It can be multiple lines. Note that a lot of things on Check Point firewalls (such as clish configs) don't have stable ordering between devices. It's generally a good idea to sort the output like in my example.

There are also a lot of expected differences between cluster members. For example, each member has its own hostname, and its own unique IP addresses on each interface.

19 Replies
Sorin_Gogean
Advisor

Hello @Bob_Zimmerman ,

 

Can you elaborate a bit on the "one member of one of my clusters was aware of 15 VIPs and the other member was only aware of 13" ?

What meant you was more like, on one node you had 15 interfaces, while on the other one you had 13 ? Otherwise I a bit lost 😕

 

Thank you,

0 Kudos
Bob_Zimmerman
Authority
Authority

At the OS level, both members knew about all the same interfaces. They both had unique IPs and could ping each other on every interface. Policy successfully pushed to both. The cluster object was fine and had the right interface names, IPs, and everything for both members.

'cphaprob -a if' on one of them said it had 15 cluster interfaces. 'cphaprob -a if' on the other said it had 13 cluster interfaces. The two missing interfaces showed VIPs on one and didn't show in the list at all on the other. When member 2 was active, it didn't try to claim the VIPs on those two interfaces.

In years in the TAC, I had never seen that happen unless the cluster object's interface was only given a member interface on one member. In the end, we deleted the broken interfaces from the cluster object, published, created them again, published, then pushed. That solved the problem for that cluster, but we have a lot of other clusters. Something weird on one could potentially happen on others, so I was tasked with checking all of the clusters to find similar inconsistencies. Thus, this script.

It can potentially be used to find inconsistencies in dynamic routing configuration, proxy ARP configuration, and other such things which should normally be identical on both members.

Sorin_Gogean
Advisor

Thank you @Bob_Zimmerman for the explication,  so as I understand it, "'cphaprob -a if' on one of them said it had 15 cluster interfaces. 'cphaprob -a if' on the other said it had 13 cluster interfaces. The two missing interfaces showed VIPs on one and didn't show in the list at all on the other. When member 2 was active, it didn't try to claim the VIPs on those two interfaces" - that would be a NORMAL behavior since the 2nd member was not aware of the 2 missing network interfaces.
In our environment we didn't had those situations, therefore my question .

 

Indeed the presented script would make it easily to check that on cluster members. Like I had mismatches in the routes and other things - so I'll give it a try and come back 🙂.

 

Thank you, 

0 Kudos
Bob_Zimmerman
Authority
Authority

We found out about the problem when we had to run on the member which was missing the VIPs (an adjacent switch was undergoing some maintenance). Suddenly we had an outage. Took a while to track it down to the missing VIPs. Before seeing this, I would have confidently said that we didn't have this situation anywhere, because the VIPs are pushed with the policy. All of our configuration was correct, and the management told us it was applied successfully to both members, therefore all the members know about all the VIPs. I would have been wrong.

At least it was a failure which showed up in the diagnostic commands like 'cphaprob -a if'.

 

Interestingly enough, this also led me to one of the most unhelpful error messages I've ever seen:

Note: For more information on bond interfaces, use the command:
      cphaprob show_bond [<bond_name>]

It doesn't have any indication of why you might want more information on your bond interfaces. Zero indication that anything is wrong at all. I opened a ticket with the TAC asking why this showed up on some of my clusters but not others. Turns out that only shows up when at least one bond is unhealthy (typically when it has a down member interface).

JozkoMrkvicka
Mentor
Mentor

The main question is why the member with missing VIPs was not reporting "interface problem" and thus going to Down state. The number of interfaces (also VIPs) must match both members. If not, one member (or both) must be aware about issues and cannot be in Active-Standby state.

What was the version and Jumbo on both members ?

Anyway, nice script 🙂 I do the same also for normal cluster config diffs, but using old style method - Excel with EXACT function 😄 

Kind regards,
Jozko Mrkvicka
0 Kudos
Bob_Zimmerman
Authority
Authority

Cluster members can only report a problem on interfaces they're aware of. It wasn't a situation where the member knew about 15 interfaces and only 13 had VIPs. The clustering mechanism only knew about 13 interfaces at all. As far as it was concerned, everything it knew about was perfectly healthy.

Pretty sure the management was R81.10 jumbo 66 when the interfaces were added, and pretty sure the firewalls were R81.10 jumbo 66. It was an easy fix once we noticed the problem, but there was absolutely no indication there was a problem to notice outside of the 'cphaprob -a if' output until we were running on the member with the missing VIPs.

@Alex-: We didn't actually push when the interface was deleted. We just deleted it and added it again to force the management to recompute what should go where, as you mentioned. Once it was back, we pushed. The active member with the extra VIPs kept them, and there was zero traffic disruption.

0 Kudos
Alex-
Advisor
Advisor

Now that you mention it, this environment (SMS + GW's) was also R81.10 T66 and still is as I have a yet to secure a maintenance window to upgrade. Correct, the standby member was always fine with 13 VIP until that recompute was done.

Didn't lose any traffic either as the Active member always had that particular  VIP active.

0 Kudos
Alex-
Advisor
Advisor

Great script. I had this issue once on appliances with R81.10, 15 on Active, 14 on Standby, happily ACTIVE/STANDBY without any issues in cphaprob list. The correct interface settings were verified on both the appliances and topology.

Production network so this subnet couldn't be erased and recreated without impact. I believed it was solved by modifying something on the standby definition in the topology to force a recompute, like the subnet mask and installing the policy then putting back the correcvt parameter and it's OK since.

0 Kudos
Bob_Zimmerman
Authority
Authority

And now I have a version for VSX:

 

#!/usr/bin/env bash
######################################################################
### For VSX
######################################################################
portNumber=443
unset cmaList
. /etc/profile.d/CP.sh
cmaList=$(mgmt_cli --port "${portNumber}" -f json -r true show domains limit 500 details-level full \
| jq -c '.objects[]|{name:.name,server:.servers[]|{host:."multi-domain-server",ipAddress:."ipv4-address"}}' \
| grep $(hostname) \
| jq -c '[.name,.server.ipAddress]')
if [ ${#cmaList} -eq 0 ];then cmaList=("[\"$(hostname)\",\"\"]");fi
for cmaRow in $cmaList; do
cmaName=$(echo "${cmaRow}" | jq '.[0]' | sed 's#"##g')
cmaAddress=$(echo "${cmaRow}" | jq '.[1]' | sed 's#"##g')
mdsenv "${cmaAddress}" 2>/dev/null
firewallListUuids=$(mgmt_cli --port "${portNumber}" -f json -d "${cmaAddress}" -r true show gateways-and-servers limit 500 details-level full \
| jq -c '.objects[]|.' \
| grep CpmiVsxClusterNetobj \
| jq -c '.uid' \
| xargs -L 1 mgmt_cli --port "${portNumber}" -f json -d "${cmaAddress}" -r true show generic-object uid \
| jq -c '{clusterName:.name,member:."clusterMembers"[]}')
echo "" > sedScript
for line in $(echo $firewallListUuids | tr ' ' '\n'); do
memberUuid=$(echo $line | jq .member)
member=$(echo "$memberUuid" | xargs mgmt_cli --port "${portNumber}" -f json -d "${cmaAddress}" -r true show object details-level full uid | jq -c '.object|{name:.name,address:."ipv4-address"}')
echo "s#${memberUuid}#${member}#" >> sedScript
done
firewallList=$(echo $firewallListUuids | sed -f sedScript | jq -c '{clusterName:.clusterName,memberName:.member.name,address:.member.address}')
clusterList=$(echo "$firewallList" | jq -c ".clusterName" | sort | uniq | sed 's#"##g')
for clusterName in $clusterList; do
for firewallLine in $(echo "$firewallList" | grep "$clusterName"); do
memberName="$(echo "${firewallLine}" | jq '.memberName' | sed 's#"##g')"
firewall="$(echo "${firewallLine}" | jq '.address' | sed 's#"##g')"
cprid_util -verbose -server "${firewall}" rexec -rcmd sh -c '
########################################
echo "" > /tmp/vsxDiff.output
vsids=$(ip netns list 2>/dev/null | cut -d" " -f3 | cut -d")" -f1 | sort -n;ls /proc/vrf/ 2>/dev/null | sort -n)
for vsid in $vsids;do
echo -n "set virtual-system " > /tmp/script.clish
echo $vsid >> /tmp/script.clish
echo "show configuration" >> /tmp/script.clish
clish -if /tmp/script.clish \
| sed -E "s/^Processing .+?\r//g" \
| grep -v "ipv4-address" \
| grep -v "set hostname" \
| grep -v "password-hash" \
| grep -v " Configuration of " \
| grep -v " Exported by admin on " \
| sort \
>> /tmp/vsxDiff.output
echo "-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-" \
>> /tmp/vsxDiff.output
done
########################################
'
cprid_util -server "${firewall}" getfile -remote_file /tmp/vsxDiff.output -local_file /tmp/"$clusterName"-"$memberName".output
done
echo "========================================"
echo -n "$clusterName:"
diffOut=$(diff /tmp/"$clusterName"-*)
if [ $? -eq 0 ]; then
echo " NO DIFFERENCES"
else
echo ""
echo "$diffOut"
fi
/bin/rm /tmp/"$clusterName"-*
done
done;echo "========================================"

 

The inner script to be run on the members is very different. cprid_util can only get a certain amount of output back to the management server, and VSX boxes tend to have a lot of configuration which can overrun this buffer. To work around this, I instead have the inner script write to a file on the VSX cluster member. After it's done, the management server uses cprid_util to pull that file to /tmp/<cluster name>-<member name>.output, then it diffs /tmp/<cluster name>*.

This particular inner script iterates through all the VSIDs on the system and shows their configuration sorted. I use a set of 'grep -v' statements to remove the stuff which is expected to differ between the members such as the hostname, password hashes, and so on. It puts a little banner row between the output from each VS to make them easier to read if the diff alone doesn't show you enough.

This still only supports clusters with two members.

The script does not clean up the output file from the VSX members, though it does clean up the management-side files used for the diff.

JozkoMrkvicka
Mentor
Mentor

If you can combine both VSX and non-VSX and maybe also somehow automatic detection which one is to be checked, I can imagine this will be perfect SmartConsole Extension 😉 

In addition to that, I am wondering why such a thing is not build-in by Check Point ? It is so hard to sync config between all cluster members automatically? Let this feature be optional for every admin, disabled by default. If admin would like, he/she will just enable sync of config on all members and it will sync all config across all members (even if config is added/modified/deleted from any cluster member, not just from active member).

Some competitors already have such a feature and it is great. You are always sure the config is 100% identical on all members.

Kind regards,
Jozko Mrkvicka
0 Kudos
Chris_Atkinson
Employee Employee
Employee

GAiA cloning groups exist today but there are some limitations.

CCSM R77/R80/ELITE
0 Kudos
JozkoMrkvicka
Mentor
Mentor

using not-personalized account such as "cadmin" is a no-go from security/audit point of view. Every configuration change has to be done with real user account.

Kind regards,
Jozko Mrkvicka
0 Kudos
Bob_Zimmerman
Authority
Authority

It wouldn't be exceptionally hard to make one consistent script. Normal cluster objects and VSX cluster objects have different types. One API call to dump all of both types of cluster and sort them. Iterate through the list. Once you've picked a cluster, dump the members of the cluster using the appropriate method, and run the appropriate cprid_util command. Move on to the next cluster.

The headache with this would be the different commands for VSX and non-VSX clusters. I guess one command could be written to detect on-member whether it's VSX or not. That would get ugly fast, though.

I need to take a break from thinking about this particular chunk of code, but might come back to it in a few weeks.

JozkoMrkvicka
Mentor
Mentor

additional crazy idea - SmartConsole Extension where for selected object (cluster / VSX cluster / VS object) you can check and correct (if confirmed) the discrepencies in configuration.

extended crazy idea - check every XY hours/days all configs between all configured clusters/VSs and report via mail the findings, if there are any.

I wish the day have more than 24 hours so I have some spare time to work on such an ideas stuck in my head 😄

Kind regards,
Jozko Mrkvicka
0 Kudos
Bob_Zimmerman
Authority
Authority

These scripts were built to be in files on the management and to run via cron. I have a parent script with a variable for my mail relay and another for the destination email addresses. Each time it's run, it runs the non-VSX script, then the VSX script, then sends the results. I don't yet schedule it every day because I have a lot of differences to deal with. Once I've cut that down a bit, I do plan to have it email me (and my team) the detected differences.

I actually started on the VSX version first, because I kept running into situations where an interface was added to OSPF on one VS on one member, but not on the other member. It's a HUGE PAIN to get clish output from VSs other than 0 now because clish is so, so incredibly bad. Hit limits there, so I got it working for non-VSX clusters first.

0 Kudos
JozkoMrkvicka
Mentor
Mentor

Did you try to check and test some similar scripts already available (older ones), if they also stopped working with clish on the latest versions ?

vsxexport.sh - Export

VS_Conf_Collector.sh from sk180485

On the other hand, I want to always avoid checking clish commands from the script. Everything should be stored in some files, like routedX.conf or from Gaia OS database files in /config/db/ folder while using dbget command.

Kind regards,
Jozko Mrkvicka
0 Kudos
Bob_Zimmerman
Authority
Authority

Provider-1 is much weirder internally than I remembered. Specifically, 'mdsenv' isn't a binary, a shell script, an alias, or any of the normal things I would expect. It is instead a function injected into the BASH environment by /etc/profile.d/CP.sh. If you want to run these via cron, for example, you need to include the line '. /etc/profile.d/CP.sh' early. I've edited both to include it as the third functional line.

0 Kudos
PhoneBoy
Admin
Admin

The fact you're still using that name means you've been doing this a while 🙂

0 Kudos
the_rock
Legend
Legend

I vote for CP to "adopt" that name...Provider-1, thats pretty much how I always called it too 🙂

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events