Solved: New Maestro Cluster SIC errors

Sorin_Gogean

Hello Checkmates,

Recently we started the process of migrating to Maestro in our DC's.

For that we decided to go with 2xMHO140's and either 3 x SG9300 or SG9400 - depending on the size of the DC.

Racked everything in February, and since then we're battling with some weird things.

Starting from the connections and set-up, we're having a single site with 2 MHO's and each SG's connected in both.
Then we have the first 1 and 2 ports from MHO's connected to our ACI for SG Management - Vlan168 - and then the Uplinks on ports 5, 7, 8, 11 and 12) . For SG's - downlinks - we use port 25 (changed to downlink!) and 27 and 29 .

Now, for the installation process, we re-imaged the appliances with R82 T777 (back in February) and configured the Management, LOM and other standard settings.

We've built an SG (VSNext) with only one appliance, did the JHF60 (again it was back in February) and added it to Management - brand new VM with R82.

Added a 2nd node on the SG, and after everything synchronized we created an VS. Added that new VS to Management and all was fine.

When we wanted to push an updated policy to the newly created SG - VS0 - we got several failures due to different SIC or communication issues (as per below examples). And the FUN begins 🙂 .

From the investigations we started with Support Engineer and our Professional Services guys, we noticed that for whatever reason, when we are doing SIC verifications from Management, either 2 times out of 5 or 3 times out of 5 we get different SIC errors like:

But when we check on the SG directly, we can see that SIC is OK.

That error shows only when we have 2 or more nodes in the Security Group, either with VSNext or without.

We also applied the Certificate Hotfix for the JHF60, still no change and we also did the JHF73 without success - same SIC errors or policy push errors were seen.

We've verified the SIC certificate to be valid and Management knows it, we re-did SIC at least 10 times, no change. As long as we have a 2nd member in Security Group, it's starting to show the SIC error while validating or pushing policies.

Last Saturday, we have re-imaged all the appliances from Singapore DC to R82 T779 and applied JHF91, same behavior.

While waiting for the Support engineer to see with BU what can be wrong, I want to ask if any of you that work with Maestro, have seen this beavior, and

Thank you,

PS: in order to exclude the dedicated Management interface as a possible cause of the problem, we have 3 ports shut-down from MHO side, so right now we are working with a single Interface for the management of SG's. Same SIC errors are seen.....

Gennady

I do have Maestro configured. Unfortunately, it is on R81.20 and it is connected to regular Cisco Nexus. This is why I cannot be helpful enough for your problem investigation.

However, please, take a look at this SK. It may give you a lead.
sk168181 - Communication problems with ClusterXL clusters connected to Cisco ACI

From general Maestro standpoint. Magg interfaces are excluded from distribution, and it should not be a problem until somehow management traffic is sent/received via a data interface. Usually, it happens because of some routing mistake. I am sure that it was already checked.

If you would like to look into SMO state in more details, then you can try this zdebug command:
fw ctl zdebug -T -d 'SMO,smo' -m cluster + conf

It is very lightweight and shows changes in SMO role.

"-d" puts a filter in kernel to match string "SMO" or string "smo", otherwise module cluster with flag conf returns too many data.

View solution in original post

emmap

Please have a check through this SK and see if it helps: https://support.checkpoint.com/results/sk/sk168181

View solution in original post

_Val_

@Lari_Luoma , @Anatoly Any comments?

emmap

How is your magg bond configured, and does it match what's on the switches? Are the switches doing any sort of MAC learning from outbound packets? If so this would need to be disabled.

Sorin_Gogean

Either way, If we go with normal Security Group or with VSNext, the management interface is configured active/back-up .
On VSNext, underneath it's an MAGG while on the V0 it's an WRP interface.

On ACI side we have all ports set as access, no MAC filtering.

[Global] ALVS-SGFW022-s01-01:0> show interface magg1
1_01:
state on
mac-addr 00:1c:7f:4c:18:dd
type magg
link-state not available
instance 500
mtu 1500
auto-negotiation off
speed N/A
ipv6-autoconfig Not configured
monitor-mode Not configured
duplex N/A
link-speed Not configured
comments
ipv4-address Not Configured
ipv6-address Not Configured
ipv6-local-link-address Not Configured

Statistics:
TX bytes:237474217 packets:716004 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:1307319726 packets:2281057 errors:0 dropped:149771 overruns:0 frame:0

SD-WAN: Not Configured

1_02:
state on
mac-addr 00:1c:7f:4c:07:c9
type magg
link-state not available
instance 500
mtu 1500
auto-negotiation off
speed N/A
ipv6-autoconfig Not configured
monitor-mode Not configured
duplex N/A
link-speed Not configured
comments
ipv4-address Not Configured
ipv6-address Not Configured
ipv6-local-link-address Not Configured

Statistics:
TX bytes:182919557 packets:593347 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:1173288484 packets:2164249 errors:0 dropped:148454 overruns:0 frame:0

SD-WAN: Not Configured

[Global] ALVS-SGFW022-s01-01:0>

[Global] ALVS-SGFW022-s01-01:0> show interfaces all
Interface wrp0
state on
mac-addr 00:12:c1:10:00:29
type wrp
link-state not available
instance 0
mtu 1500
auto-negotiation off
speed N/A
ipv6-autoconfig Not configured
monitor-mode Not configured
duplex N/A
link-speed Not configured
comments
ipv4-address 10.18.169.41/21
ipv6-address Not Configured
ipv6-local-link-address Not Configured

Statistics:
TX bytes:237437595 packets:715738 errors:0 dropped:0 overruns:0 carrier:0
RX bytes:1145052615 packets:1333705 errors:321620 dropped:0 overruns:0 frame:0

SD-WAN: Not Configured

Leading to Virtual Switch: mgmt-switch (ID 500)
[Global] ALVS-SGFW022-s01-01:0>

Thank you,

Gennady

Good day!

Do you see any regular packet drop when you ping from the Management Server to the SG Management IP? May it be a general connectivity issue? I see 6.5% of RX errors on magg interfaces. This doesn't look good.

If we see significant packet loss from the Management Server to the SG, then it is expected that SIC verification and policy push fails. We need to narrow down the problem area starting from bottom up.

Sorin_Gogean

hello @Gennady ,

No we do not have packet loss between Management and SG's.

And even if we would have a packet loss.

If that would be the case, then how can we explain that in an VS created on top of the SecurityGroup, we get 5 out of 5 SIC validations and no errors when pushing the Policy?

Thank you,

Gennady

In addition,

Did you have a chance to capture traffic on the SG (all SGMs at the same time) at the moment when you verify SIC from the Management Server?

The fact that the problem appears only if you add more than 1 SGM to the SG points to some distribution problem. It should not affect magg interface until packets from the Management Server in fact arrives on some Data interface or SMO role is flapping (this is very unlikely).

Sorin_Gogean

Yes we did packet captures on Management and both SG members and shared them with the Support .

Indeed it points to a distribution problem, but the distribution set-up is for the data-path and not for management per my understanding.

SMO role was not flapping as we checked and it's always the first member that we used to build the SG.
still while doing tcpdumps on both members and we were checking SIC, we've seen that almost every time when SIC failed, the other member was showing traffic at one point. So why is shifting, I can't say.

Thank you,

Thank you,

emmap

MAC learning would explain the shifting. If you do the tcpdumps with the MACs shown (-e) do you see the inbound MACs change when SIC stops working and you see traffic on the other SGM?

Sorin_Gogean

We'll run again the captures and watch the MAC and come back.
Still I repeat, if we are using a single port between MHO's and ACI for management, then we have a single Inbound MAC so ?

Thank you,

emmap

No it's MAC per SGM, not per MHO/port. If it works stable with one SGM then it's not MAC flapping. Not sure about VSNext if each VS also has its own MAC address on there but that might need checking too.

Sorin_Gogean

I have the same understanding, and in order to exclude VSNext, we have wiped and created an standard Security Group as well.
And with that we have the same SIC behavior.

Ty,

emmap

Not MAC filtering, MAC learning. If the ACI is learning MAC addresses from outbound packets, it's going to be constantly changing the MAC table for the magg interfaces, and packets are going to get lost, leading to SIC failures. On the SG management port, each SGM uses its own MAC address, so MAC learning can break things.

Sorin_Gogean

As I know, we don't do any MAC learning/filtering on ACI side.

But, if we were to do any of that, we should have alerts on ACI side for MAC flapping and the Leaf ports that we have the Management connected - those first 2 ports from each MHO.
But since we are with only one port active right now, there is no MAC flapping, unless the Management IP - 10.18.169.41 in Singapore case - jumps between the 2 members for whatever reason....

I'll doublecheck and come back.

Thank you,

Sorin_Gogean

As promised, we checked the ACI side for MAC flapping and we don't have any alerts for the ports that we use for SG Management.

Sorin_Gogean

@emmap and @Gennady ,

I assume you have Maestro Clusters that you have set.
My questions to you is, from MHO SG management ports to your network, do you connect to an ACI or standard network clusters/stack ?
If it's an ACI like we have, did you configure it as Access Port only or ?

Thank you,

Gennady

I do have Maestro configured. Unfortunately, it is on R81.20 and it is connected to regular Cisco Nexus. This is why I cannot be helpful enough for your problem investigation.

However, please, take a look at this SK. It may give you a lead.
sk168181 - Communication problems with ClusterXL clusters connected to Cisco ACI

From general Maestro standpoint. Magg interfaces are excluded from distribution, and it should not be a problem until somehow management traffic is sent/received via a data interface. Usually, it happens because of some routing mistake. I am sure that it was already checked.

If you would like to look into SMO state in more details, then you can try this zdebug command:
fw ctl zdebug -T -d 'SMO,smo' -m cluster + conf

It is very lightweight and shows changes in SMO role.

"-d" puts a filter in kernel to match string "SMO" or string "smo", otherwise module cluster with flag conf returns too many data.

Sorin_Gogean

ty @Gennady , I will check the SMO state and come back.

As for the ACI and SK you provided, that makes sense, but we should have alerts on ACI side if an IP changes MAC .

As you see, we have the same MAC for the SG IP:

ty,
PS: I've run the zdebug on both nodes while checking SIC and getting failures, and I did not get any packets, so during the SIC check SMO was not changing, I guess....

Gennady

If there are no messages in zdebug during the problem replication. then we can rule out SMO flap.

An example of SMO role change is below:

Sorin_Gogean

just to clarify for myself, what would cause an SMO change ?
Other than the shut-down/reload of the active one.....

can I trigger that by CLI so I just confirm it ?

Thank you,

Gennady

Good day!

The easy way to trigger SMO role change is by brief shutdown of downlink port from MHO side.

Use "orch_stat" to find logical port number for downlink interface which leads to SMO. Then you can use tor_util on both MHO at the same time (you can try just one MHO at first for the test):

"tor_util set_port_admin_state <logical_port_number> down && sleep 1 && tor_util set_port_admin_state <logical_port_number> up"

Example:

"tor_util set_port_admin_state 65 down && sleep 1 && tor_util set_port_admin_state 65 up"

The command above will disable downlink port for 1 second and then bring it back UP. This results in SMO role change from 1_01 to 1_02 for ~35 seconds and then it will go back from 1_02 to 1_01. You will see both events in zdebug. The zdebug command should be executed on 1_02 because you will break SSH to 1_01.

Be careful if the test is in production because SMO role flap distracts all SMO local connections:

control ssh to SG IP because it is handled by SMO (if magg interface is used)
communication between IDC-PDP-PEP for Identity Awareness if PDP or PEP role in on SG
BGP peering if it is terminated on the SG
may see a blip in LAG status on connected network device because LACP task is also handled by SMO
connection to Security Management and Log servers

Martijn

Hi,

I am not a ACI expert and not sure how much of ACI features you are using, but I know ACI is some what different than a traditional switched network. Is there a possibility to connect the MAGG to a traditional (non-ACI) switch to see what happens?
Maybe you can rule out ACI.

Don't think it is relevant here, but did you configure a Primary Interface in the MAGG bond? I have seen strange things when creating an Active/Backup bond without a Primary Interface.

Martijn

Sorin_Gogean

hello @Martijn ,

We don't have anything fancy on this Vlan - Vlan168 where we have the SG Management assigned.

As for the MAGG creation, I did not do anything in the CLI - as per documentation you do the interface preference in CLI - we did everything in GAIA.

I just set the primary interface on the magg1 as eth1-Mgmt1 and still the SIC error is there 😫.

[Global] ALVS-SGFW022-s01-01:0> show bonding groups
1_01:
Bonding Interface: 1
Bond Configuration
xmit-hash-policy layer2
down-delay 200
primary eth1-Mgmt1
lacp-rate Not configured
mode active-backup
up-delay 200
mii-interval 100
type mgmt
min-links 0
Bond Interfaces
eth1-Mgmt1
eth2-Mgmt1

1_02:
Bonding Interface: 1
Bond Configuration
xmit-hash-policy layer2
down-delay 200
primary eth1-Mgmt1
lacp-rate Not configured
mode active-backup
up-delay 200
mii-interval 100
type mgmt
min-links 0
Bond Interfaces
eth1-Mgmt1
eth2-Mgmt1

[Global] ALVS-SGFW022-s01-01:0>

Thank you,

emmap

Please have a check through this SK and see if it helps: https://support.checkpoint.com/results/sk/sk168181

Sorin_Gogean

Hello mates,

Following SK that I was pointed to, SK168181 ( thank you @Gennady and @emmap ) it states that what we’re facing, is due to "Dataplane Endpoint Learning".

Symptoms

Security Group Management interfaces (Example: eth1-Mgmt4, magg0) are intermittently inaccessible. Policy installation most likely fails during this time.
Communication issues with ClusterXL members.
Traffic captures indicate the packets are forwarded to the wrong member.
For example - when trying to access the Standby member, packets are forwarded to the Active member.
Cluster status monitoring commands like 'cphaprob stat' do not show any problematic status
Active traffic may be forwarded to the wrong cluster member or Maestro SGM

Cause

Cisco ACI has several proprietary features which cause problems with Check Point clusters.
These features are "Endpoint Dataplane Learning", "COOP Endpoint Dampening", and "Rogue Endpoint Detection".
Before explaining these features, we will review some key points about Check Point clusters.

Check Point Clustering

Most switches learn information about hosts connected to the network by listening to ARP requests and replies. Check Point clustering relies on this behavior to ensure traffic is always sent to the Active cluster member by using GARP.

Key points about ClusterXL Clusters:

All clustered interfaces have a Virtual IP (VIP) which is shared by all cluster members.
Any cluster member may transmit information on a clustered interface using the shared VIP and their own unique MAC address.
Traffic destined for the shared VIP must be forwarded to the Active cluster member.

Key points about Chassis and Maestro Security Groups:

The IP configured on the Management port is shared between all Security Group members.
Only the SMO replies to ARP requests received on the Management port.
All other SGMs drop the ARP request.
Any SGM in the Security Group may transmit information using the Management port, the shared Management IP, and their own unique MAC address.
Traffic received on the Management port must have the destination MAC address of the current SMO.

Cisco Endpoint Learning Features

Cisco ACI does not behave like most switches as explained above.
The switches still listen to ARP requests and replies to learn about the network.
With "Dataplane Endpoint Learning", the Cisco switches also learn about the network from the source IP and source MAC information in regular network traffic.

Because any Cluster member or SGM can transmit traffic using their own unique MAC address and a shared IP address, the Cisco ACI switch thinks that the shared IP is constantly "moving" to different MACs.
If the IP-MAC association changes too frequently, the IP is considered "misbehaving" or "rogue".

Once the shared IP address is considered "misbehaving" or "rogue", Cisco features like "Rogue Endpoint Detection" or "COOP Endpoint Dampening" disable updates and freeze the current IP-MAC association. If the frozen MAC does not belong to the currently active Cluster member or SMO SGM, then there will be traffic issues.

Solution

There are several possible solutions:

Note: Each of the following is an independent option. They are not all required.

Option 1: In ClusterXL clusters, configure the cluster to use VMAC as explained in sk50840.
Chassis and Maestro Security Groups use VMAC by default for Data interfaces.
VMAC is currently not supported on Chassis and Maestro Management interfaces.
Option 2: As of Cisco version 5.2(1g), Endpoint Dataplane Learning can now be configured at the EPG level (per host). To resolve the behavior, simply disable Endpoint Dataplane Learning for the relevant Check Point Cluster / Maestro IP Address(es). The new configuration options are explained in this Cisco documentation.

So, we had checked our DC ACI and we could confirm what the SK states.

and

Based on those information’s we’re seeing in the Faults/Events I was able to confirm that the management issue could be cause by "Dataplane Endpoint Learning".

As this confirms that we indeed had 2 MAC’s pointing to the same IP address – 10.4.169.181 from the 0012.C110.00B5 and 0012.C120.00B5

and

To confirm those, we have also tested by temporary moving Management Ports outside ACI and as soon as DC guys helped us with the connections move, we could validate SIC for 5 - 7 times. Previously we could not get 3 out of 5 validations.

As we can’t stay with the Management outside of ACI, we discussed with DC ACI responsible colleagues to disable the “IP Data-plane Learning” on specific IP’s and confirm if that fixes our problem or not. Just few minutes after we had set the specific IP's to not do “IP Data-plane Learning”, we could validate the the SIC communication, same way we did while outside ACI.

In conclusion, if you have in your DataCenters ACI implemented, or wherever you install Maestro, make sure that the Security Group Management is either outside ACI - if that is possible - but if you don't, then add the Management IP's specifically and disable “IP Data-plane Learning” !!!!

Confusing part was that we would expect to see Alerts on ACI, like any other MAC or IP flapping but that was not the case. On Maestro Security Group, it seems that even there is one SG appliance selected as "primary" for Management, the traffic shows with MAC from each node in certain cases.

[Expert@ALVA-SGFW01-s01-01:0]# ifconfig | grep 00:12
wrp0 Link encap:Ethernet HWaddr 00:12:C1:10:00:B4
[Expert@ALVA-SGFW01-s01-01:0]#

[Expert@ALVA-SGFW01-s01-02:0]# ifconfig | grep 00:12
wrp0 Link encap:Ethernet HWaddr 00:12:C1:20:00:B4
[Expert@ALVA-SGFW01-s01-02:0]#

[Expert@ALVA-SGFW01-s01-03:0]# ifconfig | grep 00:12
wrp0 Link encap:Ethernet HWaddr 00:12:C1:30:00:B4
[Expert@ALVA-SGFW01-s01-03:0]#

[Expert@ALVA-SGFW01-s01-03:0]# asg stat -i tasks
--------------------------------------------------------------------------------
| Task (Task ID) | Site1 |
--------------------------------------------------------------------------------
| SMO (0) | 1 |
| General (1) | 1 |
| LACP (2) | 1 |
| CH Monitor (3) | 1 |
| DR Manager (4) | 1 |
| UIPC (5) | 1 |
| Alert (6) | 1 |
| SDWAN (7) | 1 |
--------------------------------------------------------------------------------

[Expert@ALVA-SGFW01-s01-03:0]#

Thank you everyone for the support, as I learned something new in the last months.

Now we're moving forward with the planning of the migration from the 2 x 15K cluster to vFW on Maestro.

Thank you and have a great week,